ALM Coffee II – with Incidents and Problems

Proper ITIL needs Problem Management: Difference between Incident and Problem. The term Incident may be familiar to most people. But why a further differentiation to the problem makes a lot of sense is explained below.

Incident

An incident is an unexpected disturbance of a service (impairment, interruption). To put it bluntly, the user cannot work. Incidents are always reactive. Incidents must be distinguished from service requests and request fulfillment, which usually involve standardized, known activities.

The incident is usually handled by 1st level support. The aim is to restore the service as quickly as possible with a solution if the fault is found in the known-error database (knowledge base), or with a workaround; the main thing is that the user can work again.

Incident handling is usually under SLAs and knows escalation scenarios. Customer satisfaction has priority.

Problem

A Problem, on the other hand, targets the cause of a malfunction within the IT infrastructure in order to fix it and enter the solution thus found into the Known-Error database.

A problem can be reactive in response to one or more incidents, but it also includes proactive measures (regular reviews) to prevent (further) incidents. This includes actions that need to be taken within the IT infrastructure. Therefore, the Problem is the interface to Change Management. Problems are usually handled by 2nd level support, changes by 3rd level support.

As long as the cause of a problem is not found, new incidents will always arise, which are attached to the existing problem.

Since the duration of such a root cause analysis is not known in advance, a problem is usually not under SLAs. The quality of the solution found has priority.

Risks in case of missing distinction

Incident resolution activities can result in longer service interruptions if root cause analysis is performed instead of restoring normal operations.
Incidents are closed too early (SLA!), no deeper action is taken to determine the root cause and resolution; no entry is created in the known-error database; therefore, similar incidents keep reappearing.
Incidents are left open so that a root cause analysis can be performed. As a result, it is usually no longer possible to determine when the service will be available again. SLA targets are not met, although the user may be able to work again. These long-running incidents become more and more, they have to be cleaned up regularly.
Wishes emerge to introduce additional statuses or priorities to stop the SLA clock anyway.
There are requests to attach further incidents of a series to the first one, which actually turns it into an incognito problem. Programmatically, this creates a monster over time.

Advantages of the distinction

If Incident and Problem are separated from each other, support staff can fulfill the goal of rapid recovery in the context of Incident Management and at the same time perform a root cause analysis and solve the problem in a separate, parallel Problem Management process.

Riccardo Escher

Riccardo is a Senior ALM Consultant and has in-depth knowledge of all areas of Solution Manager and ABAP development. He has extensive experience in the areas of ChaRM and the Solution Manager Test Suite.