ITIL Problem Management
ITIL defines a 'problem' as the cause of one or more incidents.
Problem Management includes the activities required to diagnose the root cause of incidents and to determine the resolution to those problems. It is also responsible for ensuring that the resolution is implemented through the appropriate control procedures, especially Change Management and Release Management.
Problem Management will also maintain information about problems and the appropriate workarounds and resolutions, so that the organization is able to reduce the number and impact of incidents over time. In this respect, Problem Management has a strong interface with Knowledge Management, and tools such as the Known Error Database will be used for both.
Although Incident and Problem Management are separate processes, they are closely related and will typically use the same tools, and may use similar categorization, impact and priority coding systems. This will ensure effective communication when dealing with related incidents and problems.
More on Problem Management
Read more on Problem Management here:
Problem Management - Problem detection
It is likely that multiple ways of detecting problems will exist in all organizations. These will include:
- Suspicion or detection of a cause of one or more incidents by the Service Desk, resulting in a Problem Record being raisedthe desk may have resolved the incident but has not determined a definitive cause and suspects that it is likely to recur, so will raise a Problem Record to allow the underlying cause to be resolved. Alternatively, it may be immediately obvious from the outset that an incident, or incidents, has been caused by a major problem, so a Problem Record will be raised without delay.
- Analysis of an incident by a technical support group which reveals that an underlying problem exists, or is likely to exist.
- Automated detection of an infrastructure or application fault, using event/alert tools automatically to raise an incident which may reveal the need for a Problem Record.
- A notification from a supplier or contractor that a problem exists that has to be resolved.
- Analysis of incidents as part of proactive Problem Management - resulting in the need to raise a Problem Record so that the underlying fault can be investigated further.
Frequent and regular analysis of incident and problem data must be performed to identify any trends as they become discernible. This will require meaningful and detailed categorization of incidents/problems and regular reporting of patterns and areas of high occurrence. 'Top ten' reporting, with drill-down capabilities to lower levels, is useful in identifying trends.
Further details of how detected trends should be handled are included in the Continual Service Improvement publication.
Problem Management - Problem logging
Regardless of the detection method, all the relevant details of the problem must be recorded so that a full historic record exists. This must be date and time stamped to allow suitable control and escalation.
A cross-reference must be made to the incident(s) which initiated the Problem Record and all relevant details must be copied from the Incident Record(s) to the Problem Record. It is difficult to be exact, as cases may vary, but typically this will include details such as:
- User details
- Service details
- Equipment details
- Date/time initially logged
- Priority and categorization details
- Incident description
- Details of all diagnostic or attempted recovery actions taken.
Problem Categorization
Problems must be categorized in the same way as incidents (and it is advisable to use the same coding system) so that the true nature of the problem can be easily traced in the future and meaningful management information can be obtained.
Problem Prioritization
Problems must be prioritized in the same way and for the same reasons as incidents - but the frequency and impact of related incidents must also be taken into account. The coding system described earlier in Table 4.1 (which combines impact with urgency to give an overall priority level) can be used to prioritize problems in the same way that it might be used for incidents, though the definitions and guidance to support staff on what constitutes a problem, and the related service targets at each level, must obviously be devised separately. Problem prioritization should also take into account the severity of the problems. Severity in this context refers to how serious the problem is from an infrastructure perspective, for example:
- Can the system be recovered, or does it need to be replaced?
- How much will it cost?
- How many people, with what skills, will be needed to fix the problem?
- How long will it take to fix the problem?
- How extensive is the problem (e.g. how many CIs are affected)?
Problem Management - Problem resolution
Ideally, as soon as a solution has been found, it should be applied to resolve the problem - but in reality safeguards may be needed to ensure that this does not cause further difficulties. If any change in functionality is required this will require an RFC to be raised and approved before the resolution can be applied. If the problem is very serious and an urgent fix is needed for business reasons, then an Emergency RFC should be handled by the Emergency Change Advisory Board (ECAB). Otherwise, the RFC should follow the established Change Management process for that type of change - and the resolution should be applied only when the change has been approved and scheduled for release. In the meantime, the KEDB should be used to help resolve quickly any further occurrences of the incidents/problems that occur.
Note: There may be some problems for which a Business Case for resolution cannot be justified (e.g. where the impact is limited but the cost of resolution would be extremely high). In such cases a decision may be taken to leave the Problem Record open but to use a workaround description in the Known Error Record to detect and resolve any recurrences quickly. Care should be taken to use the appropriate code to flag the open Problem Record so that it does not count against the performance of the team performing the process and so that unauthorized rework does not take place.
Problem Closure
When any change has been completed (and successfully reviewed), and the resolution has been applied, the Problem Record should be formally closed - as should any related Incident Records that are still open. A check should be performed at this time to ensure that the record contains a full historical description of all events - and if not, the record should be updated.
The status of any related Known Error Record should be updated to shown that the resolution has been applied.
Problem Management - Problem Investigation and Diagnosis
An investigation should be conducted to try to diagnose the root cause of the problem - the speed and nature of this investigation will vary depending upon the impact, severity and urgency of the problem - but the appropriate level of resources and expertise should be applied to finding a resolution commensurate with the priority code allocated and the service target in place for that priority level.
There are a number of useful problem solving techniques that can be used to help diagnose and resolve problems - and these should be used as appropriate. Such techniques are described in more detail later in this section.
The CMS must be used to help determine the level of impact and to assist in pinpointing and diagnosing the exact point of failure. The Know Error Database (KEDB) should also be accessed and problem-matching techniques (such as key word searches) should be used to see if the problem has occurred before and, if so, to find the resolution.
It is often valuable to try to recreate the failure, so as to understand what has gone wrong, and then to try various ways of finding the most appropriate and cost-effective resolution to the problem. To do this effectively without causing further disruption to the users, a test system will be necessary that mirrors the production environment.
There are many problem analysis, diagnosis and solving techniques available and much research has been done in this area. Some of the most useful and frequently used techniques include:
- Chronological analysis: When dealing with a difficult problem, there are often conflicting reports about exactly what has happened and when. It is therefore very helpful briefly to document all events in chronological order - to provide a timeline of events. This often makes it possible to see which events may have been triggered by others - or to discount any claims that are not supported by the sequence of events.
- Pain Value Analysis: This is where a broader view is taken of the impact
of an incident or problem, or incident/problem type. Instead of just
analysing the number of incidents/problems of a particular type in a
particular period, a more in-depth analysis is done to determine exactly
what level of pain has been caused to the organization/business by these
incidents/problems. A formula can be devised to calculate this pain level.
Typically this might include taking into account:
- The number of people affected
- The duration of the downtime caused
- The cost to the business (if this can be readily calculated or estimated).
- Kepner and Tregoe: Charles Kepner and Benjamin Tregoe developed a
useful way of problem analysis which can be used formally to investigate
deeper-rooted problems. They defined the following stages:
- defining the problem
- describing the problem in terms of identity, location, time and size
- establishing possible causes
- testing the most probable cause
- verifying the true cause.
- Brainstorming: It can often be valuable to gather together the relevant people, either physically or by electronic means, and to 'brainstorm' the problem - with people throwing in ideas on what the potential cause may be and potential actions to resolve the problem. Brainstorming sessions can be very constructive and innovative but it is equally important that someone, perhaps the Problem Manager, documents the outcome and any agreed actions and keeps a degree of control in the session(s).
- Ishikawa Diagrams: Kaoru Ishikawa (1915-89), a leader in Japanese quality control, developed a method of documenting causes and effects which can be useful in helping identify where something may be going wrong, or be improved. Such a diagram is typically the outcome of a brainstorming session where problem solvers can offer suggestions. The main goal is represented by the trunk of the diagram, and primary factors are represented as branches. Secondary factors are then added as stems, and so on. Creating the diagram stimulates discussion and often leads to increased understanding of a complex problem. An example diagram is given in Appendix D.
- Pareto Analysis: This is a technique for separating important potential
causes from more trivial issues. The following steps should be taken:
- Form a table listing the causes and their frequency as a percentage.
- Arrange the rows in the decreasing order of importance of the causes, i.e. the most important cause first.
- Add a cumulative percentage column to the table. By this step, the chart should look something like Table 4.2, which illustrates 10 causes of network failure in an organization.
Network failures | |||
Causes | Percentage of total | Computation | Cumulative % |
Network Controller | 35 | 0+35% | 35 |
File corruption | 26 | 35%+26% | 61 |
Addressing conflicts | 19 | 61%+19% | 80 |
Server OS | 6 | 80%+6% | 86 |
Scripting error | 5 | 86%+5% | 91 |
Untested change | 3 | 91%+3% | 94 |
Operator error | 2 | 94%+2% | 96 |
Backup failure | 2 | 96%+2% | 98 |
Intrusion attempts | 1 | 98%+1% | 99 |
Disk failure | 1 | 99%+1% | 100 |
From this chart it is clear to see that there are three primary causes for network failure in the organization. These should therefore be targeted first.