ITIL Incident Management
There should be a close interface between the Incident Management process and the Problem Management and Change management processes as well as the function of Service Desk. If not properly controlled, Changes may introduce new Incidents. A way of tracking back is required. It is therefore recommended that the incident records should be held on the same CMDB as the Problem, Known Error and Change records, or at least linked without the need for re-keying, to improve the interfaces and easy interrogation and reporting.
Incident priorities and escalation procedures need to be agreed as part of the Service level Management process and documented in the SLAs.
The Problem Management process requires the accurate and comprehensive recording of Incidents in order to identify and efficiently the cause of the Incidents and trends. Problem Management also needs to liaise closely with the Availability Management process to identify these trends and instigate remedial action.
More on Incident Management
Read more on Incident Management here:
- Definition »
- Escalation »
- Incidents logging »
- Major incidents »
- Scope »
- Priority of incidents »
- Closing incidents »
Incident Management - What is Incident Management
In ITIL terminology, an 'incident' is defined as:
An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set.
Incident Management is the process for dealing with all incidents; this can include failures, questions or queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or automatically detected and reported by event monitoring tools.
Purpose
The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within SLA limits.
Value to business
The value of Incident Management includes:
- The ability to detect and resolve incidents which results in lower downtime to the business, which in turn means higher availability of the service. This means that the business is able to exploit the functionality of the service as designed.
- The ability to align IT activity to real-time business priorities. This is because Incident Management includes the capability to identify business priorities and dynamically allocate resources as necessary.
- The ability to identify potential improvements to services. This happens as a result of understanding what constitutes an incident and also from being in contact with the activities of business operational staff.
- The Service Desk can, during its handling of incidents, identify additional service or training requirements found in IT or the business.
Incident Management is highly visible to the business, and it is therefore easier to demonstrate its value than most areas in Service Operation. For this reason, Incident Management is often one of the first processes to be implemented in Service Management projects. The added benefit of doing this is that Incident Management can be used to highlight other areas that need attention - thereby providing a justification for expenditure on implementing other processes.
Incident Management - Incident escalation
Incident escalation can be divided into:
- Functional escalation. As soon as it becomes clear that the Service Desk
is unable to resolve the incident itself (or when target times for first-point
resolution have been exceededwhichever comes first!) the incident
must be immediately escalated for further support.
If the organization has a second-level support group and the Service Desk believes that the incident can be resolved by that group, it should refer the incident to them. If it is obvious that the incident will need deeper technical knowledgeor when the second-level group has not been able to resolve the incident within agreed target times (whichever comes first), the incident must be immediately escalated to the appropriate third-level support group. Note that third-level support groups may be internalbut they may also be third parties such as software suppliers or hardware manufacturers or maintainers. The rules for escalation and handling of incidents must be agreed in OLAs and UCs with internal and external support groups respectively.
Note: Incident Ownership remains with the Service Desk! Regardless of where an incident is referred to during its life, ownership of the incident remains with the Service Desk at all times. The Service Desk remains responsible for tracking progress, keeping users informed and ultimately for Incident Closure. - Hierarchic escalation. If incidents are of a serious nature (for example
Priority 1 incidents) the appropriate IT managers must be notified, for
informational purposes at least. Hierarchic escalation is also used if the
'Investigation and Diagnosis' and 'Resolution and Recovery' steps are
taking too long or proving too difficult. Hierarchic escalation should
continue up the management chain so that senior managers are aware
and can be prepared and take any necessary action, such as allocating
additional resources or involving suppliers/maintainers. Hierarchic
escalation is also used when there is contention about to whom the
incident is allocated.
Hierarchic escalation can, of course, be initiated by the affected users or customer management, as they see fit - that is why it is important that IT managers are made
The exact levels and timescales for both functional and hierarchic escalation need to be agreed, taking into account SLA targets, and embedded within support tools which can then be used to police and control the process flow within agreed timescales.
The Service Desk should keep the user informed of any relevant escalation that takes place and ensure the Incident Record is updated accordingly to keep a full history of actions.
There may be many incidents in a queue with the same priority level - so it will be the job of the Service Desk and/or Incident Management staff initially, in conjunction with managers of the various support groups to which incidents are escalated, to decide the order in which incidents should be picked up and actively worked on. These managers must ensure that incidents are dealt with in true business priority order and that staff are not allowed to 'cherry-pick' the incidents they choose!
Incident Management - Logging
All incidents must be fully logged and date/time stamped, regardless of whether they are raised through a Service Desk telephone call or whether automatically detected via an event alert.
Note: If Service Desk and/or support staff visit the customers to deal with one incident, they may be asked to deal with further incidents 'while they are there'. It is important that if this is done, a separate Incident Record is logged for each additional incident handled - to ensure that a historical record is kept and credit is given for the work undertaken.
All relevant information relating to the nature of the incident must be logged so that a full historical record is maintained - and so that if the incident has to be referred to other support group(s), they will have all relevant information to hand to assist them.
The information needed for each incident is likely to include:
- Unique reference number
- Incident categorization (often broken down into between two and four levels of sub-categories)
- Incident urgency
- Incident impact
- Incident prioritization
- Date/time recorded
- Name/ID of the person and/or group recording the incident
- Method of notification (telephone, automatic, e-mail, in person, etc.)
- Name/department/phone/location of user
- Call-back method (telephone, mail, etc.)
- Description of symptoms
- Incident status (active, waiting, closed, etc.)
- Related CI
- Support group/person to which the incident is allocated
- Related problem/Known Error
- Activities undertaken to resolve the incident
- Resolution date and time
- Closure category
- Closure date and time.
If the Service Desk does not work 24/7 and responsibility for first-line incident logging and handling passes to another group, such as IT Operations or Network Support, out of Service Desk hours, then these staff need to be equally rigorous about logging of incident details. Full training and awareness needs to be provided to such staff on this issue.
Categorization
Part of the initial logging must be to allocate suitable incident categorization coding so that the exact type of the call is recorded. This will be important later when looking at incident types/frequencies to establish trends for use in Problem Management, Supplier Management and other ITSM activities.
Please note that the check for Service Requests in this process does not imply that Service Requests are incidents. This is simply recognition of the fact that Service Requests are sometimes incorrectly logged as incidents (e.g. a user incorrectly enters the request as an incident from the web interface). This check will detect any such requests and ensure that they are passed to the Request Fulfilment process.
Multi-level categorization is available in most tools - usually to three or four levels of granularity. All organizations are unique and it is therefore difficult to give generic guidance on the categories an organization should use, particularly at the lower levels. However, there is a technique that can be used to assist an organization to achieve a correct and complete set of categories - if they are starting from scratch! The steps involve:
- Hold a brainstorming session among the relevant support groups, involving the SD Supervisor and Incident and Problem Managers.
- Use this session to decide the 'best guess' top-level categories - and include an 'other' category. Set up the relevant logging tools to use these categories for a trial period.
- Use the categories for a short trial period (long enough for several hundred incidents to fall into each category, but not too long that an analysis will take too long to perform).
- Perform an analysis of the incidents logged during the trial period. The number of incidents logged in each higher-level category will confirm whether the categories are worth having - and a more detailed analysis of the 'other' category should allow identification of any additional higherlevel categories that will be needed.
- A breakdown analysis of the incidents within each higher-level category should be used to decide the lower-level categories that will be required.
- Review and repeat these activities after a further period - of, say, one to three months - and again regularly to ensure that they remain relevant. Be aware that any significant changes to categorization may cause some difficulties for incident trending or management reporting - so they should be stabilized unless changes are genuinely required.
If an existing categorization scheme is in use, but it is not thought to be working satisfactorily, the basic idea of the technique suggested above can be used to review and amend the existing scheme.
NOTE: Sometimes the details available at the time an incident is logged may be incomplete, misleading or incorrect. It is therefore important that the categorization of the incident is checked, and updated if necessary, at call closure time (in a separate closure categorization field, so as not to corrupt the original categorization)
Incident Management - Major incidents
A separate procedure, with shorter timescales and greater urgency, must be used for 'major' incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system - such that they will be dealt with through the major incident process.
Note: People sometimes use loose terminology and/or confuse a major incident with a problem. In reality, an incident remains an incident forever - it may grow in impact or priority to become a major incident, but an incident never 'becomes' a problem. A problem is the underlying cause of one or more incidents and remains a separate entity always!
Some lower-priority incidents may also have to be handled through this procedure - due to potential business impact - and some major incidents may not need to be handled in this way if the cause and resolutions are obvious and the normal incident process can easily cope within agreed target resolution times - provided the impact remains low!
Where necessary, the major incident procedure should include the dynamic establishment of a separate major incident team under the direct leadership of the Incident Manager, formulated to concentrate on this incident alone to ensure that adequate resources and focus are provided to finding a swift resolution. If the Service Desk Manager is also fulfilling the role of Incident Manager (say in a small organization), then a separate person may need to be designated to lead the major incident investigation team - so as to avoid conflict of time or priorities - but should ultimately report back to the Incident Manager.
If the cause of the incident needs to be investigated at the same time, then the Problem Manager would be involved as well but the Incident Manager must ensure that service restoration and underlying cause are kept separate. Throughout, the Service Desk would ensure that all activities are recorded and users are kept fully informed of progress.
Incident Management - Scope
Incident Management includes any event which disrupts, or which could disrupt, a service. This includes events which are communicated directly by users, either through the Service Desk or through an interface from Event Management to Incident Management tools.
Incidents can also be reported and/or logged by technical staff (if, for example, they notice something untoward with a hardware or network component they may report or log an incident and refer it to the Service Desk). This does not mean, however, that all events are incidents. Many classes of events are not related to disruptions at all, but are indicators of normal operation or are simply informational (see section 4.1).
Although both incidents and service requests are reported to the Service Desk, this does not mean that they are the same. Service requests do not represent a disruption to agreed service, but are a way of meeting the customer's needs and may be addressing an agreed target in an SLA. Service requests are dealt with by the Request Fulfilment process (see section 4.3).
Incident Management - Incident prioritization
Another important aspect of logging every incident is to agree and allocate an appropriate prioritization code as this will determine how the incident is handled both by support tools and support staff.
Prioritization can normally be determined by taking into account both the urgency of the incident (how quickly the business needs a resolution) and the level of impact it is causing. An indication of impact is often (but not always) the number of users being affected. In some cases, and very importantly, the loss of service to a single user can have a major business impact - it all depends upon who is trying to do what - so numbers alone is not enough to evaluate overall priority! Other factors that can also contribute to impact levels are:
- Risk to life or limb
- The number of services affected - may be multiple services
- The level of financial losses
- Effect on business reputation
- Regulatory or legislative breaches.
An effective way of calculating these elements and deriving an overall priority level for each incident is given in the table
Impact | ||||
High | Medium | Low | ||
High | 1 | 2 | 3 | |
Urgency | Medium | 2 | 3 | 4 |
Low | 3 | 4 | 5 | |
Priority code | Description | Target resolution time | ||
1 | Critical | 1 hour | ||
2 | High | 8 hours | ||
3 | Medium | 24 hours | ||
4 | Low | 48 hours | ||
5 | Planning | Planned |
In all cases, clear guidance - with practical examples - should be provided for all support staff to enable them to determine the correct urgency and impact levels, so the correct priority is allocated. Such guidance should be produced during service level negotiations.
However, it must be noted that there will be occasions when, because of particular business expediency or whatever, normal priority levels have to be overridden. When a user is adamant that an incident's priority level should exceed normal guidelines, the Service Desk should comply with such a request - and if it subsequently turns out to be incorrect this can be resolved as an off-line management level issue, rather than a dispute occurring when the user is on the telephone.
Some organizations may also recognize VIPs (high-ranking executives, officers, diplomats, politicians, etc.) whose incidents would be handled on a higher priority than normal - but in such cases this is best catered for and documented within the guidance provided to the Service Desk staff on how to apply the priority levels, so they are all aware of the agreed rules for VIPs, and who falls into this category.
It should be noted that an incident's priority may be dynamic - if circumstances change, or if an incident is not resolved within SLA target times, then the priority must be altered to reflect the new situation.
Note: some tools may have constraints that make it difficult automatically to calculate performance against SLA targets if a priority is changed during the lifetime of an incident. However, if circumstances do change, the change in priority should be made - and if necessary manual adjustments made to reporting tools. Ideally, tools with such constraints should not be selected.
Incident and Problem Management - Closing an incident
When a potential resolution has been identified, this should be applied and tested. The specific actions to be undertaken and the people who will be involved in taking the recovery actions may vary, depending upon the nature of the fault - but could involve:
- Asking the user to undertake directed activities on their own desk top or remote equipment
- The Service Desk implementing the resolution either centrally (say,rebooting a server) or remotely using software to take control of the user's desktop to diagnose and implement a resolution
- Specialist support groups being asked to implement specific recovery actions (e.g. Network Support reconfiguring a router)
- A third-party supplier or maintainer being asked to resolve the fault.
Even when a resolution has been found, sufficient testing must be performed to ensure that recovery action is complete and that the service has been fully restored to the user(s).
The Service Desk should check that the incident is fully resolved and that the users are satisfied and willing to agree the incident can be closed. The Service Desk should also check the following:
- Closure categorization. Check and confirm that the initial incident categorization was correct or, where the categorization subsequently turned out to be incorrect, update the record so that a correct closure categorization is recorded for the incident - seeking advise or guidance from the resolving group(s) as necessary.
- User satisfaction survey. Carry out a user satisfaction call-back or e-mail survey for the agreed percentage of incidents.
- Incident documentation. Chase any outstanding details and ensure that the Incident Record is fully documented so that a full historic record at a sufficient level of detail is complete.
- Ongoing or recurring problem? Determine (in conjunction with resolver groups) whether it is likely that the incident could recur and decide whether any preventive action is necessary to avoid this. In conjunction with Problem Management, raise a Problem Record in all such cases so that preventive action is initiated.
- Formal closure. Formally close the Incident Record.
Note: Some organizations may chose to utilize an automatic closure period on specific, or even all, incidents (e.g. incident will be automatically closed after two working days if no further contact is made by the user). Where this approach is to be considered, it must first be fully discussed and agreed with the users - and widely publicized so that all users and IT staff are aware of this. It may be inappropriate to use this method for certain types of incidents - such as major incidents or those involving VIPs, etc.
Rules for reitilfoundations.compening incidents
Despite all adequate care, there will be occasions when incidents recur even though they have been formally closed. Because of such cases, it is wise to have pre-defined rules about if and when an incident can be reitilfoundations.compened. It might make sense, for example, to agree that if the incident recurs within one working day then it can be reitilfoundations.compened - but that beyond this point a new incident must be raised, but linked to the previous incident(s).
The exact time threshold/rules may vary between individual organizations - but clear rules should be agreed and documented and guidance given to all Service Desk staff so that uniformity is applied.