AD | Application | AWS | Azure | Cloud | Database | Enterprise | Environmental | Event Log | File System | IoT | IT Service | Network/System | Infra | Performance | Protocol | SaaS | Security | Service Level | Storage | Linux | VMware | VoIP | Web | Wireless | SNMP

Crumbtrail

MonitorTools.com » NetTech Insights » ITIL Insights » Processes » Incident Management

ITIL Incident Management

There should be a close interface between the Incident Management process and the Problem Management and Change management processes as well as the function of Service Desk. If not properly controlled, Changes may introduce new Incidents. A way of tracking back is required. It is therefore recommended that the incident records should be held on the same CMDB as the Problem, Known Error and Change records, or at least linked without the need for re-keying, to improve the interfaces and easy interrogation and reporting.

Incident priorities and escalation procedures need to be agreed as part of the Service level Management process and documented in the SLAs.

The Problem Management process requires the accurate and comprehensive recording of Incidents in order to identify and efficiently the cause of the Incidents and trends. Problem Management also needs to liaise closely with the Availability Management process to identify these trends and instigate remedial action.

More on Incident Management

Read more on Incident Management here:


 

Incident Management - What is Incident Management

In ITIL terminology, an 'incident' is defined as:

An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted service is also an incident, for example failure of one disk from a mirror set.

Incident Management is the process for dealing with all incidents; this can include failures, questions or queries reported by the users (usually via a telephone call to the Service Desk), by technical staff, or automatically detected and reported by event monitoring tools.

Purpose

The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained. 'Normal service operation' is defined here as service operation within SLA limits.

Value to business

The value of Incident Management includes:

Incident Management is highly visible to the business, and it is therefore easier to demonstrate its value than most areas in Service Operation. For this reason, Incident Management is often one of the first processes to be implemented in Service Management projects. The added benefit of doing this is that Incident Management can be used to highlight other areas that need attention - thereby providing a justification for expenditure on implementing other processes.


 

Incident Management - Incident escalation

Incident escalation can be divided into:

The exact levels and timescales for both functional and hierarchic escalation need to be agreed, taking into account SLA targets, and embedded within support tools which can then be used to police and control the process flow within agreed timescales.

The Service Desk should keep the user informed of any relevant escalation that takes place and ensure the Incident Record is updated accordingly to keep a full history of actions.

There may be many incidents in a queue with the same priority level - so it will be the job of the Service Desk and/or Incident Management staff initially, in conjunction with managers of the various support groups to which incidents are escalated, to decide the order in which incidents should be picked up and actively worked on. These managers must ensure that incidents are dealt with in true business priority order and that staff are not allowed to 'cherry-pick' the incidents they choose!


 

Incident Management - Logging

All incidents must be fully logged and date/time stamped, regardless of whether they are raised through a Service Desk telephone call or whether automatically detected via an event alert.

Note: If Service Desk and/or support staff visit the customers to deal with one incident, they may be asked to deal with further incidents 'while they are there'. It is important that if this is done, a separate Incident Record is logged for each additional incident handled - to ensure that a historical record is kept and credit is given for the work undertaken.

All relevant information relating to the nature of the incident must be logged so that a full historical record is maintained - and so that if the incident has to be referred to other support group(s), they will have all relevant information to hand to assist them.

The information needed for each incident is likely to include:

If the Service Desk does not work 24/7 and responsibility for first-line incident logging and handling passes to another group, such as IT Operations or Network Support, out of Service Desk hours, then these staff need to be equally rigorous about logging of incident details. Full training and awareness needs to be provided to such staff on this issue.

Categorization

Part of the initial logging must be to allocate suitable incident categorization coding so that the exact type of the call is recorded. This will be important later when looking at incident types/frequencies to establish trends for use in Problem Management, Supplier Management and other ITSM activities.

Please note that the check for Service Requests in this process does not imply that Service Requests are incidents. This is simply recognition of the fact that Service Requests are sometimes incorrectly logged as incidents (e.g. a user incorrectly enters the request as an incident from the web interface). This check will detect any such requests and ensure that they are passed to the Request Fulfilment process.

Multi-level categorization is available in most tools - usually to three or four levels of granularity. All organizations are unique and it is therefore difficult to give generic guidance on the categories an organization should use, particularly at the lower levels. However, there is a technique that can be used to assist an organization to achieve a correct and complete set of categories - if they are starting from scratch! The steps involve:

  1. Hold a brainstorming session among the relevant support groups, involving the SD Supervisor and Incident and Problem Managers.
  2. Use this session to decide the 'best guess' top-level categories - and include an 'other' category. Set up the relevant logging tools to use these categories for a trial period.
  3. Use the categories for a short trial period (long enough for several hundred incidents to fall into each category, but not too long that an analysis will take too long to perform).
  4. Perform an analysis of the incidents logged during the trial period. The number of incidents logged in each higher-level category will confirm whether the categories are worth having - and a more detailed analysis of the 'other' category should allow identification of any additional higherlevel categories that will be needed.
  5. A breakdown analysis of the incidents within each higher-level category should be used to decide the lower-level categories that will be required.
  6. Review and repeat these activities after a further period - of, say, one to three months - and again regularly to ensure that they remain relevant. Be aware that any significant changes to categorization may cause some difficulties for incident trending or management reporting - so they should be stabilized unless changes are genuinely required.

If an existing categorization scheme is in use, but it is not thought to be working satisfactorily, the basic idea of the technique suggested above can be used to review and amend the existing scheme.

NOTE: Sometimes the details available at the time an incident is logged may be incomplete, misleading or incorrect. It is therefore important that the categorization of the incident is checked, and updated if necessary, at call closure time (in a separate closure categorization field, so as not to corrupt the original categorization)


 

Incident Management - Major incidents

A separate procedure, with shorter timescales and greater urgency, must be used for 'major' incidents. A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritization system - such that they will be dealt with through the major incident process.

Note: People sometimes use loose terminology and/or confuse a major incident with a problem. In reality, an incident remains an incident forever - it may grow in impact or priority to become a major incident, but an incident never 'becomes' a problem. A problem is the underlying cause of one or more incidents and remains a separate entity always!

Some lower-priority incidents may also have to be handled through this procedure - due to potential business impact - and some major incidents may not need to be handled in this way if the cause and resolutions are obvious and the normal incident process can easily cope within agreed target resolution times - provided the impact remains low!

Where necessary, the major incident procedure should include the dynamic establishment of a separate major incident team under the direct leadership of the Incident Manager, formulated to concentrate on this incident alone to ensure that adequate resources and focus are provided to finding a swift resolution. If the Service Desk Manager is also fulfilling the role of Incident Manager (say in a small organization), then a separate person may need to be designated to lead the major incident investigation team - so as to avoid conflict of time or priorities - but should ultimately report back to the Incident Manager.

If the cause of the incident needs to be investigated at the same time, then the Problem Manager would be involved as well but the Incident Manager must ensure that service restoration and underlying cause are kept separate. Throughout, the Service Desk would ensure that all activities are recorded and users are kept fully informed of progress.


 

Incident Management - Scope

Incident Management includes any event which disrupts, or which could disrupt, a service. This includes events which are communicated directly by users, either through the Service Desk or through an interface from Event Management to Incident Management tools.

Incidents can also be reported and/or logged by technical staff (if, for example, they notice something untoward with a hardware or network component they may report or log an incident and refer it to the Service Desk). This does not mean, however, that all events are incidents. Many classes of events are not related to disruptions at all, but are indicators of normal operation or are simply informational (see section 4.1).

Although both incidents and service requests are reported to the Service Desk, this does not mean that they are the same. Service requests do not represent a disruption to agreed service, but are a way of meeting the customer's needs and may be addressing an agreed target in an SLA. Service requests are dealt with by the Request Fulfilment process (see section 4.3).


 

Incident Management - Incident prioritization

Another important aspect of logging every incident is to agree and allocate an appropriate prioritization code as this will determine how the incident is handled both by support tools and support staff.

Prioritization can normally be determined by taking into account both the urgency of the incident (how quickly the business needs a resolution) and the level of impact it is causing. An indication of impact is often (but not always) the number of users being affected. In some cases, and very importantly, the loss of service to a single user can have a major business impact - it all depends upon who is trying to do what - so numbers alone is not enough to evaluate overall priority! Other factors that can also contribute to impact levels are:

An effective way of calculating these elements and deriving an overall priority level for each incident is given in the table

      Impact  
    High Medium Low
  High 1 2 3
Urgency Medium 2 3 4
  Low 3 4 5
Priority code Description Target resolution time
1 Critical 1 hour
2 High 8 hours
3 Medium 24 hours
4 Low 48 hours
5 Planning Planned

In all cases, clear guidance - with practical examples - should be provided for all support staff to enable them to determine the correct urgency and impact levels, so the correct priority is allocated. Such guidance should be produced during service level negotiations.

However, it must be noted that there will be occasions when, because of particular business expediency or whatever, normal priority levels have to be overridden. When a user is adamant that an incident's priority level should exceed normal guidelines, the Service Desk should comply with such a request - and if it subsequently turns out to be incorrect this can be resolved as an off-line management level issue, rather than a dispute occurring when the user is on the telephone.

Some organizations may also recognize VIPs (high-ranking executives, officers, diplomats, politicians, etc.) whose incidents would be handled on a higher priority than normal - but in such cases this is best catered for and documented within the guidance provided to the Service Desk staff on how to apply the priority levels, so they are all aware of the agreed rules for VIPs, and who falls into this category.

It should be noted that an incident's priority may be dynamic - if circumstances change, or if an incident is not resolved within SLA target times, then the priority must be altered to reflect the new situation.

Note: some tools may have constraints that make it difficult automatically to calculate performance against SLA targets if a priority is changed during the lifetime of an incident. However, if circumstances do change, the change in priority should be made - and if necessary manual adjustments made to reporting tools. Ideally, tools with such constraints should not be selected.


 

Incident and Problem Management - Closing an incident

When a potential resolution has been identified, this should be applied and tested. The specific actions to be undertaken and the people who will be involved in taking the recovery actions may vary, depending upon the nature of the fault - but could involve:

Even when a resolution has been found, sufficient testing must be performed to ensure that recovery action is complete and that the service has been fully restored to the user(s).

The Service Desk should check that the incident is fully resolved and that the users are satisfied and willing to agree the incident can be closed. The Service Desk should also check the following:

Note: Some organizations may chose to utilize an automatic closure period on specific, or even all, incidents (e.g. incident will be automatically closed after two working days if no further contact is made by the user). Where this approach is to be considered, it must first be fully discussed and agreed with the users - and widely publicized so that all users and IT staff are aware of this. It may be inappropriate to use this method for certain types of incidents - such as major incidents or those involving VIPs, etc.

Rules for reitilfoundations.compening incidents

Despite all adequate care, there will be occasions when incidents recur even though they have been formally closed. Because of such cases, it is wise to have pre-defined rules about if and when an incident can be reitilfoundations.compened. It might make sense, for example, to agree that if the incident recurs within one working day then it can be reitilfoundations.compened - but that beyond this point a new incident must be raised, but linked to the previous incident(s).

The exact time threshold/rules may vary between individual organizations - but clear rules should be agreed and documented and guidance given to all Service Desk staff so that uniformity is applied.