This post is the part two (and concluding part) of a series where we discuss the Major Incident Handling process and how to put one together. Previously we discussed the elements and considerations that should go into the process design. In this post, I have elaborated some of those considerations further with a sample process flow and a corresponding process design.
Sample Major Incident Handling Process Flow
Sample Major Incident Handling Process Design
A major incident generally imposes higher impact and requires special attention to resolve it. To summarize, I think an effective Major Incident Handling process design should clearly define at least the following who-does-what-by-when-and-how elements:
- What constitutes a major incident in your organization? What criteria do you use to quickly and effectively determine and declare a major incident?
- Who is accountable for coordinating and controlling the activities during a major incident exercise? The Major Incident Manager role can be fulfilled by a person or by a team, and she needs the proper authority to direct the activities and the people who are involved.
- How the resolution efforts will be coordinated and conducted? The exact details may vary from one organization to another, or even from one incident to another. The general approach should be worked out beforehand, and the Major Incident Manager should be trained to utilize the approach as consistently as possible.
- What escalation or communication approach will be used during and after the Major Incident?
- What metrics will be used to measure the effectiveness of the process? Keep them simple, easily understood and reasonably painless to collect the data.
- What format of communication and reporting will be used for the major incident? Who will get what type of information? Try to keep the contents appropriate for the intended audience.
I hope the information presented so far has been helpful. Please feel free to suggest options or other approaches that have worked for your organization.
Links to other posts in the series
In IT, incidents as a result of technology failure or human error can strike at any moment. Occasionally, we can have an incident that has a wide impact and poses serious risks to the business operations. Those major incidents need to be handled swiftly, so the IT service can be restored quickly with useful information captured that can be used for the root cause analysis afterward. If you have business critical services or applications under your management, having an organized approach to handling major incidents can save a lot of time and improve productivity. If you need to put a process together for your organization, here are some elements to take into consideration.
- Scope and Criteria: What characteristics would qualify an incident as being a “Major” Incident? This is very organization specific but generally there are two basic elements to consider, impact and urgency. Many organizations use the combination of those two elements to classify the priority level assigned to an incident, and that is a good starting point. Any incident that possesses a high degree of impact and high degree of urgency should probably be considered “major” and get the utmost attention. You may have other characteristics you want to define. For example, the outage of a particular application or for a particular line of business may trigger a “major” incident automatically. Since mobilizing the people and logistic necessary to handle a major incident is never a trivial exercise, clearly defined and agreed upon scope and criteria are mandatory.
- Roles and Responsibilities: Who will declare a major incident is in motion and own the process execution end-to-end? Since we are talking about major incidents, the Incident Management process owner in your organization will likely own this process as well. Will you have a person or a team designated as the “Major Incident Manager?” Will you rotate such role from individual to individual or from team to team? Depending on the nature of the technology failure or breakdown, how will the major incident manager find the appropriate technical resources to get involved? Will the major incident manager someone who is on stand-by waiting for the occasions to spring into action or will she have another “day-job” and wear the major incident manager hat when necessary? This will again depend on how your organization feels about this role. One thing I am certain of is that this role will require someone with the appropriate skills, environment know-how, and leadership experience to pull people together and execute the agreed-upon process. Another word, I do not believe this is a simple service desk phone dispatch type of role.
- Logistic and Facility: Everyone needs to know exactly what to do when the major incident process gets initiated. Will you have a dedicated meeting space or war room type of set up? Will people know what teleconference number to use in order to call in and to provide updates or to receive updates? Will you have a separate teleconference number to work through the technology aspect of incident recovery without cluttering with other non-technical discussions? Who will manage the conference call? What criteria determine when the conference calls start and end? In addition to the conference call, will you hold some kind of web meeting or online collaboration setup where people can share things on screen? Will you have some type of continual update via web or email, so people can stay informed? All these finer details should be planned upfront.
- Escalation and Communication: How will you define the communication interval and who will receive what communication at what point in time? How will the incident be escalated up the chain of command as long as the incident remains open? For example, you may define something simple as follow:
- At Hour 0: Major incident declared and the technical team contacted by phone. Director of the technical team and VP of IT notified via email.
- At Minute 30: Director of the technical team notified again via email with updates.
- At Hour 1: Major Incident Manager asks the Director of the technical team to join the conference in person. Another email update goes to the VP of IT.
- At Hour 2: Major Incident Manager asks the VP of IT asked to join the conference call for updates.
- At Hour 4: Major Incident Manager asks the business customer to join the conference call for updates and to discuss other recovery options.
- Other Considerations: How will this process connect with a downstream process such as Problem Management? Will you have the problem manager on the call as the incident progresses? What documentation or deliverables will the major incident process produce? Simple log of incident chronology, who participated the call when, important details shared at various point of the incident, official updates communicated, reasons for the incident closure, and other pertinent information about the incident probably should be documented at a minimum.
One thing for sure, all these considerations are too important not to get agreed upon beforehand. When the agreed upon details are not in place, it is simply not productive for everyone involved to try to figure out the process details during the heat of the battle. When that happens, most people have a tendency to go into the “headless chicken” mode – responsibility-dodging and finger-pointing start to spawn shortly afterward. In the next post, I will provide a sample process flow for further discussion.
Links to other posts in the series
This post is the part two (and concluding part) of a series where we discuss the Major Incident Review process and how to put one together. Previously we discussed the elements and considerations that should go into the process design. We elaborated those considerations further with a sample process flow. We will describe the process activities further along with a reporting template you can use to implement the process.
Sample Incident Report Template
Sample Process Design Document
The process design document provides a detailed description of the fields within the report template, so no plan to repeat. I think there are two factors to keep in mind when undertaking such process. First, don’t do the process just for the sake of doing it. Do it because your organization genuinely wants to improve service by eliminating as many of these incidents over the long-term as you can. If the organization chose not to implement certain solution for some reasons, costs, technical complexity, longevity of the technology, regulatory/compliance, or whatever, at least document the discussion. That way, it shows that the organization understood the risks and chose to accept them.
Second, perform meaningful measurements and, again, use the statistics to improve service. For example, if the majority of the incidents are reported by the end users, perhaps that is giving us a clue that we should be more proactive and beef up the automated monitoring? If a particular technology area has been experiencing more major incidents than the other areas, perhaps we should figure out what ills are plaguing the area and fix what are broken? If a particular business unit or segment has been experiencing more major incidents than the other segments, perhaps we owe it to the business communities to figure out what we can do to make things better? The business impact information we capture will enhance our understanding of the incidents and help us in formulating the solutions that make sense for the business.
Most organizations I know practice some type of incident review process, so I hope the information presented so far has been helpful. Please feel free to suggest other approaches that have worked for your organization.
Links to other posts in the series
One approach to improve just about anything is to learn from a mishap and do something to prevent similar mishaps from taking place. In the IT world, stuff happens frequently enough that we have the tendency to just fix things up and move on to the next incident or crisis. Having a disciplined approach to do root cause analysis (RCA) on incidents and putting permanent solutions in place, or Problem Management another word, can only help.
ITIL Service Operation handbook already provides an excellent overview of the Problem Management process with a suggested model and various analysis techniques, there is no need for me to reiterate. I am proposing a periodic Major Incident Review process that takes place after incidents have been resolve with service restored. People and tools aside, I think there are three things that can contribute to the effectiveness of this review process.
- Have a well-defined scope of incident you plan to review. Depending how an organization defines the impact and urgency of the incidents, the scope of incidents that get reviewed can vary. Some shops will choose to review only the most critical incidents with visible business impacts. Other might choose to review all incidents that took place. Many organizations will probably fall somewhere in between. My recommendation is to figure out what types of incidents the process stakeholders care about. We will discuss more about the stakeholder shortly.
- Have a clear exit plan on what to do with the root cause discovered and the permanent solutions. I think the situation to avoid is having the root causes identified but getting stuck on what to do next. The root cause to a number of incidents is a simple break-down of technology. For those straight-forward incidents, you identify what broke down, fix it, and move on. Sometimes a permanent solution could take a significant amount of time, people, or financial resource to achieve. For all incidents, a decision should be made to either
- Follow up using the same Major Incident Review process up to some point.
- Get this item off the Major Incident Review process but to use another process/procedure to track and to follow up on making sure the permanent solutions get implemented
- Decide nothing further will or can be done. Capture the lesson learned in a known error database or some knowledge management repository.
- Perform the process periodically, without exception. The frequency of the review can vary from one organization to another, monthly or bi-weekly for some or weekly for those busy shops. RCA activities can have a tendency to take a while to do because they often do not register at the same level of criticality as the incidents. By having this review process on a periodic basis, we are taking a position of … “Don’t procrastinate. Let’s figure out what went wrong. What were the causes? Decide what we plan to do about it and move on.”
In this post, I will discuss what you need in order to put a Major Incident Review process together for your organization. In addition to the three success factors that need to take into account, here are some additional elements to consider.
- Who will participate and who are the stakeholders? This review process will involve three groups of people at a minimum, the Incident Management team, the Problem Management team, and the technology/application support team. The actual organizational functions that perform the incident and problem management processes can vary. Frequently, the Service Desk is the owner for the Incident Management process, while the ownership for Problem Management rests somewhere outside of the Service Desk. Also, the IT/corporate management and the business users will likely become another set of stakeholders, since the results of the view often get shared with those constituents.
- Who will own this process? Since the review process will occur after incident resolution and spend a great deal time on RCA activities, the Problem Management process owner is the logical owner of this process. Even though the name of the process is called Major “Incident” Review, calling this process a Major “Problem” Review just sounds awkward. The convention you use in your organization should determine what is the most easily understood name you will use.
- What technology or tools will the process require? This process is straight forward enough where a spreadsheet tool should be sufficient. In addition, most organization will have some type of incident tracking tools that can feed the incident information into this review process. The output of this process could also feed into a Problem Management tool if you have one.
Sample Major Incident Review Process Flow
Here is an example of the review process flow. I have used a bi-weekly schedule for the review process. Depending on your organization’s requirement, this schedule could expand or contract. In part two, I will describe this process flow more in detail with a template you can use to capture the incident review details.
Links to other posts in the series
You must be logged in to post a comment.