Month: January 2012

Event Management Process Design – Part One

This post is the first part of a series where we discuss the Event Management process and how to put one together. Accordingly to ITIL (quoted directly), Event Management is the process that monitor all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exceptional conditions. Another word, Event Management picks up the alerts and events generated from the devices and applications, figure out what to do with those alerts and events, and follow up afterward to make sure the alerts and events get the due attention and addressed properly. To begin putting together an Event Management process for your organization, here are some elements to think about.

  1. What events and alerts do you plan to trap and process? It may be a noble goal to design a process that can trap 100% of the alerts from the environment and process them all. It is not always possible. Some events can be trapped and processed automatically by the tools you have on hand, and some alerts will require manual intervention. Where will the alerts/events be captured from and where they will be recorded? ITIL suggested centralizing the event management process as much as possible, and it makes sense. If the alerts need to come from different technology stacks or devices, which they often do, can you at least centralize the location where the recording and processing activities can take place? Determine the scope, what you can do or cannot do, and have a clear idea of what you hope to get out of the process.
  2. Once you determine the set of events or alerts that can be picked up and fed through the process, you will need a set of rules on what to do with those events. The rules need to be explicit so there is little room for guessing or personal interpretation by those carrying out the process. The rules will determine what conditions, after being met or exceeded with some thresholds, will trigger an event. For example, you may have a rule that says when server ABC’s CPU utilization reaches 90% and stay there consistently for over 10 minutes during business hours (6am to 6pm), an alert will be triggered. The rule will further stipulate what actions will be taken when the event is triggered. For example, you may have a rule that says the CPU alert will be escalated or handed over to the systems admin team for further evaluation via email or phone call. The rule will also call out what acknowledgement or interaction will constitute a successful escalation or hand-off.
  3. You should have a classification scheme for the incoming alerts/events. Not all alerts require the same handling actions. Using ITIL’s suggestion of having alerts that can be either Informational, Warning, and Exception is a good starting point and more than sufficient for most organizations anyway. For example, informational alerts usually get recorded for historical purposes and not escalated anywhere else, only the warning and exception get escalated further. Between the warning and exception alerts, they may get escalated differently to different teams with different timing considerations. Furthermore, once the alert is escalated, the job of Event Management is not 100% done. We also need to have a standard rule or approach on how to follow up while the alert condition is being addressed and to close out the alerts once certain conditions are met (incident resolved or alerts cease to repeat within a 24 or 48 time frame).
  4. As you can see, determining what to do with an alert, making sure the alerts are handled correctly and efficiently, and following up to close the alerts properly take some up-front thoughts and planning. The number of alerts monitored in a moderately complex IT environment can grow very quickly. Therefore, having heavily customized, individual alerts is not recommended, and really not necessary. My suggestion is to have a default event handling procedure that will work for over 90-95% of the events you anticipate to process. For the remaining 5-10%, use the default handling procedure as the foundation but with some customized procedure on top so the events can be handled correctly.
  5. Who will be on point as the process owner for and responsible for carrying out the Event Management process? If you are lucky enough where you can have a team in your organization whose primary responsibility is to monitor the environment and process the events, that team can be both accountable as the process owner and responsible for doing it. If a dedicated team is not an option and multiple people/teams will be carrying out the process, at least designate one, single process owner and have a consistent process in place for everyone else to follow.
  6. How will the process be measured for efficiency and effectiveness? What measurements does your organization care about? What actions will result from analyzing the measurement data? Measurements will mean very little if they are not acted upon to further improve the performance of the process.

Those are a lot to think about for now. In part two, I will provide a sample list of Event Management process design requirements and a sample process flow for further discussion.

Links to other posts of the series