Events are, and have been for quite some time, the fundamental elements of IT infrastructure real-time monitoring. Any status changed, threshold crossed in device usage, or step performed in a process generates an event that needs to be reported, analyzed, and acted upon by IT operations.
Historically, the lower layers of IT infrastructure (i.e., network components and hardware platforms) have been regarded as the most prone to hardware and software failures and have therefore been the object of all attention and of most management software investments. In reality, today’s failures are much more likely to be coming from the application and the management of platform and application updates than from the hardware platforms. The increased infrastructure complexity has resulted in a multiplication of events reported on IT management consoles.
Over the years, several solutions have been developed to extract the truth from the clutter of event messages. Network management pioneered solutions such as rule engines and codebook. The idea was to determine, among a group of related events, the original straw that broke the camel’s back. We then moved on to more sophisticated statistical and pattern analysis: Using historical data we could determine what was normal at any given time for a group of parameters. This not only reduces the number of events, it eliminates false alerts and provides a predictive analysis based on parameters’ value evolution in time.
The next step, which has been used in industrial process control and in business activities and is now finding its way into IT management solutions, is complex event processing (CEP).
About 10 years ago, I worked in IT operations in a company that was launching an IPO. Of course, the number of hits on the company’s Web site doubled and even tripled in the days before the launch. This is an obvious and simplistic example, but which actually would result in different conclusions when seen from an event management perspective. Network management rule engines received events from the infrastructure: router memory low, packets dropped, server CPU usage too high, database server performance alerts, etc. With no idea about what caused this sudden increase (of course we knew, but this is just a simple example), IT operations might have been tempted to increase the Web site infrastructure capacity. If IT operations were using predictive analysis, it would detect an abnormal pattern and predict the impending crash of the Web site in time for IT operations to do something about it. But it would have fared no better in analyzing the true root cause of the problem. Using complex event processing adds the business event dimension, for example, that the company made its intention public at some point in time. Processing this business event with the infrastructure events tells IT operations why the traffic increased, infers that it is most probably a transient phenomenon, and therefore recommends a temporary increase in capacity. If a private cloud is implemented, it could even trigger the provisioning of this extra and temporary capacity.
Several APM-BTM vendors have announced a CEP capability with their solutions. Building CEP rules is not simple and will require a good dose of analysis and cooperation between business and IT, but the reward is certainly worth it.
I hope I did some justice to CEP and welcome your comments.