Dana Gardner's BriefingsDirect: New ways emerge to head off datacenter problems while improving IT operational performance

Tuesday, February 5, 2008

New ways emerge to head off datacenter problems while improving IT operational performance

Listen to the podcast. Or read a full transcript. Sponsor: Integrien.

Complexity in today's IT systems makes previous error prevention approaches for operators inefficient and costly. IT staffs are expensive to retain, and are increasingly hard to find. There is also insufficient information about what’s going on in the context of an entire systems setup.

Operators are using manual processes -- in reactive firefighting mode -- to maintain critical service levels. It simply takes too long to interpret and resolve IT failures and glitches. We now see 70-plus-percent of the IT operations budget spent on labor costs.

IT executives are therefore seeking more automated approaches to not only remediate problems, but also to get earlier detection. These same operators don’t want to replace their system’s management investments, they want to better use them in a cohesive manner to learn more from them, and to better extract the information that these systems emit.

To help better understand the new solutions and approaches to detection and remediation of IT operations issues, I recently chatted with Steve Henning, the Vice President of Products for Integrien, in a sponsored BriefingsDirect podcast.

Here are some excerpts:

IT operations is being told to either keep their budgets static or to reduce them. Traditionally, the way that the vice president of IT operations has been able to keep problems from occurring in these environments has been by throwing more people at it.

This is just not scalable. There is no way ... (to) possibly hire the people to support that. Even with the budget, he couldn’t find the people today.

If you look at most IT environments today, the IT people will tell you that three or four minutes before a problem occurs, they will start to understand that little pattern of events that lead to the problem.

But most of the people that I speak to tell me that’s too late. By the time they identify the pattern that repeats and leads to a particular problem -- for example, a slowdown of a particular critical transaction -- it’s too late. Either the system goes down or the slowdown is such that they are losing business.

Service oriented architecture (SOA) and virtualization increase the management problem by at least a factor of three. So you can see that this is a more complex and challenging environment to manage.

So it’s a very troubling environment these days. It’s really what’s pushing people toward looking at different approaches, of taking more of a probabilistic look, measuring variables, and looking at probable outcomes -- rather than trying to do things in a deterministic way, measuring every possible variable, looking at it as quickly as possible, and hoping that problems just don’t slip by.

If you look at the applications that are being delivered today, monitoring everything from a silo standpoint and hoping to be able to solve problems in that environment is absolutely impossible. There has to be some way for all of the data to be analyzed in a holistic fashion, understanding the normal behaviors of each of the metrics that are being collected by these monitoring systems. Once you have that normal behavior, you’re alerting only to abnormal behaviors that are the real precursors to problems.

One of the alternatives is separating the wheat from the chaff and learning the normal behavior of the system. If you look at Integrien Alive, we use sophisticated, dynamic thresholding algorithms. We have multiple algorithms looking at the data to determine that normal behavior and then alerting only to abnormal precursors of problems.

Once you've learned the normal behavior of the system, these abnormal behaviors far downstream of where the problem actually occurs are the earliest precursors to these problems. We can pick up that these problems are going to occur, sometimes an hour before the problem actually happens.

The ability to get predictive alerts ... that’s kind of the nirvana of IT operations. Once you’ve captured models of the recurring problems in the IT environment, a product like Integrien Alive can see the incoming stream of real-time data and compare that against the models in the library.

If it sees a match with a high enough probability it can let you know ahead of time, up to an hour ahead of time, that you are going to have a particular problem that has previously occurred. You can also record exactly what you did to solve the problem, and how you have diagnosed it, so that you can solve it.

We're actually enhancing the expertise of these folks. You're always going to need experts in there. You’re always going to need the folks who have the tribal knowledge of the application. What we are doing, though, is enabling them to do their job better with earlier understanding of where the problems are occurring by adding and solving this massive data correlation issue when a problem occurs.