Tuesday, December 16, 2008

Killing comatose servers: OK, but how?

One of the best sources of data center energy efficiency guidance available today is the Uptime Institute. Not only do they run some really focused, useful events on the topic, but their fearless leader, Ken Brill, is very visible and very direct with his recommendations. His recent article in Forbes took on one of those dirty little secrets in IT: there are a lot of servers in your data center doing absolutely nothing.

Given how serious of a problem that the current growth rates of data center energy usage will be, Ken and Uptime have given some serious thinking to how to curb the problem. Topping that list was the directive that served as the title of his article: "Kill comatose computers."

Call them orphan servers, comatose servers, idle servers, or whatever, Brill calls them "corporate enemy No. 1. Unless you have a rigorous program of removing obsolete servers at the end of their lifecycle," he writes, "it is very likely that between 15% and 30% of the equipment running in your data center is comatose. It consumes electricity without doing any computing."

Yikes. Numbers tossed around by Paul McGuckin at the recent Gartner Data Center Conference are similarly high. The solution? Brill has a simple answer: "This dead equipment needs to be unplugged and removed."

No arguments so far from me. In fact, doing work like this is one of the steps we recommend toward revamping and improving how you run your data center. The more intriguing question is one Ken also asks: why hasn’t this already happened?

The answer, unfortunately, is that most shops have worked long and hard on the steps for standing up servers, installing new components, and the like, probably because adding things to the data center is always done with some sort of time or business pressure. People are watching and they want their stuff up now. Rarely is someone breathing down your neck to unplug something. In the frenetic everyday life of an IT ops person, the decommissioning bit is the part that can wait while you handle the urgent fire of the day.

The problem is that after today's crisis comes tomorrow's. And though the orphan servers are using up power to keep them running and air conditioning to keep them cool, removing them isn't a priority. But as more and more data centers start to hit the wall for power capacity, that's going to have to change.

So, what do you do? As Ken Brill points out in Forbes, "after weeks or months pass and employees turn over, the details of what can be removed will be forgotten, and it becomes a major research project to identify what is not needed."

Unfortunately, identifying orphan servers is not something that will take care of itself. Here are some of things we've seen customers focus on to help solve this problem:

- Enable some sort of detailed monitoring on your servers
- Determine what time period is appropriate to watch for changes, based on your business and what the servers are likely to be doing
- Watch usage, users, processes, and other statistics that will be helpful in making decisions later
- Sift through the mountains of data you collect with someone who can translate it into useful information
- Engage the end users in the process

We've found that organizations sometimes want to do these steps themselves, sometimes they don't. When customers ask Cassatt to help with these steps, it's often because they are in need of the expertise or tested tools and processes for finding out what their servers are doing. It's something they often don't have internally.

The other thing we generally bring to the process is experience we've had working with some very large customers. Through our recently announced Cassatt Active Profiling Service, we've helped customers identify orphan servers, recommended candidates for virtualization, found candidates for power management, and located other servers that they could start to use as a free pool of resources to support a move toward setting up a sort of "internal cloud" architecture.

I guess that's the good news: with the data you get from a project like this, you can start to really make some significant changes to the way you manage your IT infrastructure. So, not only can you follow Ken Brill's advice and begin to kill off those comatose servers and save yourself a great deal of power, but you can also arm yourself with some unexpectedly useful information.

For example, if you know what your servers are doing (and not doing) at different times of the day, the month, and the quarter, you can use that information to start to set up some automation to manage your infrastructure based on those profiles. You can set up shared services to allocate or pull back servers for the applications which have the highest priority at any given time, based upon your priorities.

But I'm getting ahead of myself. The first step, then, is to find out what the servers in your data centers are doing. Then, if you don't like what they're doing, you can actually do something about it.

No comments: