Tuesday, March 31, 2009

No April Fools' Day joke: Data center managers don't know what their servers are doing

Given the arrival of my favorite Silicon Valley holiday, I'd like to brush aside some of the content in this post as a big April Fools' Day ruse. Unfortunately, it's not.

Here are the facts: according to a couple folks who should know, data centers are buying new equipment before making good use of what they already have. OK, maybe that's not new news, but we've just added another couple scary bits of data ourselves. According to a new survey we did here at Cassatt, not only are data center managers making poor use of their equipment, they don't even know what some of that equipment is doing.

Sounds like a major disconnect. Here are some specifics from a couple big analyst firms and early feedback from our 2009 Cassatt Data Center Survey:

Gartner: Organizations struggle to quantify their data center capacity problems

Gartner analyst Rakesh Kumar published a paper (ID #G00165501) at the beginning of the month in which he slapped some pretty direct zingers (for Gartner, anyway) in his "key findings":

· 50% of data centers will face power, cooling, and floor space constraints within 3 years. (Our forthcoming survey, by the way, saw similar problems. 46% said their data center is within 25% of its maximum power capacity.)
· Data centers use "inefficient, ad hoc approaches" rather than "continuous-improvement, process-driven" methods to get a handle on these problems (how's that for a nice way to scold the IT guys?).
· Most organizations can't quantify their capacity problems.

Now, we know that the processes, like the technology, in use in today's data centers (especially for big organizations) have been cobbled together over time. Those ad hoc approaches Rakesh mentioned don't surprise me. But not being able to even quantify the problem seems like an issue that has to get solved immediately.

Forrester: the bad economy means it's time to improve IT operations...or else

Forrester's Glenn O'Donnell published some similar points in a recent NetworkWorld article. Given the rocky economy, companies are making IT investments in anything that improves operational discipline, provided there are speedy results, according to Glenn. He places the operational IT budget at around 75% of the total that companies spend on IT (also known as the "keeping the lights on" money). However, "30% to 50% of the energy we expend is wasted," he said, blaming "inefficient processes and poor decisions." And by energy, he means time and effort. And time = money. So he's talking about...money. The business world equivalent of Darwin's natural selection will not be kind to companies wasting money in the current economic climate, says Glenn. He points to "mean time to resolution" (MTTR) as a gauge of how a company is doing in its data center operational efficiency. One solution: "We must demonstrate...MTTR improvements with pilots of process and automation."

Cassatt survey: people don't know what their servers are doing

Back for a moment to Rakesh's Gartner report. "Most organizations," Rakesh writes, "struggle with quantifying the scale and technical nature of their data center capacity problems because of organizational problems, and because of a lack of available information." There's the rub. You need real, actual data before you can do anything about it. And -- as our customers have been telling us -- that's not easy to come by.

Our new Cassatt 2009 Data Center Survey (due out in the next few weeks), shows how acute the problem is. The survey will show that over 75% of data center managers only have a general idea of the current dynamic usage profile of their servers. A couple other somewhat disturbing stats we found:

· 7% said they don't have a very good handle on what their servers are doing
· 20% know what their servers were originally provisioned to do, but aren't certain that those machines are actually still involved in those tasks
· Only a bit more than 16% of those in IT ops have a detailed, minute-by-minute profile of what activity is being performed, the users involved, granular usage stats, interdependencies, and the like
· More than 20% of respondents thought that between 10-30% of their servers were "orphans" (servers that are on, but doing absolutely nothing). The actual number we have routinely found to be true orphans in our investigations with big customers, incidentally, is right around 11%. (For more on the "orphan" server topic, see my previous post.)

Getting the information that IT ops needs

OK, so that's all pretty dire. I'm interested, though, in how we help end user IT ops teams make progress in the face of this. I know that when working with customers on data center efficiency projects, this lack of data doesn't cut it: it's much better to come to them with a solution. Or at least some suggestions.

I'm sure others have come up with different ways to solve this, too, but we at Cassatt realized we had to create a way to get a profile of what a customer's environment is actually doing over time. So, as a step in the process of using our software to create an internal compute cloud -- complete with automated service-level management policies for their applications, VMs, and servers -- we put together an Active Profiling Service. We use some monitoring software and tap into the smarts from some of our experts who know the ins and outs of data centers to put this service together. The result: a look at what your data center is doing today and recommendations about what to do with that info.

(If you're interested, we can show you some sample Cassatt Active Profiling Service reports: ping us at info@cassatt.com. Also, Steve Oberlin and Craig Vosburgh will be walking through some aspects of this in a webcast this week.)

Once you have the data: some data center optimization suggestions

Once you have some of the crucial profile data, what are some useful optimization suggestions?

Glenn from Forrester points to Harley Davidson as someone who is headed in the right direction through a combination of "process, automation, hiring good people, and a determination to discard the destructive practices of the past."

Randy Ortiz of Data Center Journal suggests a little something called the "Data Center and IT Ops Diet." When you are on a diet, Randy notes, "you carefully examine what you take in and how much you burn off. The Data Center and IT Ops diet [which he describes as driven by the economic downturn] provides you with the necessities only: availability and efficiency. There is no room for large projects with long-reaching ROIs."

And what about Rakesh of Gartner? He has similar advice: before jumping into some new data center build-out, a "continuous process of data center improvement [should] be established" (in fact, his whole paper is called, appropriately enough, "Continuously Optimize Your Data Center Capacity Before Building or Buying More," which is what got me started on this whole rant in the first place). "Too often," Rakesh writes, "the data center improvements are considered a project" and not a long-term, on-going process. He suggests continually optimizing IT infrastructure because "sprawl in the infrastructure creates sprawl in the data center." The tactics he lists for consideration are pretty basic: consolidation, virtualization, and tossing out older hardware.

These are great starts. We're seeing customers do them all. And many of those customers we are working with are implementing these ideas as an integral part of a data center optimization project that also includes working with our software and an active profiling engagement (we do, of course, as a result, suffer from a sampling bias). However, what we've been helping these customers do is relevant here: they can identify and decommission specific orphan servers -- the actual ones that are sitting around doing nothing. They can find the specific candidates for virtualization, where server utilization is very low, and workloads are such that they can be stacked together with other workloads on fewer boxes. They can locate and set up intelligent server power management -- identifying hardware for which workloads are very cyclical, enabling servers to be completely shut down to save power during off hours. And, from all this can also come recommendations for ways to optimize your overall IT operations, including setting up policy-based IT infrastructure automation and an internal compute cloud. You could even use some of the decommissioned servers as spare capacity.

As cool as any of this may sound, though, there's no need to get carried away. Start simply. Find out what your servers are doing. Or start experimenting with optimization in a limited corner of your environment where the stakes aren't very high.

But start.

The industry that spawns some of the most creative April Fools' Day jokes shouldn't be one itself.

If you're interested in getting a preview copy of our second annual Data Center Survey results under non-disclosure, let me know at jay.fry@cassatt.com.

No comments: