Showing posts with label facilities. Show all posts
Showing posts with label facilities. Show all posts

Tuesday, May 17, 2011

Uptime: Eeyore forever, or can cloud computing help the facilities-IT gap?

I dropped by the Uptime Institute Symposium in Santa Clara last week. It was a chance to step outside of my IT-focused world for a moment and hear the discussions from the other side of the fence: the facilities folks. And, I have to say, the view was a bit different.

In general, I think it's safe to say that facilities is not yet comfortable with where the cloud computing conversation is taking them.

Cloud computing is a part of the conversation on both sides of the fence, but people are looking at the cloud from very different angles. IT, despite having its own running battle with business users about cloud, can at least see cloud as an opportunity. Facilities in many cases views it as a direct, job-endangering threat. And, while there were hints at alignment, there are definitely some ruffled feathers.

A history of disconnects: IT & facilities

But this shouldn't be a surprise. A few years back, before Cassatt Corp. was acquired by CA Technologies, those of us at Cassatt spent quite a bit of time and effort understanding the facilities world and working to connect what the IT guys were doing with what was going on in the facilities realm. At that point, cloud computing was barely even called that. But the beginnings were there. The prototype ideas that would become cloud computing were starting to find their way into the IT and data center efficiency conversations in some of the more forward-looking companies. (If you want some historical snapshots, check out the work of the Silicon Valley Leadership Group in this area, plus the early entries from this blog and those by Ken Oestreich, now cloud marketing guy at EMC).

One of the biggest issues we ran into again and again was a disconnect between IT and facilities. And, after a few days at the most recent Uptime Institute event, I think it’s safe to say that the rift is still there.

IT (at the urging of the business) is leading the cloud charge

To show how the two groups are still pretty far apart, I’ll highlight a couple of the presentations I heard at the event. Several 451 Group analysts had presentations throughout the week, providing the IT perspective. One was William Fellows’ rapid-fire survey of where things are with cloud today. His premise was that cloud is moving from the playground to production in both public and private cloud incarnations.

Fellows pointed out that providing cloud-enabling technologies for service providers was one of the hottest spaces at the moment – he’s tracking a list of some 80 vendors at this point. Demand is moving cloud from an “ad hoc developer activity to become a first-class citizen.” Production business applications are “creeping up in public cloud” because of the ability to flexibly scale.

Enterprise IT, said Fellows, “wants to unlock their inner service-provider selves. They want to use cloud as just another node in the system.” In other words, the IT guys are starting to make leaps forward in how they are sourcing IT service, and even in how they are thinking about the IT role.

But from the other Uptime sessions and discussions, these forwarding-looking glimpses seemed to be the exception, rather than the rule.

Facilities is grappling with cloud’s implications and feeling uneasy

Contrast this with the keynote from AOL’s Mike Manos. Mike spent a chunk of his stage time on a self-described rant about how facilities people were feeling left out – “a bit gloomy” even – when it comes to the cloud computing discussion.

Manos compared facilities folks to Eeyore, the mopey character from the Winnie the Pooh children’s books. That prompted a few knowing chuckles in the crowd. But despite a predisposition to getting bummed out when the topic comes up, “you can’t duck your head,” said Manos, when the discussion turns to cloud computing.

He pointed out that the things that a Google keynoter from earlier in the conference had mentioned were not revolutionary, despite all they have accomplished. In fact, “Google is asking us to do the things we’ve been talking about [at Uptime conferences] for the past 10 years.”

The advice from Manos was good – and assertive. Facilities should step aggressively into the conversation about cloud computing. Don’t be worried that cloud might suddenly mean that data centers are suddenly going to disappear and you might lose your job. It won’t mean that, and especially not if you play your cards right. Instead of dreading cloud, figure out how to be part of (or even lead) the business decisions.

“No matter what, you’re going to have a hybrid model” in which data centers from external cloud providers will provide some of your IT service, and your own data centers will provide some as well. And, once you’re in that situation, “you’re going to have to manage it,” Manos said.

Now, there is a big list of things the facilities guys will need to get going on before they can take this head-on. Manos listed things as basic as “knowing what you have” in your data center and what it’s doing, as well as things that aren’t normally taken into account, including “soft costs you hardly ever capture.”

The cloud computing challenge for facilities

The ironic thing in all this is that the big cloud providers are given lots of kudos for their IT operations and their ability to enable IT service to aggressively support their business. One of the reasons that Google, Amazon, and others have gotten good at IT service delivery is, in fact, that they are good at the facilities side of things, too. Their facilities teams are integral to their success. So, folks, it’s possible.

Manos left his audience with a challenge – a challenge to jump into the cloud computing conversation with both feet. It means an investment to get applications ready for what happens when infrastructure fails (which it does) and to understand the operational impact of moving to the cloud (which is too often overlooked). It means an acknowledgement that a move to the cloud means a clearer understanding between how applications are architected and how data center facilities are run. Or at least an understanding of what you need to know when computing begins to happen both inside and outside your physical premises.

So, maybe cloud can actually help bridge the IT world and the facilities world. To some of us who have watched these two worlds dance around each other for a while, it’s been a long time coming. And, for sure, it’s not here yet. But Manos and others, in conjunction with the pressures facilities people are feeling from their business discussions about cloud comptuing, might just be providing the nudge they need.

Or, at the very least, a great nickname.

Monday, December 28, 2009

PG&E’s Bramfitt: data centers can be efficient and sustainable, but we must think beyond our current horizons

Before heading off for a little vacation, I posted the first part of a two-part interview with Pacific Gas & Electric’s Mark Bramfitt. Mark is best known in data center circles (you know, the kind of people that hang around this site and others like it) as the face of the Bay Area utility’s efforts to drive data center energy efficiency through innovative incentive programs. Now that I’m back, I’m posting part 2 of that discussion.

In the first part of the interview, Mark talked about what’s worked well and what hasn’t in PG&E's data center energy efficiency efforts, the impact of the recession on green IT projects, and how cloud computing is impacting the story.

After the interview appeared, Forrester analyst Doug Washburn wondered on Twitter if PG&E might kill the incentive programs Mark had been previously guiding. The thought is a natural one given how integral Mark has been to the program. From my interview, it didn’t sound so far-fetched: Mark himself thought cost pressures on PG&E might cause PG&E to “de-emphasize…some of the industry leadership activities undertaken by PG&E…we may not be able to afford to develop new programs and services if they won’t deliver savings.”

In the final part of the interview, posted below, Mark gets a little philosophical about energy efficiency metrics, highlights what’s really holding back data center efficiency work, and doles out a final bit of advice about creating "inherently efficient" IT infrastructure as he departs PG&E.

Jay Fry, Data Center Dialog: Mark, we’ve talked previously about the fact that a lot of the data center efficiency problem comes down to split incentives – often IT doesn’t pay or even see the power bill and the facilities guy doesn’t have control over the servers that are creating the problem. In addition, facilities is usually disconnected from the business reason for running those servers. How serious is this problem right now? What effective or creative ways have you seen for dealing with it?

Mark Bramfitt, PG&E: While I think we can all cite the split incentive issue as a problem, it’s not the only obstacle to success and it gets perhaps a bit too much air time. As much as I’d like to say that everyone’s focus on energy efficiency is making the split incentive obstacle moot, our experience is that there are only two ways that we get past that.

The first is when the IT management side simply has to get more done with less spend, leading them to think about virtualizing workloads. The second is when a facility runs out of capacity, throwing both IT and facility managers into the same boat, with energy efficiency measures often saving the day.

DCD: Many of the industry analyst groups (the 451 Group, Forrester, Gartner, etc.) have eco-efficient IT practices or focus areas now, a definite change from a few years ago. Is this a sign of progress and how can industry analyst groups be helpful in this process?


Mark Bramfitt: I can’t argue against the focus on energy efficient IT being anything but a good thing, but I’ve identified a big resource gap that hampers the ability of utilities to drive efficiency programs. We rely on engineering consultants who can accurately calculate the energy savings from implementing facility and IT improvements, and even here in the Bay Area, we have a hard time finding firms that have this competency – especially on the IT equipment side.

In my discussions with utilities across the U.S., this is probably the single biggest barrier to program adoption – they can’t find firms who can do the calculations, or resources to appropriately evaluate and review them.

So, I think there’s a real opportunity for IT service providers – companies that both sell solutions and services – to step into this market. I don’t know that analyst firms can do much about that.

DCD: I think it’s fair to say that measurement of the power and ecological impact of data centers has started and really does matter. In the (pre-CA) survey Cassatt did at the beginning of this year on the topic, we found that there were definitely areas of progress, but that old data center habits die hard. Plus, arguments among the various industry and governmental groups like the Green Grid and Energy Star over how and what to measure (PUE, DCiE, EUE) probably don’t help. How do you think people should approach measurement?


Mark Bramfitt: We’ve delivered a consistent message to both customers and to vendors of metering and monitoring systems that we think quantifying energy use can only have a positive impact on driving people to manage their data centers better. Metering and monitoring systems lead people to make simple changes, and can directly measure energy savings in support of utility incentive programs.

We also like that some systems are moving beyond just measurement into control of facility and IT equipment, and to the extent that they can do so, we can provide incentive funding to support implementation.

The philosophical distinctions being made around what metrics are best are understandable – the ones we have now place an emphasis on driving down the “overhead” energy use of cooling and power conditioning equipment, and have nothing to say about IT power load. I believe that the industry should focus on an IT equipment “utilization index” rather than holding out for the ideal efficiency metric, which is probably not conceivable given all of the different IT workloads extant in the marketplace.

DCD: One of the major initiatives you had was working with other utilities and create a coalition trying to push for similar incentives to those PG&E has been offering. I’ve heard Data Center Pulse is helping work on that, too. How would you characterize where we are with this process currently. What’s next?

Mark Bramfitt: The PG&E-sponsored and -led Utility IT Energy Efficiency Coalition [also mentioned in part 1 of Mark’s interview] now has almost 50 members, and I would say it has been a success in its core mission – to provide a forum for utilities to share program models and to discuss opportunities in this market space.

I think it’s time for the Coalition, in whatever form it takes in the future, to expand to offering the IT industry a view of what utilities are offering programs and how to engage with them, as well as a place for IT and other companies to list their competencies.

I’ll be frank, though, in saying that I don’t know whether PG&E will continue to lead this effort, or if we need to think about another way to accomplish this work. I’ve had conversations with a lot of players to see how to maintain the existing effort as well as to extend it into new areas.

DCD: Watching this all from PG&E is certainly different than the perspective of vendors, IT departments, or facilities folks. Any comments on seeing this process through the vantage point you’ve had?

Mark Bramfitt: Utilities primarily look at the data center market as a load growth challenge: how can we provide new energy delivery capacity to a segment that is projected to double every 5 years? There’s already immense national and global competition for locations where 20 or even 100 mW loads can be accommodated, and where customers want to build out facilities in months, not years.

PG&E’s answer to that is to actively work with customers to improve energy efficiency, extending our ability to accommodate new load growth without resorting to energy supply and delivery capacity that is expensive to build and not our best choice from an environmental sustainability perspective.

My larger view is that IT can deliver tremendous environmental benefits as it affects broad swaths of our activities – improving delivery systems, for example, and in my line of work enabling “smart grid” technologies that can improve utility operation and efficiency. But to get there, we need to have IT infrastructure that is inherently efficient and thereby sustainable, and we have a huge opportunity space to make that happen.

DCD: Do you have any advice you’d care to leave everyone with?

Advice? Every major achievement I’ve seen in this space has been due to people expanding their vision and horizons. It’s IT managers taking responsibility for energy costs even if they don’t roll up in their budget. It’s IT companies supporting efficiency measures that might in some ways be at cross-purposes with their primary business objectives. And it’s utilities that know that their mission can’t just be about delivering energy, they need to support customers and communities in new ways.



Thanks, Mark, for taking the time for the interview, and best of luck with the new venture.

Wednesday, December 16, 2009

As Bramfitt departs PG&E, where will the new focus for data center energy efficiency efforts be?

If you were in the audience at the Silicon Valley Leadership Group’s Data Center Energy Efficiency Summit earlier this year, you were probably there (among other things) to hear Mark Bramfitt from Pacific Gas & Electric (PG&E). Mark has been the key figure in the Bay Area utility’s efforts to drive improvements in how data centers are run to cut energy costs for the data center owners and to reduce their burgeoning demand for power.

But, Mark had a surprise for his audience. During his presentation, he announced he was leaving PG&E.

“The reaction from the crowd was impressive…and for good reason,” said John Sheputis, CEO of Fortune Data Centers, in a story by Matt Stansberry at SearchDataCenter. “Mark is an excellent speaker, a very well-known entity in the Valley, and among the most outspoken people I know of regarding the broader engagement opportunities between data centers and electricity providers,” Sheputis said. “No one has done more to fund efficiency programs and award high tech consumers for efficient behavior.”

Mark has stayed on with PG&E for the months since then to help with the transition. Before he moves on to his Next Big Thing at the end of 2009, I thought I’d ask him for his thoughts on a few relevant topics. In the first part of the interview that I’m posting today, Mark talks about what’s worked well and what hasn’t in PG&E's data center energy efficiency efforts, the impact of the recession on green IT projects, and how cloud computing is impacting the story.

Jay Fry, Data Center Dialog: Mark, you’ve become synonymous with PG&E’s data center efficiency work and if the reaction to the announcement that you’ll be leaving that role at the SVLG event is any indication, you’ll be missed. Can you give some perspective on how things have changed in this area during your time on the project at PG&E?

Mark Bramfitt, PG&E: First, I truly appreciate the opportunity to offer some clarity around PG&E’s continued focus on this market, as well as my own plans, as I feel I’ve done something of a disservice to both the IT and utility industry nexus by not being more forthright regarding our plans.

My team and I have treated the data center and IT energy efficiency market as a start-up within PG&E’s much larger program portfolio, and we’ve seen a great growth curve over the past four years – doubling our accomplishments in 2008 compared to 2007, for example. We’ve built an industry-leading portfolio of programs and services, and I expect PG&E will continue to see great engagement from our customers in this space.

That being said, utilities in California are under tremendous pressure to deliver energy efficiency as cost effectively as possible, so some of the industry leadership activities undertaken by PG&E may have to be de-emphasized, and we may not be able to afford to develop new programs and services if they won’t deliver savings.

My personal goal is to see 20 or more utilities follow PG&E’s lead by offering comprehensive efficiency programs for data centers and IT, and I think I can best achieve that through targeted consulting support. I’ve been supporting utilities essentially in my “spare” time, in part through the Utility IT Energy Efficiency Coalition, but there are significant challenges to address in the industry, and I think my full-time focus as a consultant will lead to broader success.

DCD: Why are you leaving PG&E, why now, and what will you be working on?

Mark Bramfitt: It may sound trite, or worse, arrogant, but I want to amplify the accomplishments we’ve made at PG&E over the past few years, using my knowledge and skills to drive better engagement between the utility and IT industries in the coming years. PG&E now has a mature program model that can be executed well in Northern California, so I’d like to spend my time on the bigger challenge of driving nationwide activities that will hopefully yield big results.

DCD: You had some big wins at some big Silicon Valley data centers: NetApp being one of those. Can you talk about what came together to make some of those possible? What should other organizations focus on to get them closer to being able to improve their data center efficiency as well?

Mark Bramfitt:
Our “big hit” projects have all been new construction engagements where PG&E provides financial incentives to help pay for the incremental costs of energy efficiency improvements – for measures like air- or water-side economizers, premium efficiency power conditioning and delivery equipment, and air flow isolation measures.

We certainly think our financial support is a big factor in making these projects work, but my project managers will tell you that the commitment of the project developer/owner is key. The design team has to want to work with the technical resource team PG&E brings to the table, and be open to spending more capital to realize expense savings down the road.

DCD: You made some comments onstage at the Gartner Data Center Conference last year, saying that “It’s been slow going.” Why do you think that’s been, and what was most disappointing to you about this effort?

Mark Bramfitt: I don’t have any real disappointments with how things of gone – we’re just very focused on being as successful as we can possibly be, and we are introspective in thinking about what we could do better.

I’d characterize it this way: we’ve designed ways to support on the order of 25 energy efficiency technologies and measures, absolutely leading the utility industry. We’ve reached out to dozens of VARs and system integrators, all of the major IT firms, every industry group and customer association, made hundreds of presentations, delivered free training and education programs, the list goes on.

What has slowed us down, I think, is that the IT industry and IT managers had essentially no experience with utility efficiency programs three years ago. It simply has taken us far longer than we anticipated to get the utility partnership message out there to the IT community.

DCD: The green IT hype was pretty impressive in late 2007 and early 2008. Then the economic crisis really hit. How has the economic downturn affected interest in energy efficiency projects? Did it get lost in the crisis? My personal take is that it certainly didn’t get as much attention as it would have otherwise. Maybe the recession caused companies to be more practical about energy efficiency topics, but I’m not sure about that. What are your thoughts?

Mark Bramfitt: I don’t see that the message has been lost, but certainly the economy has affected market activity.

PG&E is not seeing the level of new data center construction that we had in ’07 and ’08, but the collocation community tells me demand is exceeding supply by 3-to-1. They just can’t get financing to build new facilities.

On the retrofit side, we’re seeing interest in air flow management measures as the hot spot, perhaps because customers are getting the message that the returns are great, and it is an easy way to extend the life and capacity of existing facilities.

DCD: The other topic that’s taken a big share of the IT and facilities spotlight in the last year has obviously been cloud computing. How do you see the efficiency and cloud computing conversations playing together? Is the cloud discussion helping or hindering the efficiency discussion inside organizations in your opinion?

Mark Bramfitt: I’ve talked to some thought leaders on cloud computing and many seem to think that highlighting the potential energy efficiency advantages of shared services has merit. But with regard to our program delivery, the intersection has really been about how to serve collocation owners and tenants, rather than on the broader topic of migration to cloud services.



Be sure to come back for Part 2 of the interview. We'll cover a few other topics of note, including Mark’s thoughts on the philosophical differences over measurement approaches, the single biggest barrier to data center efficiency program adoption, and even a little bit of parting advice from Mark as he gets ready to leave PG&E.

Sunday, May 31, 2009

Old habits die hard: the bad news on data center energy efficiency

Despite the batch of pretty good news I reported in my previous post about the trends we see in how data center managers are approaching energy efficiency from 2008 to 2009, there is some bad news. Isn't there always?

But before we get too gloomy, it should be noted that much of the media, vendor, and customer discussion about energy efficiency over the past few years seems to have paid off in getting the word out. As I discussed last time, more IT folks have "green" initiatives to leverage, and more are measuring their power consumption. Sure, that sometimes means that what they measure is pretty inefficient or problematic ("Um, guys...do you know we're out of power in our New York data center?"), but at least they're measuring.

The conversation has led many (including my own esteemed colleague and Cassatt chief scientist Steve Oberlin) to point out that the current cloud computing discussions can absolutely be seen as one way that data centers -- or at least the organizations that own them -- can become more green.

The "green IT" messages just might be getting through, but...

All this discussion seems to mean that organizations like the EPA's Energy Star program, the Uptime Institute, the Green Grid, and others are getting through to people who run data centers. For example, in this year's Cassatt Data Center Survey, slightly more (63.2% v. 61.4% previously) know that the EPA recommends turning off servers when they aren't in use. Even though it's just a slight gain, I'll take it.

However, our survey also uncovered enough backsliding from 2008 to 2009 that I'm still forced to point out that some old habits die hard. Even if some of those habits are hurting the operation of your data center. In fact, some of the concepts that Cassatt has been strongly advocating, are meeting some very stiff opposition. Chalk up a few points for operational inertia.

Example: many would consider shutting off idle servers, but the percentage has dropped

Here's an example of one case where the "business-as-usual" approach is still holding on pretty firmly: one of the simplest use cases for Cassatt Active Response has been to use our software to shut down servers when they become idle and then use our policy-based automation to turn them back on when needed. However, powering up and down servers has long been seen as a no-no in IT operations, despite a wave of sources (including the EPA and the Green Grid) recently advocating this approach.

Given the long-standing skepticism about fiddling with server power, our 2009 survey's result actually seems impressive: 55% of the folks who responded to our survey (IT operations, data center managers, and the like) could justify turning off servers. Seems significant, right? Well, it is, actually, if you have a good feel for the underlying conservatism of those charged with keeping the data center running. The sad news from our perspective, however, is that this number is down slightly from last year's figure (59%), showing that the entrenched management ideas are still very strong. That’s despite some pretty good savings estimates. (Our savings calculators conservatively show Active Power Management-driven cost reductions starting at around 23%, with potential for closer to 50% in many cases.)

Similarly, the "deal breakers" for server power management remain similar to what they were last year: application availability is the most important. Impact on physical reliability and application stability were tied for the 2nd most important. This year's numbers do show a surge in worries about the physical reliability of machines (36.3% to 42.5%) and in the potential application downtime that people perceive server power management might cause (45.3% to 51.4%). But if you end up with a heat-induced outage like Last.fm had today, suddenly some proactive server shut-downs to avert a literal data center meltdown may not seem so scary.

So, any signs that the conservatism in IT operations groups can change?

Actually, yes. Respondents seemed to have grown more comfortable with determining ROI around the topic of server power management. (And, no, I don't think our aforementioned savings calculators can be credited with that.) Also, despite an increase in skepticism regarding using automation for server power management, 36.7% still said they would be OK using automation to power manage a majority of servers in their dev/test environments -- exactly the kind of advice we have been giving prospective customers. Interestingly, 27.2% even said they'd do this for low priority production servers.

Though the IT/facilities gap remains, it is shrinking

By the way, one of the issues alluded to in nearly all writings on the topic of data center energy efficiency -- the alignment gap between IT and facilities -- is still there. But the gap is closing. When asked how integrated facilities and IT planning are, 29.7% said there was "no gap" and they were "tightly aligned," 37.0% said there was a "small gap" and that they speak with their counterparts in the other organization "somewhat."

Where was the improvement? Last year, 32.2% said they had either a "significant gap" in which IT and facilities touch base "infrequently" or a "large gap," meaning they "don't interact at all." This year those numbers dropped to 20.3%. Maybe these organizations are being brought together by smart companies looking for answers to their data center energy problems. Or maybe these guys are just taking the advice of Ken Brill of the Uptime Institute or various analysts and doing something as simple as taking their IT and facilities counterparts to lunch. I'm happy either way. It's amazing what a little communication can do.

So, are the 'experts' succeeding in being heard about data center energy efficiency?

One of the odd things we noticed in last year's survey was that, despite there being a great deal of independent, unbiased expert advice out there regarding data center energy efficiency, respondents got most of their information on the topic from entrenched system and power/cooling vendors. You know, the ones with a big stake in keeping the status quo. (Of course, it could be argued that these vendors have a good perspective on what's needed. However, these vendors just don't have the economic incentives to push radical change.)

And 2009? Same thing. Expert bloggers and media websites on the topic (like TechTarget's SearchDataCenter and others) did get significant mention. Peers and industry analysts did well, too. But the big guys are still the ones that folks are going to for guidance.

There was a curveball this year, too, however. The Uptime Institute, the Green Grid, the EPA, and even the Silicon Valley Leadership Group (which put on a great event about "green IT" last June) all did unexpectedly worse than last year when asked where folks go for data center energy efficiency guidance. Hmmmm.

Big problems take a long time to fix, so don’t expect instant improvement

The take-away from all this? I look at it this way: the data center power problem is a big, long-term one. The processes and approaches that have gotten us all into this situation are big, long-term ones. And, therefore, the solutions are going to need to be pretty fundamental, and, as a result, they will also take a long time to implement. So, we need to be in this for the long haul.

The good news, as I see it, is that the 2009 Cassatt Data Center Survey suggests that the magnitude of the problem is starting to be seen and understood. IT and facilities groups should be given credit for the initial actions they seem to be taking in the areas that they have under their control. It's a great start. Now, if you combine these small initial steps with a bit of economic pressure (which the outside world is adding to the mix quite effectively on its own right now), who knows? Maybe even some of these outdated habits will fall by the wayside.

If you'd like a copy of the 2009 Cassatt Data Center Survey results, e-mail me at jay.fry@cassatt.com.

Friday, February 27, 2009

Recoverability: How an internal cloud makes it easier to get your apps back up and running after a failure

Ok, back to talking about the "-ilities" this week and how cloud computing can help you address one of the key issues you are concerned with in your data center. On deck is the recoverability of your IT environment when run on an internal cloud infrastructure (like Cassatt Active Response).

As discussed in my last post, there can be a fair amount of organizational and operational change required to adopt an internal cloud infrastructure, but there are many benefits from taking on the task. The next couple of posts will outline one of the major benefits (recoverability) that comes from separating the applications from their network, computing, and storage resources, and how this separation allows for both intra-datacenter and inter-datacenter recoverability of your applications.

Five Internal Cloud Recovery Scenarios
To walk you through the discussion, I'm going to take you through a number of levels of recoverability as a function of the complexity of the application and the failure being addressed. Taking this approach, I've come up with five different scenarios that start with the recovery of a single application and end with recovery of a primary datacenter into a backup datacenter. The first four scenarios (intra-datacenter recovery) are covered in this post and the last one (inter-datacenter recovery) will be covered in my next post. So, enough background: let’s get into the discussion.

Application Recovery
Let's begin with the simplest level of recoverability and start with a single application (think a single workgroup server for a department that might be running things like a wiki, mail, or a web server.) From talking to many admins over the years, the first step they do when they find that a given server has dropped offline is to perform a "therapeutic reboot" to see if that gets everything back to a running state. The reality of IT is that many of the frameworks/containers/applications that you run leak memory slowly over time and that a reboot is the easiest way to clean up the issue.

In the case of a traditional IT environment, a monitoring tool would be used to monitor the servers under management and if the monitors go offline then an alert/page is generated to tell the admin that something is wrong that needs their attention. With an internal cloud infrastructure in place, the creation of monitors for each managed server comes for free (you just select in the UI how you want to monitor a given service from a list of supported monitors). In addition, when the monitor(s) drop offline you can configure policy to tell the system how you want the failure handled.

In the case of an application that you'd like rebooted prior to considering it failed, you simply instruct the internal cloud infrastructure to power cycle the node and to send an alert only if it doesn't recover correctly. While this simple level of recoverability is interesting in making an admin more productive (less chasing unqualified failures), it really isn't that awe-inspiring, so let's move on to the next level of recovery.

Hardware Recovery
In this scenario, we'll pick up where the last example left off and add an additional twist. Assume that the issue in the previous example wasn't fixed by rebooting the node. With an internal cloud infrastructure, you can enable the policy for the service to not only reboot the service on failure (to see if that re-establishes the service) but also to swap out the current piece of hardware for a new piece (complete with reconfiguring the network and storage on the fly so that the new node meets the applications requirements).

Let's explore this a bit more. With current IT practices, if you are faced with needing to make a singleton application available (like a workgroup server) you probably start thinking about clustering solutions (Windows or U*nx) that allow you to have a single primary node running and a secondary backup node listening in case of failure. The problem with this approach is that it is costly (you have to buy twice the hardware), your utilization is poor (the back-up machine is sitting idle), and you have to budget for twice the cooling and power because the back-up machine sits on, but idle.

Now contrast that with an internal cloud approach, where you have a pool of hardware shared among your applications. In this situation, you get single application hardware availability for free (just configure the policy accordingly). You buy between 1/5th and 1/10th the backup hardware (depending the number of concurrent failures you want to be resilient to) as the back-up hardware can be shared across many applications. Additionally, the spare hardware you do purchase sits powered off while awaiting a failure so it consumes zero power and cooling.

Now, that level of recoverability is interesting, but IT shops typically have more complex n-tier applications where the application being hosted is spread across multiple application tiers (e.g. a database tier, a middleware/service tier and a web tier).

Multi-tier Application Recovery
In more complex application types (including those with four tiers, as is typical with JBoss, WebLogic, or WebSphere), an internal cloud infrastructure continues to help you out. For this example let's take JBoss as our working example as many people have experience with that toolset. JBoss' deployment model for a web application will typically consist of four tiers of services that work in concert to provide the desired application to the end user. There will be a database tier (where the user's data is stored), a service tier that provides the business logic for the application being hosted, and a web tier that will interact with the business logic and dynamically generate the desired HTML page. The fourth and final tier (which isn't in the IT environment) is the user's browser that actually renders the HTML to human-readable format. In this type of n-tier application stack there are implicit dependencies between the different tiers that are usually managed by the admins who know what the dependencies are for the various tiers and, as a result, the correct order for startup/shutdown/recovery (e.g. there is no point in starting up the business or web tiers if the DB is not running).

In the case of an n-tier application, an internal cloud computing infrastructure can help you with manageability, scalability, as well as recoverability (we're starting to pull out all the "-ilities" now…) We'll cover them in order and close with the recoverability item as that's this week’s theme. On the manageability front, an internal cloud infrastructure can capture the dependencies between the tiers and orchestrate the orderly startup/shutdown of the tiers (e.g. first verify that the DB is running, then start the business logic, and finally finish with the web tier). This means that the specifics of the application are no longer kept in the admin's head, but rather in the tool where any admin can benefit from the knowledge.

On the n-tier scalability front, usually a horizontal rather than vertical scaling approach is used for the business and web tiers. With an internal cloud infrastructure managing the tiers and using the monitors to determine the required computing capacity (to do this with Cassatt Active Response, we use what we call Demand-Based Policies), the infrastructure will automatically increase/decrease the capacity in each tier as a function of the demand being generated against the tier.

Finally, on the recoverability front, everything outlined in the last recovery scenario applies (restart on failure and swap to a new node if that doesn't work), but now you also get the added value of being able to restart services in multiple dimensions. As an example, in many cases connection pooling is used in the business tier to increase the performance of accessing the database. One downside (depending on the solution used for managing the connection pool) is that if the database goes away, then the business tier has to be restarted to re-establish the connections. In a typical IT shop this would mean that the admin would have to manage the recovery across the various tiers. However, in an internal cloud computing environment, the infrastructure has sufficient knowledge to know that if the DB went down, there is no point in trying to restart the failed business tier until the DB has been recovered. Likewise, there is no point in trying to recover the web tier when the business tier is offline. This means that even if the root failure can not be addressed by the infrastructure (which can happen if the issue is not transient or hardware related) the admin can focus on the recovery of the specific item that has failed and the system will take care of the busywork associated with restoring the complete service.

Intra-datacenter recovery
Ok, we're now into hosting non-trivial applications within a cloud infrastructure so let's take that same n-tier application example, but add in the extra complexity that there are now multiple n-tier applications being managed. What we'll do, though, is throw in a datacenter infrastructure failure. This example would be relevant for issues like a Computer Room Air Conditioning (CRAC) unit failure, loss of one of the consolidated UPS units, or a water leak (all of which would not cause a complete failure in a datacenter, but would typically knock out a block of computing capacity).

Before we jump into the failure, we need to explore one more typical practice for IT shops. Specifically, as more applications are added to an IT environment it is not uncommon for the IT staff to begin to stratify the applications into support levels that correspond to the importance the business places on the specific application in question (e.g. revenue systems and customer facing systems typically have a higher availability requirement than, say, a workgroup server or test and development servers.) For this example lets say that the IT department has three levels of support/availability that they use, with level one being the highest priority and level three being the lowest. With Cassatt Active Response, you can put this type of policy directly into the application and allow it to optimize the allocation of your computing resources to applications per your defined priorities. With that as background, let’s walk through the failure we outlined above and see what Cassatt Active Response will do for you in the face of a major failure in your datacenter (we're going to take the UPS example for this discussion).

We'll assume prior to the failure that the environment is in a steady state with all applications up and running at the desired levels of capacity. At this point, one of the shared UPS units goes offline, which affects all compute resources connected to that UPS unit. This appears to Cassatt Active Response as a number of node failures that go through the same recovery steps outlined above. However, as mentioned above, usually you will plan for a certain number of concurrent failures and you will keep that much spare capacity available for deployment. Unfortunately, when you loose something shared like a UPS, the number of failures quickly consumes the spare capacity available and you find yourself in an over-constrained situation.

This is where being on a cloud infrastructure really starts to shine. Since you have already identified the priority of the various applications you host in the tool, it can dynamically react to the loss in compute capacity and move resources as necessary to maintain your service levels on your higher priority applications. Specifically, in this example lets assume that 30% of the lost capacity was in your level 1 (most important) applications. The infrastructure will first try to shore up those applications from the excess capacity available, but when that is exhausted, it will start repurposing nodes from lower priority applications in support of re-establishing the service level of the higher priority applications. Now, because the cloud infrastructure can manage the power, network, and images of the applications, it can do all of this gracefully (existing lower priority applications get gracefully shut down prior to their hardware being repurposed) and without user interaction. Within a short period of time (think 10s of minutes), the higher priority applications have been re-established to their necessary operating levels.

The final part of this example is what occurs when the issue causing the failure is fixed (in our example, the UPS is repaired and power is re-applied to the effected computing resources.) With a cloud infrastructure managing your environment, your lower priority applications that were effected in support of shoring up the higher priority applications all just recover automatically. Specifically, once the power is reapplied, all you have to do is mark the hardware as available and the system will do the rest of the work to re-inventory and re-allocate the hardware back into those tiers that are running below the desired capacity levels.

Well, there you have it. We've walked through a variety of failure scenarios within a datacenter and discussed how an internal cloud infrastructure can offload much of the busy/mundane work of recovery. In the next post I'll take the example we just finished and broaden it to include recovery into a completely different datacenter. Until then…

Wednesday, February 11, 2009

VMware's Thiele: Never more data center change than right now

Making headway in running a data center is hard. Even if you've worked on it a lot. The guy I'm talking to in today's Cassatt Data Center Dialog interview is someone who -- despite the curveballs that IT and the business it supports can throw at you -- has been consistently making big strides in how data centers are run: Mark Thiele.

Mark is director of R&D business operations for virtualization giant VMware and as part of that job runs data centers in Palo Alto, Massachusetts, Washington, and Bangalore, totaling approximately 85,000 square feet. I've also seen Mark in action at his previous job, where his ideas and initiative helped shape parts of our Active Power Management technology and what became Cassatt Active Response, Standard Edition.

I last saw Mark speak at the Gartner Data Center Conference in December where he and Mark Bramfitt of PG&E talked about the state of green IT. I used that as a starting point for our interview:

Jay Fry, Data Center Dialog: In your panel at the Gartner Data Center Conference, you talked about how it's important to bring facilities and IT folks together for improved IT operations. You also touched on bringing operations and process management into the R&D organization and "stretching" people (including yourself) beyond their normal areas of expertise. Do you have some specific suggestions on how to do this or some examples of what's worked for you?

Mark Thiele, VMware: There's no simple answer here. I've been working in IT for a long time and as a result I'm biased in my belief that IT, when used appropriately can solve almost any problem. All joking aside, the reality is that many of the folks that work in IT look at how things get done a little differently than everyone else. An IT person's first reaction to a job that needs to get done is "how can I write a script to automate that?" I've utilized this to my advantage by looking at what are seemingly intractable "non-IT" problems from the IT perspective and shining a new light on them. This IT-centric focus helps to stretch non-IT folks and IT folks alike. As they work together to solve shared problems they come to realize the benefit of the shared experience and uncommon backgrounds. Once this happens, improving operations between groups becomes less of a headache.

DCD: You discussed having a "bridging the gap" person who can look at a data center as a holistic system. How do you find someone to fill that role? What skills should folks look for?

Mark Thiele: This can be a very difficult role to fill. The ideal person is someone who has a strong understanding of IT infrastructure, but an understanding of the importance of dealing with the entire data center as a system. In my case I was able to identify a strong candidate and convince them to take on the new role by explaining the potential opportunity for improvement and bottom-line impact for the company. The data center has become one of the most commonly referenced areas of opportunity in business today. There has never been more focus and change in data centers than there is right now. This kind of change and business importance can be very enticing to forward-thinking and career-driven IT staff.

DCD: One of the things you mentioned in Vegas was that there is a 3-5 year gap between when something is proven to be a good idea for improving IT operations and when people are actually using it. You said you’d like to find a way to shrink that time period. Any specific examples of things that seem "proven" but aren't being used yet? Any ideas how to shrink that time gap?

Mark Thiele: The dynamics of why it often takes years to implement new technology in the data center are many. These dynamics include risk avoidance, cost of entry, myth, intractable staff, and/or inflexible data center facilities. However, the aforementioned factors still don't explain why it often times takes 5 years or even more for proven technologies to be implemented.

The delays are associated with the inability to truly measure and understand the risk/reward of making change in the data center. As an industry we need to carry more responsibility for looking at the long term benefits of new technology vs. the short term potential for disruption in the environment.

Take virtualization as an example. You can pick any of 1,000 white papers that will explain why implementing a major virtualization strategy in your data centers is the best way to improve operations and drive down cost of doing business. Yet almost every day I talk to folks who say "VMware works great, [but] I just can't risk it in production" or "my software provider told me they won't support it on a VM." Thousands of world-class organizations have large "production" VMware solutions installed; how many more do we need before everyone agrees it works?

Part of this problem is aversion to any type of risk perceived or real. If the potential benefit is better Disaster Preparedness, faster provisioning, higher availability, and lower cost of doing business, it should be OK to accept a certain amount of calculated risk. As leaders in business and the IT space we should be obligated to ensure that our teams understand that intelligent risk is OK, in fact it's expected.

DCD: You suggested that people get started on energy-efficiency projects in bite-sized pieces, to show some quick wins. Any specific suggestions about how someone should approach creating that list or any particular projects you would suggest they start with?

Mark Thiele: There is a mountain of information available that can help with identifying power/efficiency opportunities in the data center. There's the Green Grid, APC Data Center University, Emerson's Efficient Data Center portal, LinkedIn groups like Data Center Pulse and many more. Once you've gone through some of the information on what can be done, you need to audit your current data centers to identify and prioritize the gap resolution activities. This prioritized list of gaps or opportunities should then be built into a program. I would highly recommend that anyone initiating a large effort like this should ensure they capture current state relative to space, power, and cooling so that you can measure and report the improvements to your management.

DCD: Who else within someone's company or organization should the IT operations people ally themselves with (besides facilities) to make progress on the data center energy efficiency front?

Mark Thiele:
Finance is your friend. If you can demonstrate the potential savings and long term cost of doing business improvements, they will likely become ardent supporters of the effort.

DCD: What has surprised you most about what’s going on in this space today?

Mark Thiele: That it's taken so long for data centers to get this much attention. It's about time for the IT Infrastructure folks to be getting the attention they deserve.

DCD: You mentioned Data Center Pulse, the group of data center operations people that you helped found via LinkedIn. Given the wide range of existing organizations focused different aspects of the data center, why did you feel a need to create a new one?

Mark Thiele: Our primary driver for creating Data Center Pulse was to give data center owner/operators a chance to have a direct and immediate influence on the industry that supports them. We are effectively a working group that will deliver information and opportunity-for-improvement information to any and all vendors who support the data center space. I guess our primary difference from most other orgs is that we are not sponsored and we only allow owner/operators to join the group. We don't have any sales, marketing, business development, recruiting, or press folks in the group.

DCD: What are some of the immediate things you hope Data Center Pulse accomplishes?

Mark Thiele: In the near term, our first major accomplishments have revolved around making the group a functioning entity. We've established a board of directors, and we've grown the group to over 600 members. The members represent over 20 countries and virtually every industry. Our next big adventure is the upcoming Data Center Pulse Summit that will be held in February [next week, actually: Feb. 17-19, 2009 in Santa Clara, CA --Jay]. We will be presenting findings generated by the group at the following AFCOM chapter meeting and [Teladata's] Technology Convergence Conference.

...

Thanks, Mark for the interview. In addition to the data center energy-efficiency resources that Mark mentioned, we have a few up on the Cassatt site as well, with some focused on more general green data center information and some focused on recommendations around server power management issues.

Rich Miller at Data Center Knowledge has also posted a more detailed overview of the Data Center Pulse Summit, if you want more information about the event.