Friday, February 27, 2009

Recoverability: How an internal cloud makes it easier to get your apps back up and running after a failure

Ok, back to talking about the "-ilities" this week and how cloud computing can help you address one of the key issues you are concerned with in your data center. On deck is the recoverability of your IT environment when run on an internal cloud infrastructure (like Cassatt Active Response).

As discussed in my last post, there can be a fair amount of organizational and operational change required to adopt an internal cloud infrastructure, but there are many benefits from taking on the task. The next couple of posts will outline one of the major benefits (recoverability) that comes from separating the applications from their network, computing, and storage resources, and how this separation allows for both intra-datacenter and inter-datacenter recoverability of your applications.

Five Internal Cloud Recovery Scenarios
To walk you through the discussion, I'm going to take you through a number of levels of recoverability as a function of the complexity of the application and the failure being addressed. Taking this approach, I've come up with five different scenarios that start with the recovery of a single application and end with recovery of a primary datacenter into a backup datacenter. The first four scenarios (intra-datacenter recovery) are covered in this post and the last one (inter-datacenter recovery) will be covered in my next post. So, enough background: let’s get into the discussion.

Application Recovery
Let's begin with the simplest level of recoverability and start with a single application (think a single workgroup server for a department that might be running things like a wiki, mail, or a web server.) From talking to many admins over the years, the first step they do when they find that a given server has dropped offline is to perform a "therapeutic reboot" to see if that gets everything back to a running state. The reality of IT is that many of the frameworks/containers/applications that you run leak memory slowly over time and that a reboot is the easiest way to clean up the issue.

In the case of a traditional IT environment, a monitoring tool would be used to monitor the servers under management and if the monitors go offline then an alert/page is generated to tell the admin that something is wrong that needs their attention. With an internal cloud infrastructure in place, the creation of monitors for each managed server comes for free (you just select in the UI how you want to monitor a given service from a list of supported monitors). In addition, when the monitor(s) drop offline you can configure policy to tell the system how you want the failure handled.

In the case of an application that you'd like rebooted prior to considering it failed, you simply instruct the internal cloud infrastructure to power cycle the node and to send an alert only if it doesn't recover correctly. While this simple level of recoverability is interesting in making an admin more productive (less chasing unqualified failures), it really isn't that awe-inspiring, so let's move on to the next level of recovery.

Hardware Recovery
In this scenario, we'll pick up where the last example left off and add an additional twist. Assume that the issue in the previous example wasn't fixed by rebooting the node. With an internal cloud infrastructure, you can enable the policy for the service to not only reboot the service on failure (to see if that re-establishes the service) but also to swap out the current piece of hardware for a new piece (complete with reconfiguring the network and storage on the fly so that the new node meets the applications requirements).

Let's explore this a bit more. With current IT practices, if you are faced with needing to make a singleton application available (like a workgroup server) you probably start thinking about clustering solutions (Windows or U*nx) that allow you to have a single primary node running and a secondary backup node listening in case of failure. The problem with this approach is that it is costly (you have to buy twice the hardware), your utilization is poor (the back-up machine is sitting idle), and you have to budget for twice the cooling and power because the back-up machine sits on, but idle.

Now contrast that with an internal cloud approach, where you have a pool of hardware shared among your applications. In this situation, you get single application hardware availability for free (just configure the policy accordingly). You buy between 1/5th and 1/10th the backup hardware (depending the number of concurrent failures you want to be resilient to) as the back-up hardware can be shared across many applications. Additionally, the spare hardware you do purchase sits powered off while awaiting a failure so it consumes zero power and cooling.

Now, that level of recoverability is interesting, but IT shops typically have more complex n-tier applications where the application being hosted is spread across multiple application tiers (e.g. a database tier, a middleware/service tier and a web tier).

Multi-tier Application Recovery
In more complex application types (including those with four tiers, as is typical with JBoss, WebLogic, or WebSphere), an internal cloud infrastructure continues to help you out. For this example let's take JBoss as our working example as many people have experience with that toolset. JBoss' deployment model for a web application will typically consist of four tiers of services that work in concert to provide the desired application to the end user. There will be a database tier (where the user's data is stored), a service tier that provides the business logic for the application being hosted, and a web tier that will interact with the business logic and dynamically generate the desired HTML page. The fourth and final tier (which isn't in the IT environment) is the user's browser that actually renders the HTML to human-readable format. In this type of n-tier application stack there are implicit dependencies between the different tiers that are usually managed by the admins who know what the dependencies are for the various tiers and, as a result, the correct order for startup/shutdown/recovery (e.g. there is no point in starting up the business or web tiers if the DB is not running).

In the case of an n-tier application, an internal cloud computing infrastructure can help you with manageability, scalability, as well as recoverability (we're starting to pull out all the "-ilities" now…) We'll cover them in order and close with the recoverability item as that's this week’s theme. On the manageability front, an internal cloud infrastructure can capture the dependencies between the tiers and orchestrate the orderly startup/shutdown of the tiers (e.g. first verify that the DB is running, then start the business logic, and finally finish with the web tier). This means that the specifics of the application are no longer kept in the admin's head, but rather in the tool where any admin can benefit from the knowledge.

On the n-tier scalability front, usually a horizontal rather than vertical scaling approach is used for the business and web tiers. With an internal cloud infrastructure managing the tiers and using the monitors to determine the required computing capacity (to do this with Cassatt Active Response, we use what we call Demand-Based Policies), the infrastructure will automatically increase/decrease the capacity in each tier as a function of the demand being generated against the tier.

Finally, on the recoverability front, everything outlined in the last recovery scenario applies (restart on failure and swap to a new node if that doesn't work), but now you also get the added value of being able to restart services in multiple dimensions. As an example, in many cases connection pooling is used in the business tier to increase the performance of accessing the database. One downside (depending on the solution used for managing the connection pool) is that if the database goes away, then the business tier has to be restarted to re-establish the connections. In a typical IT shop this would mean that the admin would have to manage the recovery across the various tiers. However, in an internal cloud computing environment, the infrastructure has sufficient knowledge to know that if the DB went down, there is no point in trying to restart the failed business tier until the DB has been recovered. Likewise, there is no point in trying to recover the web tier when the business tier is offline. This means that even if the root failure can not be addressed by the infrastructure (which can happen if the issue is not transient or hardware related) the admin can focus on the recovery of the specific item that has failed and the system will take care of the busywork associated with restoring the complete service.

Intra-datacenter recovery
Ok, we're now into hosting non-trivial applications within a cloud infrastructure so let's take that same n-tier application example, but add in the extra complexity that there are now multiple n-tier applications being managed. What we'll do, though, is throw in a datacenter infrastructure failure. This example would be relevant for issues like a Computer Room Air Conditioning (CRAC) unit failure, loss of one of the consolidated UPS units, or a water leak (all of which would not cause a complete failure in a datacenter, but would typically knock out a block of computing capacity).

Before we jump into the failure, we need to explore one more typical practice for IT shops. Specifically, as more applications are added to an IT environment it is not uncommon for the IT staff to begin to stratify the applications into support levels that correspond to the importance the business places on the specific application in question (e.g. revenue systems and customer facing systems typically have a higher availability requirement than, say, a workgroup server or test and development servers.) For this example lets say that the IT department has three levels of support/availability that they use, with level one being the highest priority and level three being the lowest. With Cassatt Active Response, you can put this type of policy directly into the application and allow it to optimize the allocation of your computing resources to applications per your defined priorities. With that as background, let’s walk through the failure we outlined above and see what Cassatt Active Response will do for you in the face of a major failure in your datacenter (we're going to take the UPS example for this discussion).

We'll assume prior to the failure that the environment is in a steady state with all applications up and running at the desired levels of capacity. At this point, one of the shared UPS units goes offline, which affects all compute resources connected to that UPS unit. This appears to Cassatt Active Response as a number of node failures that go through the same recovery steps outlined above. However, as mentioned above, usually you will plan for a certain number of concurrent failures and you will keep that much spare capacity available for deployment. Unfortunately, when you loose something shared like a UPS, the number of failures quickly consumes the spare capacity available and you find yourself in an over-constrained situation.

This is where being on a cloud infrastructure really starts to shine. Since you have already identified the priority of the various applications you host in the tool, it can dynamically react to the loss in compute capacity and move resources as necessary to maintain your service levels on your higher priority applications. Specifically, in this example lets assume that 30% of the lost capacity was in your level 1 (most important) applications. The infrastructure will first try to shore up those applications from the excess capacity available, but when that is exhausted, it will start repurposing nodes from lower priority applications in support of re-establishing the service level of the higher priority applications. Now, because the cloud infrastructure can manage the power, network, and images of the applications, it can do all of this gracefully (existing lower priority applications get gracefully shut down prior to their hardware being repurposed) and without user interaction. Within a short period of time (think 10s of minutes), the higher priority applications have been re-established to their necessary operating levels.

The final part of this example is what occurs when the issue causing the failure is fixed (in our example, the UPS is repaired and power is re-applied to the effected computing resources.) With a cloud infrastructure managing your environment, your lower priority applications that were effected in support of shoring up the higher priority applications all just recover automatically. Specifically, once the power is reapplied, all you have to do is mark the hardware as available and the system will do the rest of the work to re-inventory and re-allocate the hardware back into those tiers that are running below the desired capacity levels.

Well, there you have it. We've walked through a variety of failure scenarios within a datacenter and discussed how an internal cloud infrastructure can offload much of the busy/mundane work of recovery. In the next post I'll take the example we just finished and broaden it to include recovery into a completely different datacenter. Until then…

Wednesday, February 25, 2009

Sorry, VMware: you don't need virtualization for cloud computing

The VMworld Europe PR blitz is in full swing (hats off to many of my old BEA marketing compatriots!). And as it was at VMworld in Vegas back in September, VMworld Europe is all about the cloud. The only problem (if you're VMware) is that the cloud isn't all about virtualization.

In fact, you don't need virtualization for cloud computing. Despite what they'd like you to think. Blasphemy? Maybe, but let me explain...

While I was away on vacation last week a great discussion took place on this topic, started by Christofer Hoff's simple, incomplete thought that he lobbed to his readers: "How many of you assume that virtualization is an integral part of cloud computing? From your perspective do you assume one includes the other? Should you care?"

The thing that got him asking the question was a difference in the way Google and Amazon deliver and define their respective cloud services (PaaS v. IaaS, from my perspective). I, too, noticed an assumptive thread that consistently weaves its way through most conversations about cloud computing (both the internal and external varieties): the assumption is that when people are talking about dynamic, on-demand, cloud-style resources, of course there will always necessarily be virtualization underneath it all.

Not true, actually.

First, though, credit needs to be given to market-leader VMware (currently sipping their share of kir royales in Cannes, I'm sure) and all the virtualization providers, really, for changing the conversation about how a data center can be run. For years, siloed and static application stacks kept anything dynamic or elastic from getting very far. Virtualization changed that. It allowed a first step, a separation of underlying hardware from the software running on top.

Virtualization has really opened up the thinking of IT ops folks. Now that these previously inseparable items are able to be sliced, diced, moved around, expanded, contracted, and the like, all bets are off. In fact, it's probably the change in thinking brought about by virtualization that allows us to even consider talking about cloud computing in the first place.

However, virtualization isn't the silver bullet that enables cloud computing. In fact, it isn't even required. It's one of several types of technologies that can be employed to help deliver the service your business requires from a set of compute resources, either in your data center (an internal cloud), outside your data center (an external cloud), or both (a hybrid of the two).

Internal clouds will mean a mix of physical and virtual

When we at Cassatt talk to customers about implementing internal clouds, it's a discussion about using what you already have running in your data center -- apps, servers, virtual machines, networks -- with no changes, but applied in a different way. Instead of the compute resources being dedicated to particular apps, the apps (with the help of our software) pull what's needed from a big pool (cloud) of available compute supply. If our customers are any indication about what data centers really look like, some of those compute resources will be physical, some virtual. And, with the economy the way it is, people want to squeeze every bit of capability out of whatever mishmash they are running, even if it's not ideal. That means there will be a little bit of everything.

Alessandro Perilli of virtualization.info quoted VMware CEO Paul Maritz as saying that starting later this year, when the first generation of vSphere platforms will be out, there will be no technical reason not to virtualize 100% of your data center. I bet not. Why? We asked the VMware users that came by our booth at VMworld in September '08 that same question: "You're eventually going to virtualize everything, right?" Every one of them responded, "No, no, of course not." They cited cost, performance, management, and a host of other reasons. Translation: reality gets in the way. So, if you aren’t going to virtualize everything, aren't you still going to want to make the best use of all your data center resources? The answer is yes from what we hear.

External and hybrid clouds: will anybody have the same infrastructure?

External clouds might be a different story. You'll use whatever it is that your external service providers have. They could have a fully virtualized set-up. Or, they may have a more mixed environment. Again, Perilli quoted Maritz saying that virtualization is the only viable way to do cloud computing. Nope. It's an option. Of course, if you can apply your internal IT ops expertise to your external cloud work, too, it's a win (as Christofer Hoff notes in another post). It just won't always be possible.

When it comes to hybrid clouds -- moving from internal to external clouds and back, or some federated way of mixing and matching compute power from both -- the ability to be able to leverage physical and virtual, and even to leverage VMware, Citrix, Microsoft, Parallels Virtuozzo, etc., is going to be really important. Nobody is going to have the same stuff. The real world is heterogeneous.

Which brings me to some general commentary about VMware's European announcements:

Things that sounded good from the VMware announcements

There were definitely some items that sounded positive from VMworld Europe this week. Maritz talked about interoperability, and was quoted by Alex Barrett of SearchServerVirtualization.com as saying, "What we fear is the emergence of a couple highly proprietary uberclouds. There's an old joke about a California hotel that you can check in to but that you can't check out of. We don't think that should happen." (Of course, I always thought that was the Roach Motel, but maybe the point's the same.)

Kudos to Maritz for talking up internal clouds and the move to hybrid/federated clouds. Chuck Hollis from EMC underscored Martiz's validation of the private cloud concept, calling it a "big thing" in his post, because "of all the different cloud models I've heard...this is the first one that I think can work for real-world enterprise IT." In creating their "software mainframe," Maritz talks about wanting to help change the decision about whether to run a workload on an internal or external cloud from being a can't-change-it architectural one to being an operational one. Management will be at the "service level and not at the plumbing level." To me, that sounds like the right vision, and a great reason to base your infrastructure on policy-based automation. But I'm a little biased on that topic.

Some things VMware forgot to mention in Cannes (and not by accident)

As with the initial announcement of vCloud and the VDC-OS (now vSphere) back in September, though, there are some underlying problems when you try to fit the vision slides with the reality (Ken Oestreich did a good overview of these back then). First, you'll probably have more than one virtualization technology in your IT systems. This is either by design or by happenstance (like, say, from acquisitions), but it's going to be the reality in many cases. VMware's vision of "hyper-vising the data center" (as Chris Mellor at The Register calls it) ignores that.

Second, and more fundamentally (and as I noted already), you'll have both physical and virtual servers in your data center. You'll want to think through an internal cloud strategy that takes that into account.

Reuven Cohen’s post on the announcements hits the nail on the head on both of these points: "As for being interoperable, VMware is saying that its various management tools will only work on top of the VMware hypervisor. In other words, physical servers and servers virtualized by Microsoft, Citrix or any other vendor will not be compatible with the vCloud initative. Summarized, we're interoperable as long as it's VMware."

It's not unexpected that VMware would ignore these points. But *you* need to keep them in mind in trying to reconcile cloud computing versus virtualization. Virtualization can certainly help deliver a cloud computing infrastructure either inside your data center, or via a cloud service provider. But it's only one of the components, not the main driver. Delivering your business requirements as efficiently as possible -- that's the important thing.

I recommend digging through the comments in Hoff's blog entry that I mentioned at the start of this post. There are a stack of good comments there from Alessandro Perilli of virtualization.info, Andre Gironda, Anthony Chaves, James Urquhart, and others. James says "abstraction is what is important to cloud computing, not virtualization -- a big difference." Mike Rothman of securityincite.com views "cloud computing as what is being delivered and virtualization as one of the ways to deliver it."

Great points. See what I miss when I go on vacation?

Monday, February 23, 2009

When an elephant sits on your blade center

Trade shows should be considered a full-contact sport. Especially if you are a blade center heading to a Cassatt booth.

To show off our software at the Gartner Data Center Conference or other industry data center gatherings, we not only ship a couple of smiling, knowledgeable employees, but we also send a mobile computing cluster. Thing is, we've had a few problems with it arriving in one piece.

And you thought trade show cocktail parties were rough.

Our mobile computing cluster is *supposed* to be an easy way to show off some of the cool tricks you can pull off with our software (or actually, to watch our software show itself off -- policy-based automation can do stuff like that). This mobile cluster is essentially a data-center-in-a-box. Unlike the big Sun, Rackable, and other "18-wheeler" or "mobile-home"-style containerized data centers, this is more like an oversized suitcase, stuffed with servers.

As you can see from the picture here, something happened on the way to the Forum. More on that in a moment.

How it's supposed to work

The idea for this mobile computing cluster is simple: put 4 servers in a 2' x 3' shipping case on wheels. Include a network (and switch) and voila!, you have a simplistic data center that you can wheel into a trade show booth for product demonstrations. One of the cluster's servers has the Cassatt Active Response controller software on it. The others are running permutations of Windows, Linux, and/or VMware ESX. All the servers in this case were Dell blade servers.

To begin the "showing off" part, then, all we usually have to do is set up Cassatt Active Response to show two scenarios. First is a simulation of a dev/test environment moving back and forth between used and unused states (say, from during the workday to after hours and back again). In response to the policies you set, our software gracefully shuts down the software and servers based on a time schedule. Second is a simulation that shows how the software dynamically handles changing demand on our sample applications. Applications are given priorities and as load on those apps increases (one of our smiling, knowledgeable employees helps this along), Cassatt automatically provisions new physical or virtual servers to handle the load, turning on cold, bare metal, and laying down the operating system and application image (plus virtual machine where appropriate) to enable the app to scale out. As the load on the app drops, Cassatt de-provisions unnecessary servers, turns them off, returning them to the spare pool of resources. The fun you can have with demand-based policies, eh?

What happened in Vegas nearly stayed in Vegas

As I mentioned at the outset, at the recent Gartner Data Center Conference in Vegas, things didn't quite go as planned. When we got to the booth, our sturdy data-center-in-a-box looked like it had been sat upon by a large elephant. Or had been on the receiving end of a very angry forklift. OK, so it wasn't anything like what the GoGrid guys show at nohardware.com, but something very heavy had definitely come into contact with the crate, warping the frame so much that the servers were no longer sitting squarely in their tracks. In fact, there wasn't much "squareness" left at all.

Of course, since we were in Vegas, we took bets on how many servers would actually even turn on.

I figured all was lost from a demo standpoint. I started thinking about what other flashing gizmos we could include in the booth to attract attention if the software was out of commission. But hang on a minute, our techie gurus said. This seems like exactly the kind of thing our software should be able to do: it should be able to help the apps in our mini-data center recover from minor setbacks like having all available servers dislodged by unidentified blunt trauma. You know, the kind of thing that happens in your data center every day.

So, the apps already had service levels assigned to them. The Cassatt control node had a pool of hardware to work with, uncertain though the quality was. We booted up the control node, crossed our fingers, and let it do its thing.

Truth be told, we also used the next few seconds to glance over to see where the fire extinguishers were, and how close our nearest usable exit actually was.

Software that finds and uses whatever available resources you have for your apps

The good news: the controller turned on. And, after our techies fiddled with a network cable or two, our software was talking to the power controllers for each of the other servers. It booted each server in turn as it looked for working compute hardware to support the demo application at the service level we had set. It got a couple of the servers working. The one remaining server, not so much. When one of the most damaged servers didn’t respond appropriately, Cassatt Active Response "quarantined" it: putting it into the maintenance pool for a smiling, knowledgeable human to investigate. Of course, we knew already that there was, um, a hardware problem.

So, a happy ending. We were able to show off our software and came away with a great little true-life application resiliency story out of the deal.

Even better, it turned out that our booth was right next to the bar. And we had glowing green swizzle sticks to hand out. But that was just the Vegas trade show gods trying to make it all up to us somehow, I think.

The really happy ending is that after we (carefully) shipped the damaged mobile cluster back to our offices and pulled each of the servers out, it turned out that each blade survived the ordeal. We should probably let our friends at Dell know this. Their blades are officially Cassatt trade-show proof.

The shipping crate, however, has been retired to a corner of our headquarters offices that we call The Dark Side, where it awaits its fate as a vaguely modern coffee table or other such creative use in which having its sides at 90 degree angles to each other is not a requirement.

Saturday, February 14, 2009

Who will save us now? Not Silicon Valley, apparently, but optimism lingers

So apparently new Treasury Secretary Timothy Geithner is not the superhero the stock market was expecting to rescue us all, banks included, from an economy in a downward spiral. At least, that's the message that the markets telegraphed (broadcast? texted?) following the introduction of the Obama administration's government bank bailout plan, part deux, earlier this week.

Uh oh. Now what? We'll have to look elsewhere for our savior.

Hey, how about Silicon Valley?

The credentials are certainly here. Silicon Valley was a contributor to a good deal of economic growth in times past: microchips in the ‘70s, PCs in the ‘80s, all things Web since then. The innovation engine of Silicon Valley even managed to right itself after that whole dot com bubble fiasco earlier this century.

So, how about this time around? The global economy certainly could use the help. And, aren't we currently in the midst of inventing and working the kinks out of, among other things, this cloud computing thing that should go great in leaner times: a new way to do IT that delivers only the computing power you need without all that capital outlay. (For more on cloud computing in this economy, you can see what the 451 Group, IDC's Al Gillen, and cloud journalist Derrick Harris have to say.) Sounds ideal, right?

Not likely, say a few folks who’ve watched this kind of thing before.

The recession may have arrived too early for cloud computing to help

According to Gartner analyst Mark Raskino, the Great Recession may have arrived just a little too early for cloud computing to have a role in helping us get out of it. "…[C]ompanies looking for a significant new round of IT cost cuts are having to tackle the challenge creatively," Mark wrote in his blog recently. "That makes many of them very interested in cloud ideas. …The problem is that the cloud isn't ready for corporate prime time. …[Vendors] are doing it as fast as they can -- but a lot if it just isn't ready for the mainstream, moderately risk averse centre of the market. Which is a shame." Understatement alert.

Looking beyond just the realm of cloud computing, Fortune's Jessi Hempel chronicled Silicon Valley’s growth-accelerating history in a recent article, and came to a similar conclusion: "Alas, economists and executives believe that this time tech won't lead the country out of its slump."

Deeper problems with Silicon Valley?

Part of the issue is the near-complete evaporation of money that can be borrowed, the fuel for a lot of the innovation in Silicon Valley. That piece of this puzzle wasn't working against us after the dot com bubble burst. It certainly is working against us now. Without fuel and no way off the highway (as in, no IPOs), the trip of an innovative start-up is a lot rougher right now. Even so, some companies (like business-model challenged Twitter) are still able to attract new funding.

Other commentators, like a few highlighted in the December Steve Hamm cover story for BusinessWeek (which I don't think made him too popular with entrepreneurs or even his own staffers in the Valley), question Silicon Valley's innovation more fundamentally, citing "short-term thinking" and "risk-aversion." He quoted Andy Grove, former CEO of Intel, saying that today's start-ups "give us refinements, not breakthroughs." The man who said "only the paranoid survive" now says Silicon Valley doesn't worry enough (unless they're talking about an exit strategy).

Short-term pessimism, but still long-term optimism

Hamm and others acknowledge ups and downs in innovation in the Valley. The region has been one of the greatest examples of creative destruction (whereas Detroit, as discussed on Twitter recently by James Governor, Tim O’Reilly, and others: not so much).

Another business press editor that I had lunch with this week had a couple interesting comments on this whole situation. First off, the situation is indeed dire. "This is the kind of downturn where we come out different as a society," he said. But from that comes good news for Silicon Valley: "Every downturn forces a tech change. This is good news for people in cloud."

So back to the cloud computing move as a potential economic catalyst, then. Even Mark Raskino at Gartner speculates that the current timing misalignment between what cloud computing can do and what the economy needs now (two items that he described as “annoyingly adrift” of each other) might just accelerate the whole market evolution out of necessity.

Bill Coleman, our CEO here at Cassatt, has seen a downturn or two in his career at BEA, Sun, and a couple other places. He (quoted in the Fortune article I mentioned above) is somewhat pessimistic about Silicon Valley's ability to help launch the recovery in the short term, but reiterates his long-term belief that this place understands how to invent things -- and reinvent itself. "I have every confidence in the Valley," he says.

"The very, very early adopters, I believe, are the people who use a downturn to re-engineer themselves," Bill told Quentin Hardy of Forbes in a video interview posted this week. He rattled off Intel, Cisco, Goldman Sachs, and FedEx as examples of companies who have successfully increased investment during previous economic slides, and have been rewarded for it. Investing while everyone else is running for cover means "you get to buy everything at a discount," Bill said. (More tidbits from Bill on managing a business through a downturn turn up here in Forbes.com as well.)

And as for Silicon Valley, specifically? The Fortune article points out that even in economic depths of late 2001, a couple of guys were already working on a start-up to try to make money on Web searches, creating a company in the process, whose name is now no longer a noun, but a verb. Says Jessi, "the moment that conventional wisdom suggests that innovation is dead is probably the perfect time to go peeking inside Palo Alto garages again."

If you're interested in jumping into the discussion about Silicon Valley's reaction to the recession, the Churchill Club is sponsoring a panel at the Stanford Law School on Wednesday, Feb. 18, 2009, featuring Lisa Lambert of Intel Capital, Bill Coleman from Cassatt, and Patricia Sueltz of LogLogic. Joseph Grundfest, co-director of the Rock Center on Corporate Governance and former commissioner of the SEC, is moderating.

Wednesday, February 11, 2009

VMware's Thiele: Never more data center change than right now

Making headway in running a data center is hard. Even if you've worked on it a lot. The guy I'm talking to in today's Cassatt Data Center Dialog interview is someone who -- despite the curveballs that IT and the business it supports can throw at you -- has been consistently making big strides in how data centers are run: Mark Thiele.

Mark is director of R&D business operations for virtualization giant VMware and as part of that job runs data centers in Palo Alto, Massachusetts, Washington, and Bangalore, totaling approximately 85,000 square feet. I've also seen Mark in action at his previous job, where his ideas and initiative helped shape parts of our Active Power Management technology and what became Cassatt Active Response, Standard Edition.

I last saw Mark speak at the Gartner Data Center Conference in December where he and Mark Bramfitt of PG&E talked about the state of green IT. I used that as a starting point for our interview:

Jay Fry, Data Center Dialog: In your panel at the Gartner Data Center Conference, you talked about how it's important to bring facilities and IT folks together for improved IT operations. You also touched on bringing operations and process management into the R&D organization and "stretching" people (including yourself) beyond their normal areas of expertise. Do you have some specific suggestions on how to do this or some examples of what's worked for you?

Mark Thiele, VMware: There's no simple answer here. I've been working in IT for a long time and as a result I'm biased in my belief that IT, when used appropriately can solve almost any problem. All joking aside, the reality is that many of the folks that work in IT look at how things get done a little differently than everyone else. An IT person's first reaction to a job that needs to get done is "how can I write a script to automate that?" I've utilized this to my advantage by looking at what are seemingly intractable "non-IT" problems from the IT perspective and shining a new light on them. This IT-centric focus helps to stretch non-IT folks and IT folks alike. As they work together to solve shared problems they come to realize the benefit of the shared experience and uncommon backgrounds. Once this happens, improving operations between groups becomes less of a headache.

DCD: You discussed having a "bridging the gap" person who can look at a data center as a holistic system. How do you find someone to fill that role? What skills should folks look for?

Mark Thiele: This can be a very difficult role to fill. The ideal person is someone who has a strong understanding of IT infrastructure, but an understanding of the importance of dealing with the entire data center as a system. In my case I was able to identify a strong candidate and convince them to take on the new role by explaining the potential opportunity for improvement and bottom-line impact for the company. The data center has become one of the most commonly referenced areas of opportunity in business today. There has never been more focus and change in data centers than there is right now. This kind of change and business importance can be very enticing to forward-thinking and career-driven IT staff.

DCD: One of the things you mentioned in Vegas was that there is a 3-5 year gap between when something is proven to be a good idea for improving IT operations and when people are actually using it. You said you’d like to find a way to shrink that time period. Any specific examples of things that seem "proven" but aren't being used yet? Any ideas how to shrink that time gap?

Mark Thiele: The dynamics of why it often takes years to implement new technology in the data center are many. These dynamics include risk avoidance, cost of entry, myth, intractable staff, and/or inflexible data center facilities. However, the aforementioned factors still don't explain why it often times takes 5 years or even more for proven technologies to be implemented.

The delays are associated with the inability to truly measure and understand the risk/reward of making change in the data center. As an industry we need to carry more responsibility for looking at the long term benefits of new technology vs. the short term potential for disruption in the environment.

Take virtualization as an example. You can pick any of 1,000 white papers that will explain why implementing a major virtualization strategy in your data centers is the best way to improve operations and drive down cost of doing business. Yet almost every day I talk to folks who say "VMware works great, [but] I just can't risk it in production" or "my software provider told me they won't support it on a VM." Thousands of world-class organizations have large "production" VMware solutions installed; how many more do we need before everyone agrees it works?

Part of this problem is aversion to any type of risk perceived or real. If the potential benefit is better Disaster Preparedness, faster provisioning, higher availability, and lower cost of doing business, it should be OK to accept a certain amount of calculated risk. As leaders in business and the IT space we should be obligated to ensure that our teams understand that intelligent risk is OK, in fact it's expected.

DCD: You suggested that people get started on energy-efficiency projects in bite-sized pieces, to show some quick wins. Any specific suggestions about how someone should approach creating that list or any particular projects you would suggest they start with?

Mark Thiele: There is a mountain of information available that can help with identifying power/efficiency opportunities in the data center. There's the Green Grid, APC Data Center University, Emerson's Efficient Data Center portal, LinkedIn groups like Data Center Pulse and many more. Once you've gone through some of the information on what can be done, you need to audit your current data centers to identify and prioritize the gap resolution activities. This prioritized list of gaps or opportunities should then be built into a program. I would highly recommend that anyone initiating a large effort like this should ensure they capture current state relative to space, power, and cooling so that you can measure and report the improvements to your management.

DCD: Who else within someone's company or organization should the IT operations people ally themselves with (besides facilities) to make progress on the data center energy efficiency front?

Mark Thiele:
Finance is your friend. If you can demonstrate the potential savings and long term cost of doing business improvements, they will likely become ardent supporters of the effort.

DCD: What has surprised you most about what’s going on in this space today?

Mark Thiele: That it's taken so long for data centers to get this much attention. It's about time for the IT Infrastructure folks to be getting the attention they deserve.

DCD: You mentioned Data Center Pulse, the group of data center operations people that you helped found via LinkedIn. Given the wide range of existing organizations focused different aspects of the data center, why did you feel a need to create a new one?

Mark Thiele: Our primary driver for creating Data Center Pulse was to give data center owner/operators a chance to have a direct and immediate influence on the industry that supports them. We are effectively a working group that will deliver information and opportunity-for-improvement information to any and all vendors who support the data center space. I guess our primary difference from most other orgs is that we are not sponsored and we only allow owner/operators to join the group. We don't have any sales, marketing, business development, recruiting, or press folks in the group.

DCD: What are some of the immediate things you hope Data Center Pulse accomplishes?

Mark Thiele: In the near term, our first major accomplishments have revolved around making the group a functioning entity. We've established a board of directors, and we've grown the group to over 600 members. The members represent over 20 countries and virtually every industry. Our next big adventure is the upcoming Data Center Pulse Summit that will be held in February [next week, actually: Feb. 17-19, 2009 in Santa Clara, CA --Jay]. We will be presenting findings generated by the group at the following AFCOM chapter meeting and [Teladata's] Technology Convergence Conference.

...

Thanks, Mark for the interview. In addition to the data center energy-efficiency resources that Mark mentioned, we have a few up on the Cassatt site as well, with some focused on more general green data center information and some focused on recommendations around server power management issues.

Rich Miller at Data Center Knowledge has also posted a more detailed overview of the Data Center Pulse Summit, if you want more information about the event.

Tuesday, February 10, 2009

Is your organization ready for an internal cloud?

I'm going to take a post off from the "-ilities" discussion (Disaster Recoverability will be up next) and spend a little time talking about the technical and organizational challenges that many Fortune 1000 companies will face in their move to an internal cloud computing infrastructure.

Since I've lived through a number of these customer engagements over the past few years I thought I'd write up my cheat sheet that basically outlines the things that you need to be aware of and get worked out in your organization before you embark on a move to cloud computing. If you don't pay attention to these, I predict you'll be frustrated along the way by the organizational tension that a move such as this will cause.

Internal clouds are a game-changer on the economics front for companies that have large investments in IT. Creating a cloud-style architecture inside your own data center allows the company to unlock the excess capacity of their existing infrastructure (often locked into vertical stovepipes supporting specific applications.) Once this excess capacity is released for use within the IT environment, the company can then decrease their capital purchasing until the newfound capacity is consumed by new applications. In addition to the capital savings as a result of not having to purchase new hardware to run new applications, there are substantial power savings to the company as well, since the excess capacity is no longer powered up all the time, and is instead brought online only when needed.

Now, this sounds like motherhood and apple pie to me. Who wouldn't want to move to an IT environment that allowed this type of flexibility/cost savings? Well, it turns out that in many large companies the inertia of "business as usual" gets in the way of making this type of transformational change, even if they could see huge benefits in the form of business agility and cost savings (two things that almost everyone is looking for in this current economy to make them more competitive than the next guy).

If you find yourself reading the list below and going "no way, not in my company," then I'd posit that while you may desire the virtues of cloud computing, you'll really struggle to succeed. The limits you'll be placing on the solution due to the organizational issues will mean that many of the virtues of a cloud approach will be lost (I'll try to call out a few while I walk though the list to give you an idea of the trade-offs involved).

What you’ll need to be successful at internal cloud computing:
Organizational willingness to embrace change. To fully adopt an internal cloud infrastructure, the system, network, and storage admins -- along with application developers -- are all going to have to work together as each group brings their specialty to bear on the hosting requirements of the application being deployed into the cloud. Also, if your organization likes to know exactly where an application is running all the time then cloud computing is only going to be frustrating. In cloud computing, the environment is continually being monitored and optimized to meet the business needs (we call them service-level agreements in Cassat Active Response.) This means that while at any point in time you can know what is running where, that information is only accurate for that instant. Why? In the next instant, something may have failed or an SLA may have been breached, causing the infrastructure to react, changing the allocation of applications to resources. Bottom line, be willing to embrace change in processes, policies, roles and responsibilities or you'll never be successful with a cloud computing solution.

Willingness to network boot (PXE/BOOTP/DHCP) the computing resources. One of the major value propositions of an internal cloud computing approach is the ability to re-use hardware for different tasks at different times. To allow for this rapid re-deployment of resources, you can't use traditional approaches for imaging a node’s local disk (it takes a long time to copy down a multi-Gb image across the network and once done, if that node fails then the data is lost, since it resides on the local disk). Instead, the use of standard network protocols (NFS/iSCSI) allows for the real-time association of the image to the hardware. This approach also has the byproduct of allowing for very fast failure recovery times. Once a node is found to have failed, it only takes a few seconds to associate the image to new node and start the boot (we recover failed nodes in the time it takes to boot a node plus a few seconds to affect the re-association of the image to its new compute resource)

Computing resources that support remote power management either through on-board or off-board power controllers. For all of this dynamicism to work, the internal cloud infrastructure must be able to power-control the nodes under management so that a node can be powered down when not needed (saving power) and power it back up when necessary. Most recent computing hardware has on-board controllers specifically for this task (iLo on HP, DRAC on Dell, ALOM/ILOM on Sun…) and the cloud computing infrastructure simply uses these existing interfaces to affect the power operations. If you find yourself in an environment that has older hardware that does not have this support, don't despair. There are numerous vendors that manufacture external Power Distribution Units (PDUs) that can provide the necessary power management for otherwise "dumb" compute nodes.

Understand that your current static Configuration Management Database (CMDB) becomes real-time/dynamic in an internal cloud computing world. I touched on this under the "embrace change" bullet above, but it's worth calling out specifically. In a cloud computing world where you have pools of capacity (computing, network, and storage) that are associated in real time to applications that need that capacity to provide service, NOTHING is constant. As an example, depending on how you set up your policy, a failure of a node in a higher priority service will cause a node to be allocated from the free pool. If one is not available that matches the applications requirements (number of CPUs, disks, network cards, memory…) then a suitable replacement may be "borrowed" from a lower-priority service. This means that your environment is always changing and evolving to meet your business needs. What this also means is nothing is constant and you'll only find yourself frustrated if you don't change the mindset of static allocation.

Understand that much of the network configuration will be handled by the internal cloud infrastructure. Now this doesn't necessarily mean your core switches have to be managed by the cloud infrastructure. However, if you want the ability to allocate new compute capacity to an application that has specific networking requirements (like a web server would have if you want it behind a load balancer), then the infrastructure must reconfigure the ports connected to the desired node to be on the right network. This issue can be a show-stopper to providing a full cloud computing solution so talk about this one early and often with your network team so they have input on how they want to architect the network and have time to become comfortable with the level of control required by the infrastructure.

Boot image storage on the IP network. As I mentioned above, for internal cloud computing to work, you have to separate the application image from the physical node so that the image can be relocated on the fly as necessary to meet your policies. We currently leverage NFS for this separation as it can easily be configured to support this level of dynamicism. Also, using a NAS allows the leveraging of a single IP network and reduces cost/complexity as redundant data paths only have to be engineered for the IP network rather than the IP and the Storage network. I don't mention SAN for the boot device because it can be problematic to move around LUNs on the fly due to the myriad of proprietary vendor switch management APIs. In addition, every server out there ships with at least two on board NICs while SAN HBAs are aftermarket add-ons (to the tune of hundreds of dollars if you want redundant channel bonding). Now this single network approach comes with a downside that I'm going to be upfront about: currently, IP networks are in the 1Gigabit range with 10Gigabit on the horizon, while SAN networks are 2-4Gigibit (you can bond or trunk multiple Ethernets together to get better throughput, but for now we'll leave that aside for this discussion). If you have a very disk-intensive application, you'll need to architect the solution to work within a cloud infrastructure. Specifically, you don't want to be sending that traffic across to a NAS as the throughput can suffer due to the limited bandwidth. You should look to either use local tmp space (if you only need temporary storage) or locally attached SAN that is zoned to a specific set of nodes that can act as a backup to one another in case of failure.

Applications that run on commodity hardware. Internal cloud computing provides much more benefit when the applications to be managed run on commodity hardware. This hardware could be x86, POWER, or SPARC, depending on your environment, but should be the lower- to mid-level servers. It doesn't make a lot of sense to take something like a Sun SPARC F25K and put it into a cloud infrastructure, as it is already built to address scalability/availability issues within the chassis and has built-in high-speed interconnects for fast data access. With commodity hardware comes numbers and that is where a cloud infrastructure really shines, as it dramatically increases the span of control for the operators as they manage more at the SLA level and less at the node level.

Well, that's a lot to digest so I think I'll stop now. I'm not trying to scare anyone away from moving to an internal cloud computing infrastructure. On the contrary, I believe they are the future of IT computing. The current “best” practices are, to a large extent, responsible for the current management and over-provisioning issues facing most IT shops. However, for you to address the over-provisioning and benefit from a continuously optimized computing environment where the excess capacity you have is efficiently allocated to the applications that need it (instead of sitting idle in a specific stove pipe), you need to understand the fundamental issues that stand ahead of you. In your transition to internal cloud computing you will need to actively work to address these new issues, or else you will find yourself with just a different set of problems to chase than the ones you have now.

Thursday, February 5, 2009

Is it safe to discuss automating your data center yet?

I've been at Cassatt for over 3 years now and during that time one of words that I've wanted to use the most in describing what our software does is "automation." The only problem: it's a four letter word for IT. Until now. Maybe.

In the strictest sense of the word, Cassatt's software is all about automation. The goal of the software is to run your data center more efficiently by doing most of the work for you. It balances the resources you have (your hardware, virtual servers, applications, networks) with the application demand on those resources using policies you set -- and it does the balancing and re-balancing automatically. You end up with a big pool of compute power that gets called into service to support your apps only as needed, without someone having to manually adjust anything.

(By the way, you should probably be saying about now: "hey, that sounds like a cloud, but one that uses the stuff you already have inside your own data center" -- but that's another topic altogether.)

So why be timid about using the word "automation," then, if it's accurate? Because, in fact, it's something that IT folks have proven to be very gun-shy about. It's a concept that, frankly, doesn't have a lot of fans. As an example, the InfoWorld article by Eric Knorr debating the legitimacy of internal clouds also expressed some of this skepticism about automation.

Why is everyone afraid of automation?

From our experiences with customers, there are probably three reasons. First, IT and data center operations groups fear change. And with good reason. Their job is to make sure stuff works. Their motto usually is: if it's working, don't mess with it.

Second, automation itself requires a great deal of trust. You are replacing what a thinking person has been doing with a set of code. An IT ops person is going to want to see it working in action before he or she feels good about that. It's the way these guys are wired. (See previous paragraph for the reason.)

Third, vendors have shot themselves in the foot repeatedly by overpromising and underdelivering. (This was Eric Knorr's point noted above.) When things don't live up to the hype and fail to work as promised, successful IT ops folks write that down in their little black book of "see, I told you sos."

However, the bad economy may be changing all that. The 451 Group and others deeply involved in this space (including us here at Cassatt) have been noting that change may be afoot. Why? Automation is something that higher-up execs begin to think about in a downturn, especially when lay-offs are happening all around them. Wouldn't it be great to be able to do more stuff with fewer people? Or worse: now that we have fewer people, how do we keep the lights on in the data center?

In the Great Recession, automating how you run your data center may be something you need to consider in order to survive.

Of course, there are several angles on what "automation" actually is. There's the HP/Opsware and BMC/BladeLogic automation, which sets up configurations of your software and apps on your servers. And then there's automating the run-time of your data center -- the day-to-day running of your IT systems after you have them configured -- where a huge percentage of your operational expenses go, like Cassatt does.

Either way, it's smart to get ahead of this. It's always a good idea to be one of the folks helping to push forward change rather than clinging to the current business-as-usual plans. Especially when business-as-usual might actually mean going-out-of-business. Or at least an extended "vacation" for some of your IT staff.

The thing that got me thinking about all this was a podcast that Mike Vizard of eWeek did with Cassatt CEO Bill Coleman this week about how data centers are run, why it has been so hard for IT to adopt automation to help them out, and how we might get out of this "rut of complacency."

The main problem running data centers, said Bill, "is not scale; it's complexity. We've had a good run with virtualization and data center consolidation, but they've attacked the scale problem and in doing so have added to the complexity problem. The issue going forward is that if we don't attack this, we'll get beyond human scale. I don't think there's any way to do that without automating the operations part, doing for data centers what the telephone switching system did for the telephone system."

So how do company IT departments begin to make these changes? "The only way to do this is an evolutionary process, not a revolutionary process," said Bill. "And I think we are seeing the beginning of that with cloud computing, both internally and externally."

You can listen to the whole podcast here. Total running time is about 18 minutes.