Wednesday, April 1, 2009

A test: Applying an internal cloud to disaster recovery

Amidst the talk about improving data center efficiency, a lot of things are on the table. You can move to virtualization, add automation, even try painting the roof (seriously...I heard .Mike Manos of Microsoft talk about that one in his AFCOM keynote last year). There's usually a sacred cow, however. And usually that cow turns out to be one of the biggest culprits of inefficiency in the entire IT department.

Disaster recovery has been one of those sacred cows.

Being the iconoclasts we are here at Cassatt, we thought we should hit this bastion of IT ops conservatism head-on. Sure, the data center folks have many good reasons why they need to do things the way they do currently. And, sure, those same guys and gals have their necks in the noose if the word "disaster" ever appears describing their IT systems and it is not very quickly and relatively seamlessly followed by the word "recovery." We're talking about continuity of operations for business-critical IT systems, something that very likely could make or break a company (especially if things go horribly wrong with your data center and, say, the economy is already in the dumpster).

However, we wanted to apply the internal cloud concepts of resource sharing, policy-based automated provisioning, and energy efficiency (as in, don't have stuff on when it's not needed) to disaster recovery. So we did. We even found a couple willing users to try it out. I thought I'd explain our approach along with the before and after comparisons.

What disaster recovery approaches usually look like now

Existing data center disaster recovery solutions today can vary, but usually require a full duplicate set of servers to be dedicated as back-ups in case of a failure, or an outsourced service that guarantees the same within something like two hours, at costs somewhere in the neighborhood of $5,000 per system. Oh, and those servers (no matter where they are located) need to be on, consuming power and cooling 24x7. There are less immediate and less responsive ways people have their DR set up, too (think trucks, tapes, and several days to restart). Usually, the more reliable the set-up, the more wasteful. We thought we should tackle one of the most wasteful approaches.

How we set up our test to show an internal cloud handling disaster recovery

We set up a very small test environment for this internal cloud approach to disaster recovery in conjunction with one of our customers. We placed two servers under Cassatt Active Response control and called this soon-to-fail environment the "Apples" data center (you’ll see where this is going shortly, and, no it has nothing to do with your iPod). We put another two servers -- of slightly different configuration -- under a different Cassatt controller in what we called the "Oranges" data center.

We helped the customer set up the Cassatt Active Response management consoles to control the clouds of (two) servers, plus the related OSs (Solaris in this case), application software, and networks. We helped them create service-level policies and priorities for the applications under management. The underlying stateless infrastructure management handled by Cassatt was synchronized with data and storage technologies to make sure that the customer's applications not only have servers to run on, but that they also have the data to run as users require, despite the disaster. In this example, we worked with NetApp's SnapMirror product to handle the mirroring of the data between the Apples and the Oranges data centers.

Disaster is declared: the internal cloud begins to beg, borrow, and steal servers

Time for the actual disaster now. Here's what happened:

· We declared the Apples data center "dead." A human notified the Cassatt software managing Apples that it was OK to kick off the disaster recovery process
· Because Cassatt Active Response was managing the cloud of IT resources, it knew what applications were running in the Apples data center and that, because of the disaster, they need to be moved to the Oranges data center to keep business running.
· Cassatt checked to see what hardware was available at the Oranges site. It saw 2 Sun SPARC T2000s.
· Since the hardware configuration at the Oranges site is different from the Apples site, the Cassatt software used priorities that the customer set up to bring up the most important applications on the best-fit hardware
· With these priorities, Cassatt provisioned the available hardware at the Oranges sites with the operating systems, applications, and necessary networking configurations. (The Oranges systems were off when the "disaster" began, but if they had been running other apps, those apps would have been gracefully shut down, and the servers "harvested" for the more critical disaster recovery activities
· The applications that were once running at the Apples site came up, despite the hardware differences, on the Oranges site, ready to support the users.

Sample disaster recovery complete. The internal cloud approach worked: the apps were back up in under an hour. Now that's not appropriate for every type of app or business requirement, but for others this would work just fine.

If there had been fewer server resources in the fail-over site, by the way, Cassatt Active Response would have applied the priorities the customer set up to enable the best possible provisioning of the applications onto the reduced amount of hardware.

This was only a small test, but…

Yep, these are only very limited tests, but they start to show what's possible with an internal cloud. The interesting thing here that's different from what traditional DR has been is this: the customers doing these tests used the compute resources that they already had in their data centers. With an internal cloud approach, the very logic that runs your data center day-by-day (and much more efficiently, I might add) is the same thing that brings it back to life from a disaster.

The other thing that makes this interesting is that by using a cloud-style approach, this disaster recovery scenario can also be easily adapted to more scheduled events, such as application migrations, or to move current workloads to the most energy-efficient server resources available. I'm sure there are a few other scenarios that customers will think of that we haven't.

We'll keep you updated as customers begin to deploy these and other internal cloud examples. And I promise next time we won't be comparing Apples to Oranges. (Yeah, that one hurt to write as well.)

Update: Craig Vosburgh did a technical overview of how a DR scenario might benefit from using an internal cloud-based approach similar to what's mentioned above a week or so back. You can read that here.

No comments: