Amidst the talk about improving data center efficiency, a lot of things are on the table. You can move to virtualization, add automation, even try painting the roof (seriously...I heard .Mike Manos of Microsoft talk about that one in his AFCOM keynote last year). There's usually a sacred cow, however. And usually that cow turns out to be one of the biggest culprits of inefficiency in the entire IT department.
Disaster recovery has been one of those sacred cows.
Being the iconoclasts we are here at Cassatt, we thought we should hit this bastion of IT ops conservatism head-on. Sure, the data center folks have many good reasons why they need to do things the way they do currently. And, sure, those same guys and gals have their necks in the noose if the word "disaster" ever appears describing their IT systems and it is not very quickly and relatively seamlessly followed by the word "recovery." We're talking about continuity of operations for business-critical IT systems, something that very likely could make or break a company (especially if things go horribly wrong with your data center and, say, the economy is already in the dumpster).
However, we wanted to apply the internal cloud concepts of resource sharing, policy-based automated provisioning, and energy efficiency (as in, don't have stuff on when it's not needed) to disaster recovery. So we did. We even found a couple willing users to try it out. I thought I'd explain our approach along with the before and after comparisons.
What disaster recovery approaches usually look like now
Existing data center disaster recovery solutions today can vary, but usually require a full duplicate set of servers to be dedicated as back-ups in case of a failure, or an outsourced service that guarantees the same within something like two hours, at costs somewhere in the neighborhood of $5,000 per system. Oh, and those servers (no matter where they are located) need to be on, consuming power and cooling 24x7. There are less immediate and less responsive ways people have their DR set up, too (think trucks, tapes, and several days to restart). Usually, the more reliable the set-up, the more wasteful. We thought we should tackle one of the most wasteful approaches.
How we set up our test to show an internal cloud handling disaster recovery
We set up a very small test environment for this internal cloud approach to disaster recovery in conjunction with one of our customers. We placed two servers under Cassatt Active Response control and called this soon-to-fail environment the "Apples" data center (you’ll see where this is going shortly, and, no it has nothing to do with your iPod). We put another two servers -- of slightly different configuration -- under a different Cassatt controller in what we called the "Oranges" data center.
We helped the customer set up the Cassatt Active Response management consoles to control the clouds of (two) servers, plus the related OSs (Solaris in this case), application software, and networks. We helped them create service-level policies and priorities for the applications under management. The underlying stateless infrastructure management handled by Cassatt was synchronized with data and storage technologies to make sure that the customer's applications not only have servers to run on, but that they also have the data to run as users require, despite the disaster. In this example, we worked with NetApp's SnapMirror product to handle the mirroring of the data between the Apples and the Oranges data centers.
Disaster is declared: the internal cloud begins to beg, borrow, and steal servers
Time for the actual disaster now. Here's what happened:
· We declared the Apples data center "dead." A human notified the Cassatt software managing Apples that it was OK to kick off the disaster recovery process
· Because Cassatt Active Response was managing the cloud of IT resources, it knew what applications were running in the Apples data center and that, because of the disaster, they need to be moved to the Oranges data center to keep business running.
· Cassatt checked to see what hardware was available at the Oranges site. It saw 2 Sun SPARC T2000s.
· Since the hardware configuration at the Oranges site is different from the Apples site, the Cassatt software used priorities that the customer set up to bring up the most important applications on the best-fit hardware
· With these priorities, Cassatt provisioned the available hardware at the Oranges sites with the operating systems, applications, and necessary networking configurations. (The Oranges systems were off when the "disaster" began, but if they had been running other apps, those apps would have been gracefully shut down, and the servers "harvested" for the more critical disaster recovery activities
· The applications that were once running at the Apples site came up, despite the hardware differences, on the Oranges site, ready to support the users.
Sample disaster recovery complete. The internal cloud approach worked: the apps were back up in under an hour. Now that's not appropriate for every type of app or business requirement, but for others this would work just fine.
If there had been fewer server resources in the fail-over site, by the way, Cassatt Active Response would have applied the priorities the customer set up to enable the best possible provisioning of the applications onto the reduced amount of hardware.
This was only a small test, but…
Yep, these are only very limited tests, but they start to show what's possible with an internal cloud. The interesting thing here that's different from what traditional DR has been is this: the customers doing these tests used the compute resources that they already had in their data centers. With an internal cloud approach, the very logic that runs your data center day-by-day (and much more efficiently, I might add) is the same thing that brings it back to life from a disaster.
The other thing that makes this interesting is that by using a cloud-style approach, this disaster recovery scenario can also be easily adapted to more scheduled events, such as application migrations, or to move current workloads to the most energy-efficient server resources available. I'm sure there are a few other scenarios that customers will think of that we haven't.
We'll keep you updated as customers begin to deploy these and other internal cloud examples. And I promise next time we won't be comparing Apples to Oranges. (Yeah, that one hurt to write as well.)
Update: Craig Vosburgh did a technical overview of how a DR scenario might benefit from using an internal cloud-based approach similar to what's mentioned above a week or so back. You can read that here.
Showing posts with label disaster recovery. Show all posts
Showing posts with label disaster recovery. Show all posts
Wednesday, April 1, 2009
Monday, March 23, 2009
Internal clouds and a better way to do recovery to a back-up datacenter
Posted by
Craig Vosburgh
at
8:13 AM
Last post, we talked about a variety of failures within a datacenter and how an internal cloud infrastructure would help you provide a better level of service to your customers at a lower cost. In this post, we're on to the final use case for our discussion of recovery capabilities enabled by using an internal cloud infrastructure -- and I think we've left the best for last.
In the wake of 9/11 and to respond to SOX compliance issues, many companies have been working on catastrophic disaster recovery solutions in the event a datacenter becomes unavailable. This kind of massive failure is where a cloud computing infrastructure really shines as it enables capabilities that to date were unattainable due to the cost and complexity of the available solutions. That said, we'll build on the previous example (multiple applications of varying importance to the organization) but this time the failure is going to be one in which the entire datacenter that is hosting the applications becomes unavailable (you can use your imagination on what kind of items cause these types of failures…).
Let's lay the groundwork for this example and describe a few more moving parts required to affect the solution, once again using a Cassatt implementation as the reference point. First, the datacenters must have a data replication mechanism in place as the solution relies on the data/images being replicated from the primary site to the backup site. The ideal approach would be to use a two-phased commit approach as this means no data loss on failure (other than transactions in flight which will roll back) as the things being written to the primary datacenter are being written to the backup datacenter at the same time. While this is the preferred approach, if you can relax your data coherency requirements (such that the backup site's data is within 30-60 minutes of the primary site) then the required technology/cost can be simplified/reduced substantially by using one of the myriad of non-realtime replication technologies offered by the storage vendors.
The second requirement of the solution is that the IP addresses must stay the same across the recovery sites (meaning that when the primary site is recovered to the secondary site, it will come up in the same IP address space that it had when running in the primary datacenter). The reason for this requirement is that many applications write the IP address of the node into local configuration files, making them very difficult to find and prohibitively complex to update during a failure recovery. (Think of updating thousands of these while in the throes of performing a datacenter recover and how likely that at least a few mistakes will be made. Then add to that how difficult it would be to actually find those mistakes.) We've learned that we end up with a much more stable and easy to understand/debug recovery solution if we keep the addresses constant.
On the interesting topics front, there are two items that are perhaps unexpectedly not required for the solution to work. First, the backup datacenter is not required to have identical hardware as the primary site (both the type of hardware and the quantities can differ.) Second, the backup datacenter can be used to host other lower priority applications for the company when not being used for the recovery (so your investment is not just sitting idle waiting for the rainy day to happen, but instead is contributing to generating revenue).
With those requirements/background out of the way, let’s walk through the failure and see how the recovery works. Again, we'll start with the assumption that everything is up and running in a steady state in the primary datacenter when the failure occurs. For this failure, the beginning of the failover process is manually initiated (while we could automate the failover, the recovery of one datacenter into another just seems like too big a business issue to leave the decision to automation. Instead, we require the user to initiate the process.) Once the decision is made to recover the datacenter into the backup site, the operator simply runs a program to start the recovery process. This program performs the following steps:
Once these steps are completed, the recovery logic in the cloud infrastructure is started and the recovery begins. The first thing the cloud infrastructure controller must do is to inventory the hardware in the secondary datacenter to determine the type and quantities available. Once the hardware is inventoried, the infrastructure takes the user-entered policy in the database and determines what applications have the highest priorities. It begins the allocation cycle to re-establish the SLAs on those applications. As hardware allocations complete, the infrastructure will again consult the stored policy from the database to keep intact the dependencies between the various applications, starting them in the required order to recover the applications successfully. This cycle of inventory, allocation, and activation will continue until either all of the applications have been recovered (in priority/dependency order) or until the environment becomes hardware constrained (meaning that there is insufficient hardware of the correct type to meet the needs of the applications being recovered).
The same approach outlined above is done in reverse when the primary datacenter is recovered and the applications need to be migrated back to their original locations. Once the applications are recovered back to the primary datacenter, the applications that were originally running in the backup datacenter can be recovered by simply putting the storage mounts back in place and restarting the control node. In this case no extra scrubbing steps are required, as the hardware has not changed this time. After restarting the control node, the applications are recovered just as if a power outage had happened. Once restarted, the applications will pick up exactly where they had left off prior to the primary datacenter failure.
Thanks for taking the time to read and I hope this post has you thinking about some of the major transformational benefits that your organization can receive from adopting an internal cloud infrastructure for running your IT environment. My next installment will be a discussion of how an internal cloud infrastructure's auditing and tracking capabilities can provide your organization an unparalleled view into how your resources are being used. We'll then explore how this type of information can enable you to provide your business units with billing reports that show exactly what resources their applications used and when for any given month.
In the wake of 9/11 and to respond to SOX compliance issues, many companies have been working on catastrophic disaster recovery solutions in the event a datacenter becomes unavailable. This kind of massive failure is where a cloud computing infrastructure really shines as it enables capabilities that to date were unattainable due to the cost and complexity of the available solutions. That said, we'll build on the previous example (multiple applications of varying importance to the organization) but this time the failure is going to be one in which the entire datacenter that is hosting the applications becomes unavailable (you can use your imagination on what kind of items cause these types of failures…).
Let's lay the groundwork for this example and describe a few more moving parts required to affect the solution, once again using a Cassatt implementation as the reference point. First, the datacenters must have a data replication mechanism in place as the solution relies on the data/images being replicated from the primary site to the backup site. The ideal approach would be to use a two-phased commit approach as this means no data loss on failure (other than transactions in flight which will roll back) as the things being written to the primary datacenter are being written to the backup datacenter at the same time. While this is the preferred approach, if you can relax your data coherency requirements (such that the backup site's data is within 30-60 minutes of the primary site) then the required technology/cost can be simplified/reduced substantially by using one of the myriad of non-realtime replication technologies offered by the storage vendors.
The second requirement of the solution is that the IP addresses must stay the same across the recovery sites (meaning that when the primary site is recovered to the secondary site, it will come up in the same IP address space that it had when running in the primary datacenter). The reason for this requirement is that many applications write the IP address of the node into local configuration files, making them very difficult to find and prohibitively complex to update during a failure recovery. (Think of updating thousands of these while in the throes of performing a datacenter recover and how likely that at least a few mistakes will be made. Then add to that how difficult it would be to actually find those mistakes.) We've learned that we end up with a much more stable and easy to understand/debug recovery solution if we keep the addresses constant.
On the interesting topics front, there are two items that are perhaps unexpectedly not required for the solution to work. First, the backup datacenter is not required to have identical hardware as the primary site (both the type of hardware and the quantities can differ.) Second, the backup datacenter can be used to host other lower priority applications for the company when not being used for the recovery (so your investment is not just sitting idle waiting for the rainy day to happen, but instead is contributing to generating revenue).
With those requirements/background out of the way, let’s walk through the failure and see how the recovery works. Again, we'll start with the assumption that everything is up and running in a steady state in the primary datacenter when the failure occurs. For this failure, the beginning of the failover process is manually initiated (while we could automate the failover, the recovery of one datacenter into another just seems like too big a business issue to leave the decision to automation. Instead, we require the user to initiate the process.) Once the decision is made to recover the datacenter into the backup site, the operator simply runs a program to start the recovery process. This program performs the following steps:
- Gracefully shut down any applications still running in the primary datacenter (depending on the failure, not all services may have failed, so we must start by quiescing the systems)
- Gracefully shut down the low-priority applications running in the backup datacenter in preparation for recovering the primary datacenter applications.
- Set aside the backup datacenter’s data so that we can come back to it later when the primary datacenter is recovered. When we want to migrate the payload back to the primary site, we'll want to recover the applications that were originally running in that backup datacenter. There isn't anything special being done in this step in terms of setting aside the data. In practice, this just means unmounting the secondary datacenter storage from the control node.
- Need to update the backup data centers network switch and routing information so that switches know about the production site network configuration. Also would need to update the backbone routers, etc so that they know about the change in location.
- Mount the replicated data store(s) into place. This gives the control node in our Cassatt-based example access to the application topology and requirements required to recovery the applications into the new datacenter.
- Remove all existing hardware definitions out of the replicated database. We keep all of the user-defined policies that describe the server, storage, and networking requirements of the applications. However, because the database we are recovering includes the hardware definitions from the primary datacenter and none of that hardware exists in the secondary datacenter, we must remove it prior to starting the recovery so that the system is forced to go through hardware allocation steps. These steps are important because they will map the application priorities and requirements to the hardware available in the backup datacenter.
Once these steps are completed, the recovery logic in the cloud infrastructure is started and the recovery begins. The first thing the cloud infrastructure controller must do is to inventory the hardware in the secondary datacenter to determine the type and quantities available. Once the hardware is inventoried, the infrastructure takes the user-entered policy in the database and determines what applications have the highest priorities. It begins the allocation cycle to re-establish the SLAs on those applications. As hardware allocations complete, the infrastructure will again consult the stored policy from the database to keep intact the dependencies between the various applications, starting them in the required order to recover the applications successfully. This cycle of inventory, allocation, and activation will continue until either all of the applications have been recovered (in priority/dependency order) or until the environment becomes hardware constrained (meaning that there is insufficient hardware of the correct type to meet the needs of the applications being recovered).
The same approach outlined above is done in reverse when the primary datacenter is recovered and the applications need to be migrated back to their original locations. Once the applications are recovered back to the primary datacenter, the applications that were originally running in the backup datacenter can be recovered by simply putting the storage mounts back in place and restarting the control node. In this case no extra scrubbing steps are required, as the hardware has not changed this time. After restarting the control node, the applications are recovered just as if a power outage had happened. Once restarted, the applications will pick up exactly where they had left off prior to the primary datacenter failure.
Thanks for taking the time to read and I hope this post has you thinking about some of the major transformational benefits that your organization can receive from adopting an internal cloud infrastructure for running your IT environment. My next installment will be a discussion of how an internal cloud infrastructure's auditing and tracking capabilities can provide your organization an unparalleled view into how your resources are being used. We'll then explore how this type of information can enable you to provide your business units with billing reports that show exactly what resources their applications used and when for any given month.
Subscribe to:
Comments (Atom)