Monday, March 23, 2009

Internal clouds and a better way to do recovery to a back-up datacenter

Last post, we talked about a variety of failures within a datacenter and how an internal cloud infrastructure would help you provide a better level of service to your customers at a lower cost. In this post, we're on to the final use case for our discussion of recovery capabilities enabled by using an internal cloud infrastructure -- and I think we've left the best for last.

In the wake of 9/11 and to respond to SOX compliance issues, many companies have been working on catastrophic disaster recovery solutions in the event a datacenter becomes unavailable. This kind of massive failure is where a cloud computing infrastructure really shines as it enables capabilities that to date were unattainable due to the cost and complexity of the available solutions. That said, we'll build on the previous example (multiple applications of varying importance to the organization) but this time the failure is going to be one in which the entire datacenter that is hosting the applications becomes unavailable (you can use your imagination on what kind of items cause these types of failures…).

Let's lay the groundwork for this example and describe a few more moving parts required to affect the solution, once again using a Cassatt implementation as the reference point. First, the datacenters must have a data replication mechanism in place as the solution relies on the data/images being replicated from the primary site to the backup site. The ideal approach would be to use a two-phased commit approach as this means no data loss on failure (other than transactions in flight which will roll back) as the things being written to the primary datacenter are being written to the backup datacenter at the same time. While this is the preferred approach, if you can relax your data coherency requirements (such that the backup site's data is within 30-60 minutes of the primary site) then the required technology/cost can be simplified/reduced substantially by using one of the myriad of non-realtime replication technologies offered by the storage vendors.

The second requirement of the solution is that the IP addresses must stay the same across the recovery sites (meaning that when the primary site is recovered to the secondary site, it will come up in the same IP address space that it had when running in the primary datacenter). The reason for this requirement is that many applications write the IP address of the node into local configuration files, making them very difficult to find and prohibitively complex to update during a failure recovery. (Think of updating thousands of these while in the throes of performing a datacenter recover and how likely that at least a few mistakes will be made. Then add to that how difficult it would be to actually find those mistakes.) We've learned that we end up with a much more stable and easy to understand/debug recovery solution if we keep the addresses constant.

On the interesting topics front, there are two items that are perhaps unexpectedly not required for the solution to work. First, the backup datacenter is not required to have identical hardware as the primary site (both the type of hardware and the quantities can differ.) Second, the backup datacenter can be used to host other lower priority applications for the company when not being used for the recovery (so your investment is not just sitting idle waiting for the rainy day to happen, but instead is contributing to generating revenue).

With those requirements/background out of the way, let’s walk through the failure and see how the recovery works. Again, we'll start with the assumption that everything is up and running in a steady state in the primary datacenter when the failure occurs. For this failure, the beginning of the failover process is manually initiated (while we could automate the failover, the recovery of one datacenter into another just seems like too big a business issue to leave the decision to automation. Instead, we require the user to initiate the process.) Once the decision is made to recover the datacenter into the backup site, the operator simply runs a program to start the recovery process. This program performs the following steps:
  • Gracefully shut down any applications still running in the primary datacenter (depending on the failure, not all services may have failed, so we must start by quiescing the systems)

  • Gracefully shut down the low-priority applications running in the backup datacenter in preparation for recovering the primary datacenter applications.

  • Set aside the backup datacenter’s data so that we can come back to it later when the primary datacenter is recovered. When we want to migrate the payload back to the primary site, we'll want to recover the applications that were originally running in that backup datacenter. There isn't anything special being done in this step in terms of setting aside the data. In practice, this just means unmounting the secondary datacenter storage from the control node.

  • Need to update the backup data centers network switch and routing information so that switches know about the production site network configuration. Also would need to update the backbone routers, etc so that they know about the change in location.

  • Mount the replicated data store(s) into place. This gives the control node in our Cassatt-based example access to the application topology and requirements required to recovery the applications into the new datacenter.

  • Remove all existing hardware definitions out of the replicated database. We keep all of the user-defined policies that describe the server, storage, and networking requirements of the applications. However, because the database we are recovering includes the hardware definitions from the primary datacenter and none of that hardware exists in the secondary datacenter, we must remove it prior to starting the recovery so that the system is forced to go through hardware allocation steps. These steps are important because they will map the application priorities and requirements to the hardware available in the backup datacenter.

Once these steps are completed, the recovery logic in the cloud infrastructure is started and the recovery begins. The first thing the cloud infrastructure controller must do is to inventory the hardware in the secondary datacenter to determine the type and quantities available. Once the hardware is inventoried, the infrastructure takes the user-entered policy in the database and determines what applications have the highest priorities. It begins the allocation cycle to re-establish the SLAs on those applications. As hardware allocations complete, the infrastructure will again consult the stored policy from the database to keep intact the dependencies between the various applications, starting them in the required order to recover the applications successfully. This cycle of inventory, allocation, and activation will continue until either all of the applications have been recovered (in priority/dependency order) or until the environment becomes hardware constrained (meaning that there is insufficient hardware of the correct type to meet the needs of the applications being recovered).

The same approach outlined above is done in reverse when the primary datacenter is recovered and the applications need to be migrated back to their original locations. Once the applications are recovered back to the primary datacenter, the applications that were originally running in the backup datacenter can be recovered by simply putting the storage mounts back in place and restarting the control node. In this case no extra scrubbing steps are required, as the hardware has not changed this time. After restarting the control node, the applications are recovered just as if a power outage had happened. Once restarted, the applications will pick up exactly where they had left off prior to the primary datacenter failure.

Thanks for taking the time to read and I hope this post has you thinking about some of the major transformational benefits that your organization can receive from adopting an internal cloud infrastructure for running your IT environment. My next installment will be a discussion of how an internal cloud infrastructure's auditing and tracking capabilities can provide your organization an unparalleled view into how your resources are being used. We'll then explore how this type of information can enable you to provide your business units with billing reports that show exactly what resources their applications used and when for any given month.

No comments: