Friday, February 27, 2009

Recoverability: How an internal cloud makes it easier to get your apps back up and running after a failure

Ok, back to talking about the "-ilities" this week and how cloud computing can help you address one of the key issues you are concerned with in your data center. On deck is the recoverability of your IT environment when run on an internal cloud infrastructure (like Cassatt Active Response).

As discussed in my last post, there can be a fair amount of organizational and operational change required to adopt an internal cloud infrastructure, but there are many benefits from taking on the task. The next couple of posts will outline one of the major benefits (recoverability) that comes from separating the applications from their network, computing, and storage resources, and how this separation allows for both intra-datacenter and inter-datacenter recoverability of your applications.

Five Internal Cloud Recovery Scenarios
To walk you through the discussion, I'm going to take you through a number of levels of recoverability as a function of the complexity of the application and the failure being addressed. Taking this approach, I've come up with five different scenarios that start with the recovery of a single application and end with recovery of a primary datacenter into a backup datacenter. The first four scenarios (intra-datacenter recovery) are covered in this post and the last one (inter-datacenter recovery) will be covered in my next post. So, enough background: let’s get into the discussion.

Application Recovery
Let's begin with the simplest level of recoverability and start with a single application (think a single workgroup server for a department that might be running things like a wiki, mail, or a web server.) From talking to many admins over the years, the first step they do when they find that a given server has dropped offline is to perform a "therapeutic reboot" to see if that gets everything back to a running state. The reality of IT is that many of the frameworks/containers/applications that you run leak memory slowly over time and that a reboot is the easiest way to clean up the issue.

In the case of a traditional IT environment, a monitoring tool would be used to monitor the servers under management and if the monitors go offline then an alert/page is generated to tell the admin that something is wrong that needs their attention. With an internal cloud infrastructure in place, the creation of monitors for each managed server comes for free (you just select in the UI how you want to monitor a given service from a list of supported monitors). In addition, when the monitor(s) drop offline you can configure policy to tell the system how you want the failure handled.

In the case of an application that you'd like rebooted prior to considering it failed, you simply instruct the internal cloud infrastructure to power cycle the node and to send an alert only if it doesn't recover correctly. While this simple level of recoverability is interesting in making an admin more productive (less chasing unqualified failures), it really isn't that awe-inspiring, so let's move on to the next level of recovery.

Hardware Recovery
In this scenario, we'll pick up where the last example left off and add an additional twist. Assume that the issue in the previous example wasn't fixed by rebooting the node. With an internal cloud infrastructure, you can enable the policy for the service to not only reboot the service on failure (to see if that re-establishes the service) but also to swap out the current piece of hardware for a new piece (complete with reconfiguring the network and storage on the fly so that the new node meets the applications requirements).

Let's explore this a bit more. With current IT practices, if you are faced with needing to make a singleton application available (like a workgroup server) you probably start thinking about clustering solutions (Windows or U*nx) that allow you to have a single primary node running and a secondary backup node listening in case of failure. The problem with this approach is that it is costly (you have to buy twice the hardware), your utilization is poor (the back-up machine is sitting idle), and you have to budget for twice the cooling and power because the back-up machine sits on, but idle.

Now contrast that with an internal cloud approach, where you have a pool of hardware shared among your applications. In this situation, you get single application hardware availability for free (just configure the policy accordingly). You buy between 1/5th and 1/10th the backup hardware (depending the number of concurrent failures you want to be resilient to) as the back-up hardware can be shared across many applications. Additionally, the spare hardware you do purchase sits powered off while awaiting a failure so it consumes zero power and cooling.

Now, that level of recoverability is interesting, but IT shops typically have more complex n-tier applications where the application being hosted is spread across multiple application tiers (e.g. a database tier, a middleware/service tier and a web tier).

Multi-tier Application Recovery
In more complex application types (including those with four tiers, as is typical with JBoss, WebLogic, or WebSphere), an internal cloud infrastructure continues to help you out. For this example let's take JBoss as our working example as many people have experience with that toolset. JBoss' deployment model for a web application will typically consist of four tiers of services that work in concert to provide the desired application to the end user. There will be a database tier (where the user's data is stored), a service tier that provides the business logic for the application being hosted, and a web tier that will interact with the business logic and dynamically generate the desired HTML page. The fourth and final tier (which isn't in the IT environment) is the user's browser that actually renders the HTML to human-readable format. In this type of n-tier application stack there are implicit dependencies between the different tiers that are usually managed by the admins who know what the dependencies are for the various tiers and, as a result, the correct order for startup/shutdown/recovery (e.g. there is no point in starting up the business or web tiers if the DB is not running).

In the case of an n-tier application, an internal cloud computing infrastructure can help you with manageability, scalability, as well as recoverability (we're starting to pull out all the "-ilities" now…) We'll cover them in order and close with the recoverability item as that's this week’s theme. On the manageability front, an internal cloud infrastructure can capture the dependencies between the tiers and orchestrate the orderly startup/shutdown of the tiers (e.g. first verify that the DB is running, then start the business logic, and finally finish with the web tier). This means that the specifics of the application are no longer kept in the admin's head, but rather in the tool where any admin can benefit from the knowledge.

On the n-tier scalability front, usually a horizontal rather than vertical scaling approach is used for the business and web tiers. With an internal cloud infrastructure managing the tiers and using the monitors to determine the required computing capacity (to do this with Cassatt Active Response, we use what we call Demand-Based Policies), the infrastructure will automatically increase/decrease the capacity in each tier as a function of the demand being generated against the tier.

Finally, on the recoverability front, everything outlined in the last recovery scenario applies (restart on failure and swap to a new node if that doesn't work), but now you also get the added value of being able to restart services in multiple dimensions. As an example, in many cases connection pooling is used in the business tier to increase the performance of accessing the database. One downside (depending on the solution used for managing the connection pool) is that if the database goes away, then the business tier has to be restarted to re-establish the connections. In a typical IT shop this would mean that the admin would have to manage the recovery across the various tiers. However, in an internal cloud computing environment, the infrastructure has sufficient knowledge to know that if the DB went down, there is no point in trying to restart the failed business tier until the DB has been recovered. Likewise, there is no point in trying to recover the web tier when the business tier is offline. This means that even if the root failure can not be addressed by the infrastructure (which can happen if the issue is not transient or hardware related) the admin can focus on the recovery of the specific item that has failed and the system will take care of the busywork associated with restoring the complete service.

Intra-datacenter recovery
Ok, we're now into hosting non-trivial applications within a cloud infrastructure so let's take that same n-tier application example, but add in the extra complexity that there are now multiple n-tier applications being managed. What we'll do, though, is throw in a datacenter infrastructure failure. This example would be relevant for issues like a Computer Room Air Conditioning (CRAC) unit failure, loss of one of the consolidated UPS units, or a water leak (all of which would not cause a complete failure in a datacenter, but would typically knock out a block of computing capacity).

Before we jump into the failure, we need to explore one more typical practice for IT shops. Specifically, as more applications are added to an IT environment it is not uncommon for the IT staff to begin to stratify the applications into support levels that correspond to the importance the business places on the specific application in question (e.g. revenue systems and customer facing systems typically have a higher availability requirement than, say, a workgroup server or test and development servers.) For this example lets say that the IT department has three levels of support/availability that they use, with level one being the highest priority and level three being the lowest. With Cassatt Active Response, you can put this type of policy directly into the application and allow it to optimize the allocation of your computing resources to applications per your defined priorities. With that as background, let’s walk through the failure we outlined above and see what Cassatt Active Response will do for you in the face of a major failure in your datacenter (we're going to take the UPS example for this discussion).

We'll assume prior to the failure that the environment is in a steady state with all applications up and running at the desired levels of capacity. At this point, one of the shared UPS units goes offline, which affects all compute resources connected to that UPS unit. This appears to Cassatt Active Response as a number of node failures that go through the same recovery steps outlined above. However, as mentioned above, usually you will plan for a certain number of concurrent failures and you will keep that much spare capacity available for deployment. Unfortunately, when you loose something shared like a UPS, the number of failures quickly consumes the spare capacity available and you find yourself in an over-constrained situation.

This is where being on a cloud infrastructure really starts to shine. Since you have already identified the priority of the various applications you host in the tool, it can dynamically react to the loss in compute capacity and move resources as necessary to maintain your service levels on your higher priority applications. Specifically, in this example lets assume that 30% of the lost capacity was in your level 1 (most important) applications. The infrastructure will first try to shore up those applications from the excess capacity available, but when that is exhausted, it will start repurposing nodes from lower priority applications in support of re-establishing the service level of the higher priority applications. Now, because the cloud infrastructure can manage the power, network, and images of the applications, it can do all of this gracefully (existing lower priority applications get gracefully shut down prior to their hardware being repurposed) and without user interaction. Within a short period of time (think 10s of minutes), the higher priority applications have been re-established to their necessary operating levels.

The final part of this example is what occurs when the issue causing the failure is fixed (in our example, the UPS is repaired and power is re-applied to the effected computing resources.) With a cloud infrastructure managing your environment, your lower priority applications that were effected in support of shoring up the higher priority applications all just recover automatically. Specifically, once the power is reapplied, all you have to do is mark the hardware as available and the system will do the rest of the work to re-inventory and re-allocate the hardware back into those tiers that are running below the desired capacity levels.

Well, there you have it. We've walked through a variety of failure scenarios within a datacenter and discussed how an internal cloud infrastructure can offload much of the busy/mundane work of recovery. In the next post I'll take the example we just finished and broaden it to include recovery into a completely different datacenter. Until then…

No comments: