The folks working on our 3Tera AppLogic product revved up a short video that I thought was a good illustration of a couple ways customers are using the product to help them.
Plus, honestly, I thought the team came up with some amusing names for the not-so-amusing quandaries that customers are in – the things they are using cloud computing to solve. Add a groovy beat behind it all, and it’s certainly not the worst way to spend 3 minutes and 36 seconds on YouTube.
See if any of these sound familiar for big enterprises:
Time machine. The business needs their applications released now. Sure, they didn’t ask IT to start working on this until, well, now. What they need is a time machine. Or at least a way to help dramatically accelerate their speed to market. “Delay is not an option.” Oh, gee, thanks.
New markets/old problems. You need your applications rolled out in new places around the world. Really, this kind of replication sounds like it should be simple. I mean, they are the same applications, after all. And it is simple -- unless you’re the guy trying to help Bangalore do all this remotely from Chicago.
Full plate. Those geniuses in marketing (hey!) are throwing requirements at IT that are going to stretch the infrastructure as it is. Then they add more. It’s a big problem that needs on-demand scalability. A lot of it.
(OK, so don’t expect it to be as amusing as the conference call spoof Dave Grady did that’s going around. But that’s pretty hard to live up to.)
Here’s the video:
Hint: I don’t think I’d be giving anything away if I told you that each of these scenarios has a happy ending. That’s why we brought the 3Tera guys onboard to be part of a cloud solution for customers, after all.
Any good ones they missed? Comments welcome.
Showing posts with label business agility. Show all posts
Showing posts with label business agility. Show all posts
Thursday, August 5, 2010
Monday, March 23, 2009
Internal clouds and a better way to do recovery to a back-up datacenter
Posted by
Craig Vosburgh
at
8:13 AM
Last post, we talked about a variety of failures within a datacenter and how an internal cloud infrastructure would help you provide a better level of service to your customers at a lower cost. In this post, we're on to the final use case for our discussion of recovery capabilities enabled by using an internal cloud infrastructure -- and I think we've left the best for last.
In the wake of 9/11 and to respond to SOX compliance issues, many companies have been working on catastrophic disaster recovery solutions in the event a datacenter becomes unavailable. This kind of massive failure is where a cloud computing infrastructure really shines as it enables capabilities that to date were unattainable due to the cost and complexity of the available solutions. That said, we'll build on the previous example (multiple applications of varying importance to the organization) but this time the failure is going to be one in which the entire datacenter that is hosting the applications becomes unavailable (you can use your imagination on what kind of items cause these types of failures…).
Let's lay the groundwork for this example and describe a few more moving parts required to affect the solution, once again using a Cassatt implementation as the reference point. First, the datacenters must have a data replication mechanism in place as the solution relies on the data/images being replicated from the primary site to the backup site. The ideal approach would be to use a two-phased commit approach as this means no data loss on failure (other than transactions in flight which will roll back) as the things being written to the primary datacenter are being written to the backup datacenter at the same time. While this is the preferred approach, if you can relax your data coherency requirements (such that the backup site's data is within 30-60 minutes of the primary site) then the required technology/cost can be simplified/reduced substantially by using one of the myriad of non-realtime replication technologies offered by the storage vendors.
The second requirement of the solution is that the IP addresses must stay the same across the recovery sites (meaning that when the primary site is recovered to the secondary site, it will come up in the same IP address space that it had when running in the primary datacenter). The reason for this requirement is that many applications write the IP address of the node into local configuration files, making them very difficult to find and prohibitively complex to update during a failure recovery. (Think of updating thousands of these while in the throes of performing a datacenter recover and how likely that at least a few mistakes will be made. Then add to that how difficult it would be to actually find those mistakes.) We've learned that we end up with a much more stable and easy to understand/debug recovery solution if we keep the addresses constant.
On the interesting topics front, there are two items that are perhaps unexpectedly not required for the solution to work. First, the backup datacenter is not required to have identical hardware as the primary site (both the type of hardware and the quantities can differ.) Second, the backup datacenter can be used to host other lower priority applications for the company when not being used for the recovery (so your investment is not just sitting idle waiting for the rainy day to happen, but instead is contributing to generating revenue).
With those requirements/background out of the way, let’s walk through the failure and see how the recovery works. Again, we'll start with the assumption that everything is up and running in a steady state in the primary datacenter when the failure occurs. For this failure, the beginning of the failover process is manually initiated (while we could automate the failover, the recovery of one datacenter into another just seems like too big a business issue to leave the decision to automation. Instead, we require the user to initiate the process.) Once the decision is made to recover the datacenter into the backup site, the operator simply runs a program to start the recovery process. This program performs the following steps:
Once these steps are completed, the recovery logic in the cloud infrastructure is started and the recovery begins. The first thing the cloud infrastructure controller must do is to inventory the hardware in the secondary datacenter to determine the type and quantities available. Once the hardware is inventoried, the infrastructure takes the user-entered policy in the database and determines what applications have the highest priorities. It begins the allocation cycle to re-establish the SLAs on those applications. As hardware allocations complete, the infrastructure will again consult the stored policy from the database to keep intact the dependencies between the various applications, starting them in the required order to recover the applications successfully. This cycle of inventory, allocation, and activation will continue until either all of the applications have been recovered (in priority/dependency order) or until the environment becomes hardware constrained (meaning that there is insufficient hardware of the correct type to meet the needs of the applications being recovered).
The same approach outlined above is done in reverse when the primary datacenter is recovered and the applications need to be migrated back to their original locations. Once the applications are recovered back to the primary datacenter, the applications that were originally running in the backup datacenter can be recovered by simply putting the storage mounts back in place and restarting the control node. In this case no extra scrubbing steps are required, as the hardware has not changed this time. After restarting the control node, the applications are recovered just as if a power outage had happened. Once restarted, the applications will pick up exactly where they had left off prior to the primary datacenter failure.
Thanks for taking the time to read and I hope this post has you thinking about some of the major transformational benefits that your organization can receive from adopting an internal cloud infrastructure for running your IT environment. My next installment will be a discussion of how an internal cloud infrastructure's auditing and tracking capabilities can provide your organization an unparalleled view into how your resources are being used. We'll then explore how this type of information can enable you to provide your business units with billing reports that show exactly what resources their applications used and when for any given month.
In the wake of 9/11 and to respond to SOX compliance issues, many companies have been working on catastrophic disaster recovery solutions in the event a datacenter becomes unavailable. This kind of massive failure is where a cloud computing infrastructure really shines as it enables capabilities that to date were unattainable due to the cost and complexity of the available solutions. That said, we'll build on the previous example (multiple applications of varying importance to the organization) but this time the failure is going to be one in which the entire datacenter that is hosting the applications becomes unavailable (you can use your imagination on what kind of items cause these types of failures…).
Let's lay the groundwork for this example and describe a few more moving parts required to affect the solution, once again using a Cassatt implementation as the reference point. First, the datacenters must have a data replication mechanism in place as the solution relies on the data/images being replicated from the primary site to the backup site. The ideal approach would be to use a two-phased commit approach as this means no data loss on failure (other than transactions in flight which will roll back) as the things being written to the primary datacenter are being written to the backup datacenter at the same time. While this is the preferred approach, if you can relax your data coherency requirements (such that the backup site's data is within 30-60 minutes of the primary site) then the required technology/cost can be simplified/reduced substantially by using one of the myriad of non-realtime replication technologies offered by the storage vendors.
The second requirement of the solution is that the IP addresses must stay the same across the recovery sites (meaning that when the primary site is recovered to the secondary site, it will come up in the same IP address space that it had when running in the primary datacenter). The reason for this requirement is that many applications write the IP address of the node into local configuration files, making them very difficult to find and prohibitively complex to update during a failure recovery. (Think of updating thousands of these while in the throes of performing a datacenter recover and how likely that at least a few mistakes will be made. Then add to that how difficult it would be to actually find those mistakes.) We've learned that we end up with a much more stable and easy to understand/debug recovery solution if we keep the addresses constant.
On the interesting topics front, there are two items that are perhaps unexpectedly not required for the solution to work. First, the backup datacenter is not required to have identical hardware as the primary site (both the type of hardware and the quantities can differ.) Second, the backup datacenter can be used to host other lower priority applications for the company when not being used for the recovery (so your investment is not just sitting idle waiting for the rainy day to happen, but instead is contributing to generating revenue).
With those requirements/background out of the way, let’s walk through the failure and see how the recovery works. Again, we'll start with the assumption that everything is up and running in a steady state in the primary datacenter when the failure occurs. For this failure, the beginning of the failover process is manually initiated (while we could automate the failover, the recovery of one datacenter into another just seems like too big a business issue to leave the decision to automation. Instead, we require the user to initiate the process.) Once the decision is made to recover the datacenter into the backup site, the operator simply runs a program to start the recovery process. This program performs the following steps:
- Gracefully shut down any applications still running in the primary datacenter (depending on the failure, not all services may have failed, so we must start by quiescing the systems)
- Gracefully shut down the low-priority applications running in the backup datacenter in preparation for recovering the primary datacenter applications.
- Set aside the backup datacenter’s data so that we can come back to it later when the primary datacenter is recovered. When we want to migrate the payload back to the primary site, we'll want to recover the applications that were originally running in that backup datacenter. There isn't anything special being done in this step in terms of setting aside the data. In practice, this just means unmounting the secondary datacenter storage from the control node.
- Need to update the backup data centers network switch and routing information so that switches know about the production site network configuration. Also would need to update the backbone routers, etc so that they know about the change in location.
- Mount the replicated data store(s) into place. This gives the control node in our Cassatt-based example access to the application topology and requirements required to recovery the applications into the new datacenter.
- Remove all existing hardware definitions out of the replicated database. We keep all of the user-defined policies that describe the server, storage, and networking requirements of the applications. However, because the database we are recovering includes the hardware definitions from the primary datacenter and none of that hardware exists in the secondary datacenter, we must remove it prior to starting the recovery so that the system is forced to go through hardware allocation steps. These steps are important because they will map the application priorities and requirements to the hardware available in the backup datacenter.
Once these steps are completed, the recovery logic in the cloud infrastructure is started and the recovery begins. The first thing the cloud infrastructure controller must do is to inventory the hardware in the secondary datacenter to determine the type and quantities available. Once the hardware is inventoried, the infrastructure takes the user-entered policy in the database and determines what applications have the highest priorities. It begins the allocation cycle to re-establish the SLAs on those applications. As hardware allocations complete, the infrastructure will again consult the stored policy from the database to keep intact the dependencies between the various applications, starting them in the required order to recover the applications successfully. This cycle of inventory, allocation, and activation will continue until either all of the applications have been recovered (in priority/dependency order) or until the environment becomes hardware constrained (meaning that there is insufficient hardware of the correct type to meet the needs of the applications being recovered).
The same approach outlined above is done in reverse when the primary datacenter is recovered and the applications need to be migrated back to their original locations. Once the applications are recovered back to the primary datacenter, the applications that were originally running in the backup datacenter can be recovered by simply putting the storage mounts back in place and restarting the control node. In this case no extra scrubbing steps are required, as the hardware has not changed this time. After restarting the control node, the applications are recovered just as if a power outage had happened. Once restarted, the applications will pick up exactly where they had left off prior to the primary datacenter failure.
Thanks for taking the time to read and I hope this post has you thinking about some of the major transformational benefits that your organization can receive from adopting an internal cloud infrastructure for running your IT environment. My next installment will be a discussion of how an internal cloud infrastructure's auditing and tracking capabilities can provide your organization an unparalleled view into how your resources are being used. We'll then explore how this type of information can enable you to provide your business units with billing reports that show exactly what resources their applications used and when for any given month.
Friday, February 27, 2009
Recoverability: How an internal cloud makes it easier to get your apps back up and running after a failure
Posted by
Craig Vosburgh
at
9:16 AM
Ok, back to talking about the "-ilities" this week and how cloud computing can help you address one of the key issues you are concerned with in your data center. On deck is the recoverability of your IT environment when run on an internal cloud infrastructure (like Cassatt Active Response).
As discussed in my last post, there can be a fair amount of organizational and operational change required to adopt an internal cloud infrastructure, but there are many benefits from taking on the task. The next couple of posts will outline one of the major benefits (recoverability) that comes from separating the applications from their network, computing, and storage resources, and how this separation allows for both intra-datacenter and inter-datacenter recoverability of your applications.
Five Internal Cloud Recovery Scenarios
To walk you through the discussion, I'm going to take you through a number of levels of recoverability as a function of the complexity of the application and the failure being addressed. Taking this approach, I've come up with five different scenarios that start with the recovery of a single application and end with recovery of a primary datacenter into a backup datacenter. The first four scenarios (intra-datacenter recovery) are covered in this post and the last one (inter-datacenter recovery) will be covered in my next post. So, enough background: let’s get into the discussion.
Application Recovery
Let's begin with the simplest level of recoverability and start with a single application (think a single workgroup server for a department that might be running things like a wiki, mail, or a web server.) From talking to many admins over the years, the first step they do when they find that a given server has dropped offline is to perform a "therapeutic reboot" to see if that gets everything back to a running state. The reality of IT is that many of the frameworks/containers/applications that you run leak memory slowly over time and that a reboot is the easiest way to clean up the issue.
In the case of a traditional IT environment, a monitoring tool would be used to monitor the servers under management and if the monitors go offline then an alert/page is generated to tell the admin that something is wrong that needs their attention. With an internal cloud infrastructure in place, the creation of monitors for each managed server comes for free (you just select in the UI how you want to monitor a given service from a list of supported monitors). In addition, when the monitor(s) drop offline you can configure policy to tell the system how you want the failure handled.
In the case of an application that you'd like rebooted prior to considering it failed, you simply instruct the internal cloud infrastructure to power cycle the node and to send an alert only if it doesn't recover correctly. While this simple level of recoverability is interesting in making an admin more productive (less chasing unqualified failures), it really isn't that awe-inspiring, so let's move on to the next level of recovery.
Hardware Recovery
In this scenario, we'll pick up where the last example left off and add an additional twist. Assume that the issue in the previous example wasn't fixed by rebooting the node. With an internal cloud infrastructure, you can enable the policy for the service to not only reboot the service on failure (to see if that re-establishes the service) but also to swap out the current piece of hardware for a new piece (complete with reconfiguring the network and storage on the fly so that the new node meets the applications requirements).
Let's explore this a bit more. With current IT practices, if you are faced with needing to make a singleton application available (like a workgroup server) you probably start thinking about clustering solutions (Windows or U*nx) that allow you to have a single primary node running and a secondary backup node listening in case of failure. The problem with this approach is that it is costly (you have to buy twice the hardware), your utilization is poor (the back-up machine is sitting idle), and you have to budget for twice the cooling and power because the back-up machine sits on, but idle.
Now contrast that with an internal cloud approach, where you have a pool of hardware shared among your applications. In this situation, you get single application hardware availability for free (just configure the policy accordingly). You buy between 1/5th and 1/10th the backup hardware (depending the number of concurrent failures you want to be resilient to) as the back-up hardware can be shared across many applications. Additionally, the spare hardware you do purchase sits powered off while awaiting a failure so it consumes zero power and cooling.
Now, that level of recoverability is interesting, but IT shops typically have more complex n-tier applications where the application being hosted is spread across multiple application tiers (e.g. a database tier, a middleware/service tier and a web tier).
Multi-tier Application Recovery
In more complex application types (including those with four tiers, as is typical with JBoss, WebLogic, or WebSphere), an internal cloud infrastructure continues to help you out. For this example let's take JBoss as our working example as many people have experience with that toolset. JBoss' deployment model for a web application will typically consist of four tiers of services that work in concert to provide the desired application to the end user. There will be a database tier (where the user's data is stored), a service tier that provides the business logic for the application being hosted, and a web tier that will interact with the business logic and dynamically generate the desired HTML page. The fourth and final tier (which isn't in the IT environment) is the user's browser that actually renders the HTML to human-readable format. In this type of n-tier application stack there are implicit dependencies between the different tiers that are usually managed by the admins who know what the dependencies are for the various tiers and, as a result, the correct order for startup/shutdown/recovery (e.g. there is no point in starting up the business or web tiers if the DB is not running).
In the case of an n-tier application, an internal cloud computing infrastructure can help you with manageability, scalability, as well as recoverability (we're starting to pull out all the "-ilities" now…) We'll cover them in order and close with the recoverability item as that's this week’s theme. On the manageability front, an internal cloud infrastructure can capture the dependencies between the tiers and orchestrate the orderly startup/shutdown of the tiers (e.g. first verify that the DB is running, then start the business logic, and finally finish with the web tier). This means that the specifics of the application are no longer kept in the admin's head, but rather in the tool where any admin can benefit from the knowledge.
On the n-tier scalability front, usually a horizontal rather than vertical scaling approach is used for the business and web tiers. With an internal cloud infrastructure managing the tiers and using the monitors to determine the required computing capacity (to do this with Cassatt Active Response, we use what we call Demand-Based Policies), the infrastructure will automatically increase/decrease the capacity in each tier as a function of the demand being generated against the tier.
Finally, on the recoverability front, everything outlined in the last recovery scenario applies (restart on failure and swap to a new node if that doesn't work), but now you also get the added value of being able to restart services in multiple dimensions. As an example, in many cases connection pooling is used in the business tier to increase the performance of accessing the database. One downside (depending on the solution used for managing the connection pool) is that if the database goes away, then the business tier has to be restarted to re-establish the connections. In a typical IT shop this would mean that the admin would have to manage the recovery across the various tiers. However, in an internal cloud computing environment, the infrastructure has sufficient knowledge to know that if the DB went down, there is no point in trying to restart the failed business tier until the DB has been recovered. Likewise, there is no point in trying to recover the web tier when the business tier is offline. This means that even if the root failure can not be addressed by the infrastructure (which can happen if the issue is not transient or hardware related) the admin can focus on the recovery of the specific item that has failed and the system will take care of the busywork associated with restoring the complete service.
Intra-datacenter recovery
Ok, we're now into hosting non-trivial applications within a cloud infrastructure so let's take that same n-tier application example, but add in the extra complexity that there are now multiple n-tier applications being managed. What we'll do, though, is throw in a datacenter infrastructure failure. This example would be relevant for issues like a Computer Room Air Conditioning (CRAC) unit failure, loss of one of the consolidated UPS units, or a water leak (all of which would not cause a complete failure in a datacenter, but would typically knock out a block of computing capacity).
Before we jump into the failure, we need to explore one more typical practice for IT shops. Specifically, as more applications are added to an IT environment it is not uncommon for the IT staff to begin to stratify the applications into support levels that correspond to the importance the business places on the specific application in question (e.g. revenue systems and customer facing systems typically have a higher availability requirement than, say, a workgroup server or test and development servers.) For this example lets say that the IT department has three levels of support/availability that they use, with level one being the highest priority and level three being the lowest. With Cassatt Active Response, you can put this type of policy directly into the application and allow it to optimize the allocation of your computing resources to applications per your defined priorities. With that as background, let’s walk through the failure we outlined above and see what Cassatt Active Response will do for you in the face of a major failure in your datacenter (we're going to take the UPS example for this discussion).
We'll assume prior to the failure that the environment is in a steady state with all applications up and running at the desired levels of capacity. At this point, one of the shared UPS units goes offline, which affects all compute resources connected to that UPS unit. This appears to Cassatt Active Response as a number of node failures that go through the same recovery steps outlined above. However, as mentioned above, usually you will plan for a certain number of concurrent failures and you will keep that much spare capacity available for deployment. Unfortunately, when you loose something shared like a UPS, the number of failures quickly consumes the spare capacity available and you find yourself in an over-constrained situation.
This is where being on a cloud infrastructure really starts to shine. Since you have already identified the priority of the various applications you host in the tool, it can dynamically react to the loss in compute capacity and move resources as necessary to maintain your service levels on your higher priority applications. Specifically, in this example lets assume that 30% of the lost capacity was in your level 1 (most important) applications. The infrastructure will first try to shore up those applications from the excess capacity available, but when that is exhausted, it will start repurposing nodes from lower priority applications in support of re-establishing the service level of the higher priority applications. Now, because the cloud infrastructure can manage the power, network, and images of the applications, it can do all of this gracefully (existing lower priority applications get gracefully shut down prior to their hardware being repurposed) and without user interaction. Within a short period of time (think 10s of minutes), the higher priority applications have been re-established to their necessary operating levels.
The final part of this example is what occurs when the issue causing the failure is fixed (in our example, the UPS is repaired and power is re-applied to the effected computing resources.) With a cloud infrastructure managing your environment, your lower priority applications that were effected in support of shoring up the higher priority applications all just recover automatically. Specifically, once the power is reapplied, all you have to do is mark the hardware as available and the system will do the rest of the work to re-inventory and re-allocate the hardware back into those tiers that are running below the desired capacity levels.
Well, there you have it. We've walked through a variety of failure scenarios within a datacenter and discussed how an internal cloud infrastructure can offload much of the busy/mundane work of recovery. In the next post I'll take the example we just finished and broaden it to include recovery into a completely different datacenter. Until then…
As discussed in my last post, there can be a fair amount of organizational and operational change required to adopt an internal cloud infrastructure, but there are many benefits from taking on the task. The next couple of posts will outline one of the major benefits (recoverability) that comes from separating the applications from their network, computing, and storage resources, and how this separation allows for both intra-datacenter and inter-datacenter recoverability of your applications.
Five Internal Cloud Recovery Scenarios
To walk you through the discussion, I'm going to take you through a number of levels of recoverability as a function of the complexity of the application and the failure being addressed. Taking this approach, I've come up with five different scenarios that start with the recovery of a single application and end with recovery of a primary datacenter into a backup datacenter. The first four scenarios (intra-datacenter recovery) are covered in this post and the last one (inter-datacenter recovery) will be covered in my next post. So, enough background: let’s get into the discussion.
Application Recovery
Let's begin with the simplest level of recoverability and start with a single application (think a single workgroup server for a department that might be running things like a wiki, mail, or a web server.) From talking to many admins over the years, the first step they do when they find that a given server has dropped offline is to perform a "therapeutic reboot" to see if that gets everything back to a running state. The reality of IT is that many of the frameworks/containers/applications that you run leak memory slowly over time and that a reboot is the easiest way to clean up the issue.
In the case of a traditional IT environment, a monitoring tool would be used to monitor the servers under management and if the monitors go offline then an alert/page is generated to tell the admin that something is wrong that needs their attention. With an internal cloud infrastructure in place, the creation of monitors for each managed server comes for free (you just select in the UI how you want to monitor a given service from a list of supported monitors). In addition, when the monitor(s) drop offline you can configure policy to tell the system how you want the failure handled.
In the case of an application that you'd like rebooted prior to considering it failed, you simply instruct the internal cloud infrastructure to power cycle the node and to send an alert only if it doesn't recover correctly. While this simple level of recoverability is interesting in making an admin more productive (less chasing unqualified failures), it really isn't that awe-inspiring, so let's move on to the next level of recovery.
Hardware Recovery
In this scenario, we'll pick up where the last example left off and add an additional twist. Assume that the issue in the previous example wasn't fixed by rebooting the node. With an internal cloud infrastructure, you can enable the policy for the service to not only reboot the service on failure (to see if that re-establishes the service) but also to swap out the current piece of hardware for a new piece (complete with reconfiguring the network and storage on the fly so that the new node meets the applications requirements).
Let's explore this a bit more. With current IT practices, if you are faced with needing to make a singleton application available (like a workgroup server) you probably start thinking about clustering solutions (Windows or U*nx) that allow you to have a single primary node running and a secondary backup node listening in case of failure. The problem with this approach is that it is costly (you have to buy twice the hardware), your utilization is poor (the back-up machine is sitting idle), and you have to budget for twice the cooling and power because the back-up machine sits on, but idle.
Now contrast that with an internal cloud approach, where you have a pool of hardware shared among your applications. In this situation, you get single application hardware availability for free (just configure the policy accordingly). You buy between 1/5th and 1/10th the backup hardware (depending the number of concurrent failures you want to be resilient to) as the back-up hardware can be shared across many applications. Additionally, the spare hardware you do purchase sits powered off while awaiting a failure so it consumes zero power and cooling.
Now, that level of recoverability is interesting, but IT shops typically have more complex n-tier applications where the application being hosted is spread across multiple application tiers (e.g. a database tier, a middleware/service tier and a web tier).
Multi-tier Application Recovery
In more complex application types (including those with four tiers, as is typical with JBoss, WebLogic, or WebSphere), an internal cloud infrastructure continues to help you out. For this example let's take JBoss as our working example as many people have experience with that toolset. JBoss' deployment model for a web application will typically consist of four tiers of services that work in concert to provide the desired application to the end user. There will be a database tier (where the user's data is stored), a service tier that provides the business logic for the application being hosted, and a web tier that will interact with the business logic and dynamically generate the desired HTML page. The fourth and final tier (which isn't in the IT environment) is the user's browser that actually renders the HTML to human-readable format. In this type of n-tier application stack there are implicit dependencies between the different tiers that are usually managed by the admins who know what the dependencies are for the various tiers and, as a result, the correct order for startup/shutdown/recovery (e.g. there is no point in starting up the business or web tiers if the DB is not running).
In the case of an n-tier application, an internal cloud computing infrastructure can help you with manageability, scalability, as well as recoverability (we're starting to pull out all the "-ilities" now…) We'll cover them in order and close with the recoverability item as that's this week’s theme. On the manageability front, an internal cloud infrastructure can capture the dependencies between the tiers and orchestrate the orderly startup/shutdown of the tiers (e.g. first verify that the DB is running, then start the business logic, and finally finish with the web tier). This means that the specifics of the application are no longer kept in the admin's head, but rather in the tool where any admin can benefit from the knowledge.
On the n-tier scalability front, usually a horizontal rather than vertical scaling approach is used for the business and web tiers. With an internal cloud infrastructure managing the tiers and using the monitors to determine the required computing capacity (to do this with Cassatt Active Response, we use what we call Demand-Based Policies), the infrastructure will automatically increase/decrease the capacity in each tier as a function of the demand being generated against the tier.
Finally, on the recoverability front, everything outlined in the last recovery scenario applies (restart on failure and swap to a new node if that doesn't work), but now you also get the added value of being able to restart services in multiple dimensions. As an example, in many cases connection pooling is used in the business tier to increase the performance of accessing the database. One downside (depending on the solution used for managing the connection pool) is that if the database goes away, then the business tier has to be restarted to re-establish the connections. In a typical IT shop this would mean that the admin would have to manage the recovery across the various tiers. However, in an internal cloud computing environment, the infrastructure has sufficient knowledge to know that if the DB went down, there is no point in trying to restart the failed business tier until the DB has been recovered. Likewise, there is no point in trying to recover the web tier when the business tier is offline. This means that even if the root failure can not be addressed by the infrastructure (which can happen if the issue is not transient or hardware related) the admin can focus on the recovery of the specific item that has failed and the system will take care of the busywork associated with restoring the complete service.
Intra-datacenter recovery
Ok, we're now into hosting non-trivial applications within a cloud infrastructure so let's take that same n-tier application example, but add in the extra complexity that there are now multiple n-tier applications being managed. What we'll do, though, is throw in a datacenter infrastructure failure. This example would be relevant for issues like a Computer Room Air Conditioning (CRAC) unit failure, loss of one of the consolidated UPS units, or a water leak (all of which would not cause a complete failure in a datacenter, but would typically knock out a block of computing capacity).
Before we jump into the failure, we need to explore one more typical practice for IT shops. Specifically, as more applications are added to an IT environment it is not uncommon for the IT staff to begin to stratify the applications into support levels that correspond to the importance the business places on the specific application in question (e.g. revenue systems and customer facing systems typically have a higher availability requirement than, say, a workgroup server or test and development servers.) For this example lets say that the IT department has three levels of support/availability that they use, with level one being the highest priority and level three being the lowest. With Cassatt Active Response, you can put this type of policy directly into the application and allow it to optimize the allocation of your computing resources to applications per your defined priorities. With that as background, let’s walk through the failure we outlined above and see what Cassatt Active Response will do for you in the face of a major failure in your datacenter (we're going to take the UPS example for this discussion).
We'll assume prior to the failure that the environment is in a steady state with all applications up and running at the desired levels of capacity. At this point, one of the shared UPS units goes offline, which affects all compute resources connected to that UPS unit. This appears to Cassatt Active Response as a number of node failures that go through the same recovery steps outlined above. However, as mentioned above, usually you will plan for a certain number of concurrent failures and you will keep that much spare capacity available for deployment. Unfortunately, when you loose something shared like a UPS, the number of failures quickly consumes the spare capacity available and you find yourself in an over-constrained situation.
This is where being on a cloud infrastructure really starts to shine. Since you have already identified the priority of the various applications you host in the tool, it can dynamically react to the loss in compute capacity and move resources as necessary to maintain your service levels on your higher priority applications. Specifically, in this example lets assume that 30% of the lost capacity was in your level 1 (most important) applications. The infrastructure will first try to shore up those applications from the excess capacity available, but when that is exhausted, it will start repurposing nodes from lower priority applications in support of re-establishing the service level of the higher priority applications. Now, because the cloud infrastructure can manage the power, network, and images of the applications, it can do all of this gracefully (existing lower priority applications get gracefully shut down prior to their hardware being repurposed) and without user interaction. Within a short period of time (think 10s of minutes), the higher priority applications have been re-established to their necessary operating levels.
The final part of this example is what occurs when the issue causing the failure is fixed (in our example, the UPS is repaired and power is re-applied to the effected computing resources.) With a cloud infrastructure managing your environment, your lower priority applications that were effected in support of shoring up the higher priority applications all just recover automatically. Specifically, once the power is reapplied, all you have to do is mark the hardware as available and the system will do the rest of the work to re-inventory and re-allocate the hardware back into those tiers that are running below the desired capacity levels.
Well, there you have it. We've walked through a variety of failure scenarios within a datacenter and discussed how an internal cloud infrastructure can offload much of the busy/mundane work of recovery. In the next post I'll take the example we just finished and broaden it to include recovery into a completely different datacenter. Until then…
Tuesday, February 10, 2009
Is your organization ready for an internal cloud?
Posted by
Craig Vosburgh
at
11:17 AM
I'm going to take a post off from the "-ilities" discussion (Disaster Recoverability will be up next) and spend a little time talking about the technical and organizational challenges that many Fortune 1000 companies will face in their move to an internal cloud computing infrastructure.
Since I've lived through a number of these customer engagements over the past few years I thought I'd write up my cheat sheet that basically outlines the things that you need to be aware of and get worked out in your organization before you embark on a move to cloud computing. If you don't pay attention to these, I predict you'll be frustrated along the way by the organizational tension that a move such as this will cause.
Internal clouds are a game-changer on the economics front for companies that have large investments in IT. Creating a cloud-style architecture inside your own data center allows the company to unlock the excess capacity of their existing infrastructure (often locked into vertical stovepipes supporting specific applications.) Once this excess capacity is released for use within the IT environment, the company can then decrease their capital purchasing until the newfound capacity is consumed by new applications. In addition to the capital savings as a result of not having to purchase new hardware to run new applications, there are substantial power savings to the company as well, since the excess capacity is no longer powered up all the time, and is instead brought online only when needed.
Now, this sounds like motherhood and apple pie to me. Who wouldn't want to move to an IT environment that allowed this type of flexibility/cost savings? Well, it turns out that in many large companies the inertia of "business as usual" gets in the way of making this type of transformational change, even if they could see huge benefits in the form of business agility and cost savings (two things that almost everyone is looking for in this current economy to make them more competitive than the next guy).
If you find yourself reading the list below and going "no way, not in my company," then I'd posit that while you may desire the virtues of cloud computing, you'll really struggle to succeed. The limits you'll be placing on the solution due to the organizational issues will mean that many of the virtues of a cloud approach will be lost (I'll try to call out a few while I walk though the list to give you an idea of the trade-offs involved).
What you’ll need to be successful at internal cloud computing:
• Organizational willingness to embrace change. To fully adopt an internal cloud infrastructure, the system, network, and storage admins -- along with application developers -- are all going to have to work together as each group brings their specialty to bear on the hosting requirements of the application being deployed into the cloud. Also, if your organization likes to know exactly where an application is running all the time then cloud computing is only going to be frustrating. In cloud computing, the environment is continually being monitored and optimized to meet the business needs (we call them service-level agreements in Cassat Active Response.) This means that while at any point in time you can know what is running where, that information is only accurate for that instant. Why? In the next instant, something may have failed or an SLA may have been breached, causing the infrastructure to react, changing the allocation of applications to resources. Bottom line, be willing to embrace change in processes, policies, roles and responsibilities or you'll never be successful with a cloud computing solution.
• Willingness to network boot (PXE/BOOTP/DHCP) the computing resources. One of the major value propositions of an internal cloud computing approach is the ability to re-use hardware for different tasks at different times. To allow for this rapid re-deployment of resources, you can't use traditional approaches for imaging a node’s local disk (it takes a long time to copy down a multi-Gb image across the network and once done, if that node fails then the data is lost, since it resides on the local disk). Instead, the use of standard network protocols (NFS/iSCSI) allows for the real-time association of the image to the hardware. This approach also has the byproduct of allowing for very fast failure recovery times. Once a node is found to have failed, it only takes a few seconds to associate the image to new node and start the boot (we recover failed nodes in the time it takes to boot a node plus a few seconds to affect the re-association of the image to its new compute resource)
• Computing resources that support remote power management either through on-board or off-board power controllers. For all of this dynamicism to work, the internal cloud infrastructure must be able to power-control the nodes under management so that a node can be powered down when not needed (saving power) and power it back up when necessary. Most recent computing hardware has on-board controllers specifically for this task (iLo on HP, DRAC on Dell, ALOM/ILOM on Sun…) and the cloud computing infrastructure simply uses these existing interfaces to affect the power operations. If you find yourself in an environment that has older hardware that does not have this support, don't despair. There are numerous vendors that manufacture external Power Distribution Units (PDUs) that can provide the necessary power management for otherwise "dumb" compute nodes.
• Understand that your current static Configuration Management Database (CMDB) becomes real-time/dynamic in an internal cloud computing world. I touched on this under the "embrace change" bullet above, but it's worth calling out specifically. In a cloud computing world where you have pools of capacity (computing, network, and storage) that are associated in real time to applications that need that capacity to provide service, NOTHING is constant. As an example, depending on how you set up your policy, a failure of a node in a higher priority service will cause a node to be allocated from the free pool. If one is not available that matches the applications requirements (number of CPUs, disks, network cards, memory…) then a suitable replacement may be "borrowed" from a lower-priority service. This means that your environment is always changing and evolving to meet your business needs. What this also means is nothing is constant and you'll only find yourself frustrated if you don't change the mindset of static allocation.
• Understand that much of the network configuration will be handled by the internal cloud infrastructure. Now this doesn't necessarily mean your core switches have to be managed by the cloud infrastructure. However, if you want the ability to allocate new compute capacity to an application that has specific networking requirements (like a web server would have if you want it behind a load balancer), then the infrastructure must reconfigure the ports connected to the desired node to be on the right network. This issue can be a show-stopper to providing a full cloud computing solution so talk about this one early and often with your network team so they have input on how they want to architect the network and have time to become comfortable with the level of control required by the infrastructure.
• Boot image storage on the IP network. As I mentioned above, for internal cloud computing to work, you have to separate the application image from the physical node so that the image can be relocated on the fly as necessary to meet your policies. We currently leverage NFS for this separation as it can easily be configured to support this level of dynamicism. Also, using a NAS allows the leveraging of a single IP network and reduces cost/complexity as redundant data paths only have to be engineered for the IP network rather than the IP and the Storage network. I don't mention SAN for the boot device because it can be problematic to move around LUNs on the fly due to the myriad of proprietary vendor switch management APIs. In addition, every server out there ships with at least two on board NICs while SAN HBAs are aftermarket add-ons (to the tune of hundreds of dollars if you want redundant channel bonding). Now this single network approach comes with a downside that I'm going to be upfront about: currently, IP networks are in the 1Gigabit range with 10Gigabit on the horizon, while SAN networks are 2-4Gigibit (you can bond or trunk multiple Ethernets together to get better throughput, but for now we'll leave that aside for this discussion). If you have a very disk-intensive application, you'll need to architect the solution to work within a cloud infrastructure. Specifically, you don't want to be sending that traffic across to a NAS as the throughput can suffer due to the limited bandwidth. You should look to either use local tmp space (if you only need temporary storage) or locally attached SAN that is zoned to a specific set of nodes that can act as a backup to one another in case of failure.
• Applications that run on commodity hardware. Internal cloud computing provides much more benefit when the applications to be managed run on commodity hardware. This hardware could be x86, POWER, or SPARC, depending on your environment, but should be the lower- to mid-level servers. It doesn't make a lot of sense to take something like a Sun SPARC F25K and put it into a cloud infrastructure, as it is already built to address scalability/availability issues within the chassis and has built-in high-speed interconnects for fast data access. With commodity hardware comes numbers and that is where a cloud infrastructure really shines, as it dramatically increases the span of control for the operators as they manage more at the SLA level and less at the node level.
Well, that's a lot to digest so I think I'll stop now. I'm not trying to scare anyone away from moving to an internal cloud computing infrastructure. On the contrary, I believe they are the future of IT computing. The current “best” practices are, to a large extent, responsible for the current management and over-provisioning issues facing most IT shops. However, for you to address the over-provisioning and benefit from a continuously optimized computing environment where the excess capacity you have is efficiently allocated to the applications that need it (instead of sitting idle in a specific stove pipe), you need to understand the fundamental issues that stand ahead of you. In your transition to internal cloud computing you will need to actively work to address these new issues, or else you will find yourself with just a different set of problems to chase than the ones you have now.
Since I've lived through a number of these customer engagements over the past few years I thought I'd write up my cheat sheet that basically outlines the things that you need to be aware of and get worked out in your organization before you embark on a move to cloud computing. If you don't pay attention to these, I predict you'll be frustrated along the way by the organizational tension that a move such as this will cause.
Internal clouds are a game-changer on the economics front for companies that have large investments in IT. Creating a cloud-style architecture inside your own data center allows the company to unlock the excess capacity of their existing infrastructure (often locked into vertical stovepipes supporting specific applications.) Once this excess capacity is released for use within the IT environment, the company can then decrease their capital purchasing until the newfound capacity is consumed by new applications. In addition to the capital savings as a result of not having to purchase new hardware to run new applications, there are substantial power savings to the company as well, since the excess capacity is no longer powered up all the time, and is instead brought online only when needed.
Now, this sounds like motherhood and apple pie to me. Who wouldn't want to move to an IT environment that allowed this type of flexibility/cost savings? Well, it turns out that in many large companies the inertia of "business as usual" gets in the way of making this type of transformational change, even if they could see huge benefits in the form of business agility and cost savings (two things that almost everyone is looking for in this current economy to make them more competitive than the next guy).
If you find yourself reading the list below and going "no way, not in my company," then I'd posit that while you may desire the virtues of cloud computing, you'll really struggle to succeed. The limits you'll be placing on the solution due to the organizational issues will mean that many of the virtues of a cloud approach will be lost (I'll try to call out a few while I walk though the list to give you an idea of the trade-offs involved).
What you’ll need to be successful at internal cloud computing:
• Organizational willingness to embrace change. To fully adopt an internal cloud infrastructure, the system, network, and storage admins -- along with application developers -- are all going to have to work together as each group brings their specialty to bear on the hosting requirements of the application being deployed into the cloud. Also, if your organization likes to know exactly where an application is running all the time then cloud computing is only going to be frustrating. In cloud computing, the environment is continually being monitored and optimized to meet the business needs (we call them service-level agreements in Cassat Active Response.) This means that while at any point in time you can know what is running where, that information is only accurate for that instant. Why? In the next instant, something may have failed or an SLA may have been breached, causing the infrastructure to react, changing the allocation of applications to resources. Bottom line, be willing to embrace change in processes, policies, roles and responsibilities or you'll never be successful with a cloud computing solution.
• Willingness to network boot (PXE/BOOTP/DHCP) the computing resources. One of the major value propositions of an internal cloud computing approach is the ability to re-use hardware for different tasks at different times. To allow for this rapid re-deployment of resources, you can't use traditional approaches for imaging a node’s local disk (it takes a long time to copy down a multi-Gb image across the network and once done, if that node fails then the data is lost, since it resides on the local disk). Instead, the use of standard network protocols (NFS/iSCSI) allows for the real-time association of the image to the hardware. This approach also has the byproduct of allowing for very fast failure recovery times. Once a node is found to have failed, it only takes a few seconds to associate the image to new node and start the boot (we recover failed nodes in the time it takes to boot a node plus a few seconds to affect the re-association of the image to its new compute resource)
• Computing resources that support remote power management either through on-board or off-board power controllers. For all of this dynamicism to work, the internal cloud infrastructure must be able to power-control the nodes under management so that a node can be powered down when not needed (saving power) and power it back up when necessary. Most recent computing hardware has on-board controllers specifically for this task (iLo on HP, DRAC on Dell, ALOM/ILOM on Sun…) and the cloud computing infrastructure simply uses these existing interfaces to affect the power operations. If you find yourself in an environment that has older hardware that does not have this support, don't despair. There are numerous vendors that manufacture external Power Distribution Units (PDUs) that can provide the necessary power management for otherwise "dumb" compute nodes.
• Understand that your current static Configuration Management Database (CMDB) becomes real-time/dynamic in an internal cloud computing world. I touched on this under the "embrace change" bullet above, but it's worth calling out specifically. In a cloud computing world where you have pools of capacity (computing, network, and storage) that are associated in real time to applications that need that capacity to provide service, NOTHING is constant. As an example, depending on how you set up your policy, a failure of a node in a higher priority service will cause a node to be allocated from the free pool. If one is not available that matches the applications requirements (number of CPUs, disks, network cards, memory…) then a suitable replacement may be "borrowed" from a lower-priority service. This means that your environment is always changing and evolving to meet your business needs. What this also means is nothing is constant and you'll only find yourself frustrated if you don't change the mindset of static allocation.
• Understand that much of the network configuration will be handled by the internal cloud infrastructure. Now this doesn't necessarily mean your core switches have to be managed by the cloud infrastructure. However, if you want the ability to allocate new compute capacity to an application that has specific networking requirements (like a web server would have if you want it behind a load balancer), then the infrastructure must reconfigure the ports connected to the desired node to be on the right network. This issue can be a show-stopper to providing a full cloud computing solution so talk about this one early and often with your network team so they have input on how they want to architect the network and have time to become comfortable with the level of control required by the infrastructure.
• Boot image storage on the IP network. As I mentioned above, for internal cloud computing to work, you have to separate the application image from the physical node so that the image can be relocated on the fly as necessary to meet your policies. We currently leverage NFS for this separation as it can easily be configured to support this level of dynamicism. Also, using a NAS allows the leveraging of a single IP network and reduces cost/complexity as redundant data paths only have to be engineered for the IP network rather than the IP and the Storage network. I don't mention SAN for the boot device because it can be problematic to move around LUNs on the fly due to the myriad of proprietary vendor switch management APIs. In addition, every server out there ships with at least two on board NICs while SAN HBAs are aftermarket add-ons (to the tune of hundreds of dollars if you want redundant channel bonding). Now this single network approach comes with a downside that I'm going to be upfront about: currently, IP networks are in the 1Gigabit range with 10Gigabit on the horizon, while SAN networks are 2-4Gigibit (you can bond or trunk multiple Ethernets together to get better throughput, but for now we'll leave that aside for this discussion). If you have a very disk-intensive application, you'll need to architect the solution to work within a cloud infrastructure. Specifically, you don't want to be sending that traffic across to a NAS as the throughput can suffer due to the limited bandwidth. You should look to either use local tmp space (if you only need temporary storage) or locally attached SAN that is zoned to a specific set of nodes that can act as a backup to one another in case of failure.
• Applications that run on commodity hardware. Internal cloud computing provides much more benefit when the applications to be managed run on commodity hardware. This hardware could be x86, POWER, or SPARC, depending on your environment, but should be the lower- to mid-level servers. It doesn't make a lot of sense to take something like a Sun SPARC F25K and put it into a cloud infrastructure, as it is already built to address scalability/availability issues within the chassis and has built-in high-speed interconnects for fast data access. With commodity hardware comes numbers and that is where a cloud infrastructure really shines, as it dramatically increases the span of control for the operators as they manage more at the SLA level and less at the node level.
Well, that's a lot to digest so I think I'll stop now. I'm not trying to scare anyone away from moving to an internal cloud computing infrastructure. On the contrary, I believe they are the future of IT computing. The current “best” practices are, to a large extent, responsible for the current management and over-provisioning issues facing most IT shops. However, for you to address the over-provisioning and benefit from a continuously optimized computing environment where the excess capacity you have is efficiently allocated to the applications that need it (instead of sitting idle in a specific stove pipe), you need to understand the fundamental issues that stand ahead of you. In your transition to internal cloud computing you will need to actively work to address these new issues, or else you will find yourself with just a different set of problems to chase than the ones you have now.
Wednesday, January 21, 2009
Business Agility : Using internal cloud computing to create a computer animation render farm in less than a day
Posted by
Craig Vosburgh
at
3:33 PM
As I mentioned in my last post, cloud computing is all about the "-ilities" of computing environments and I wanted to spend some time in this post talking about internal cloud computing and how it can dramatically enhance a company's business and IT agility.
Many of the customers that I speak with have the same problem with their IT infrastructure. Specifically, they believe they have plenty of excess compute, storage, and network capacity in their existing IT environment to host additional applications, but they just can't get access to it for other applications. Unfortunately, due to the stove-piped approach often taken in allocating hardware to applications, this excess capacity is locked up, awaiting the day when the planned peak load comes along.
What they would like instead is to be able to dynamically allocate resources in and out of the "hot" applications as needed. This reduces the need for idle hardware in fixed silos, reducing power/cooling costs as the idle resources can be powered down, and shrinking the number of overall spare resources, since they can share the excess capacity among the apps -- and not all of their apps will hit their peak capacity at the same time).
With an internal cloud computing infrastructure, that is exactly what you can do. I'll walk you through a little example of how we supported a completely new application in our lab in less than a day. First a little background: it turns out that a few of us here at Cassatt are computer animation hobbyists and one of us (Bob Hendrich) was trying to figure out where he could render a short film that he was working with his kids for school. Now, he could always pay to use time on an existing render farm (respower as an example) but as a test, we wanted to see how long it would take us to use Cassatt Active Response as an internal cloud computing infrastructure and re-purpose a portion of our test/IT resources to create an ad hoc render farm for rendering a Blender movie (you can check out a few seconds of a test render we did here).
In our case, as we already use Cassatt Active Response to manage our internal cloud computing environment, we didn't have to install our product. If we had needed to do an install, we probably would have tacked on a few days to set up a small environment with most of that time going to racking and cabling the compute nodes (if you want to see more on how that's done check this out.) Anyway, as we already had a Cassatt-controlled environment setup, the first task was setting up a machine with all the required software necessary for a render node. This step allowed us to configure/debug the render software and capture a baseline image for use within Cassatt Active Response.
A quick sidebar for those that may not know much about Active Response, it has the concept of capturing a single "golden" image for a given service to be hosted and then replicating that image as necessary to turn up as many instances of that service as desired. As we wanted to build out a service made up of an arbitrary number of instances, we had to start with capturing the "golden" and then letting the Cassatt software handle the replication (Active Response actually not only does the image replication but also does most of the required instance-level configuration like hostnames and IPs for you).
OK, back to the main thread. To setup the image host, we snagged an existing CentOS 5.3 image and kickstarted it onto a spare box in the lab (elapsed time so far: 45 minutes).
Once CentOS was installed, we had to install/update the following to support the distributed Blender environment.
• nfs-utils
• Python
• Blender 2.48
• JRE 1.6
• Distriblend or FarmerJoe (both open source renderfarm utilities)
In addition to these software updates, we needed to export a directory via NFS in which the client nodes could dump their rendered files. The way the render software works is there is a single load manager that distributes parts of a frame to render to the machines available in the farm. That means that all the nodes need a common accessible file system to allow them to coalesce the different parts of the frame back together into a single coherent frame. In our case, we just shared the disk out off of the singleton distribution node to act as the collation point.
The final step in creating the golden image was to configure the required daemon processes to start up on boot. With this completed, we proceeded to run cccapture to peel the image off the image host and store it into the Cassatt Active Response Image Matrix (that's a fancy word for what Cassatt calls the place that images (both goldens and instances) get stored. (Elapsed time: 3 hrs.)
With the image in Active Response, we moved on to building out the two required tiers needed to host the render farm (one for the distribution service and one for the rendering service). In Cassatt terms, the tier is where the user can configure all of the policy that they want enforced (things like how many nodes must be booted to provide service, what approach to use to grow/shrink capacity as a function of demand, how hardware should be handled on failure, and what, if any, network requirements the service has).
The first tier we created was for the singleton service that manages the render distribution. This tier was set up to allow only a single node per the render software architecture. In Cassatt speak that would be a min=one, target=one, max=one tier with no dynamic service-level agreement (SLA) configured.
With that tier created, we moved on to create the tier that would house the render nodes. In our case, we wanted to have Active Response manage the allocation of the nodes based on the incoming load on the system. To do this, we configured our growth management policy as Load Balanced Allocation (LBA) using SNMP load1Average as the LBA control value. We then set up the tier min to one node needing activation to provide service, the max to six nodes (the max number of nodes we wanted to use in the render farm) and set the tier to idle off after three minutes of inactivity.
An aside here for anyone that doesn't know what LBA is and what an idle timer is used for. As we kick the renders off overnight (they take hours to complete), we want to save power and have the nodes all automatically shut down when the render is complete. LBA will scale up/down a tier automatically based on the load against the tier, but it will only ever shrink a tier to the minimum user defined size (in our case that minimum was set to one.) As we didn't even want a single node to be running for multiple hours due to power consumption (we're a pretty green bunch here at Cassatt) we set up an idle timer in the tier policy that says to go ahead and shutdown even the last node if it sits idle for a specified time period (in our case, three minutes).
OK, back to the render tier config. We next specified that the tier only grab four-CPU nodes or better to maximize performance of the render (the software is tuned to get the biggest bang for the buck on multi-CPU machines.) Networking in our case was not an issue, as we just used the Active Response default network so there wasn't any network-specific policy to enter. With the tier definitions completed it was time to allocate nodes into the tiers and activate them so we could take the first test render for a spin. As we had set the minimums to 1 for both tiers, only two nodes were allocated and the services were brought online.
This concept of allocation may be a little foreign to folks not familiar with the cloud computing paradigm, so I'll explain. With a cloud computing infrastructure in place, the user doesn't manage specific hardware and its associated image (today's traditional model), but rather the service images and available hardware separately. The infrastructure then handles all the work of binding the desired image(s) to the required hardware as needed by the service. This loose coupling is at the heart of cloud computing's dramatic business agility enhancements; the same hardware can be used by different services as the user policy dictates (where that policy can be schedule, demand, or priority based. (Elapsed time: 4 hrs.)
Now the cool part. With the two tiers up and running, we handed a job to the render distribution node and the render tier immediately picked up four frames to render, and went to pretty much 100% on all four CPUs. Load1Average went through the roof as depicted on the graph in the tier page. We had set the LBA parameters to monitor every 15 seconds, and average over 60 seconds. Within just a couple of minutes, Active Response booted a second node as the service was deemed overloaded. The second node, once booted, immediately grabbed frames from the distribution node to render and it also pegged on CPU utilization. A couple more minutes passed and the Cassatt software booted a third node to try to address the service's load spike. As the render tier had a max of six instances, the system continued to try to get another piece of hardware to activate a fourth render node, but since none of the hardware available in the free pool matched the requirement, it was unsuccessful in allocating a fourth node (Active Response let me know that the tier was resource-limited by declaring it in a warning state in the UI. Had we set up email notifications, we would have also gotten an email to that effect.)
Load1Average stayed pretty high, as expected, until the nodes started running out of work. At that point, load1Average started dropping and when it got below the configured min value, the tier first dropped out of being in a warning state (it no longer needed nodes). A minute later, the tier started shutting down and returning nodes to the free pool. Once all the frames were rendered, Active Response shut the remaining node down after it was below min for the three minutes we configured (but kept that node allocated to the tier as the min was set to one).
Now, the really, really cool part. If we had access to another 100 machines, we would not have to do anything else to use them except create a wider render tier. We would use the same image and Active Response would handle the instance creation just like it did for the initial six, but for any tier max we set. Literally, in 15 minutes we could create another tier and run the test again with 100 nodes, and the system would use as many nodes as it needed to complete the job. In addition, in an environment of 100 nodes, we could use those nodes for other jobs or other applications, and if we set the priority of the render tier above the others, the render tier could steal the nodes in use by lower-priority tiers, and those tiers could get the nodes back when my render was done. We would not have to touch anything to make it happen during runtime, Active Response would simply be enforcing the policy we set up.
Well, I thought I'd close this post out with a bit of a recap as we covered a fair amount of ground. An internal cloud computing infrastructure is the enabler to allow for a substantial increase in business and IT agility within your organization. By decoupling the physical resources (hardware, storage and networking) from the specific applications they are in support of (the application images) it allows the infrastructure to provide capacity on demand for any managed application as required by the current load against that application (no more having to guess at the peak load and then keeping all that spare capacity sitting idle waiting for the rainy day to come) In addition, you as the user can update the policy for allocation as you need to keep the allocation of resources in line with the importance of the application(s) being managed.
As an example, we took a new application (a render farm application) and hosted it in Cassatt Active Response (the internal cloud computing infrastructure for our example) and was able to not only initially host the application in less than a day but was also able to host the application on existing IT/lab resources over the weekend and then give them back by Monday morning for use in their normal uses (meaning that we hosted this new application by simply using the spare cycles already available in our IT infrastructure rather than purchasing new hardware for the specific purpose as it typically the approach taken today)
Next week, we're going to spend some time talking about Disaster Recovery and how Active Response acting as your internal cloud computing infrastructure can provide DR capabilities that to date you probably thought were unattainable.
Many of the customers that I speak with have the same problem with their IT infrastructure. Specifically, they believe they have plenty of excess compute, storage, and network capacity in their existing IT environment to host additional applications, but they just can't get access to it for other applications. Unfortunately, due to the stove-piped approach often taken in allocating hardware to applications, this excess capacity is locked up, awaiting the day when the planned peak load comes along.
What they would like instead is to be able to dynamically allocate resources in and out of the "hot" applications as needed. This reduces the need for idle hardware in fixed silos, reducing power/cooling costs as the idle resources can be powered down, and shrinking the number of overall spare resources, since they can share the excess capacity among the apps -- and not all of their apps will hit their peak capacity at the same time).
With an internal cloud computing infrastructure, that is exactly what you can do. I'll walk you through a little example of how we supported a completely new application in our lab in less than a day. First a little background: it turns out that a few of us here at Cassatt are computer animation hobbyists and one of us (Bob Hendrich) was trying to figure out where he could render a short film that he was working with his kids for school. Now, he could always pay to use time on an existing render farm (respower as an example) but as a test, we wanted to see how long it would take us to use Cassatt Active Response as an internal cloud computing infrastructure and re-purpose a portion of our test/IT resources to create an ad hoc render farm for rendering a Blender movie (you can check out a few seconds of a test render we did here).
In our case, as we already use Cassatt Active Response to manage our internal cloud computing environment, we didn't have to install our product. If we had needed to do an install, we probably would have tacked on a few days to set up a small environment with most of that time going to racking and cabling the compute nodes (if you want to see more on how that's done check this out.) Anyway, as we already had a Cassatt-controlled environment setup, the first task was setting up a machine with all the required software necessary for a render node. This step allowed us to configure/debug the render software and capture a baseline image for use within Cassatt Active Response.
A quick sidebar for those that may not know much about Active Response, it has the concept of capturing a single "golden" image for a given service to be hosted and then replicating that image as necessary to turn up as many instances of that service as desired. As we wanted to build out a service made up of an arbitrary number of instances, we had to start with capturing the "golden" and then letting the Cassatt software handle the replication (Active Response actually not only does the image replication but also does most of the required instance-level configuration like hostnames and IPs for you).
OK, back to the main thread. To setup the image host, we snagged an existing CentOS 5.3 image and kickstarted it onto a spare box in the lab (elapsed time so far: 45 minutes).
Once CentOS was installed, we had to install/update the following to support the distributed Blender environment.
• nfs-utils
• Python
• Blender 2.48
• JRE 1.6
• Distriblend or FarmerJoe (both open source renderfarm utilities)
In addition to these software updates, we needed to export a directory via NFS in which the client nodes could dump their rendered files. The way the render software works is there is a single load manager that distributes parts of a frame to render to the machines available in the farm. That means that all the nodes need a common accessible file system to allow them to coalesce the different parts of the frame back together into a single coherent frame. In our case, we just shared the disk out off of the singleton distribution node to act as the collation point.
The final step in creating the golden image was to configure the required daemon processes to start up on boot. With this completed, we proceeded to run cccapture to peel the image off the image host and store it into the Cassatt Active Response Image Matrix (that's a fancy word for what Cassatt calls the place that images (both goldens and instances) get stored. (Elapsed time: 3 hrs.)
With the image in Active Response, we moved on to building out the two required tiers needed to host the render farm (one for the distribution service and one for the rendering service). In Cassatt terms, the tier is where the user can configure all of the policy that they want enforced (things like how many nodes must be booted to provide service, what approach to use to grow/shrink capacity as a function of demand, how hardware should be handled on failure, and what, if any, network requirements the service has).
The first tier we created was for the singleton service that manages the render distribution. This tier was set up to allow only a single node per the render software architecture. In Cassatt speak that would be a min=one, target=one, max=one tier with no dynamic service-level agreement (SLA) configured.
With that tier created, we moved on to create the tier that would house the render nodes. In our case, we wanted to have Active Response manage the allocation of the nodes based on the incoming load on the system. To do this, we configured our growth management policy as Load Balanced Allocation (LBA) using SNMP load1Average as the LBA control value. We then set up the tier min to one node needing activation to provide service, the max to six nodes (the max number of nodes we wanted to use in the render farm) and set the tier to idle off after three minutes of inactivity.
An aside here for anyone that doesn't know what LBA is and what an idle timer is used for. As we kick the renders off overnight (they take hours to complete), we want to save power and have the nodes all automatically shut down when the render is complete. LBA will scale up/down a tier automatically based on the load against the tier, but it will only ever shrink a tier to the minimum user defined size (in our case that minimum was set to one.) As we didn't even want a single node to be running for multiple hours due to power consumption (we're a pretty green bunch here at Cassatt) we set up an idle timer in the tier policy that says to go ahead and shutdown even the last node if it sits idle for a specified time period (in our case, three minutes).
OK, back to the render tier config. We next specified that the tier only grab four-CPU nodes or better to maximize performance of the render (the software is tuned to get the biggest bang for the buck on multi-CPU machines.) Networking in our case was not an issue, as we just used the Active Response default network so there wasn't any network-specific policy to enter. With the tier definitions completed it was time to allocate nodes into the tiers and activate them so we could take the first test render for a spin. As we had set the minimums to 1 for both tiers, only two nodes were allocated and the services were brought online.
This concept of allocation may be a little foreign to folks not familiar with the cloud computing paradigm, so I'll explain. With a cloud computing infrastructure in place, the user doesn't manage specific hardware and its associated image (today's traditional model), but rather the service images and available hardware separately. The infrastructure then handles all the work of binding the desired image(s) to the required hardware as needed by the service. This loose coupling is at the heart of cloud computing's dramatic business agility enhancements; the same hardware can be used by different services as the user policy dictates (where that policy can be schedule, demand, or priority based. (Elapsed time: 4 hrs.)
Now the cool part. With the two tiers up and running, we handed a job to the render distribution node and the render tier immediately picked up four frames to render, and went to pretty much 100% on all four CPUs. Load1Average went through the roof as depicted on the graph in the tier page. We had set the LBA parameters to monitor every 15 seconds, and average over 60 seconds. Within just a couple of minutes, Active Response booted a second node as the service was deemed overloaded. The second node, once booted, immediately grabbed frames from the distribution node to render and it also pegged on CPU utilization. A couple more minutes passed and the Cassatt software booted a third node to try to address the service's load spike. As the render tier had a max of six instances, the system continued to try to get another piece of hardware to activate a fourth render node, but since none of the hardware available in the free pool matched the requirement, it was unsuccessful in allocating a fourth node (Active Response let me know that the tier was resource-limited by declaring it in a warning state in the UI. Had we set up email notifications, we would have also gotten an email to that effect.)
Load1Average stayed pretty high, as expected, until the nodes started running out of work. At that point, load1Average started dropping and when it got below the configured min value, the tier first dropped out of being in a warning state (it no longer needed nodes). A minute later, the tier started shutting down and returning nodes to the free pool. Once all the frames were rendered, Active Response shut the remaining node down after it was below min for the three minutes we configured (but kept that node allocated to the tier as the min was set to one).
Now, the really, really cool part. If we had access to another 100 machines, we would not have to do anything else to use them except create a wider render tier. We would use the same image and Active Response would handle the instance creation just like it did for the initial six, but for any tier max we set. Literally, in 15 minutes we could create another tier and run the test again with 100 nodes, and the system would use as many nodes as it needed to complete the job. In addition, in an environment of 100 nodes, we could use those nodes for other jobs or other applications, and if we set the priority of the render tier above the others, the render tier could steal the nodes in use by lower-priority tiers, and those tiers could get the nodes back when my render was done. We would not have to touch anything to make it happen during runtime, Active Response would simply be enforcing the policy we set up.
Well, I thought I'd close this post out with a bit of a recap as we covered a fair amount of ground. An internal cloud computing infrastructure is the enabler to allow for a substantial increase in business and IT agility within your organization. By decoupling the physical resources (hardware, storage and networking) from the specific applications they are in support of (the application images) it allows the infrastructure to provide capacity on demand for any managed application as required by the current load against that application (no more having to guess at the peak load and then keeping all that spare capacity sitting idle waiting for the rainy day to come) In addition, you as the user can update the policy for allocation as you need to keep the allocation of resources in line with the importance of the application(s) being managed.
As an example, we took a new application (a render farm application) and hosted it in Cassatt Active Response (the internal cloud computing infrastructure for our example) and was able to not only initially host the application in less than a day but was also able to host the application on existing IT/lab resources over the weekend and then give them back by Monday morning for use in their normal uses (meaning that we hosted this new application by simply using the spare cycles already available in our IT infrastructure rather than purchasing new hardware for the specific purpose as it typically the approach taken today)
Next week, we're going to spend some time talking about Disaster Recovery and how Active Response acting as your internal cloud computing infrastructure can provide DR capabilities that to date you probably thought were unattainable.
Subscribe to:
Comments (Atom)