Wednesday, January 21, 2009

Business Agility : Using internal cloud computing to create a computer animation render farm in less than a day

As I mentioned in my last post, cloud computing is all about the "-ilities" of computing environments and I wanted to spend some time in this post talking about internal cloud computing and how it can dramatically enhance a company's business and IT agility.

Many of the customers that I speak with have the same problem with their IT infrastructure. Specifically, they believe they have plenty of excess compute, storage, and network capacity in their existing IT environment to host additional applications, but they just can't get access to it for other applications. Unfortunately, due to the stove-piped approach often taken in allocating hardware to applications, this excess capacity is locked up, awaiting the day when the planned peak load comes along.

What they would like instead is to be able to dynamically allocate resources in and out of the "hot" applications as needed. This reduces the need for idle hardware in fixed silos, reducing power/cooling costs as the idle resources can be powered down, and shrinking the number of overall spare resources, since they can share the excess capacity among the apps -- and not all of their apps will hit their peak capacity at the same time).

With an internal cloud computing infrastructure, that is exactly what you can do. I'll walk you through a little example of how we supported a completely new application in our lab in less than a day. First a little background: it turns out that a few of us here at Cassatt are computer animation hobbyists and one of us (Bob Hendrich) was trying to figure out where he could render a short film that he was working with his kids for school. Now, he could always pay to use time on an existing render farm (respower as an example) but as a test, we wanted to see how long it would take us to use Cassatt Active Response as an internal cloud computing infrastructure and re-purpose a portion of our test/IT resources to create an ad hoc render farm for rendering a Blender movie (you can check out a few seconds of a test render we did here).

In our case, as we already use Cassatt Active Response to manage our internal cloud computing environment, we didn't have to install our product. If we had needed to do an install, we probably would have tacked on a few days to set up a small environment with most of that time going to racking and cabling the compute nodes (if you want to see more on how that's done check this out.) Anyway, as we already had a Cassatt-controlled environment setup, the first task was setting up a machine with all the required software necessary for a render node. This step allowed us to configure/debug the render software and capture a baseline image for use within Cassatt Active Response.

A quick sidebar for those that may not know much about Active Response, it has the concept of capturing a single "golden" image for a given service to be hosted and then replicating that image as necessary to turn up as many instances of that service as desired. As we wanted to build out a service made up of an arbitrary number of instances, we had to start with capturing the "golden" and then letting the Cassatt software handle the replication (Active Response actually not only does the image replication but also does most of the required instance-level configuration like hostnames and IPs for you).

OK, back to the main thread. To setup the image host, we snagged an existing CentOS 5.3 image and kickstarted it onto a spare box in the lab (elapsed time so far: 45 minutes).

Once CentOS was installed, we had to install/update the following to support the distributed Blender environment.

• nfs-utils
• Python
• Blender 2.48
• JRE 1.6
Distriblend or FarmerJoe (both open source renderfarm utilities)

In addition to these software updates, we needed to export a directory via NFS in which the client nodes could dump their rendered files. The way the render software works is there is a single load manager that distributes parts of a frame to render to the machines available in the farm. That means that all the nodes need a common accessible file system to allow them to coalesce the different parts of the frame back together into a single coherent frame. In our case, we just shared the disk out off of the singleton distribution node to act as the collation point.

The final step in creating the golden image was to configure the required daemon processes to start up on boot. With this completed, we proceeded to run cccapture to peel the image off the image host and store it into the Cassatt Active Response Image Matrix (that's a fancy word for what Cassatt calls the place that images (both goldens and instances) get stored. (Elapsed time: 3 hrs.)

With the image in Active Response, we moved on to building out the two required tiers needed to host the render farm (one for the distribution service and one for the rendering service). In Cassatt terms, the tier is where the user can configure all of the policy that they want enforced (things like how many nodes must be booted to provide service, what approach to use to grow/shrink capacity as a function of demand, how hardware should be handled on failure, and what, if any, network requirements the service has).

The first tier we created was for the singleton service that manages the render distribution. This tier was set up to allow only a single node per the render software architecture. In Cassatt speak that would be a min=one, target=one, max=one tier with no dynamic service-level agreement (SLA) configured.

With that tier created, we moved on to create the tier that would house the render nodes. In our case, we wanted to have Active Response manage the allocation of the nodes based on the incoming load on the system. To do this, we configured our growth management policy as Load Balanced Allocation (LBA) using SNMP load1Average as the LBA control value. We then set up the tier min to one node needing activation to provide service, the max to six nodes (the max number of nodes we wanted to use in the render farm) and set the tier to idle off after three minutes of inactivity.

An aside here for anyone that doesn't know what LBA is and what an idle timer is used for. As we kick the renders off overnight (they take hours to complete), we want to save power and have the nodes all automatically shut down when the render is complete. LBA will scale up/down a tier automatically based on the load against the tier, but it will only ever shrink a tier to the minimum user defined size (in our case that minimum was set to one.) As we didn't even want a single node to be running for multiple hours due to power consumption (we're a pretty green bunch here at Cassatt) we set up an idle timer in the tier policy that says to go ahead and shutdown even the last node if it sits idle for a specified time period (in our case, three minutes).

OK, back to the render tier config. We next specified that the tier only grab four-CPU nodes or better to maximize performance of the render (the software is tuned to get the biggest bang for the buck on multi-CPU machines.) Networking in our case was not an issue, as we just used the Active Response default network so there wasn't any network-specific policy to enter. With the tier definitions completed it was time to allocate nodes into the tiers and activate them so we could take the first test render for a spin. As we had set the minimums to 1 for both tiers, only two nodes were allocated and the services were brought online.

This concept of allocation may be a little foreign to folks not familiar with the cloud computing paradigm, so I'll explain. With a cloud computing infrastructure in place, the user doesn't manage specific hardware and its associated image (today's traditional model), but rather the service images and available hardware separately. The infrastructure then handles all the work of binding the desired image(s) to the required hardware as needed by the service. This loose coupling is at the heart of cloud computing's dramatic business agility enhancements; the same hardware can be used by different services as the user policy dictates (where that policy can be schedule, demand, or priority based. (Elapsed time: 4 hrs.)

Now the cool part. With the two tiers up and running, we handed a job to the render distribution node and the render tier immediately picked up four frames to render, and went to pretty much 100% on all four CPUs. Load1Average went through the roof as depicted on the graph in the tier page. We had set the LBA parameters to monitor every 15 seconds, and average over 60 seconds. Within just a couple of minutes, Active Response booted a second node as the service was deemed overloaded. The second node, once booted, immediately grabbed frames from the distribution node to render and it also pegged on CPU utilization. A couple more minutes passed and the Cassatt software booted a third node to try to address the service's load spike. As the render tier had a max of six instances, the system continued to try to get another piece of hardware to activate a fourth render node, but since none of the hardware available in the free pool matched the requirement, it was unsuccessful in allocating a fourth node (Active Response let me know that the tier was resource-limited by declaring it in a warning state in the UI. Had we set up email notifications, we would have also gotten an email to that effect.)

Load1Average stayed pretty high, as expected, until the nodes started running out of work. At that point, load1Average started dropping and when it got below the configured min value, the tier first dropped out of being in a warning state (it no longer needed nodes). A minute later, the tier started shutting down and returning nodes to the free pool. Once all the frames were rendered, Active Response shut the remaining node down after it was below min for the three minutes we configured (but kept that node allocated to the tier as the min was set to one).

Now, the really, really cool part. If we had access to another 100 machines, we would not have to do anything else to use them except create a wider render tier. We would use the same image and Active Response would handle the instance creation just like it did for the initial six, but for any tier max we set. Literally, in 15 minutes we could create another tier and run the test again with 100 nodes, and the system would use as many nodes as it needed to complete the job. In addition, in an environment of 100 nodes, we could use those nodes for other jobs or other applications, and if we set the priority of the render tier above the others, the render tier could steal the nodes in use by lower-priority tiers, and those tiers could get the nodes back when my render was done. We would not have to touch anything to make it happen during runtime, Active Response would simply be enforcing the policy we set up.

Well, I thought I'd close this post out with a bit of a recap as we covered a fair amount of ground. An internal cloud computing infrastructure is the enabler to allow for a substantial increase in business and IT agility within your organization. By decoupling the physical resources (hardware, storage and networking) from the specific applications they are in support of (the application images) it allows the infrastructure to provide capacity on demand for any managed application as required by the current load against that application (no more having to guess at the peak load and then keeping all that spare capacity sitting idle waiting for the rainy day to come) In addition, you as the user can update the policy for allocation as you need to keep the allocation of resources in line with the importance of the application(s) being managed.

As an example, we took a new application (a render farm application) and hosted it in Cassatt Active Response (the internal cloud computing infrastructure for our example) and was able to not only initially host the application in less than a day but was also able to host the application on existing IT/lab resources over the weekend and then give them back by Monday morning for use in their normal uses (meaning that we hosted this new application by simply using the spare cycles already available in our IT infrastructure rather than purchasing new hardware for the specific purpose as it typically the approach taken today)

Next week, we're going to spend some time talking about Disaster Recovery and how Active Response acting as your internal cloud computing infrastructure can provide DR capabilities that to date you probably thought were unattainable.

1 comment:

Anonymous said...

Hi,
Saltmarch Media is organizing its third edition of Business Technology Summit 2010 which will take place on 11 and 12 Nov'10 at Nimhans Convention Centre, Bangalore. The summit will feature topics like Soa, SaaS, PaaS, Cloud Computing, Cloud Development, Cloud Governance and more. For more info log on to btsummit dot com