Tuesday, February 10, 2009

Is your organization ready for an internal cloud?

I'm going to take a post off from the "-ilities" discussion (Disaster Recoverability will be up next) and spend a little time talking about the technical and organizational challenges that many Fortune 1000 companies will face in their move to an internal cloud computing infrastructure.

Since I've lived through a number of these customer engagements over the past few years I thought I'd write up my cheat sheet that basically outlines the things that you need to be aware of and get worked out in your organization before you embark on a move to cloud computing. If you don't pay attention to these, I predict you'll be frustrated along the way by the organizational tension that a move such as this will cause.

Internal clouds are a game-changer on the economics front for companies that have large investments in IT. Creating a cloud-style architecture inside your own data center allows the company to unlock the excess capacity of their existing infrastructure (often locked into vertical stovepipes supporting specific applications.) Once this excess capacity is released for use within the IT environment, the company can then decrease their capital purchasing until the newfound capacity is consumed by new applications. In addition to the capital savings as a result of not having to purchase new hardware to run new applications, there are substantial power savings to the company as well, since the excess capacity is no longer powered up all the time, and is instead brought online only when needed.

Now, this sounds like motherhood and apple pie to me. Who wouldn't want to move to an IT environment that allowed this type of flexibility/cost savings? Well, it turns out that in many large companies the inertia of "business as usual" gets in the way of making this type of transformational change, even if they could see huge benefits in the form of business agility and cost savings (two things that almost everyone is looking for in this current economy to make them more competitive than the next guy).

If you find yourself reading the list below and going "no way, not in my company," then I'd posit that while you may desire the virtues of cloud computing, you'll really struggle to succeed. The limits you'll be placing on the solution due to the organizational issues will mean that many of the virtues of a cloud approach will be lost (I'll try to call out a few while I walk though the list to give you an idea of the trade-offs involved).

What you’ll need to be successful at internal cloud computing:
Organizational willingness to embrace change. To fully adopt an internal cloud infrastructure, the system, network, and storage admins -- along with application developers -- are all going to have to work together as each group brings their specialty to bear on the hosting requirements of the application being deployed into the cloud. Also, if your organization likes to know exactly where an application is running all the time then cloud computing is only going to be frustrating. In cloud computing, the environment is continually being monitored and optimized to meet the business needs (we call them service-level agreements in Cassat Active Response.) This means that while at any point in time you can know what is running where, that information is only accurate for that instant. Why? In the next instant, something may have failed or an SLA may have been breached, causing the infrastructure to react, changing the allocation of applications to resources. Bottom line, be willing to embrace change in processes, policies, roles and responsibilities or you'll never be successful with a cloud computing solution.

Willingness to network boot (PXE/BOOTP/DHCP) the computing resources. One of the major value propositions of an internal cloud computing approach is the ability to re-use hardware for different tasks at different times. To allow for this rapid re-deployment of resources, you can't use traditional approaches for imaging a node’s local disk (it takes a long time to copy down a multi-Gb image across the network and once done, if that node fails then the data is lost, since it resides on the local disk). Instead, the use of standard network protocols (NFS/iSCSI) allows for the real-time association of the image to the hardware. This approach also has the byproduct of allowing for very fast failure recovery times. Once a node is found to have failed, it only takes a few seconds to associate the image to new node and start the boot (we recover failed nodes in the time it takes to boot a node plus a few seconds to affect the re-association of the image to its new compute resource)

Computing resources that support remote power management either through on-board or off-board power controllers. For all of this dynamicism to work, the internal cloud infrastructure must be able to power-control the nodes under management so that a node can be powered down when not needed (saving power) and power it back up when necessary. Most recent computing hardware has on-board controllers specifically for this task (iLo on HP, DRAC on Dell, ALOM/ILOM on Sun…) and the cloud computing infrastructure simply uses these existing interfaces to affect the power operations. If you find yourself in an environment that has older hardware that does not have this support, don't despair. There are numerous vendors that manufacture external Power Distribution Units (PDUs) that can provide the necessary power management for otherwise "dumb" compute nodes.

Understand that your current static Configuration Management Database (CMDB) becomes real-time/dynamic in an internal cloud computing world. I touched on this under the "embrace change" bullet above, but it's worth calling out specifically. In a cloud computing world where you have pools of capacity (computing, network, and storage) that are associated in real time to applications that need that capacity to provide service, NOTHING is constant. As an example, depending on how you set up your policy, a failure of a node in a higher priority service will cause a node to be allocated from the free pool. If one is not available that matches the applications requirements (number of CPUs, disks, network cards, memory…) then a suitable replacement may be "borrowed" from a lower-priority service. This means that your environment is always changing and evolving to meet your business needs. What this also means is nothing is constant and you'll only find yourself frustrated if you don't change the mindset of static allocation.

Understand that much of the network configuration will be handled by the internal cloud infrastructure. Now this doesn't necessarily mean your core switches have to be managed by the cloud infrastructure. However, if you want the ability to allocate new compute capacity to an application that has specific networking requirements (like a web server would have if you want it behind a load balancer), then the infrastructure must reconfigure the ports connected to the desired node to be on the right network. This issue can be a show-stopper to providing a full cloud computing solution so talk about this one early and often with your network team so they have input on how they want to architect the network and have time to become comfortable with the level of control required by the infrastructure.

Boot image storage on the IP network. As I mentioned above, for internal cloud computing to work, you have to separate the application image from the physical node so that the image can be relocated on the fly as necessary to meet your policies. We currently leverage NFS for this separation as it can easily be configured to support this level of dynamicism. Also, using a NAS allows the leveraging of a single IP network and reduces cost/complexity as redundant data paths only have to be engineered for the IP network rather than the IP and the Storage network. I don't mention SAN for the boot device because it can be problematic to move around LUNs on the fly due to the myriad of proprietary vendor switch management APIs. In addition, every server out there ships with at least two on board NICs while SAN HBAs are aftermarket add-ons (to the tune of hundreds of dollars if you want redundant channel bonding). Now this single network approach comes with a downside that I'm going to be upfront about: currently, IP networks are in the 1Gigabit range with 10Gigabit on the horizon, while SAN networks are 2-4Gigibit (you can bond or trunk multiple Ethernets together to get better throughput, but for now we'll leave that aside for this discussion). If you have a very disk-intensive application, you'll need to architect the solution to work within a cloud infrastructure. Specifically, you don't want to be sending that traffic across to a NAS as the throughput can suffer due to the limited bandwidth. You should look to either use local tmp space (if you only need temporary storage) or locally attached SAN that is zoned to a specific set of nodes that can act as a backup to one another in case of failure.

Applications that run on commodity hardware. Internal cloud computing provides much more benefit when the applications to be managed run on commodity hardware. This hardware could be x86, POWER, or SPARC, depending on your environment, but should be the lower- to mid-level servers. It doesn't make a lot of sense to take something like a Sun SPARC F25K and put it into a cloud infrastructure, as it is already built to address scalability/availability issues within the chassis and has built-in high-speed interconnects for fast data access. With commodity hardware comes numbers and that is where a cloud infrastructure really shines, as it dramatically increases the span of control for the operators as they manage more at the SLA level and less at the node level.

Well, that's a lot to digest so I think I'll stop now. I'm not trying to scare anyone away from moving to an internal cloud computing infrastructure. On the contrary, I believe they are the future of IT computing. The current “best” practices are, to a large extent, responsible for the current management and over-provisioning issues facing most IT shops. However, for you to address the over-provisioning and benefit from a continuously optimized computing environment where the excess capacity you have is efficiently allocated to the applications that need it (instead of sitting idle in a specific stove pipe), you need to understand the fundamental issues that stand ahead of you. In your transition to internal cloud computing you will need to actively work to address these new issues, or else you will find yourself with just a different set of problems to chase than the ones you have now.

No comments: