Wednesday, December 29, 2010

Making 'good enough' the new normal

In looking back on some of the more insightful observations that I’ve heard concerning cloud computing in 2010, one kept coming up over and over again. In fact, it was re-iterated by several analysts onstage at the Gartner Data Center Conference in Las Vegas earlier this month.

The thought went something like this:

IT is being weighed down by more and more complexity as time goes on. The systems are complex, the management of those systems is complex, and the underlying processes are, well, also complex.

The cloud seems to offer two ways out of this problem. First, going with a cloud-based solution allows you to start over, often leaving a lot of the complexity behind. But that’s been the same solution offered by any greenfield effort – it always seems deceptively easier to start over than to evolve what you already have. Note that I said “seems easier.” The real-world issues that got you into the complexity problem in the first place quickly return to haunt any such project. Especially in a large organization.

Cloud and the 80-20 rule

But I’m more interested in highlighting the second way that cloud can help. That way is more about the approach to architecture that is embodied in a lot of the cloud computing efforts. Instead of building the most thorough, full-featured systems, cloud-based systems are often using “good enough” as their design point.

This is the IT operations equivalent of the 80-20 rule. It’s the idea that not every system has to have full redundancy, fail-over, or other requirements. It doesn't need to be perfect or have every possible feature. You don't need to know every gory detail from a management standpoint. In most cases, going to those extremes means what you're delivering will be over-engineered and not worth the extra time, effort, and money. That kind of bad ROI is a problem.

“IT has gotten away from “good enough” computing,” said Gartner’s Donna Scott in one of her sessions at the Data Center Conference. “There is a lot an IT dept can learn from cloud, and that’s one of them.”

The experiences of eBay

In talking about his experiences working at eBay during the same conference, Mazen Rawashdeh, vice president of eBay's technology operations, talked about his company’s need to be able to understand what made the most impact on cost and efficiency and optimize for those. That mean a lot of “good enough” decisions in other areas.

eBay IT developed metrics that helped drive the right decisions, and then focused, according to Rawashdeh, on innovation, innovation, innovation. They avoided the things that would weigh them down because “we needed to break the linear relationship between capacity growth and infrastructure cost,” said Rawashdeh. At the conference, he laid out a blueprint for a pretty dynamic IT operations environment, stress-tested by one of the bigger user bases on the Web.

Rawashdeh couched all of this IT operations advice in one of his favorite quotes from Charles Darwin: “It’s not the strongest of species that survive, nor the most intelligent, but the ones most responsive to change.” In the IT context, it means being resilient to lots of little changes – and little failures – so that the whole can still keep going. “The data center itself is our ‘failure domain,’” he said. Architecting lots of little pieces to be “good enough” lets the whole be stronger, and more resilient.

Everything I needed to know about IT operations I learned from my cloud provider

So who seems to be the best at “good enough” IT these days? Most would point to the cloud service providers, of course.

Many end-user organizations are starting to get this kind of experience, but aren’t very far yet. Forrester’s James Staten says in his 2011 predictions blog that he believes end-user organizations will build private clouds in 2011, “and you will fail. And that’s a good thing. Because through this failure you will learn what it really takes to operate a cloud environment.” He recommends that you “fail fast and fail quietly. Start small, learn, iterate, and then expand.”

Most enterprises, Staten writes, “aren’t ready to pass the baton” – to deliver this sort of dynamic infrastructure – yet. “But service providers will be ready in 2011.” Our own Matt Richards agrees. He created a holiday-inspired list of some interesting things that service providers are using CA Technologies software to make possible.

In fact, Gartner’s Cameron Haight had a whole session at the Vegas event to highlight things that IT ops can learn from the big cloud providers.

Some highlights:

· Make processes experienced-based, rather than set by experts. Just because it was done one way before doesn’t mean that’s the right way now. “Cloud providers get good at just enough process,” said Haight, especially in the areas of deployment and incident management.

· Failure happens. In fact, the big guys are moving toward a “recovery-oriented” computing philosophy. “Don’t focus on avoiding failure, but on recovery,” said Haight. The important stat with this approach is not mean-time-between-failures (MTBF), but mean-time-to-repair (MTTR). Reliability, in this case, comes from software, not the underlying hardware.

· Manageability follows from both software and management design. Management should lessen complexity, not add to it. Haight pointed toward tools trying to facilitate “infrastructure as code,” to enable flexibility.

Know when you need what

So, obviously, “good enough” is not always going to be, well, good enough for every part of your IT infrastructure. But it’s an idea that’s getting traction because of successes with cloud computing. Those successes are causing IT people to ask a few fundamental questions about how they can apply this approach to their specific IT needs. And that’s a useful thing.

In thinking about where and how “good enough” computing is appropriate, you need to ask yourself a couple questions. First, how vital is the system I’m working on? What’s its use, tolerance for failure, importance, and the like. The more critical it is, the more careful you have to be with your threshold of “good enough.”

Second, is speed of utmost importance? Cost? Security? Or a set of other things? Like Rawashdeh at eBay, know what metrics are important, and optimize to those.

Be honest with yourself and your organization about places you can try this approach. It’s one of the ideas that got quite a bit of attention in 2010 that’s worth considering.

No comments: