« Chaotic Perspectives | Main | Inverting the Reliability Stack »

Thursday, August 23, 2007

In Support of Non-Stop Software

Sometimes you say things, and measure the reaction. For example, tell a room of software professionals:

Software should never crash.

You will get unanimous agreement. Of course software does crash but it shouldn't. That's a given. But try another statement:

Software should never need to be restarted.

Now you will often find a conversation ensue. "Never? Seems like a long time." Discussions of mitigating factors. No, never seems far too long. Need to put some time constraints on that. Daily is unreasonable but weekly isn't so bad. Why is it so bad that a service needs to be restarted once a week. After all, the service is stateless and load balanced. Should be okay, right?

But in the confines of software and the hardware it executes on, there is no explanation for deteriorating behavior. Granted that hardware can produce errors, but we'll not blame software for that. No, if software looses performance or stability over time, it can only be explained by a collection of bugs.

All software has bugs. That's a simple fact. Any complex piece of software will have hundreds of bugs associated with each release. The curiosity is why bugs that impact the stability of an application are not prioritized higher. In most organizations a bug that causes a crash is a top priority. A bug that reduces time between restarts to hours would also be considered high priority. But as that time starts moving into days the priority quickly drops. Why? Well, my theory is that far too many of us have become accustomed to this level of instability and accept it as a reality of software.

At one point I had a client whose operations team was unhappy with service start up time. I found that curious, because while a service that takes a while to launch is frustrating, it's rarely an operational handicap. The source of the frustration was the service needed to be restarted every 2-3 days and the start time was becoming a challenge for maintaining availability. Now what was shocking to me is I was being asked for advice on how to improve start time, NOT how to fix the problem forcing a restart every few days. That was accepted as a reality. This is precisely the complacency with regards to long running stability to which I refer.

The question then is how do we get past this? Well, it's actually quite simple. Stop accepting anything less than long term stability from your applications. Run burn in tests for longer period of time. Stress test at full load for several days straight. Your application should be delivering the same performance with the same memory footprint after 5 days as it did on the first day. Anything else, consider it a high priority bug and fix it. Most importantly, instill a culture of expecting stability in your teams. Make it as important as delivering software that doesn't crash.

What is your opinion? I would love to hear what others have to say on this topic.

Technorati Tags: , , , , , , , ,

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2010650/21056543

Listed below are links to weblogs that reference In Support of Non-Stop Software:

» Do we really need non-stop runningservice? from Libor.SOUCEK("WEBLog")
Dan Pritchett has brought up an interesting issue: Software should never need to be restarted . This is pretty strong statement and question is when and how we shall realize it. These questions ... [Read More]

Comments

You'd probably find the Recovery-Oriented Computing (http://roc.cs.berkeley.edu) perspective interesting. If I hear people complain about having to deploy software, I'm usually more interested to dig for low-hanging fruit in their processes than I am eager to try to fix their system. No matter how good or "bad" a system is, it should be easy to deploy. (Of course, fixing the system is fun, too, but the process is probably more broken than the software.)

At IBM WebSphere, we also faced startup time concerns but these are mostly for development needs as well as the MIPs burn when customer start a large number of JVMS on a big box at the same time. We implemented hot restart in WAS 6.0 by restarting singletons on an existing JVM when the original JVM fails. We've now added state replication with ObjectGrid to further reduce the recovery time by failing over to the JVM with the state copy directly. This enables pretty fast failover times when using ObjectGrid of around under a second.
Now, being able to restart/recover stateful singletons means that heisenbugs can be handled by periodically restarting the JVMs as we've made recovery pretty fast. This goes down the recovery focused computing model but it's an interesting progression from where we started pre V6.0 WebSphere.

I have written longer comment than can just fit into designated section. Details available at http://lsblog.wordpress.com/2007/08/27/do-we-really-need-non-stop-running-service/

Giving that you're striving for software that doesn't crash and is highly stable, you still can't remove hardware related issues (failure, maintenance, ...). How do you combat hardware related issues at eBay?

If the process of hard resetting the servers is automated, wouldn't you say that it is analogous in some way to garbage collection in the VM? And we all accept that don't we.

Allan,

I should have worded that differently. What I should have written was:

In the eBay stack, do you do anything special to deal with hardware related issues?

Regards,
Al.

I have been frustrated by the very same issues you describe in your article. A common question that comes up in these conversations is "Would you fly in an airplane that was piloted by your software?" But you don't have to go nearly that far: most people wouldn't ride a pogo-stick that ran with the stability of most commercially developed software.

I have settled on the idea that while the bulk of software is still written by human hands, we will face errors. I think the best bang-for-the-buck is to design a robust system that can recover from and hide these issues. Run all services two-by-two (or five-by-five), load-balance all HTTP requests to web services, build into the system recoverability for lost messages, etc. In such an environment, restarting a ill-functioning server or grappling with services that take extended periods of time to start don't impact the system appreciably.

Ideally, we would all be using the most appropriate design-patterns and exercising sound development methodologies. Projects such as Ubi Dahan's NServiceBus will help to take the complicated plumbing out of the hands of mainstream developers and allow developers to write simpler software components that are smaller in scope and easier to test. Until then, the best most enterprises can do is insulate the *outer* interfaces from such problems.

What do you think? Is it cheaper (and ultimately more productive) to build layers of protection into the system to keep bugs and failures hidden from the eyes of the consumer (especially in the context of non-stop computing)?

Thanks for your comments.
-Mike

Post a comment

If you have a TypeKey or TypePad account, please Sign In