« Chaotic Perspectives | Main | Inverting the Reliability Stack »

Thursday, August 23, 2007


 Paul BRown

You'd probably find the Recovery-Oriented Computing (https://roc.cs.berkeley.edu) perspective interesting. If I hear people complain about having to deploy software, I'm usually more interested to dig for low-hanging fruit in their processes than I am eager to try to fix their system. No matter how good or "bad" a system is, it should be easy to deploy. (Of course, fixing the system is fun, too, but the process is probably more broken than the software.)

Account Deleted

At IBM WebSphere, we also faced startup time concerns but these are mostly for development needs as well as the MIPs burn when customer start a large number of JVMS on a big box at the same time. We implemented hot restart in WAS 6.0 by restarting singletons on an existing JVM when the original JVM fails. We've now added state replication with ObjectGrid to further reduce the recovery time by failing over to the JVM with the state copy directly. This enables pretty fast failover times when using ObjectGrid of around under a second.
Now, being able to restart/recover stateful singletons means that heisenbugs can be handled by periodically restarting the JVMs as we've made recovery pretty fast. This goes down the recovery focused computing model but it's an interesting progression from where we started pre V6.0 WebSphere.


I have written longer comment than can just fit into designated section. Details available at https://lsblog.wordpress.com/2007/08/27/do-we-really-need-non-stop-running-service/


Giving that you're striving for software that doesn't crash and is highly stable, you still can't remove hardware related issues (failure, maintenance, ...). How do you combat hardware related issues at eBay?


If the process of hard resetting the servers is automated, wouldn't you say that it is analogous in some way to garbage collection in the VM? And we all accept that don't we.



I should have worded that differently. What I should have written was:

In the eBay stack, do you do anything special to deal with hardware related issues?



I have been frustrated by the very same issues you describe in your article. A common question that comes up in these conversations is "Would you fly in an airplane that was piloted by your software?" But you don't have to go nearly that far: most people wouldn't ride a pogo-stick that ran with the stability of most commercially developed software.

I have settled on the idea that while the bulk of software is still written by human hands, we will face errors. I think the best bang-for-the-buck is to design a robust system that can recover from and hide these issues. Run all services two-by-two (or five-by-five), load-balance all HTTP requests to web services, build into the system recoverability for lost messages, etc. In such an environment, restarting a ill-functioning server or grappling with services that take extended periods of time to start don't impact the system appreciably.

Ideally, we would all be using the most appropriate design-patterns and exercising sound development methodologies. Projects such as Ubi Dahan's NServiceBus will help to take the complicated plumbing out of the hands of mainstream developers and allow developers to write simpler software components that are smaller in scope and easier to test. Until then, the best most enterprises can do is insulate the *outer* interfaces from such problems.

What do you think? Is it cheaper (and ultimately more productive) to build layers of protection into the system to keep bugs and failures hidden from the eyes of the consumer (especially in the context of non-stop computing)?

Thanks for your comments.

The comments to this entry are closed.