Sometimes you say things, and measure the reaction. For example, tell a room of software professionals:
Software should never crash.
You will get unanimous agreement. Of course software does crash but it shouldn't. That's a given. But try another statement:
Software should never need to be restarted.
Now you will often find a conversation ensue. "Never? Seems like a long time." Discussions of mitigating factors. No, never seems far too long. Need to put some time constraints on that. Daily is unreasonable but weekly isn't so bad. Why is it so bad that a service needs to be restarted once a week. After all, the service is stateless and load balanced. Should be okay, right?
But in the confines of software and the hardware it executes on, there is no explanation for deteriorating behavior. Granted that hardware can produce errors, but we'll not blame software for that. No, if software looses performance or stability over time, it can only be explained by a collection of bugs.
All software has bugs. That's a simple fact. Any complex piece of software will have hundreds of bugs associated with each release. The curiosity is why bugs that impact the stability of an application are not prioritized higher. In most organizations a bug that causes a crash is a top priority. A bug that reduces time between restarts to hours would also be considered high priority. But as that time starts moving into days the priority quickly drops. Why? Well, my theory is that far too many of us have become accustomed to this level of instability and accept it as a reality of software.
At one point I had a client whose operations team was unhappy with service start up time. I found that curious, because while a service that takes a while to launch is frustrating, it's rarely an operational handicap. The source of the frustration was the service needed to be restarted every 2-3 days and the start time was becoming a challenge for maintaining availability. Now what was shocking to me is I was being asked for advice on how to improve start time, NOT how to fix the problem forcing a restart every few days. That was accepted as a reality. This is precisely the complacency with regards to long running stability to which I refer.
The question then is how do we get past this? Well, it's actually quite simple. Stop accepting anything less than long term stability from your applications. Run burn in tests for longer period of time. Stress test at full load for several days straight. Your application should be delivering the same performance with the same memory footprint after 5 days as it did on the first day. Anything else, consider it a high priority bug and fix it. Most importantly, instill a culture of expecting stability in your teams. Make it as important as delivering software that doesn't crash.
What is your opinion? I would love to hear what others have to say on this topic.
Technorati Tags: architecture, engineering, java, programming, scalability, software, to_read, toread, web