During one of my first few weeks at eBay I got involved in a conversation about mark down logic. Now, I had only been at the company for a short while and I was working for an e-commerce company, so I assumed mark down logic must be some business rules about price discounts. Certainly seemed like a reasonable line of thought. As it turns out, I was completely wrong and as a result got introduced to a concept that is critical to the ability of any site to achieve high levels of availability.
The term came into existence inside of eBay because the DBA's wanted a mechanism to tell the applications that the database was down, regardless of the true state of the database. They wanted to mark the database state as down. The original motivation for this was to deal with challenges with the database listener. With hundreds, at that time, of application servers, all waiting for the database to come back up, the moment the listener was turned on, a connection storm would hit and often cause the database to go down again. Rather than requiring an involved set of start up procedures that would effectively be a total site reboot, the desire was to be able to control the rate of application connections.
There are actually two concepts that have to be considered here:
- The ability to mark an external resource state as up or down and have the application honor that state.
- The ability of the application to behave in a defined way when the resource is down and to return to the proper behavior when the resource returns, without being restarted.
There are nuances to supporting mark down however. The first is dealing with how you will change the state. For a small deployment, something as simple as an HTTP POST on an administrative listener could be used. I've also seen a configuration file with a watch for modifications. Tools like Puppet can then be used to push out state changes. This works well for small deployments. Larger deployments would benefit from configuration service tools like Zookeeper.
The second concept is much more involved. The challenge faced here is that applications need to behave in a predictable way when a resource becomes unavailable. I chose predictable here because the actual behavior is going to vary considerably with the application logic and the down resource. While simply returning HTTP status 500 may be predictable, that's not what I mean and is usually not sufficient to be considered robust.
One of the most challenging but important considerations is what can the application do without the resource that is down? The simple minded approach is to simply state, nothing and return the equivalent of service temporarily unavailable. This may in fact be the only option depending upon the scenario. A more robust approach however is to design applications to try and make as much forward progress as they can with the resource missing. Design the application with resilience to missing resources. Think about what it could do if you took the resource away. What functionality could still be provided?
Equally challenging is managing state that might get confused if the resource becomes unavailable. When exceptions start coming back from database connections or REST services, the internal state of the application could become corrupt. The result is that even though the resource has returned, the application is unable to use it correctly and ultimately has to be restarted itself.
This brings me to another important point. The only way to make sure that your application can behave predictably and recover correctly from resources going down it so test it. Testing mark down needs to be a standard part of the application regressions. Netflix has taken it to the ultimate state by creating a Chaos Monkey. They turn it lose in production with the sole purpose of randomly killing things and making sure their applications can survive. I'm a fan!