I've been looking at how we typically achieve reliability in architectures. In this case I'm looking more at the reliability of persistent information. Quite simply reliability is achieved bottom up. Data is stored on a SAN. The SAN uses RAID to deal with reliability of the underlying hardware. Even the underlying hardware has various error management features (parity/ECC). The SAN is connected to the database servers through redundant connections. The database relies upon a complex set of ACID rules to insure data has been committed to the SAN.
There is no argument that this works. When the database transaction completes the data is stored (at least in one location). You can rest assured that a hardware failure has not compromised the data. It is safely available someplace, although there may be a delay before it is available. But this assurance comes at an incredible cost. Starting at the bottom of the stack, RAID can reduce capacity by as little as 20% (RAID 5 on 5 drives) to 50% (RAID 1). SAN adds more cost by putting the drives on their own network. This is critical for quick migration between database servers. This cost can be easily seen in the cost per gigabyte. Internal drives on workstations cost as little as $1-2/GB while the SAN storage costs $15/GB or more.
With an order of magnitude at play in storage cost, looking at alternatives seems reasonable. Is it possible to move reliability up to the application level, thereby removing the need for ultra reliable hardware? Certainly if it can be done, the cost savings could be significant. Replacing specialized hardware with commodity hardware always brings not only cost savings but flexibility in hardware utilization.
One technology I've looked at is Distributed Hash Tables (DHT). A DHT spreads the concept of a hash table across many nodes. The key space is divided across nodes making the table resilient to single node failures. Part of the key space may become unavailable but the remaining key space survives. The implementations I've looked at (Free Pastry and Bamboo) actually replicate keys across multiple nodes so a single node failure does not compromise the availability of keys, only the transactional capacity of that particular key partition.
DHT's are particularly interesting because the distribution and replication can span data centers as well. From an application level, this is useful as it solves disaster recoverability and makes the application more resilient, not only to individual hardware failures but even localized disasters that might take an entire data center offline. Solving both node level and data center level resilience with a single solution is a rare win.
In the case of DHT's this comes at a cost though. Access times are notoriously slow range from 10's to 100's of milliseconds. This creates challenges for many applications and may in fact make a DHT solution untenable. But that doesn't mean the whole concept of application level reliability is invalidated. In many cases, there are optimizations that can be made at the application level that will still permit the problem to be solved and preserve the use of commodity hardware.
Comments and ideas along this idea are most welcome!
Technorati Tags: architecture, engineering, performance, scalability, services, to_read, toread, web
One thing to think about early on with this is whether you are running this distributed hash table across slow WAN with untrusted hosts or a fast network with trusted hosts.
P2P systems like Pastry are designed to deal with untrusted hosts and a slower network. Response times are long because of the constraints of operating in that challenging environment.
Systems like Google's Bigtable are designed to run on trusted hosts on an internal network. They can achieve much higher performance.
It sounds like all you want is a distributed hash or map that runs across commodity hardware with cheap disks on your own network. If that's the case, I'd think you'd want something much more like Bigtable (or HBase) than Pastry.
Posted by: Greg Linden | Monday, September 17, 2007 at 11:35 AM
Regarding the transactional cost and/or latency of DHT--imagine a DHT implementation where clients that join the DHT network act as DHT nodes. Imagine a setup where key partitions are small and a node may contain multiple partitions--as well as partial partitions.
For reads, DHT nodes look in their local HT key partitions first (local access == fast), otherwise use nearest-neighbor queries to fill their local HT key partition for just the key that was accessed. (Think: peer-to-peer DHT.)
For writes, we propagate a write message to our connected peers. This can either replace the value in the local HT or just invalidate/free the value, based on a TTL.
"But what about transactional writes?!" As long as we avoid ugly things like XA and can tolerate dirty reads during the propagation delay, we're okay.
This description sounds a lot like a neural network ...
Posted by: Dossy Shiobara | Monday, September 17, 2007 at 03:54 PM
This is a very interesting discussion as there is definitely more than one way to skin a cat. I guess that the approach to design reliability from bottom up will inevitably lead to an expensive solution. The thing is that at the end of the day the complexity of the system is in the application and in case of bottom-up design you end up with the rest of the stack trying to overcome the deficiencies of the application layer as opposed to supporting it in its task. Of course, there is no silver bullet but for large web-based systems you are probably best off designing your application nodes as autonomous and stateless as possible and building data redundancy into the database layer using horizontal splitting and replication. This way you only need a very thin slice of the system to be reliable at any given time. Also, you don't rely on any particular component (database, disk, application server etc) to be extremely reliable (other slices can just step in and take over) and can thus use cheap off-the-shelf hardware. Of course, this has a performance penalty as everything needs to be persisted, replicated, queued and so forth. Which means that the key here is your database being very fit allowing for essential features like clustering and queues on a very low level. This in turn leads to the conclusion that you need to add simplicity to your data model by making your tables thin, relations simple and code robust.
Posted by: Andres Kütt | Tuesday, September 18, 2007 at 01:16 AM
See Brad Fitzpatrick's MogileFS http://danga.com/mogilefs
Posted by: Jay | Monday, September 24, 2007 at 06:14 PM
On my previous job I proposed a solution that utilized exactly that - application level reliability. I said "for new system we could use 10 servers $800 each and split work across them instead of using that ultra-expensive monster server". I could not persuade others that this is viable approach, although I still believe it was. However in that solution data and work partition had to be done via some dynamic configuration which is not so elegant as DHT.
There is a problem with application level reliability: it seems that is should be done in a unique way for each application and this is not affordable in many cases. Some generic ideas as that DHT are still rare. On the other hand database or storage levels are universal. So either you need to be smart or you have to be rich to achieve reliability :)
Posted by: Petr Gladkikh | Tuesday, September 25, 2007 at 03:27 AM
You might be very interested in Amazon's Dynamo system.
See Wernel Vogels' article which has a short explanation and a link to a paper they wrote.
Search for amazon dynamo.
Posted by: Tim Freeman | Monday, October 08, 2007 at 11:52 AM
if you are still interested in DHTs, take a look at scalaris:
--> http://code.google.com/p/scalaris/
it's built in erlang ...
Posted by: Tim | Wednesday, August 27, 2008 at 04:53 PM