« In Support of Non-Stop Software | Main | Virtual Conundrums »

Sunday, September 16, 2007


Greg Linden

One thing to think about early on with this is whether you are running this distributed hash table across slow WAN with untrusted hosts or a fast network with trusted hosts.

P2P systems like Pastry are designed to deal with untrusted hosts and a slower network. Response times are long because of the constraints of operating in that challenging environment.

Systems like Google's Bigtable are designed to run on trusted hosts on an internal network. They can achieve much higher performance.

It sounds like all you want is a distributed hash or map that runs across commodity hardware with cheap disks on your own network. If that's the case, I'd think you'd want something much more like Bigtable (or HBase) than Pastry.

Dossy Shiobara

Regarding the transactional cost and/or latency of DHT--imagine a DHT implementation where clients that join the DHT network act as DHT nodes. Imagine a setup where key partitions are small and a node may contain multiple partitions--as well as partial partitions.

For reads, DHT nodes look in their local HT key partitions first (local access == fast), otherwise use nearest-neighbor queries to fill their local HT key partition for just the key that was accessed. (Think: peer-to-peer DHT.)

For writes, we propagate a write message to our connected peers. This can either replace the value in the local HT or just invalidate/free the value, based on a TTL.

"But what about transactional writes?!" As long as we avoid ugly things like XA and can tolerate dirty reads during the propagation delay, we're okay.

This description sounds a lot like a neural network ...

Andres Kütt

This is a very interesting discussion as there is definitely more than one way to skin a cat. I guess that the approach to design reliability from bottom up will inevitably lead to an expensive solution. The thing is that at the end of the day the complexity of the system is in the application and in case of bottom-up design you end up with the rest of the stack trying to overcome the deficiencies of the application layer as opposed to supporting it in its task. Of course, there is no silver bullet but for large web-based systems you are probably best off designing your application nodes as autonomous and stateless as possible and building data redundancy into the database layer using horizontal splitting and replication. This way you only need a very thin slice of the system to be reliable at any given time. Also, you don't rely on any particular component (database, disk, application server etc) to be extremely reliable (other slices can just step in and take over) and can thus use cheap off-the-shelf hardware. Of course, this has a performance penalty as everything needs to be persisted, replicated, queued and so forth. Which means that the key here is your database being very fit allowing for essential features like clustering and queues on a very low level. This in turn leads to the conclusion that you need to add simplicity to your data model by making your tables thin, relations simple and code robust.


See Brad Fitzpatrick's MogileFS http://danga.com/mogilefs

Petr Gladkikh

On my previous job I proposed a solution that utilized exactly that - application level reliability. I said "for new system we could use 10 servers $800 each and split work across them instead of using that ultra-expensive monster server". I could not persuade others that this is viable approach, although I still believe it was. However in that solution data and work partition had to be done via some dynamic configuration which is not so elegant as DHT.

There is a problem with application level reliability: it seems that is should be done in a unique way for each application and this is not affordable in many cases. Some generic ideas as that DHT are still rare. On the other hand database or storage levels are universal. So either you need to be smart or you have to be rich to achieve reliability :)

Tim Freeman

You might be very interested in Amazon's Dynamo system.

See Wernel Vogels' article which has a short explanation and a link to a paper they wrote.

Search for amazon dynamo.


if you are still interested in DHTs, take a look at scalaris:

--> http://code.google.com/p/scalaris/

it's built in erlang ...

The comments to this entry are closed.