« August 2007 | Main | October 2007 »

Sunday, September 30, 2007

Virtual Conundrums

I read with great interest the posting on virtual computing as this is something I believe is the future of deployment models. Abstracting the physical nodes from the software architecture will definitely allow more flexibility. Of course there has been a natural progression of abstracting the hardware over the years. High level languages removed dependence on machine instructions. Virtual memory removed dependence upon physical memory constraints. Virtual machines complete with portable libraries like Java have further lifted software off dependence on operating systems and the underlying systems.

Each of these abstractions come at a cost though. Making an aspect of the system more opaque allows for better portability and hopefully longer life for applications. But it also means the applications are less well adapted to the physical environment where they run. Java applications have improved in performance and resource usage dramatically over the years but they still can't match a well written C++ program in total resource utilization. In most cases, the development efficiencies achieved by Java are far more important than the incremental resource utilization so Java is the preferred language. But it important to recognize achieving abstractions comes at the cost of efficiency in almost all cases.

Virtualizing techniques like Xen and Solaris Containers definitely provide much needed capabilities that will allow hardware utilization to be increased. Most applications won't be able to detect whether they are running in a single instance of the operating system on a system or they are one of many instances of an OS container. There are a few tricky places though where virtual containers will show up to confuse the developer.

Any application that is heavily biased towards hardware will potentially require careful design to not break in a virtualized world. Networking, storage, and resource monitoring components all have certain expectations they expect when they interact with the hardware. In some cases, these expectations are not met and the software gets surprised. Architecting for this can be a bit tricky with the current tools as they have created a largely opaque view of the physical resources. Going forward, it may be necessary to allow certain applications to pierce the veil of the virtual container, at least sufficiently to calibrate itself to the container. For example, if an application wants to throttle data based on processor or network utilization, the container's view of utilization may be insufficient to allow the application to truly avoid over running the available resources.

Abstracting the deployment definitely has promise. As we gain experience with this technology. The biggest challenge will be providing the appropriate level of abstraction without hiding the appropriate information from applications. Resource usage, networking topologies, latency, and potentially other concrete information may be pertinent to certain classes of applications. Providing this while leveraging the advantages of virtual platforms will be critical to the overall success.

Technorati Tags: , , , , , , , , , , , , , ,

Sunday, September 16, 2007

Inverting the Reliability Stack

I've been looking at how we typically achieve reliability in architectures. In this case I'm looking more at the reliability of persistent information. Quite simply reliability is achieved bottom up. Data is stored on a SAN. The SAN uses RAID to deal with reliability of the underlying hardware. Even the underlying hardware has various error management features (parity/ECC). The SAN is connected to the database servers through redundant connections. The database relies upon a complex set of ACID rules to insure data has been committed to the SAN.

There is no argument that this works. When the database transaction completes the data is stored (at least in one location). You can rest assured that a hardware failure has not compromised the data. It is safely available someplace, although there may be a delay before it is available. But this assurance comes at an incredible cost. Starting at the bottom of the stack, RAID can reduce capacity by as little as 20% (RAID 5 on 5 drives) to 50% (RAID 1). SAN adds more cost by putting the drives on their own network. This is critical for quick migration between database servers. This cost can be easily seen in the cost per gigabyte. Internal drives on workstations cost as little as $1-2/GB while the SAN storage costs $15/GB or more.

With an order of magnitude at play in storage cost, looking at alternatives seems reasonable. Is it possible to move reliability up to the application level, thereby removing the need for ultra reliable hardware? Certainly if it can be done, the cost savings could be significant. Replacing specialized hardware with commodity hardware always brings not only cost savings but flexibility in hardware utilization.

One technology I've looked at is Distributed Hash Tables (DHT). A DHT spreads the concept of a hash table across many nodes. The key space is divided across nodes making the table resilient to single node failures. Part of the key space may become unavailable but the remaining key space survives. The implementations I've looked at (Free Pastry and Bamboo) actually replicate keys across multiple nodes so a single node failure does not compromise the availability of keys, only the transactional capacity of that particular key partition.

DHT's are particularly interesting because the distribution and replication can span data centers as well. From an application level, this is useful as it solves disaster recoverability and makes the application more resilient, not only to individual hardware failures but even localized disasters that might take an entire data center offline. Solving both node level and data center level resilience with a single solution is a rare win.

In the case of DHT's this comes at a cost though. Access times are notoriously slow range from 10's to 100's of milliseconds. This creates challenges for many applications and may in fact make a DHT solution untenable. But that doesn't mean the whole concept of application level reliability is invalidated. In many cases, there are optimizations that can be made at the application level that will still permit the problem to be solved and preserve the use of commodity hardware.

Comments and ideas along this idea are most welcome!

Technorati Tags: , , , , , , ,