Sunday, October 14, 2007

Gartner's Top 10 Technologies for 2008

Network World has published an article on Gartner's Top 10 Technologies for 2008. I love predictions although I will admit that October is a bit early. But why not get yours out early and beat the holiday rush, right? Actually as predictions go these are not bad. Like most, they range from the obvious to the "as-if" but it's risky to stick your neck out and we shouldn't be harsh with any organization that does. I do have a few comments on the predictions though.

Starting off in the #1 spot is green computing. I think they are right, especially for large organizations. It's no surprise to anybody trying to run a large data center that power is now the constraining resource. Unfortunately, the message is that green computing is a hardware problem. They also fall short of tying #5, virtualization, to the challenges of green computing. But if we get people to use virtualization to drive up utilization without realizing it's also saving power, I'm okay.

I'm a bit more dubious about #2. Not that I wouldn't love to have some unification of my communication world, but for me it seems my communications are diverging, not converging. I have far too many mailboxes, phone numbers, and calendars now. Every social network a friend joins just brings another communication channel for me to manage. At some point we may all cry "Uncle!" but right now it feels like we are moving in any direction but unified communications.

The next one that captured my attention was #4, meta data. This is definitely a problem that needs more attention. It is usually left as an afterthought to the overall architecture and yet any system of scale is virtually unusable without a good meta data system, designed in, not added on. I couldn't agree more that this needs to become serious focus in the year to come.

Virtualization is #5 on their list. Personally I'd move this higher. Abstracting deployments from physical instances is incredibly powerful. It is the next step in the logical progression of abstractions (machine language, high level language, virtual machines, and now virtual OS containers). Fortunately there is little for application developers to do on this one but if you aren't designing your applications to live happily inside a virtual container, start now.

It is interesting that the broke out their definition of computing fabrics into a separate item, #8. If my execution platform is sufficiently abstract, then a logical extension is that the hardware platform will be able to gain flexibility in how it allocates physical resources like processors and memory across boundaries within blades. The more interesting fabric to me is a better abstraction of systems like Beowulf so programming becomes more practical. There are an increasing number of social problems that scale far beyond the confines of one system but also don't split will due to the network nature of the data. Environments that let mere mortals program to loosely coupled clusters would help tremendously.

As always with predictions, it's fun to have a look back at the end of the year and see what you got right and what went far astray. I'm sure 2008 will follow several of these trends but will also bring us some interesting surprises. Feel free to share your predictions with comments!

Technorati Tags: , , , , , , , , , , ,

Sunday, September 30, 2007

Virtual Conundrums

I read with great interest the posting on virtual computing as this is something I believe is the future of deployment models. Abstracting the physical nodes from the software architecture will definitely allow more flexibility. Of course there has been a natural progression of abstracting the hardware over the years. High level languages removed dependence on machine instructions. Virtual memory removed dependence upon physical memory constraints. Virtual machines complete with portable libraries like Java have further lifted software off dependence on operating systems and the underlying systems.

Each of these abstractions come at a cost though. Making an aspect of the system more opaque allows for better portability and hopefully longer life for applications. But it also means the applications are less well adapted to the physical environment where they run. Java applications have improved in performance and resource usage dramatically over the years but they still can't match a well written C++ program in total resource utilization. In most cases, the development efficiencies achieved by Java are far more important than the incremental resource utilization so Java is the preferred language. But it important to recognize achieving abstractions comes at the cost of efficiency in almost all cases.

Virtualizing techniques like Xen and Solaris Containers definitely provide much needed capabilities that will allow hardware utilization to be increased. Most applications won't be able to detect whether they are running in a single instance of the operating system on a system or they are one of many instances of an OS container. There are a few tricky places though where virtual containers will show up to confuse the developer.

Any application that is heavily biased towards hardware will potentially require careful design to not break in a virtualized world. Networking, storage, and resource monitoring components all have certain expectations they expect when they interact with the hardware. In some cases, these expectations are not met and the software gets surprised. Architecting for this can be a bit tricky with the current tools as they have created a largely opaque view of the physical resources. Going forward, it may be necessary to allow certain applications to pierce the veil of the virtual container, at least sufficiently to calibrate itself to the container. For example, if an application wants to throttle data based on processor or network utilization, the container's view of utilization may be insufficient to allow the application to truly avoid over running the available resources.

Abstracting the deployment definitely has promise. As we gain experience with this technology. The biggest challenge will be providing the appropriate level of abstraction without hiding the appropriate information from applications. Resource usage, networking topologies, latency, and potentially other concrete information may be pertinent to certain classes of applications. Providing this while leveraging the advantages of virtual platforms will be critical to the overall success.

Technorati Tags: , , , , , , , , , , , , , ,

Sunday, September 16, 2007

Inverting the Reliability Stack

I've been looking at how we typically achieve reliability in architectures. In this case I'm looking more at the reliability of persistent information. Quite simply reliability is achieved bottom up. Data is stored on a SAN. The SAN uses RAID to deal with reliability of the underlying hardware. Even the underlying hardware has various error management features (parity/ECC). The SAN is connected to the database servers through redundant connections. The database relies upon a complex set of ACID rules to insure data has been committed to the SAN.

There is no argument that this works. When the database transaction completes the data is stored (at least in one location). You can rest assured that a hardware failure has not compromised the data. It is safely available someplace, although there may be a delay before it is available. But this assurance comes at an incredible cost. Starting at the bottom of the stack, RAID can reduce capacity by as little as 20% (RAID 5 on 5 drives) to 50% (RAID 1). SAN adds more cost by putting the drives on their own network. This is critical for quick migration between database servers. This cost can be easily seen in the cost per gigabyte. Internal drives on workstations cost as little as $1-2/GB while the SAN storage costs $15/GB or more.

With an order of magnitude at play in storage cost, looking at alternatives seems reasonable. Is it possible to move reliability up to the application level, thereby removing the need for ultra reliable hardware? Certainly if it can be done, the cost savings could be significant. Replacing specialized hardware with commodity hardware always brings not only cost savings but flexibility in hardware utilization.

One technology I've looked at is Distributed Hash Tables (DHT). A DHT spreads the concept of a hash table across many nodes. The key space is divided across nodes making the table resilient to single node failures. Part of the key space may become unavailable but the remaining key space survives. The implementations I've looked at (Free Pastry and Bamboo) actually replicate keys across multiple nodes so a single node failure does not compromise the availability of keys, only the transactional capacity of that particular key partition.

DHT's are particularly interesting because the distribution and replication can span data centers as well. From an application level, this is useful as it solves disaster recoverability and makes the application more resilient, not only to individual hardware failures but even localized disasters that might take an entire data center offline. Solving both node level and data center level resilience with a single solution is a rare win.

In the case of DHT's this comes at a cost though. Access times are notoriously slow range from 10's to 100's of milliseconds. This creates challenges for many applications and may in fact make a DHT solution untenable. But that doesn't mean the whole concept of application level reliability is invalidated. In many cases, there are optimizations that can be made at the application level that will still permit the problem to be solved and preserve the use of commodity hardware.

Comments and ideas along this idea are most welcome!

Technorati Tags: , , , , , , ,

Thursday, August 23, 2007

In Support of Non-Stop Software

Sometimes you say things, and measure the reaction. For example, tell a room of software professionals:

Software should never crash.

You will get unanimous agreement. Of course software does crash but it shouldn't. That's a given. But try another statement:

Software should never need to be restarted.

Now you will often find a conversation ensue. "Never? Seems like a long time." Discussions of mitigating factors. No, never seems far too long. Need to put some time constraints on that. Daily is unreasonable but weekly isn't so bad. Why is it so bad that a service needs to be restarted once a week. After all, the service is stateless and load balanced. Should be okay, right?

But in the confines of software and the hardware it executes on, there is no explanation for deteriorating behavior. Granted that hardware can produce errors, but we'll not blame software for that. No, if software looses performance or stability over time, it can only be explained by a collection of bugs.

All software has bugs. That's a simple fact. Any complex piece of software will have hundreds of bugs associated with each release. The curiosity is why bugs that impact the stability of an application are not prioritized higher. In most organizations a bug that causes a crash is a top priority. A bug that reduces time between restarts to hours would also be considered high priority. But as that time starts moving into days the priority quickly drops. Why? Well, my theory is that far too many of us have become accustomed to this level of instability and accept it as a reality of software.

At one point I had a client whose operations team was unhappy with service start up time. I found that curious, because while a service that takes a while to launch is frustrating, it's rarely an operational handicap. The source of the frustration was the service needed to be restarted every 2-3 days and the start time was becoming a challenge for maintaining availability. Now what was shocking to me is I was being asked for advice on how to improve start time, NOT how to fix the problem forcing a restart every few days. That was accepted as a reality. This is precisely the complacency with regards to long running stability to which I refer.

The question then is how do we get past this? Well, it's actually quite simple. Stop accepting anything less than long term stability from your applications. Run burn in tests for longer period of time. Stress test at full load for several days straight. Your application should be delivering the same performance with the same memory footprint after 5 days as it did on the first day. Anything else, consider it a high priority bug and fix it. Most importantly, instill a culture of expecting stability in your teams. Make it as important as delivering software that doesn't crash.

What is your opinion? I would love to hear what others have to say on this topic.

Technorati Tags: , , , , , , , ,

Thursday, May 24, 2007

Chaotic Perspectives

I find myself looking at nondeterministic systems a lot lately. Many solutions for the challenges of extreme scale involve relaxing constraints and coping with the ensuing chaos. But humans aren't comfortable with chaos. We're wired to bring order to our surroundings. And software engineers may be more tightly wired for order than the average person.

I have been trying to get my head around this problem for a while now. Recently, I had the revelation that the amount of entropy in any software system is directly related to the breadth of the system being considered. If I start at machine instructions, the system is very predictable. I can isolate that component and define the behavior with mathematical precision. The entire component conforms to completely deterministic patterns of behavior.

But as I start looking at the system more broadly it becomes more chaotic. Introduce multiple threads of execution and I have introduced a stochastic uncertainty to my component. Behavior depends upon the randomness of the scheduler. B follows A is no longer assured. Add another processor and now B follows A or A follows B is no longer assured, as I may have A and B simultaneously.

The multi-threaded/multi-processor example is perfect to illustrate how the resistance to chaos begins. The first reaction to B failing to follow A predictably is to force it. Introduce a semaphore to ensure that A can never occur after or coincident with B. But just like reversing entropy is expensive in thermodynamics, it is expensive in software. Complexity rises. Throughput suffers. No, rather than attempting to impose order on A and B, the goal should be to relax the constraints on A and B to allow them to exist without any temporal relation. Of course this isn't always possible, but the point is to resist the first temptation of imposing order, and instead look for a solution that allows the chaos to exist.

Increasing chaos continues as the view expands. A web service can be well understood in the context of threads and processors, but the introduction of a client brings additional randomness. Clients represent events arriving at unpredictable intervals. Loads on the systems will be non-uniform. All possible combinations of requests (by type, by processing time, by memory size) will occur. The combinatorial effects are impossible to test, predict, and in most cases even reproduce. A component whose behavior was supposedly predictable becomes unpredictable as external stimuli are added.

At the highest systemic view, chaos is rampant. Asynchronous integrations, component failures, network latency variances, and a variety of other stimuli lead to a system that completely unpredictable in behavior. The larger the system, the more chaotic it will be. And just like the earlier thread example, attempting to bring order is expensive and ultimately pointless. As a friend of mine says, reversing entropy is attempting to unscramble and egg. No, chaos is the reality and rather than preventing it, a software architecture has to not only survive but thrive on it.

Which brings me to the crux of my challenges. As software engineers we are trained to solve problems in a very linear fashion. We operate within a framework of deterministic components and well understood patterns and anti-patterns. None of these are well suited to the chaotic reality of large architectures. And that's why you have to be willing to discard them. Step back and look at the architecture from a new perspective.

One of my favorite sacred cows to pick on is ACID. Along comes BASE that challenges the conventional wisdom, fails to conform to existing patterns, and arguably violates several anti-patterns. Yet, to achieve extreme scales, BASE is a necessity. And BASE is a good example of a non-linear revelation that embraces the chaos of scaling large data sets rather than trying force order into the system.

Those kinds of revelations require abandoning our preconceived notions of how to solve these problems (read, patterns) and embrace some chaotic thinking. Resist the temptation to bring order to your systems but rather, seek out ways to make the chaos irrelevant. Try thinking about the system, pushing aside the sacred cows, and envisioning what it means to have pure chaos. What breaks? How can you tolerate it without eliminating it? Is it really more complex than trying to eliminate it?

I'd love to hear your ideas about how to cope with chaos. And how you bring the concepts of chaos to your organizations.

Technorati Tags: , , , , , , , , , ,

Tuesday, February 27, 2007

Build vs Buy - One Perspective

I'm an engineer. It's in my blood. As an engineer, I want to build things. So anytime the build vs buy discussion arises, I have to fight the urge to say build and get right to it. I am definitely happier building my own stuff than configuring and deploying that built by somebody else. But I'm also a pragmatic engineer that accepts that I should leverage existing products wherever possible. But why should we ever build in problem areas where products exist?

A build vs buy decision is primarily about determining if a vendor product can be sufficiently customized to solve the problem that your organization faces. Products are designed to solve a wide range of problems. They have to be or the product vendor limits their potential customer base. The need to solve a broad set of problems does limit the optimizations a vendor can make. This follows the adage that anything that does several things, does none of them well.

How is it that custom built solutions can be better than commercial products though? You will have less engineering resources and less testing than a commercial product. Shouldn't it be obvious that commercial solutions will have an edge in capabilities and quality? Commercial products lack one crucial element though. A specific problem.

Knowing your problem domain allows you to careful choose your compromises. Every problem has unique constraints that can be leveraged to reduce complexity or improve performance. This is a reality that is inescapable and becomes the crux of the build vs buy decision process. For what you are balancing is the optimizations you can achieve against the resources the vendor can offer in terms of support and on going R&D.

The largest challenge that I've seen is adequately defining the requirements for your current problem and being realistic about your future needs. Comparing feature sets from a vendor with your internally developed solutions is a pointless exercise. The goal is to find a good fit, not to have the broadest set of features. The requirements can be difficult to clearly derive though. There are often second order requirements that are less than obvious but in fact lead to the largest opportunities.

As an example, when looking at messaging systems, we knew we needed reliable delivery. The major revelation however was that we did not need exactly once delivery or ordered delivery. The majority of the information we were propagating possessed inherent keys for managing idempotent delivery and there was no inter-event dependencies. There is a great deal of complexity that can be eliminated and tremendous performance gains that can be achieved when you eliminate ordered, exactly once delivery from the messaging infrastructure. Of course, this is not a good idea in general, but this example illustrates how a careful analysis and understand of the problem can lead to more clarity on the requirements.

Okay, so I'm obviously advocating build over buy. No, not really. There are several factors to consider before building a custom solution that has clear overlap with a vendor product. Some of the key factors are:

  • Business Impact - How much does the problem you are trying to solve impact your bottom line? The less impact, the less critical a highly optimized solution becomes. I know this is obvious, but it is worth stating. Optimizing marginal problems is largely pointless.
  • Incremental Benefits - How big are the gains you are likely to achieve by building your own solution vs using a commercial product. In the example above, we discovered we could reduce the number of SQL statements per message by 9X which made it very compelling. Had the improvements been 30% or less, we probably would not have embarked on building a solution.
  • Holistic Costs - It's very easy to focus on the cost associated with the primary solution and ignore the overall life cycle costs. Not only does the NRE for the component need to be considered, but also all ongoing support costs as well as the second order infrastructure components that are required to support the solution.

If you consider these (and others which I would appreciate hearing about) and still find that the benefits out weigh the costs for a high impact problem, then build makes sense. One thing that I find organizations also fall victim to, is assuming that product companies have smarter engineers. Ultimately, you should understand your problem better than anybody else. Therefore, you should be able to deliver the best solution to your problems. There are lots of reasons to rely upon vendors to deliver your solution, but presumed superior engineering talent should not be one of them.

As always, I welcome your feedback.

Technorati Tags: , , , , , , ,

Friday, February 09, 2007

Latency Exists, Cope!

I put this line on a slide recently for a presentation at work. Looking around the room, I could tell that some of the people understood, some were perplexed, and some annoyed. What does latency have to do with architecture anyway? We're concerned with proper component factoring, interfaces, and a collection of "ilities". How could latency be relevant to any of these?

In any large system, there is are a few inescapable facts:

  1. A broad customer base will demand reasonably consistent performance across the globe.
  2. Business continuity will demand geographic diversity in your deployments.
  3. The speed of light isn't going to change.

Given these facts, latency is a critical part of every system architecture. Yet making latency a first order constraint in the architecture is not that common. The result are systems that become heavily influenced by the distance between deployments and limit the business's ability to serve their customers effectively and protect itself against localized disasters.

So how do you design for latency? There are a few strategies that can be applied to your architecture that will allow you to deploy your components across diverse geographic locations. Here are the ones that I find particularly important.

Good Decomposition - Highly coupled, monolithic applications are the bane of any distributed architecture. Allowing components with little functional overlap to be coupled either in code or during deployment will pretty much kill any hope distributing your architecture across a collection of global data centers. Do it badly enough and you will kill any hope of distributing your architecture across two cities in the same state. This sounds obvious, but there are plenty of enterprise level applications in use today that have forced themselves into data centers on the far edges of the same city as their only business contingency plan.

Asynchronous Interactions - This is more than just using messaging between components. It starts by setting the appropriate expectations on your external interfaces be that SOA or a web page. Companies get tripped up here by exposing an early version of an interface that sets the clients expectation of synchronous, low latency interactions. As the interface becomes more heavily used it becomes more and more difficult to change that semantic. If the client has an expectation of a synchronous response, the likelihood of leveraging a collection of components with asynchronous interactions becomes low. Start with an expectation of asynchronous behavior and you can more readily add latency as needed to meet your deployment demands.

Monolithic Data - You can decompose your applications into a collection of loosely coupled components, expose your services using asynchronous interfaces, and yet still leave yourself parked in one data center with little hope of escape. You have to tackle your persistence model early in your architecture and require that data can be split along both functional and scale vectors or you will not be able to distribute your architecture across geographies. I recently read an article where the recommendation was to delay horizontal data spreading until you reach vertical scaling limits. I can think of few pieces of worse advice for an architect. Splitting data is more complex than splitting applications. But if you don't do it at the beginning, applications will ultimately take short cuts that rely on a monolithic schema. These dependencies will be extremely difficult to break in the future.

Design for Active/Active - If you do a good job with the preceding recommendations, then you've most likely created an architecture that can service your customers from all of your locations simultaneously. This is a more efficient and responsive approach than an active/passive pattern where only one location is serving traffic at a time. Utilization of your resources will be higher and by placing services nearer your customers, you are better meeting their needs as well. Additionally, active/active designs handle localized geographic events better as traffic can simply be rebalanced from the impacted data center to your remaining data centers. Business continuity is improved.

Latency is another example of what you don't take into consideration in your architecture will ultimately undo your design. It is one of the more difficult constraints to design for correctly. As such, it should be given more attention, early in your architectural process. Are their other aspects of this that you think are important? I'd love to hear them.

Technorati Tags: , , , , , , , , ,

Monday, January 15, 2007

Compute Power, Is Your Architecture Green?

In a recent article, Mahope addresses the issue of how software efficiency is impacting data center planning. This article rang particularly true for me as I've recently been pushing the concept of optimizing TPS/watt. As the article states, data centers are becoming constrained by power. (Technically power and cooling, but reduce your power consumption, you reduce your cooling problem so I focus on power).

How big is the power problem? Well, it's big enough that vendors are starting to market their energy efficiency as a feature. It's also big enough that a new law has been passed, mandating the EPA to study power consumption in data centers (government and private) in the United States. I'll skip the comments about how that will help, but needless to say, if it made the Congressional docket, there must be something to it.

What should we be doing in our software architecture to improve this? There are some concrete steps that I have started taking and hopefully other architects will join me.

Measure transactions/watt. This is a new metric for me. But if you don't measure it, you can't improve it. SPEC has formed a power and performance committee which will be useful for an initial calibration of vendor equipment, but ultimately the transactions you need to optimize are yours. It may be hard to initially set targets for transactions/watt but measuring and monitoring the metric will certainly lead to better awareness of how the application is doing over time.

Drive up server utilization. Large deployments tend towards specialization of servers to ease with management and to segment availability. This leads to situations where server utilization is incredibly low. It's not unusual to see server pools be created to provide isolation of new service. The initial traffic volume will use 50% of one server. Availability design requires three servers to meet SLA. So you now have 3 servers running at 17% utilization, essentially wasting 83% of their watts. There is an opportunity to leverage virtualization to provide logical isolation while driving utilization higher. Alternately, M+N fail over solutions where the fail over nodes are shared across many primaries can also help.

Use deployment patterns. Standardizing the patterns you use for software deployment improves the possibility of sharing hardware amongst multiple services. Design patterns so services can be safely share one server and require components to conform to the deployment pattern.

Optimize, optimize, optimize. There are always diminishing returns for optimizing but I believe the trend for the last several years has fallen way short of those inflection points. I won't be so bold as to declare a savings that can be realized but with the millions of dollars that companies face to solve their power crisis, I will say that I'm sure that a year of runway exists just through improving software efficiency.

As software architects, power consumption is now squarely in our camp to manage. There is plenty we can do to improve the quantity of power our data centers consume. But this has to become a clear focus for 2007 and forward. This is not just a hardware problem any longer.

Technorati Tags: , , , , , , , , , , , , ,

Wednesday, January 10, 2007

A Real eBay Architect Analyzes Part 3

In his ongoing interview, Duncan Cragg addresses business functions in Part 3. I have been stepping in for my imaginary friend but in this installment, I am going to switch to a different format. Rather than try to answer the questions directly, I will address specific aspects of the interview that I think are ripe for discussion.

Let me start out by stating in general, I believe either the declarative or imperative model can be applied. eBay is looking at both because there is merit to both approaches. My issues instead are with claims that I believe are overstating the benefits of REST. Duncan makes assertions about common content types in a couple of places. The two quotes that I'll call are:

We can read data at a URI with GET. We will usually understand that data when we get it, because it has a standard content type at a number of layers - perhaps from character set up to Microformat via XML and XHTML.

and

There's also the expectation of standard Content-Types, sub-types and schemas in GET and POST, rather than custom eBay WSDLs and schemas, that I mentioned before.

I've been on record for a while now with the assertion that as you move from common concepts like media or messages the availability of common formats will decrease. Duncan's example cite the resources User, Item, Offer, and Feedback. I might expect to find a common type for User. Item and Offer are unique to auctions (they differ from product and sale). While Feedback might exist in other systems, the semantics vary in each of those systems so expecting to find a common schema is a bit of a reach. I certainly wouldn't preclude eBay and the other auction sites working on a widely used format to represent an auction (which is not a product, so those formats don't work) but there isn't any current activity in that area.

A concrete example of my point is the current state of maps. There are at least two popular map interfaces that are completely incompatible. My assertion is that vendors will invent formats to serve their needs which causes divergence as you move away from common media. Roy Fielding has argued that consumers will drive the vendors back to standards but I'm not sure I agree. Video is arguably a common media type with high consumer demand. Yet the pressure from consumers has been on clients to support the dozen or so video formats, not on the content producers to standardize.

Another point that Duncan continues to make is that REST offers better scalability than SOAP. From the article:

It's scalable because of all the reasons I mentioned before: the cacheability of the basic data operations and their parallelisability through partitioning.

Plus now we have parallelisability of the application of the business rules. There's nothing more parallelisable than a declarative system.

I don't believe claims of improved parallelism following declarative vs process oriented interactions. Partitioning is about how you architect your implementation not inherent in the interaction style. We have created a massively parallel system that implements SOAP interfaces and has the ability to scale horizontally to incredible levels of parallelism.

As a counter example, state style interactions can actually lead to lower levels of efficiency in the implementation. When a client makes an imperative statement like CompleteSale, we are completely clear on the intent of the operation. We can immediately go to work on the processing and manage it as efficiently as possible. But if the client passes back an Item (which consists of over 200 state elements) with some state changed, the first task we have to perform is determining the state transition. This will involve retrieving the item and potentially other state in the system. All of this is a precursor necessary to determine intent. This certainly increases the resource requirements.

We need to partition along functional as well as data lines. We have separate functional pools for revising an item and finalizing a sale due to the different load characteristics. Since I can't efficiently deduce intent from the REST POST I have a new challenge of how to partition my functionality. So, you can see that eliminating a clear statement of intent from the information passed by the client, makes it more challenging to partition my architecture.

As always, I'm looking forward to part 4!

Technorati Tags: , , , , , , , , , , , , , , ,

Sunday, January 07, 2007

WSDL - Why Services Don't Launch

Last week was enlightening. I'm working on a project that is providing a set of services for internal consumption. The interface is actually pretty simple. We have four entities, CRUD on 3 of the entities and two operations on the fourth. Conceptually the interface can be described in five minutes. It can be expressed rigorously in any OOP language in about an hour. (The service implementation is considerably more complex, but the interface is relatively straightforward).

We opted to implement the services using SOAP. Enter WSDL. The conceptual interface was translated into WSDL which took more than a day. Once the WSDL was finally validating, we were able to generate code in Axis. Then we moved on to C# and GSOAP. Neither of them would work without further modifications to the WSDL. Another day lost on compatibility. Once the servers were deployed, we ran into issues where GSOAP generated code that compiled but didn't work. There were name space challenges. What took the engineers an hour to express in Java was taking days to express in WSDL. And I want to reiterate that this is a relatively simple interface.

Why didn't we just write the interface in Java and use java2wsdl to generate the WSDL then? Well, one reason is that the services have to be split for security reasons which leads to different ports in WSDL terms. Auto-generated WSDL would not have captured the common types with the appropriate file structures. And philosophically, if WSDL is my implementation neutral interface definition language, then it seems I should be writing it, not generating it from a specific implementation of the interface. Finally, from a stylistic point of view, I'm still waiting on the code generation process that makes files that are easy to read and understand by humans.

As long as we're on the subject of human readability, WSDL fails miserably at this quality. It is similar to reading XML schemas, only harder. Some of you may decide at this point that I'm not much of a software engineer if I find XSD and WSDL difficult to read, and so be it. But I can read a DTD and RELAX NG specifications with ease. I would expect any specification that purports to be a mechanism to allow developers to communicate interface semantics to be clearly understood by developers and not force them to rely upon tools to translate into languages they know.

"So what", some will argue. WSDL is about allowing tools to generate interfaces and is not intended for human consumption anyway. I'll argue that it lacks sufficient constraints to allow that to work well either. What does a java.util.Date map into in C++? What does an unsigned long long map into in Java? The entire XSD type space is available in WSDL which leaves the door open for developers to create unusable interfaces, especially if these interfaces were code generated from an implementation interface.

Yes, I am a heretic. I dare to boldly state that WSDL is an impediment to building services. I'm sure the intentions were good and honorable but the result misses the mark. There is a next generation WSDL in the works but it doesn't appear the primary goal is to improve either of these issues. If anything, I am concerned that 2.0 will get further from human readability than 1.1.

What is the answer? I don't think the answer is to abandon the concept of specifying interfaces. Quite the contrary, I am a big fan of having a way to express the semantics of interfaces in a way that is better defined than natural languages. The specification should support more than just SOAP interactions though. Ideally, I would like an IDL that could accomplish the following goals.

  • Provide syntax and semantics that are more readily understood by developers. I think XML is a reasonable tool for such an IDL but the focus should be on readability. Think RELAX NG vs XSD and you have the idea.
  • Support a variety of interactions. SOAP has WSDL although there is no reason a new IDL couldn't easily express SOAP interactions. It should also be possible to describe REST interactions with this IDL.
  • Provide more control over the transport binding. The primary transport in use today is HTTP. My IDL would allow me to leverage all HTTP operations, specify headers that are meaningful as well as the content-type of the body. This would allow operations that supported any of the available media types to be expressed. Other transports could be supported but in each case, the binding semantics need to leverage the full richness of the transport.
  • Don't preclude code generation. There is nothing wrong with code generation, I just feel that it should go from the IDL to the implementation and not vice versa. Partial code generation should also be supported, say for example just type handling.

I have spent the past few weeks thinking about how such an IDL might look. I've also struggled with questions of usefulness and adoption. I'm certain that having nothing is the wrong answer and my experience to date with WSDL tells me it is a suboptimal solution.

What do you think? Leave your comments or trackbacks and let me know.

Technorati Tags: , , , , , , , , , , , , ,