I find myself looking at nondeterministic systems a lot lately. Many solutions for the challenges of extreme scale involve relaxing constraints and coping with the ensuing chaos. But humans aren't comfortable with chaos. We're wired to bring order to our surroundings. And software engineers may be more tightly wired for order than the average person.
I have been trying to get my head around this problem for a while now. Recently, I had the revelation that the amount of entropy in any software system is directly related to the breadth of the system being considered. If I start at machine instructions, the system is very predictable. I can isolate that component and define the behavior with mathematical precision. The entire component conforms to completely deterministic patterns of behavior.
But as I start looking at the system more broadly it becomes more chaotic. Introduce multiple threads of execution and I have introduced a stochastic uncertainty to my component. Behavior depends upon the randomness of the scheduler. B follows A is no longer assured. Add another processor and now B follows A or A follows B is no longer assured, as I may have A and B simultaneously.
The multi-threaded/multi-processor example is perfect to illustrate how the resistance to chaos begins. The first reaction to B failing to follow A predictably is to force it. Introduce a semaphore to ensure that A can never occur after or coincident with B. But just like reversing entropy is expensive in thermodynamics, it is expensive in software. Complexity rises. Throughput suffers. No, rather than attempting to impose order on A and B, the goal should be to relax the constraints on A and B to allow them to exist without any temporal relation. Of course this isn't always possible, but the point is to resist the first temptation of imposing order, and instead look for a solution that allows the chaos to exist.
Increasing chaos continues as the view expands. A web service can be well understood in the context of threads and processors, but the introduction of a client brings additional randomness. Clients represent events arriving at unpredictable intervals. Loads on the systems will be non-uniform. All possible combinations of requests (by type, by processing time, by memory size) will occur. The combinatorial effects are impossible to test, predict, and in most cases even reproduce. A component whose behavior was supposedly predictable becomes unpredictable as external stimuli are added.
At the highest systemic view, chaos is rampant. Asynchronous integrations, component failures, network latency variances, and a variety of other stimuli lead to a system that completely unpredictable in behavior. The larger the system, the more chaotic it will be. And just like the earlier thread example, attempting to bring order is expensive and ultimately pointless. As a friend of mine says, reversing entropy is attempting to unscramble and egg. No, chaos is the reality and rather than preventing it, a software architecture has to not only survive but thrive on it.
Which brings me to the crux of my challenges. As software engineers we are trained to solve problems in a very linear fashion. We operate within a framework of deterministic components and well understood patterns and anti-patterns. None of these are well suited to the chaotic reality of large architectures. And that's why you have to be willing to discard them. Step back and look at the architecture from a new perspective.
One of my favorite sacred cows to pick on is ACID. Along comes BASE that challenges the conventional wisdom, fails to conform to existing patterns, and arguably violates several anti-patterns. Yet, to achieve extreme scales, BASE is a necessity. And BASE is a good example of a non-linear revelation that embraces the chaos of scaling large data sets rather than trying force order into the system.
Those kinds of revelations require abandoning our preconceived notions of how to solve these problems (read, patterns) and embrace some chaotic thinking. Resist the temptation to bring order to your systems but rather, seek out ways to make the chaos irrelevant. Try thinking about the system, pushing aside the sacred cows, and envisioning what it means to have pure chaos. What breaks? How can you tolerate it without eliminating it? Is it really more complex than trying to eliminate it?
I'd love to hear your ideas about how to cope with chaos. And how you bring the concepts of chaos to your organizations.
Technorati Tags: architecture, asynchronous, engineering, events, messaging, performance, programming, scalability, software, to_read, toread
I totally agree with what you are saying. Dealing with non determinstic systems is a major mind shift for most people. The systems they work on, whilst often "big", do not actually require them to think in anything other than a linear 2PC style fashion. Thats where the orthodoxy lies. Unfortunately internet scale requires dumping those assumptions (or at least questioning them!)
Posted by: Ewan | Thursday, May 24, 2007 at 10:11 PM
This is a very revealing perspective, actually. And to comfort you, not only engineers find it usually hard to deal with the un-deterministic, non-linear systems but so also do mathematicians. One of the most challenging courses I saw through my studies were "non-linear dynamics and chaos" and "differential geometry" which both deal with what happens if we add complex input to complex systems. The mathematical tooling we have developed over the centuries just does not suit well for analysis of these kinds of things. Which is probably related to the fact that a human being is not built to handle non-linear or parallel processes. Hence, the challenge of coping with systems of increasing complexity is fundamentally not of engineering but of psychology, learning and mental models.
Posted by: Andres Kütt | Friday, May 25, 2007 at 12:34 AM
Bringing chaos to the organization is difficult since there isn't a well defined way. We need a list in order of precedence of things that should be considered along with their pro's and con's. We need more reference points.
I think abandoning the traditional thinking is relatively easy once you have that "ah-ha" moment. Unfortunately, he difficulty lies with not just he implementation but the convincing of peers and those higher up in the food chain. What compounds that, is most often those higher up are already set in their mindset.
Posted by: Ben Kruger | Friday, May 25, 2007 at 06:37 PM
This from the Inktomi guys in 1998:
http://www.ccs.neu.edu/groups/IEEE/ind-acad/brewer/sld001.htm
Posted by: WillSmith | Thursday, May 31, 2007 at 01:47 PM
What would it take for you to add the one.org banner to your blog to support charity? I have added it to my own and would love to see other bloggers amplify the need to stomp out poverty.
If the activism irritates you then I understand...
James McGovern
http://duckdown.blogspot.com/
Posted by: James | Friday, July 06, 2007 at 02:49 PM
This post was special to me because it very much applies to a system I am working on.
We have a parking system which allows a car to enter a parking lot after collecting Id(s), which can be a Ticket, License plate, Credit card or an RF Id. The same id(s) can be used to exit the parking lot. The patron enters through an entry lane and exits through an exit lane (If that wasn't obvious enough).
Mostly the entry lanes are unmanned and sometimes the exit lanes are too. This adds a lot of complexity to the system, as the software has to take corrective actions in case a device(Ticket Issuing machine) fails. Also, each lane can have multiple cars (as much as three) interacting with the system at one time. There are about 11 devices in each lane and depending on the car sensor location, a car can be in 20 different states (blocking sensors). This creates 220 ways a car can interact with a device and the fact that there can three cars just adds to the complexity.
We created an event driven state machine to handle these states, but as we soon found out, the event driven system works perfectly in the test lab, but regularly breaks down in real world. It turns out that we did not consider temporal factors and continues degradation of various equipment. Our initial reaction was exactly as you predicted; one to add more constraints to the system and consider "time" as well as "events". The more temporal constraints we added, the more event constraints we broke.
We eventually gave up and ended up making assumptions for now (It's a version 1.0 system). For example, in ambiguous situations, we assume that the car is moving forward and the RF id can only be read when the car is at a certain location in the lane... etc.
I am sure there are ways to solve these problems without adding unnecessary constraints, but I just haven't had time to think about them.
Posted by: Rohit Gandhe | Friday, July 27, 2007 at 10:23 PM