The Misery of Distributed Systems

The more I learn about distributed systems, the more I start to think that nobody has any idea how to make them well. It is unbelievable how little we know about distributed systems design and development. And how many failures do we still need to suffer to learn at least something?

As far back as I remember, there was Sun RPC, CORBA, DCOM, RMI, ... And now there are Web Services and REST. This is at least 20 years of development, yet all of that approaches seems to manifest the same fallacy: they try to hide the network from the developer. I will use Java JAX-WS as an example, but this is in no way specific to Java. The JAX-WS provides a runtime for a web service and generates a web service client. Both are designed to use local Java calls, efficiently hiding the network boundary. It goes a great deal for a programmer to feel comfortable. For example it hides the network exceptions from the programmer (by making them runtime exceptions). This may seem like a good approach, but it is a bad think in the very principle.

Most people would be surprised how applicable is the theory of relativity to a design of software systems. And most developers would be really surprised how slow the light is. The light will travel only approx. 300km in one millisecond. If the system needs a response under 1 millisecond, it just cannot communicate with anything that is more distant than 150km. Add a TCP three-way handshake and you are pretty much down to a size of a city. Assuming Einstein was right there is nothing, absolutely nothing, that could be done about this. No amount of technology or money can speed it up.

One millisecond is unbelievably long time for a computer system. Even a cheap computer can execute more than a million instructions in 1 millisecond. Local memory access is a bit slower, but even with that slow-down the local call is incomparably faster than a call over a fast network. Faster by several orders of magnitude. How could anyone hope to hide such a difference?

Reliability is another problem. While local call cannot (reasonably) fail, network call can and often does. Hiding the errors from the engineer does a disservice. It usually means that the network problems are ignored or handled at the top-level of an application. It means that any serious error in network communication means a sudden death of the application. Applications that are written a little bit better still manifest serious usability problems in presence of network failures. I can see that well enough on my iPhone.

Network is a significant boundary. A boundary that just cannot be swept under the carpet of a software development framework. Anyone trying to do that will most likely fail. Fail miserably.