Last week in Why Erlang? Part 1, we talked about some basics when it comes to Erlang, KAZOO, and single-server software. If you haven't read it already, make sure you do so before reading Part 2. Ready? Let's dive in!
Adding data centers
Now that the software has a shared database and the SPOFs have been addressed, the software needs to consider communicating over a wider area network (WAN) - this is commonly the public Internet. See, developers make assumptions about how computers talk to each other - most just assume the packets from one computer will arrive at their destination in a quick and orderly fashion. Software is then written on these false assumptions. And while the software is running in a single data center (typically with pretty good networking cables and equipment) these assumptions seem to hold true.
But even data centers aren't impervious to becoming unreachable. A utility company accidentally cuts a fiber cable, a hurricane floods the data center, the bandwidth provider has an outage - these events happen more than most people might assume! The software then breaks in unexpected ways and it becomes a nightmare to figure out why.
Practical concerns
Now that the software must communicate over less reliable networks (inter-data center, public Internet) and over circuits that charge for usage (generally intra-network communications are free or heavily discounted), new issues on the technical (increased latency, bandwidth saturation, packet loss/re-transmission) and business (higher bandwidth costs, data center build-outs, techs local to the new data centers maybe?) become a new reality.
For instance, servers on the local network may be programmed to sync their state (their individual view of the world) periodically to maintain a mostly coherent view. Depending on the system, this could result in many extra mega-, giga-, or even terabytes of bandwidth used, just to sync! (This isn't even counting the bandwidth used to process actual work). That data doesn't get transferred instantaneously nor for free (pricing on Amazon EC2 can start at $0.09/GB). Imagine a phone call being delayed or choppy audio on existing calls because the servers are hogging the bandwidth to sync their state!
Technical concerns
There are whole swaths of computer science dedicated to distributed computing and all the fun that comes with unreliable communication. Sadly, it probably isn't likely that the majority of programmers have dealt with these types of systems or have the tooling and infrastructure in place to build systems that handle these distributed computing problems.
Erlang itself is no panacea either; the default communication mechanisms do not work great over unreliable networks. However, several projects are addressing the gap including Lasp, Partisan and Riak process groups among others. Other technologies can be used too (like RabbitMQ as Kazoo does) to help with this.
To use these infrastructure components with software that wasn't designed for it requires a large investment to rewrite the software, or, more commonly, the software will try to emulate the functionality within the software itself (and often quite poorly).
The problem with most telecom infrastructure
Simply put, it was not designed to operate across geographically-distributed networks and does not take into account the errors and issues inherent in communicating over unreliable networks. Those features are bolted on, typically under pressure from management and by an inexperienced team in distributed systems development, and as a result are often poorly implemented and riddled with bugs and inefficiencies.
However, for better or worse, that software works for the most part because the underlying infrastructure is robust - this breeds a false sense of security, and marketing teams start throwing around "reliability" and "redundancy" to describe their company's offerings.
When things do go wrong (and as Murphy's Law shows, they always do) however, the price is paid many times over. "An ounce of prevention is worth a pound of cure" and all that.
So Why Erlang?
Given that things do go wrong and that even the smartest team can't anticipate all errors, it stands to reason that the work is to handle known and unknown errors gracefully (fault tolerance), starting at the lowest level.
Erlang has a 'Let it crash' philosophy related to units of work - if the data is wrong, the external resource is unavailable, whatever, crash the worker (stop it) and let a supervising process handle the crash. This could be restarting a new worker, bubbling an error up to the API response, logging the crash and moving on, or any number of strategies that are required. The value is that Erlang and OTP put these concerns front and center and give the programmer the tools to handle them in an orderly fashion.
Because it is trivial to start and stop workers, the code's architecture benefits tremendously from separating the "work" from the "management" and "supervision" of the work. As a consequence, the code becomes smaller and more concise (in Erlang, one would say "program the happy case") which makes reasoning easier, testing easier, and should generally produce better code faster.
Wrapping Up
Erlang provides a language (Erlang), a framework (OTP), and a runtime (BEAM) for building fault-tolerant, resilient systems. Embracing the 'Let it crash' philosophy yields code that handles errors more gracefully, allowing the system as a whole to remain stable in the presence of these errors.
From this foundation, highly-available, fault-tolerant systems are built.
In future articles I'll take a deeper dive at some of the architectural decisions KAZOO makes that build on the bedrock of Erlang and how those decisions impact a KAZOO cluster's ability to operate in the face of errors at all levels, from the individual processes running code to the data centers disappearing due to hurricane flooding.
Finally, I hope when marketing teams put out the buzzwords of what their platforms are, the skeptic in all of us will say "Sure, prove it. Show me the architecture you've built that allows that". Because it isn't easy to do, is even harder to get right, and harder still to scale it out. Let's pay attention to the man behind the curtain and stop listening to the great and wonderful Oz!