For a first pass, I wanted to talk about some of the high-level reasons Erlang was chosen for KAZOO back in 2010. Even if you aren't technical I hope you'll find some value in reading this.
Some Erlang History
Erlang was developed at Ericsson, starting in 1986 and was open-sourced in 1998. It was designed to improve the development of telephony applications within Ericsson.
Erlang can be broken up into three main parts:
- Erlang the language - a small language with relatively few rules and syntax (compared to other modern languages)
- The Open Telecom Platform (OTP) - a collection of middleware, libraries, and tools
- The Erlang virtual machine (BEAM) - executes the instructions of compiled Erlang (similar to the JVM if you're familiar).
The features of the language and OTP, combined with running the BEAM, provide a powerful platform on which to build systems that require fault-tolerance and resiliency in the face of errors.
Some KAZOO/2600Hz History
The primary experiences that informed what is now KAZOO were born out of Darren and Karl's experiences running telecom infrastructure as well as the broader community's experiences. As 2600Hz consulted with more and more companies for their telecom infrastructure, patterns and common pain points emerged for how people were building these systems. KAZOO is designed to address these commonalities while also exposing functionality that lets operators and resellers provide real value-add features. At the end of the day, most of telecom is rote, routine, and commoditized - connecting two phones or providing voicemail won't win the hearts and minds of customers.
Under the hood
What customers and even management of service providers may not understand or appreciate is all the complexity and difficulty in:
- writing software that handles errors gracefully
- scaling said software across multiple servers
- scaling said cluster of servers across multiple data centers
Single-server software
Most software (telecom included) is written to be run on a single server. As the load the server is handling increases, the software's ability to cope decreases, eventually hitting the limits of the hardware itself.
Imagine a growing town with only one grocery store and no way for residents to drive to other towns - those aisles are going to be packed with people and wait times in checkout lines will be horrendous (the store can only fit so many clerks and registers).
When things go wrong in the store (dropped eggs on the floor, for instance) it will take longer to fix and unrelated customers will be impacted. The store is not handling errors gracefully.
Erlang provides a way to manage this complexity and keep individual customers (processes) isolated from each other. Now one customer's dropped eggs won't impact anyone else! One customer's large order doesn't have to block all the other customers behind them; registers can be opened and closed dynamically to fit demand (still subject to the store's space constraints of course - we're not breaking the laws of physics here!).
But, software on a single server is a single point of failure (SPOF)! Any number of events - accidental, expected, or malicious - can make a server unreachable.
Adding servers
At a certain point it becomes painfully obvious a new store is needed. The conventional approach is to add a new server (grocery store) and direct work (customers) to one or the other (this is typically a server running a load balancer). Load balancers introduce new complexity as units of work (customers) aren't identical.
Balancing work
The options are to use a "dumb" load-sharing algorithm like round-robin or to create a "smart" load balancer that has to communicate with the underlying servers to know which server has capacity to accept more work.
The big drawback to round-robin and similar strategies is unbalanced load sharing (one grocery store gets all the needy customers and gets bogged down while the other store gets all the "just need eggs" customers who are in and out, and the store stays mostly empty).
Of course, with the "smart" option, those smarts need to be built by somebody and now you have a third server communicating (with all the nastiness of distributed systems that entails).
Coordinating work
Another problem with software that isn't designed for running across servers is that they don't coordinate between each other. In the grocery analogy, if the butcher at store 1 knows a customer comes in around 9:00 to pick up a particular cut of meat, the butcher can have it ready (pre-warming a cache) for that customer, leading to a faster, more pleasant experience for the customer (phone call is connected faster, perhaps). If for some reason the customer deviates and goes to store #2, the service will be sub-optimal compared to previous visits.
Software typically gets around this by using a shared database (generally on a separate machine). Except now you have a SPOF again! If the database server can't be reached, the servers won't be able to function either. So now you need a second database server and some redundancy.
Erlang makes inter-server communication easier
Erlang provides the programmer with tools that make talking to other servers' processes (running a copy of the BEAM) as easy and transparent as talking to the local BEAM and its processes. If a server becomes unreachable, the existing server can be setup to receive these events and react accordingly. In the analogy, the butchers at both stores can communicate about the customer's needs and both be prepared regardless of which store the customer chooses.
Make sure you subscribe to email updates above so you don't miss next week's Why Erlang Part 2! We'll dive into the problem with most telecom infrastructure, adding data centers, the practical and technical concerns with adding data centers, and a recap of why Erlang.