Across the communications industry, it’s fairly common to hear or read about “distributed systems”. Yet, more often than not, as you dive deeper under the hood, the term doesn’t accurately describe the architecture/capabilities. In reality, most offer what is known as a “Hot Standby”. To better understand the differences of the two strategies, let’s first dive into how and why carriers build their infrastructure the way they do.
Hot Standby:
Carriers begin by selecting a Switch and then implement the selection of choice into a Data Center. However, when dealing with technology, you always want to ensure you have a backup or Plan B. This is to ensure if an accident or natural disaster occurs (i.e. power outage, network connectivity failure, broken hardware, etc) call traffic would redirect to the backup system. Therefore a “Hot Standby” is created, which is just an exact replica of the original softswitch.
If the carrier is more forward thinking, they install a “Hot Standby” in a second Data Center, however most do not, creating a single point of failure if that Data Center goes down. Judging by what happened with Vonage and Star2Star this last week, their interconnect was connected to only one data center. When it went down, their switch may have been still up but the connection to the data center was disabled, resulting in their customers being offline.
Pros and cons of a Hot Standby:
Pros:
- Easy to set up
- Easy to quickly add additional features
Cons:
- Idle servers - If each server cost you $10k, then you have a $10k expense sitting idly waiting to be used.
- Single point of failure
- When a failure happens, a manual reroute may be needed to the backup system
- Potential data loss due to time required for failover, as well as discrepancy between backup and live server state. This is a frequent cause of demise for these systems because they’re unable to reliably detect if a server is down or just running slower.
- Stampeding herd. If the live server dies and 100s of calls get retried (because people retry their calls when disconnected), the standby server may not be able to handle the flood of calls. While normal calls per second may average 5-10, a burst of 100 calls due to retries could easily overwhelm a standby server and cause it to crash. That could then cascade to other failover servers, potentially tanking the entire infrastructure.
Distributed Approach.
2600Hz took a different approach and built a true distributed cluster amongst multiple servers in several data center.
First, we made sure all of our servers are in use and not sitting idle. Second, by operating numerous servers in multiple data centers, we removed any single point of failure through geo-redundancy. Third, because our call activity is distributed to the closest servers to the originating call, this limits latency and maximizes call quality. Fourth, the distributed cluster is treated as a monolithic switch, minimizing operational costs during scale. Finally, you can add nodes around the world to your cluster to keep calls and media on-net, reducing cost and increasing call quality.
So how do the pros and cons of a distributed system weigh up against hot standys?
Pros:
- All servers are in use
- Scaling to multiple data centers
- Geo-Redundancy
- Calls are routed to the closest Data center
- Reduce Lag
- Improve call quality
- Manage one cluster vs. single servers
- International expansion
- Keep more calls on-net
- Mitigates stampeding herds. Because retried calls will be distributed among remaining nodes, the impact on any given node will likely be negligible.
Cons:
- More complicated to initially setup
- Adding new call routing features can be tricky since they have to be distributed.
Disruptions with your customers service is less than ideal, so why are more carriers not utilizing a distributed cluster vs a hot backup? Mostly because distributed systems are harder to build. However once they are up and running, the long term savings, performance, and resilience to failures outshines any upfront cost. If there is an outage, will your system automatically failover? Will your service continue uninterrupted? Will all your data be saved? When you’re comparing systems, vendors, or carriers, make sure to ask the right questions.