2600Hz Blog

Read about cutting edge telephony thought leadership, 2600Hz product updates, customer use cases and more!

Featured Posts

Subscribe to Email Updates

How Squirrels Break DataCenters And Other Database Conjectures

iStock-1173170722
This Q&A presentation was influenced by Kyle Kingsbury’s work on Jepsen, an exploration of modern databases. If you haven’t seen his work and you like this stuff, you should go check it out. It’s awesome.

We just did our Epic Database Expert Q&A featuring Sam Bisbee of Cloudant and Darren Schreiber of 2600hz. We covered a range of topics but focused on these three kinds of failures:

  • Network Partitions
  • Layer 1 Disasters
  • Flapping Internet (Special Class of Network Partitions)

Network Partitions

All public networks are unreliable; such is the reality of modern distributed database management (and let’s face it, because of AWS we’re all managing distributed databases, whether we like it or not). Sometimes, when these unreliable networks break down, a partition can form. These partitions, depending on your database configuration, can wreak havoc across a wide gamut of scenarios.

Arguably, most of what a database admin does is prepare for network partitions and how to resolve them.
-Joshua Goldbard, 2600hz

Yes, modern databases run fairly well when they’re not in a failure state, but, frankly, the only thing that matters is the failure state. During a partition, it’s important to understand your database behavior, which can vary wildly. At 2600hz, we leverage BigCouch which is a Master-Master replication strategy with Dynamo Quorum(PPT LINK). What that means in plain English is that every node is a master node and it uses consistent hashing to redistribute the load in the event of a partition.

The best advice we can give here is to know the failure modes and behavior of your database and understand the partition realities of the software.
-Darren Schreiber, 2600hz

Layer 1 Disasters

Hurricanes, Earthquakes or Squirrels? Squirrels eating glass. Squirrels caught in HVAC units. Squirrels tampering with Power lines. All of these are examples of Layer 1 Disasters, but we only think about the really massive outages, not the unexpected ones that effect critical infrastructure.

Darren, the 2600hz CEO, has a lot of experience managing Datacenters. Here’s a quick story from back-in-the-day about managing racks in a DC:

Once upon a time a Datacenter vendor decided to give my company a couple of months notice that they were going to 10x our rates. They assumed we couldn’t migrate out of that Datacenter easily, and they were right. Because we were cheap, we did everything ourselves, which meant loading the racks into a pickup truck by hand that we drove in the rain to another Datacenter. Not my definition of Fun.
-Darren Schreiber, 2600hz

Contrast that with our experience during Sandy, when we were using BigCouch:

On the day before the storm, we just turned off the Datacenter. That was it.

We can evade storms, earthquakes and Squirrels because of Cloudant.
-Darren Schreiber, 2600hz

If a Datacenter gets into a Layer 1 issue, we just kill it and move on. When the disaster is mitigated we bring the service back up, but losing an entire DC (or even multiiple DCs) is not an issue because of our database choice.

Protip: If you can’t predict disasters, have a plan to avoid them.

Flapping Internet

It is up or is it down? Flapping internet is a special case of the Network Partition. Basically, a flapping connection is one that goes down, then comes up, then goes down; this is actually worse than a server going hard down because of the reconciliation process that happens when the networks reconnect. We’ve got one answer for this and one only: Zombie Servers get Double Tapped.

Basically, if a DataCenter is flapping, it’s better to just disconnect that datacenter manually until it can be confirmed as restored. There’s no easy way to say this, Flapping is one of those scenarios that requires manual intervention. If the DC is flapping you have to take it out or you may never get back online.

Protip: If it flaps, Double Tap.

Final Thoughts

Darren chose to use the last few minutes to pontificate about how ridiculous life was before BigCouch. There was a point very early on where we simply could not get BigCouch to work and we thought we might have to fold the company. Thanks to incredible support from the Cloudant team, we got everything working and the rest is history.

It’s night and day. We just don’t spend any time on the database anymore… We just don’t have problems with the Software.
-Darren Schreiber, 2600hz

Sam chose to talk about right reliability, specifically in the way in which other systems buffer writes and respond to concerns about right availability.

There are a lot of other databases out there that will reply “Write confirmed” when you buffer the write, NOT when it actually commits to disk. The practical effect of this is that if the disk dies before the write moves out of the buffer, you’re missing writes, which is death to a database.

Durable databases write to disk and confirm, they don’t just buffer. Databases that buffer can be very dangerous depending on the workload.
-Sam Bisbee, Cloudant

We had a blast doing this presentation with Cloudant and we can’t wait for the next Q&A Session on Border Controllers in two weeks. Click here to join us!

http://2600hzqa5.eventbrite.com/

Two weeks after that, we’re going to discuss DTMF and how all of that nonsense works in VoIP. Register for free here:

http://2600hzqa6.eventbrite.com/

Lastly, if you’d like to talk to our friends at Cloudant, check them out at Cloudant.com or in IRC on Freenode in channel #Cloudant.

Thanks so much for checking out our Q&A. If this all sounds like it’s too much work, you should call our Sales team at 8554642600 or sales@2600hz.com. We power some of the biggest infrastructures on the planet and we’d love to talk about how we can help your business eliminate the pains of operating communications infrastructure :).

Database Expert Q&A from 2600hz and Cloudant

Tagged: archives, technical