4 min read•Last updatedUpdated Mar 23, 2023

The 18 ghosts in your infrastructure stack that can cause failure (and how to avoid them)

Written by

Following the recent attack on Dyn that caused so much disruption because so many companies had a crucial single point of failure, I started thinking about how a trivial HTTP request to our API, that’s typically completed in under 100ms, is covered by almost twenty separate strategies we employ to address single points of failure. The diagram below is far from exhaustive, but I hope helps people to think about how every point of failure needs to be addressed if you want to offer guaranteed service continuity.

Notes

TLD domains do go down, see when the .io domain was unavailable. Our TLD fallback ensured the service remained up.
Big reliable DNS providers go down too, Dyn went down and took a large chunk of Internet services with it.
DDoS prevention, especially Layer 3/4 is generally handled better by an external edge barrier provider such as CloudFlare or Akamai. However, that’s not going to protect you against simple expensive attacks against your servers. You need to think about how you can very cheaply reject requests. For example, if you have to make a database lookup for every auth request and then have to do an expensive bit of cryptography to validate the auth details, it is trivial to generate significant load with a few expensive request that most DDoS providers will be unable to help you with.
Certificates can go down as well, don’t assume that’s not a single point of failure. GlobalSign recently mistakenly issued an incorrect revocation list which resulted in a lot of their certificates being revoked and thus no longer usable. Think about fallback strategies.
Load balancers are a single point of failure if they don’t exist in multiple distinct datacentres and don’t autoscale to meet demand. If you are using Amazon, use ELB with multiple availability zones. Be aware however, whilst ELB is elastic and will scale, it scales up within limits
If you deem it your responsibility to handle unexpected failures as opposed to your customers, then you need to handle in-flight HTTP requests and retry them at least once against another node that can satisfy the request. To do this, you’ll probably need your own routing layer that understands your business logic so that you don’t simply retry every failing request when the system is simply overloaded thus increasing the load further.
Use rate limiting to prevent abuse. You’ll need to figure out what you can rate limit against i.e. it could be an account, a user, or even an IP (although that’s rarely a good idea). Without it, one user, even unintentionally, can adversely affect other users.
If your application or data layer exists only in one region, you’ll struggle to offer guaranteed continuity. Network partitions occur, and entire regions go down (it’s happened with AWS). If you’re only in one datacenter (availability zone), then you’re very susceptible to failures, try and spread your app and database across at least two datacenters with each being able to operate autonomously.

We’ve been fortunate enough at Ably to have decided early on that addressing all single points of failure is key to our platform offering and what we believe will set us apart. As a result, we’ve allocated considerable resource and time to solving these problems, mostly “just in case”. If you can make the time, and at least have a strategy to start removing single points of failure in your own platform, when the next big Internet outage occurs or you have unexpected failures in your system, I hope like us, you’ll be thankful you invested the time upfront.