In the last two months at Ably we’ve spoken with hundreds of candidates for our distributed systems engineer roles. We’ve been surprised by how varied each candidate’s knowledge has been and wonder if perhaps the challenge is that there appears to be no clear definition of what skills this role should possess. Given we’ve been working on our distributed platform now for over four years, we hope we’re qualified enough take a stab at this.
If you want to become a distributed systems engineer, believe you are one or want to recruit one for your team, here’s our opinionated guide on the concepts you (or the recruitee referred to as “you”) should have a thorough understanding of:
A micro services or service-orientated architecture is not a distributed system
Take for example the following simplistic design of a service based architecture with horizontal scalability:
There’s not much “distributed” about this system. There are multiple hosts and network interconnections, but they are tightly coupled, and their network interactions are reliable, have low-latency, and are predictable. Genuinely distributed, in our view, means systems where nodes are distributed globally; network interactions are unpredictable and can create partitions; but nonetheless those nodes work together to create a predictable outcome. Distributed systems, at scale, involve state being distributed and re-balanced across the system, reacting as nodes are added and removed, and they do this in spite of the unpredictability that is inherent in a global system.
An understanding of a hash ring is a pre-requisite
If you think a hash ring has something to do with a criminal cannabis organization, then that’s certainly amusing, but unfortunately means you’re missing knowledge of a common pattern used for distributed systems.
If the above doesn’t look familiar, then I recommend you start diving into how popular distributed systems work which all rely on the ideas behind a consistent hash ring. See for example:
Gossip protocols and consensus algorithms underpin it all
Large distributed systems, usually have to track changes in cluster topology, in response to network partitions, failures, and scaling events. Various protocols exist to ensure that this can happen, with varying levels of consistency and complexity. This needs to be dynamic and real time because nodes come and go in elastic systems, failures need to be detected quickly, and load and state needs to be rebalanced in real time. With a stateful system like Ably, additionally state needs to be moved in real time between new and from old nodes whilst providing continuity throughout.
If you have never worked with Gossip or consensus algorithms, then I recommend you read up on:
- Gossip protocol
- Paxos protocol
- Raft consensus algorithm
- Popular consensus backed systems like etcd and Zookeeper, and gossip backed systems like Serf.
Eventually consistent data types and read/write consistencies
Generally in a distributed system, locks are impractical to implement and impossible to scale. As a result, trade-offs need to be made between the consistency and availability of data. In many cases, for example, availability can be prioritised, and consistency guarantees weakened to eventual consistency, with data structures such as CRDTs.
If you’re not familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store, then you’ve got some reading to do:
- Operational Transform — implemented by Google originally in their Wave product and now in Google Docs. It has uses in collaboration apps, but OTs are complex and not widely implemented.
- Conflict-free Replicated Data Types or CRDT provides an eventually consistent result so long as the data types available are used. Used by Riak distributed database and presence in Phoenix.
- Consistency levels for both read and writes in distributed databases like Cassandra
Deep network protocol understanding
In a distributed system, you’ll almost certainly be working within all layers of the networking stack. Whilst we rely extensively on various higher level protocols such as HTTP, WebSockets, gRPC and TCP sockets, without a deep understanding of those protocols and the full stack of protocols they rely on all the way down to the OS itself, then likely you’ll struggle to solve problems in a distributed system when things go wrong. Take for example the following request or WebSocket connection which would involve all of the following. At each layer, you should be confident in your understanding and ability to debug problems at a packet or frame level:
- DNS protocol and UDP for address lookup.
- File descriptors (on *nix) and buffers used for connections, NAT tables, conntrack tables etc.
- IP to route packets between hosts
- TCP to establish a connection
- TLS handshakes, termination and certificate authentication
- HTTP/1.1 or more recently 2.0 used extensively by gRPC.
- WebSocket upgrades over HTTP.
And that’s not all…
From our perspective, having a good working understanding of the concepts described above is what you expect specifically from a distributed systems engineer. Before that however you need to also be a solid systems engineer. This requires you to have the fundamentals such as programming languages, general design patterns, version control, infrastructure management, continuous integration and deployment systems already in place.
Interested in solving distributed problems in a distributed team?
Apply now for a position at Ably. We’re actively looking for remote (in Europe) or onsite (in London) team members to join our distributed engineering team.
Know someone interested in solving distributed problems and want to earn $3k for a minute of your time?
Finding the right people is hard. So if you make a referral for an engineer we employ, we’ll send you $3k as a thanks. One email = $3k.
Not a distributed engineer but interested in working at Ably?
See our jobs board to see if we have any positions suitable.
Note: This is a rewrite of my original article “Rocking horse shit, and what it takes to be a distributed systems engineer”. Upon reflection, I felt that an article that talks about the skills aspiring distributed systems engineers need is generally more useful to the community than a rant from me about how hard it is to find a good distributed systems engineer for Ably!