Your browser has Javascript disabled. Please enable it to use this site. Hide this warning

  • Blog:

  • Home
  • Ably News
  • Ably Engineering
  • Realtime APIs
  • Hardest Aspects of Realtime Engineering
  •  •  5 min read

    What is a Distributed Systems Engineer?

    In the last few months at Ably we’ve spoken with hundreds of candidates for our Lead Distributed Systems Engineer and Distributed Systems Engineering roles. We’ve been surprised by how varied each candidate’s knowledge has been. It got us wondering if the challenge in finding the right people is that there is no clear definition of what skills are required to excel in this role.

    Give that we've been working on our distributed realtime messaging platform, global cloud network, and realtime APIs for over six years, we think we're qualified enough to take a stab at defining what a distributed systems engineer needs.

    The concepts a Distributed Systems Engineer needs to know

    1. Microservices or SoA is not a distributed system
    2. Understanding hash rings is a pre-requisite
    3. Gossip protocols and consensus algorithms underpin everything
    4. Eventually consistent data types and read/write consistencies
    5. Deep understanding of network protocols

    Microservices or SoA is not a distributed system

    Here's an example of a simplistic design of a service based architecture with horizontal scalability:

    A simplistic service based architecture that scales horizontally

    There’s not much “distributed” about this system. There are multiple hosts and network interconnections but they are tightly coupled. And their network interactions are reliable, have low-latency, and are predictable. Genuinely distributed, in our view, means:

    • Systems where nodes are distributed globally
    • Network interactions are unpredictable and can create partitions
    • Nonetheless those nodes work together to create a predictable outcome

    Distributed systems, at scale, involve state being distributed and re-balanced across the system, reacting as nodes are added and removed, and they do this in spite of the unpredictability that is inherent in a global system.

    Understanding hash rings is a pre-requisite

    If you think a hash ring has something to do with a criminal cannabis organization, then that’s certainly amusing, but unfortunately means you’re missing knowledge of a common pattern used for distributed systems.

    Hash Ring example. Credit: Mathias Meyer from Travis CI

    If the above doesn’t look familiar, then I recommend you start by diving into how popular distributed systems work, all of which rely on the ideas behind a consistent hash ring. See:

    Gossip protocols and consensus algorithms underpin everything

    Large distributed systems usually have to track changes in cluster topology in response to network partitions, failures, and scaling events. Various protocols exist to ensure that this can happen, with varying levels of consistency and complexity. This needs to be dynamic and real time because:

    • Nodes come and go in elastic systems
    • Failures need to be detected quickly
    • Load and state need to be rebalanced in real time

    With a stateful system like Ably, state also needs to be moved in real time between new and old nodes whilst providing continuity throughout.

    If you have never worked with Gossip or consensus algorithms, then I recommend you read up on:

    Eventually consistent data types and read and write consistencies

    Generally in a distributed system, locks are impractical to implement and impossible to scale. As a result, trade-offs need to be made between the consistency and availability of data. In many cases, availability can be prioritised and consistency guarantees weakened to eventual consistency, with data structures such as CRDTs.

    If you’re not familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store, then you’ve got some reading to do:

    Deep understanding of network protocols

    In a distributed system, you’ll almost certainly be working within all layers of the networking stack. Ably relies extensively on various higher level protocols such as HTTP, WebSockets, gRPC, and TCP sockets. But without a deep understanding of those protocols and the full stack of protocols they rely on all the way down to the OS itself, you’ll likely struggle to solve problems in a distributed system when things go wrong.

    Take for example the following request or WebSocket connection which involves all of the following. At each layer you should be confident in your understanding and ability to debug problems at a packet or frame level:

    • DNS protocol and UDP for address lookup
    • File descriptors (on *nix) and buffers used for connections, NAT tables, conntrack tables etc.
    • IP to route packets between hosts
    • TCP to establish a connection
    • TLS handshakes, termination and certificate authentication
    • HTTP/1.1 or more recently 2.0 used extensively by gRPC
    • WebSocket upgrades over HTTP

    And that’s not all…

    From our perspective of operating a truly global and distributed system, a working understanding of the specific concepts described above is what we expect from a distributed systems engineer. Before that you need to also be a solid systems engineer. This requires you to have fundamentals in place such as programming languages, general design patterns, version control, infrastructure management, and continuous integration and deployment systems.

    Further reading


    About the author

    I’m Matt, technical CEO and co-founder of Ably. I'm very interested in realtime problems, realtime engineering, and where the industry is headed. That’s the reason I co-founded Ably, which provides cloud infrastructure and APIs to help developers simplify complex realtime engineering. Organizations build with Ably because we make it easy to power and scale realtime features in apps, or distribute  data streams to third-party developers as realtime APIs.

    We're hiring across our engineering and commercial teams, including Distributed Systems Engineering roles so check out our job board. Not looking for a role but know someone who is? Refer an engineer that we employ and we'll send you $3k as a thanks. One email = $3k.

    And if you’re interested in chatting to me about realtime problems, distributed systems, or this article, please do reach out to me @mattheworiordan or @ablyrealtime on Twitter.