For the past three and a half years, the Ably team have been living and breathing realtime messaging. Our mantra has always been to solve the tough problems so that our customers don’t need to. This approach has led us to solve not just the obvious problems, but also the difficult edge cases others have shied away from. As a result, Ably lives up to its promise, simply better realtime messaging.
In a series of ten articles, I will be exploring each of these tough technical problems discussing use cases and considering why not solving the problem matters. I will then share what we’ve learnt in regards to how it is indeed possible to engineer and deliver a realtime platform that comprehensively addresses each challenge.
# 1 — Connection state and continuity
The significant and continuing improvements in the bandwidth and quality of network infrastructure has led to an illusion of an internet connection being universal and continuously available. However, mobile devices are displacing the desktop as the principal consumption device, and this means we no longer have a reliable consistent transport to work with. Mobile devices are on the move, they change networks frequently from Wifi to 3G, they become disconnected for periods of time as they move around, and particular transports such as WebSockets may suddenly become unavailable as a user joins their corporate network.
With the proliferation of mobile, a pub/sub realtime messaging system that allows messages to be lost whilst disconnected is just not good enough. A rigorous and robust approach to handling changing connectivity state is now a necessity if you want to rely on realtime messaging in your apps.
The challenge — pub/sub is loosely coupled
Most realtime messaging services implement a publish/subscribe pattern which, by design, decouples publishers from subscribers. This is a good thing as it allows a publisher to broadcast data to any number of subscribers listening for data on that channel.
However, because of this design, this pattern is slightly at odds with the need for subscribers to have continuity of messages throughout brief periods of disconnection. If you want guaranteed message delivery to all subscribers, then you typically need to remove that decoupling and know, at the time of publishing, who the subscribers are so that you can confirm which of them has received the message. If this approach is taken however, then you lose most of the benefits of the decoupled pub/sub model in that the publisher will now be responsible for keeping track of message deliverability and importantly, is blocked — or at least must retain state — until all subscribers have confirmed receipt.
A common use case where connection state matters
Chat applications are a very common requirement and seemingly simple to build. However, most realtime messaging platforms that provide a pub/sub model just doesn’t work reliably. See the example below in the diagram:
The problem here is obvious. Without connection state, Kat, whilst moving through a tunnel is disconnected from the service until she comes out the tunnel. In the intervening time, all messages published are ephemeral and thus gone from the perspective of Kat. The problem with this is twofold:
- Kat has lost any messages published whilst she was disconnected
- Worse, Kat has no way of knowing that she missed any messages whilst disconnected.
A better way — connection state stored in the realtime service
At Ably we have approached the problem of ensuring messages are delivered to temporarily disconnected clients in a different way. Instead of placing the onus on the publisher or components of the infrastructure representing the publisher to know which clients have or have not received messages, we instead retain each client’s state that is or was recently connected to Ably in our frontend servers. As the frontend servers retain connection state when clients are disconnected, the server effectively continues to receive and store the messages on behalf of the disconnected client, as if it was still connected. When the client does eventually reconnect (by presenting a private connection key they were issued the first time they connected), we are able to replay what happened whilst they were disconnected.
So a simple group chat app with a realtime service that provides connection state now looks as follows:
So why is it hard to maintain connection state for disconnected clients?
Much like all the problems we have solved, the challenges are overwhelmingly more complicated when you have to think about the problem at scale, without any disruption, and with guarantees about continuity of service.
If connection state is stored only in volatile memory on a server that the client originally connected to, it is susceptible to data loss. For example, during scaling down events, deployments or recycling, the server process will be terminated. If the data exists only in memory, these routine day to day operations will result in data loss for clients.
We’ve solved this at Ably by storing all connection state in more than one location across at least two data centres.
Connection state migration
If a client connects to one of your frontend servers, the connection state will persist on the server they connected to. However, following an abrupt disconnection, a subsequent established connection could attempt to resume connection state stored on another server. A rather simple way to solve this is to simply use shared key/value data store such as Redis, however that is not an entirely robust solution. Redis may be recycled, overloaded, or unresponsive, and service continuity should remain unaffected.
We’ve solved this at Ably in a number of ways:
- Firstly we have a routing layer that monitors the cluster state via gossip and routes connections being resumed or recovered to the frontend server that holds the state, if available.
- Secondly, if the frontend server or underlying cache for that connection has gone away in the intervening time, we are able to rebuild the state from the secondary fail-over cache on a new frontend server. When a new server rebuilds the state, it will also persist the state in a secondary location to ensure a subsequent fail-over will succeed.
- Lastly, if the router is unable to determine which frontend holds the connection state, each frontend can effectively take over the connection state notifying the old frontend to discard its connection state.
Some clients cannot use WebSockets and will have to rely on fallback transports such as Comet. Comet will quite likely result in frequent HTTP requests being made to different frontend servers. It is impractical to keep migrating the connection state for each request as Comet, in its very nature, will result in frequent new requests. It is imperative that a routing layer exists to route all Comet requests to the frontend that is retaining an active connection state so that it can wrap all Comet requests and treat them as one connection.
We’ve solved this at Ably by building our own routing layer that is aware of the cluster, can determine where a connection currently resides, and can route all Comet requests to the server that is currently maintaining the connection state for that client.
Store cursors, not data
If the process that is spawned to retain connection state simply kept a copy of all data that has been published whilst the client has been disconnected, it would very soon become impractical to do this. For example, assuming a single frontend has 10k clients connected, and each client is receiving 5 messages per second with 10kb of data each. In 2 minutes alone, the frontend would need to store 57GB of data. Added to this, if a client reconnects and has no concept of a serial number or cursor position, the frontend cannot ever be sure which messages the client did or did not receive so could inadvertently send duplicate messages.
We’ve solved this at Ably by providing deterministic message ordering, which in turn allows us to assign a serially-numbered id for each message. As such, instead of keeping a copy of all messages published for every subscriber, we simply keep track of a serial number for each channel for each disconnected client. When the client then reconnects, it announces what the last serial was that it received before it was disconnected, and the frontends can then use that serial to determine exactly which messages need to be replayed without having to store them locally.
After three and half years in development, we learnt that hard way that the devil is in the detail. If you want one reason why you should consider using Ably over a bespoke, open source or competitor project, it’s because we’ve spent three years solving these problems so you don’t have to.
In my next article, I will address another top ten technical challenge of building a realtime system. Please follow me on Medium if you would like to be kept up to date, or sign up to our Ably realtime roundup newsletter.
If you have any questions, feedback or corrections for this article, please do get in touch with myself or the team.