Your browser has Javascript disabled. Please enable it to use this site. Hide this warning

  • Blog:

  • Home
  • Ably News
  • Ably Engineering
  • Realtime APIs
  • Hardest Aspects of Realtime Engineering
  •  •  8 min read

    Realtime data delivery - good enough is no longer good enough

    Update October 2019: Ably is now distributed over 16 data centers and 175+ edge acceleration PoPs. Visit our network page for more.

    Since Ably launched less than six months ago, I’ve talked to many people using different realtime data delivery solutions. In this article I explore a common misconception I’ve seen that realtime transports cannot be relied on for integrity and continuity over an unreliable network.

    tl;dr: Realtime data delivery that probably arrives most of the time and usually in the order it was published is no longer good enough. Ably’s protocol, client libraries and platform solves this problem and provides a truly reliable transport to deliver data.

    Before I talk through how realtime data transports can in fact be reliable, I’d like to explore a lower level network protocol, namely TCP. TCP stands for transmission control protocol, it is built on top of IP (Internet protocol), and is used to deliver the bulk of internet traffic today.

    I want to dive into what TCP does and why we’ve all come to rely on it completely for our internet-enabled apps and services. Afterwards, I’ll review the design decisions we have taken at Ably for our own protocol, the commonalities with TCP, and why this approach allows us to deliver a truly reliable transport.


    Exploring TCP

    The design of TCP is focussed on establishing channels for data-exchange between two devices. It does not attempt to understand how the data packets themselves are routed between IP addresses and therefore provides a layer of abstraction from the lower level IP layer. Therefore, TCP assumes IP “just works”, and the complexity of the TCP protocol itself is kept to a minimum (this does not mean it’s not complex, just not unnecessarily complex).

    Looking on Wikipedia, we can see that TCP is responsible for providing a reliable byte stream such that:

    • data arrives in-order
    • data has minimal error (i.e., correctness)
    • duplicate data is discarded
    • lost or discarded packets are resent
    • it includes traffic flow and congestion control

    Let’s look at each aspect and how that is achieved, from a high level.
    Please note this list is not exhaustive, TCP is an amazing bit of engineering intentionally oversimplified for the purposes of this post!

    Data ordering

    TCP uses sequence numbers to identify each packet and is thus able to reconstruct the stream of bytes in the order they were submitted. ACKs (acknowledgements) are also sent by the receiver of data to indicate that the data has arrived allowing the publisher to continue to publish and discard buffers for the already received data. The ACKs can be cumulative to minimise chatter i.e. if 50 packets are received, the receiver can publish one ACK for all 50.

    Error detection

    Detecting errors is primarily done using a checksum field for the data i.e. if the data is corrupted, there is a good chance the checksum will fail. There is additional higher integrity (typically CRC) error detection at layer 2 which collectively ensures that corrupted data is statistically unlikely.

    Data de-duplication

    As each packet has a sequence, it is easy for the receiver to simply discard packets that have already been received.

    Resending lost packets

    TCP will resend lost packets if either the transmission times out (i.e. no ACK is sent), or a triplicate ACK is sent (this is a special condition within TCP that indicates that newer packets are arriving yet a previous packet in the sequence is missing).

    Flow and congestion control

    Flow control ensures that the sender does not send data too fast for the receiver to receive and process. This is achieved by the receiver indicating how much data it is willing to accept before sending an ACK back to the sender, and it is then the responsibility of the sender to wait.
    Congestion control is far more complicated and involves a number of algorithms to achieve the highest rates of data in spite of load on the underlying network layer.

    Putting it all together

    As mentioned, there is a lot more to TCP than the above five features. However these features are the core enablers of reliable delivery, when combined and adhered to, allow applications to be built on top of TCP without having to worry about the data integrity. Whilst TCP provides data integrity, it can of course still fail. For example, if data is corrupted in transit, there is only so much the protocol can do to resolve the problem, or if the underlying IP transport is unavailable for too long, then the established connection may be unexpectedly disconnected.

    What TCP promises though is:

    • When a connection is active and working, then the data received is guaranteed to be in the order it was sent and without any missing packets
    • When there is a problem that cannot be fixed by the TCP protocol, the application using the TCP transport receives an error event indicating that there has been a loss of continuity and the developer must now take action to fix that problem i.e. most likely establish a new connection.

    So as a developer this provides a few things:

    • A transport you can rely on
    • No need to additionally check data integrity thus simplifying your app
    • A uniform way to deal with failures with the onus on the developer to resolve failures i.e. reconnect
    • A guarantee that things will never fail silently. This is my biggest bugbear of all — not knowing

    This is why TCP is great, and this is why TCP is ubiquitous as an IP protocol. Developers can get on with their job without worrying about lower level network issues.


    Exploring the Ably protocol vs TCP

    Now that we’ve had a very high level look at TCP, I want to talk you through some of the design decisions we made at Ably with our realtime protocol.

    It’s important first to note that our needs are of course different from TCP — for example we primarily operate on a pub/sub fan-out basis, publishers and subscribers are intentionally decoupled, and such ACKs in both directions would defeat the purposes of the pub/sub decoupling. However, comparing the two is useful and for those who are interested, will help you understand how we’ve managed to offer reliable realtime data delivery at scale.

    Data ordering — heavily inspired by TCP

    Like TCP, each protocol message (packet in TCP terms) is assigned a unique incrementing serial number. Ably’s ProtocolMessage type is an envelope that typically wraps underlying messages or presence events sent to Ably and provides, amongst other things, a means for the client to specify a serial number and a channel scope.

    The Ably service ensures an ACK is sent to clients for protocol messages published based on the serial number. Like TCP, we support cumulative ACKs as an optimisation i.e. we can ACK multiple messages at once by simply skipping ahead to the most recent published serial number.

    Subscribers to messages always receive the messages in the order they were published — our service is responsible for ensuring the order of messages delivered to receivers is in the order the messages were published.

    Error detection — not entirely relevant

    Because we are able to rely on TCP to detect errors in the byte stream, very little error detection is required by Ably for an established connection. The only error detection we offer is for protocol failures that immediately result in the connection being closed. For example, if a client library was used that did not adhere to our protocol specification, then the integrity of that connection is immediately undermined and at that point we have no choice but to close the connection.

    Data de-duplication — conceptually similar

    There are two scenarios where data de-duplication is effectively performed:

    • When a message is published from a client to a server, the server inspects the serial number and if already received, it will discard the message. This scenario is expected when a message is published and received by Ably and the connection is abruptly terminated before the ACK is received. At this point when the connection is resumed (see connection state recovery for more info on resumes), the client library will correctly reattempt delivery of the message that was awaiting an ACK. When the server then receives a duplicate message, it will simply discard the message and resend the ACK that was previously not received.
    • When a client is abrupt disconnected, such as when changing networks from 3G to Wifi, our client library specification requires that the client resumes the connection (using the private connection key) with the last known connection serial number. Whilst technically no de-duplication is done, the connection serial number ensures that the Ably service sends only the messages published on the resumed connection were not previously received. Note that unlike TCP, it is the responsibility of the client to present the last known connection serial number when reconnecting. As such, there is no need for the client to ACK received messages.

    Resending lost packets

    As our connections use a reliable TCP transport, the only time we need to perform the equivalent of resending packets, is when a connection is abruptly disconnected and the connection is later resumed. At the point of resume, the Ably protocol ensures that all messages not already ACK’d in the client are resent to Ably, and Ably ensures that all messages received on all channels since the client last received a message are redelivered.

    If it is not possible to resume a connection or resend messages on a channel (for example due to the connection being disconnected for too long or the queue being too large), then the client is notified of this problem through the affected channels indicating loss of continuity. In our 0.8 spec this is achieved with a DETACHED event, in our 0.9 spec this is achieved with a state change event with a “false” resumed flag. Find out more about our upcoming 0.9 spec.

    Flow and congestion control

    Currently Ably only provides flow and some congestion control server-side. This is achieved through rate limiting and the ability to detach channels when the back-pressure is too great.

    We are considering ways to allow clients to manage their back-pressure and message queues in the next version of our specification which we expect to be 1.0.

    How does this compare to TCP?

    Ably provides similar guarantees to TCP, albeit at a higher level. Our protocol promises the following:

    • “When a connection is active and working, then the data received is guaranteed to be in the order it was sent and without any missing packets”.
      This is exactly what TCP promises.
    • “When there is a problem that cannot be fixed by the Ably protocol, the application using the Ably transport receives an error event indicating that there has been a loss of continuity and the developer must now take action to fix that problem i.e. most likely re-establish a new connection / re-attach a channel.”
      This is the equivalent of what TCP promises.

    So just like TCP, as a developer our Ably protocol and platform provides a few things:

    • A transport you can rely on
    • No need to additionally check data integrity thus simplifying your app
    • A uniform way to deal with failures with the onus on the developer to resolve failures i.e. reconnect or reattach
    • A guarantee that things will never fail silently. My biggest bugbear!

    This is why I believe the Ably protocol is great, and this is also why I hope that Ably will become ubiquitous as the definitive realtime protocol for people who care about deterministic behaviour and data continuity. Like TCP, developers can get on with their job without worrying about the lower level realtime data delivery issues and nuances.


    I believe that it may to take a while for developers to realise that today’s accepted norm where realtime data can be lost, messages may be delivered in the wrong order, and silent failures are simply no longer good enough.

    We spent nearly 3.5 years getting things right at Ably before we came to market, because we believe developers deserved a realtime transport that they can rely on, like they do with TCP.

    I promise that good enough is not something you’ll ever hear from me or the team at Ably. Instead we aim to do things right.

    I hope you found this article useful. Please get in touch if you have any questions or feedback on this article.

    Further reading

    Matthew O’Riordan, CEO, Ably
    Kieran Kilbride-Singh

    Kieran Kilbride-Singh

    Writer + marketer with enough technical know-how to be dangerous in GitHub repos. He's been writing about tech for five years, first flexing his fingers on topics like interoperability in IoT devices.

    Read More of Ably Engineering