Converting a distributed system to FIFO and exactly-once message processing requires considerable user effort. This article looks at why this is the case, serving as a guide about what to bear in mind when implementing a FIFO and exactly-once message processing scenario using SQS FIFO queues.
Between the lines of SQS FIFO Queues
Reading Amazon's marketing around SQS FIFO queues, we're led to believe this provides exactly once and FIFO order processing as an out-of-the-box solution. Correspondingly one could easily believe that replacing any SQS queues in an existing distributed system with SQS FIFO queues would be the only change needed to convert the system to FIFO and exactly-once processing. It would indeed be really nice to believe Amazon engineers had done all the heavy lifting and this difficult functionality was wholly available at the flip of a switch.
The reality is somewhat more complicated. While Amazon delivers their side of the deal, it seems pretty impossible for Amazon to complete exactly what it says on the tin without user input. What does this user input entail?
What is FIFO Message Ordering?
First it's useful to break down the function of FIFO order processing. A FIFO SQS queue guarantees messages will be served in the same order in which they were received. (Alternatively, it could group messages according to their Group ID field and guarantee ordered delivery in the group, but not between different groups. Each group of messages could then be treated as FIFO queue in itself for the purposes of analysis that follows) . That, however, does not guarantee a FIFO delivery for the distributed system using the FIFO SQS queue, if we define the system to encompass the producer/sender(s) as well as the consumer/receiver(s) of the messages. For example, in the simple case of a single producer and a single consumer the system would look like:
What nuances should we know about?
It is indeed guaranteed that a message that gets in the SQS FIFO queue will not get served while there are older messages still in the queue. That, however, does not mean that the consumer/receiver will get the messages in the same order the producer/sender submitted them to the SQS FIFO queue. For this to be guaranteed two additional assumptions need to be satisfied:
- The queue must get messages in the same order they were produced by the producer/sender
- The consumer/receiver must get messages in the same order they come out of the SQS FIFO queue.
In order for these assumptions to be satisfied we need a single, synchronous producer/sender and a single, synchronous consumer/receiver transmitting data over a synchronous transport layer.
Adding asynchronicity to any part of the system removes the ordering guarantee for that part. The result is the ordering guarantee for the entire system is then removed as well. Looking into this in more detail - it is fairly obvious how an asynchronous transport layer could break message ordering. It is also to a large degree obvious that an asynchronous sender cannot guarantee the order of messages sent. An asynchronous receiver/consumer could break the FIFO order of messages under the assumption that it needs to complete processing each of them in FIFO order. Last but not least, having multiple senders or receivers virtually amounts to having single-instance, asynchronous ones, i.e. it would also break the system-wide FIFO order of messages!
Exactly-Once Processing - Guaranteed?
Exactly-once processing would mean that any duplicated messages would not affect the state of the system, i.e. they will be recognised as duplicates and ignored. A SQS FIFO queue would require a unique message deduplication ID (or it would create one by hashing the message contents) together with each message and use that ID to discount duplicate messages. At first glance it seems that using an SQS FIFO queue alone as opposed to a non-FIFO one would be enough for the entire system to attain the exactly-once processing property. Taking a deeper look, however, some subtle caveats emerge meaning exactly-once processing is not guaranteed all the time, under any conditions.
There are two limits introduced by the SQS FIFO queue implementation that make it impossible to guarantee exactly-once processing in all possible cases.
The first one is the max 5 minutes timeout on storing a given message deduplication ID. So if a duplicated message arrives more than 5 minutes after the original, the SQS FIFO queue would accept it and continue to process it. If exactly-once were a mission-critical requirement, other parts of the distributed system would have to take care of guaranteeing message uniqueness.
The second limit of the SQS FIFO queue implementation comes from the kind of “exactly-once” guarantee it gives to the consumer/receiver. It does not guarantee exactly-once delivery but it does guarantee exactly-once processing. What this means is that the SQS FIFO queue has to receive acknowledgement from the consumer that a message is processed before it stops returning it on future message requests. In other words the message might actually get delivered more than once. The way this works in SQS FIFO queue implementation is by keeping the message inaccessible to other consumers for a certain period of time after it has been delivered to a consumer. This gives the consumer who received it the chance to process it and delete it from the queue. (Deleting the message from the queue serves as acknowledgement the message has been processed.) This period of time, known as visibility timeout, can last for up to 12 hours. The assumption is that this period will be long enough to allow for receiving and processing the message and sending a delete request back to the queue. This, however, is just an assumption and it might not hold in all cases. So again, if exactly-once processing is mission-critical across the system, the fact you’re using SQS FIFO queue is not itself a guarantee. For exactly-once processing to come as guaranteed, the consumer of the messages will have to run its own independent bookkeeping.
Combining exactly-once processing with FIFO has other potential negative impacts on performance for the SQS FIFO queues. To guarantee FIFO ordering no subsequent message will be served until the previous one has been acknowledged as processed or its visibility timeout expires. (An SQS FIFO queue will deliver a group of messages instead of a single one but that has only a minimal impact - the next group will not be delivered until the queue knows the last one to be delivered has been processed by the consumer.) It’s easy to see how the system throughput would rapidly deteriorate in scenarios with subpar connection quality and large visibility timeouts.
It is evident that SQS FIFO queues guarantee in-order and exactly-once processing only within the queue itself, not in the entire distributed system that would use the queue. An SQS FIFO queue is simply a more convenient building block to use in a distributed system aiming at system-wide FIFO message delivery and exactly-once processing and nothing more than that.
Taking a step back - in the distributed computing field there exists the Total Order Broadcast problem. The Total Order Broadcast specifies that messages have to be delivered to all participants in any order, so long as it is the same for everybody. FIFO delivery of messages in a distributed system is actually a custom case of the problem with additional restriction imposed on the order. Research on the Total Order Broadcast problem aids our understanding of the limitations of possible solutions to the FIFO delivery problem.
Simply put, the Total Order Broadcast problem is equivalent to the problem of achieving distributed consensus, i.e. having all participants in a distributed system agree on message delivery order. The distributed consensus problem has proven to be impossible to solve in the general case, as described in a groundbreaking paper known as “the FLP result”.
As a consequence, the existing distributed consensus / total order broadcast algorithms that guarantee successful completion in all possible cases impose some restrictions on the distributed system they run on. This in turn limits the number of possible cases in which they can be applied. Lifting the restrictions means the algorithm will fail in some cases (which does not necessarily render it unusable in practice). Not surprisingly, the theory is demonstrated throughout our discussion so far. To guarantee FIFO/exactly-once delivery, we continued introducing various restrictions on the distributed system e.g. synchronous transport layer, transactions on the consumer side, etc.
To conclude - the problem of FIFO/exactly-once delivery in a distributed system that we hoped might be solved using SQS FIFO queues alone turns out to be much harder than simple, marketing-oriented definitions would suggest.
While an SQS FIFO queue is certainly a very useful tool in building a solution to this problem, it does not constitute a solution by itself. Guaranteeing exactly-once publishing outside the queue is a problem that frequently falls to of distributed system developers to solve.
Further reading about problems typically encountered when ensuring against duplication across distributed systems - and how we solved them - is available to read in our Deep Dive into Implementing Idempotency across distributed systems.
Ably Realtime provides cloud infrastructure and APIs to help developers simplify complex realtime engineering. We make it easy to power and scale realtime features in apps, or distribute data streams to third-party developers as realtime APIs.