Case Studies

Split

Split is the feature delivery platform that pairs the speed and reliability of feature flags with data to measure the impact of every feature. With Split, organizations like Twilio, Salesforce, and WePay have a secure way to release features, target them to customers, and measure the impact of features on their customer experience metrics. When Split decided to invest in a new streaming architecture with stringent latency and fault tolerance requirements, they chose Ably.

Case Study

Features used:

  • Presence

  • Channel Metadata API for Occupancy Events

  • Server-Sent Events (SSE)

  • Active Traffic Management for proactive routing

  • Custom CNAME to whitelist Ably for Split’s client SDKs

  • Batch Publishing API

Split serves up features to tens of millions of client apps, sending over one trillion feature flag events per month. Delivering these flags with lightning speed and reliability is critical to prevent any negative impact on user experience. Until recently Split has relied on a stable and simple polling architecture to propagate all feature flag changes.

At an implementation level, Split provides SDKs that automatically retrieve feature flag stats based on the environment and segment(s) each client belongs to. Using HTTP Polling, the SDKs subscribe to changes of this state and the segment(s) a client is assigned to. Split caches the responses in a CDN, utilizing the CDN’s network of PoPs (Points of Presence) to reduce latency between the cached asset and Split’s clients (their SDKs). This works well and keeps things simple as the CDN doesn’t need to handle client state.

At the same time as providing a stable platform, Split continuously works to improve its architecture while honoring customer feedback. As Split has served more traffic over the years and customer needs have matured it’s become clear that immediately propagating changes to SDKs is extremely important. To do this effectively requires feature flag changes to be made in under a second. This requires a shift from a polling-based architecture to a dual polling/streaming architecture for a few core reasons:

  • Polling has a five second lag whereas an event-based streaming model is near-instantaneous with flags hitting in under a second. For some of Split’s customers where speed is absolutely critical, such as in banking, reverting an issue in milliseconds versus seconds is absolutely huge.

  • Polling can be inefficient. Feature flag triggers can range from just a few per day right up to more than 600, usually during local business hours. An event-based model is much more resource-efficient and cost-effective as it pushes flags on demand only when change occurs.

  • Frequent polling of 85,000 requests per second from CDN to client is a heavy and costly infrastructure burden. With the right platform, a streaming architecture is much more efficient.

Driven by clear customer need and the limitations of a polling-based architecture, Split chose to invest in a streaming architecture powered by Ably that will become the default for its platform, with polling as a fallback.

Split’s stringent specifications

Split is a true technology company born in the cloud and led by seasoned Silicon Valley engineers, such as Pato Echagüe the CTO of Split. He did what many engineers would do and initially explored building Split’s own realtime infrastructure:

“We spoke to engineers at companies like LinkedIn, Slack, and Box who’d already built this type of infrastructure themselves. Everyone told us it would take a significant amount of upfront engineering coupled with non-trivial operating costs.”

Pato knew the most challenging thing would be engineering and operating a performant and fault tolerant realtime infrastructure layer as part of a mesh architecture:

"I just couldn’t imagine operating our own realtime infrastructure with our current DevOps resources while also delivering on all of our other ops requirements. Split is focused on delivering the best feature flag platform. We don’t want to distract ourselves by effectively getting into the realtime infrastructure business. It’s just not cost-effective.”

Pato and team recognized the engineering speed and operating cost benefits of adopting a realtime platform. They had a solid idea of what they needed:

  • Global sub-second latencies

  • Highly reliable infrastructure to support Split’s own rigorous commitments to reliability, including protocol fallback options

  • Ability to handle Split’s existing scale and effortlessly support rapid future growth - it took Split only a few months to grow from 500 million monthly events to one trillion monthly events

  • A feature-rich platform including support for Server-Sent Events (SSE) for current and future development needs

Ably was the only platform to provide the performance and dependability essential to Split while also offering a complete set of features, including existing support for SSE. After talking with Ably, Pato and his team quickly began testing.

"I was impressed by Ably’s level of specificity and openness when it came to specifications. Matthew, Ably’s CEO, pointed me to a GitHub repo with a history of changes. Other providers simply didn’t have that approach.”

The Ably Realtime Advantage:

  • A pub/sub messaging platform to build complete realtime functionality.

  • Predictable performance.

  • Integrity of data.

  • Reliability of infrastructure.

  • High availability of service.

A flexible, powerful platform that keeps architecture simple

Split’s streaming architecture needed to utilize the Server-Sent Events (SSE) protocol. SSE works over the HTTPS transport layer and is implemented through the browser-based EventSource API, which for Split is an advantage over other realtime protocols like WebSockets as it keeps things lightweight. But the realtime provider needed to also be able to fallback to polling if an SSE connection should fail. Ably’s platform was able to fully support Split’s architecture requirements.

Global sub-second kill time

Split has an internal SLO that 99.99% of features need to be killed within one second. Any realtime provider must therefore enable all global feature flag changes to consistently land in under one second latency, ideally within 300ms. To benchmark Ably’s latencies against this SLO, Split ran several testing scenarios which measured:

  • Latencies from the time in which a feature flag (split) change was made

  • The time the push notification arrived

  • The time until the last piece of the message payload was received

Split ran this test several times from different locations, leaning on Ably’s global network of 15 datacenters and 205 edge acceleration Points of Presence (PoPs), to see how latency varies from one place to another. In all those scenarios, the push notifications arrived within a few hundred milliseconds and the full message containing all the feature flag changes were consistently under a second latency. This last measurement includes the time until the last byte of the payload arrives.

Mission-critical reliability

Many critical apps rely on Split to safely rollout - or rollback - features. It was essential that Ably have adequate fault tolerance and SLAs to support Split’s own rigorous commitments to reliability and uptime. Reliability is one of the four core pillars that Ably is architected around. Our platform is fault tolerant at global and regional levels as we’ve designed around statistical risks of failure, ensuring sufficient redundancy in our infrastructure to ensure continuity of service even in the face of multiple infrastructure failures. What really stood out to Pato was Ably’s commitment to global and regional redundancy:

"The fact that Ably had proactively thought of using multiple CDNs just like we do at Split was fantastic. It really reassured me that Ably took things seriously and that we could depend on its platform.”

Ability to handle rapid growth and huge scale with zero DevOps

Split’s rate of growth is immense, growing from 500 million monthly feature flag events to one trillion in the space of a few months. Their realtime provider needs to handle their current scale while also being able to effortlessly support rapid growth. We were quickly able to reassure Pato and Split that Ably can handle current and future scale requirements.

Ably sends billions of messages each day to millions of devices. Our platform is meticulously designed to be elastic and highly-available, providing the uptime and scale required for stringent and demanding realtime requirements. A rigorous, mathematically grounded design means we can transparently share the operating boundaries we monitor to ensure capacity and therefore availability, which helped Split to understand the type of scale and elasticity capable with Ably.

Forming a strategic partnership for the long-term

Split looked at various realtime providers that could support their scale, growth rate, and commitment to reliability. The proven dependability of Ably made sense and stood out as an opportunity for a true strategic partnership. As Split grows their realtime needs will mature. They’ll require additional functionality along with increased scale of infrastructure, all without sacrificing on reliability.

You miss so much by not using a platform like Ably. When you need to implement a new feature, the capabilities are there, ready to go. Or when you need to scale, the capacity is seamlessly available. There’s no need to even think about these things. Building on Ably was the only logical choice because we managed to bypass a hefty DevOps debt and rapidly ship our new streaming capabilities while keeping our architecture as simple and reliable as possible.

Pato Echagüe

Chief Technical Officer / Split