Intro to Amazon Kinesis Data Streams

Copy link to clipboardKinesis services 

Data is all around us, whether it’s video, log files, e-commerce purchases, financial transactions, or in-game activities. This data provides insight about customers and applications, but its collection requires complex software and infrastructure which can be expensive to manage. AWS Kinesis is a service that helps gather, analyze, and process real-time video and data streams. It provides a way to react quickly once data has been processed, enabling new insights to be derived in seconds. Kinesis can ingest data such as video, audio, application logs, IoT telemetry and website clickstreams. The collected data can be used for analytics, machine learning, and other applications. It is a fully managed service that reduces the need to take care of the underlying infrastructure.

Kinesis has the following four services:

Kinesis Video Streams is a service that streams video from connected devices using the Kinesis Video Streams SDK. This can be used for machine learning, video playback, and other types of processing. It is a managed service that elastically scales the infrastructure as needed.

Kinesis Data Streams is a scalable and durable real-time data streaming service.  It can handle gigabytes of data per second from various sources, including financial transactions, social media feeds, logs, website clickstreams, database event streams, and location-tracking events. The data can be stored for a limited time enabling real-time analytics use cases, such as dashboard applications, dynamic pricing, and anomaly detection.

Kinesis Data Firehose is a service that loads streaming data into data lakes, data stores, and analytics services. It provides a way to capture, transform and deliver streaming data to other AWS services such as S3, Redshift, Elasticsearch Service, other service providers such as New Relic, MongoDB, Datadog, and Splunk, but also to generic HTTP endpoints.

Kinesis Data Analytics can provide actionable customer insights in real-time. It integrates the streamed data from services, such as Kinesis Data Streams, and makes it possible to filter, aggregate, and transform streaming data. Users can perform SQL queries using the collected data. The service provides templates and an interactive editor where users can build queries such as joins, aggregations over time windows, filters, etc. Java developers can also build streaming applications using open source libraries.

Copy link to clipboardTechnical overview of Kinesis Data Streams

Kinesis streams are divided into ordered shards. A shard is a horizontal partition of data in a database, where each` shard is placed on a separate database server instance in order to spread the load. Producers produce data, called records, and stream it to Kinesis, where it is divided into a shard. Consumers can then make use of the streamed data. By default, the data retention in Kinesis is 24 hours but can be changed to 7 days. Multiple apps can consume the same stream.

Kinesis Data Streams overview

One Kinesis Data Stream is made up of multiple shards, where the number of shards provisioned determines the billing. Producers create records of data and stream them to Kinesis. A record is a data blob: it is serialized as bytes up to 1MiB in size and can represent any kind of data. A record also contains a record key, which is used to group it into a specific shard. After being ingested into the stream, Kinesis adds a unique identifier for each record. The number of shards is unlimited. For each shard, all the records that are streamed to it are ordered.

Kinesis Data Streams Limits

  • Producers:

    • Each shard ingests up to 1 MiB/second and 1000 records/second, otherwise a ProvisionedThroughputException will be thrown.

  • Consumer:

    • Maximum total data read is 2MiB/second per shard.

    • 5 API calls/second per shard.

  • Data retention

    • By default 24 hours, extendable to 7 days.

Copy link to clipboardSending data to Kinesis Data Streams

There are several ways to send data to a stream. AWS provides SDKs for multiple different languages, each providing APIs for Kinesis Streams. There exist useful utilities that AWS has created for sending data to streams, such as Amazon Kinesis Agent and Amazon Kinesis Producer Library (KPL).

The KPL allows for higher write throughput to Kinesis streams. It’s a configurable library that’s installed on a host that sends data to streams. It collects records and writes them to multiple shards per request. It also has a configurable retry mechanism. In order to improve throughput, it can aggregate records to increase payload size. To monitor the performance of a producer, it can send metrics to Amazon CloudWatch.

Copy link to clipboardAdding records to a stream using the Kinesis API 

In Kinesis Data Streams, a record is a data structure containing a data blob. The API has two operations for adding records to a stream: PutRecords for sending multiple records in a single HTTP request and PutRecord for single records.

Because the APIs are exposed inside all AWS SDKs, available in many different programming languages, this method is the most flexible one. If one is unable to use the KPL or Kinesis Agent, for instance when sending data from a mobile application, using the API is the best choice. Using the API can also lower end-to-end latency.

Copy link to clipboardProcessing data in Kinesis Streams

A consumer is an application that reads and processes the data streamed to Kinesis. There are multiple ways of building a consumer, using Kinesis Analytics, Lambda, the Kinesis API, or the Kinesis Client Library (KCL).

The KCL can be used to create consumer applications for Kinesis Streams. It’s a Java library; support for other languages is provided via a multi-language interface. KCL takes care of load balancing across multiple instances, handling instance failures, reacting to resharding, and freeing the developer to focus on processing data from the stream.

Another way to process data streamed to Kinesis is using AWS Lambda, it runs code without managing or provisioning any servers. Lambda functions can subscribe to, read, and process batches of records that are sent to Kinesis Streams. Lambda polls the stream once per second, and when new records are found it invokes the Lambda function and passes the records as a parameter.

The downside of using Lambda is its statelessness, meaning that you can't make use of previous records easily when processing new records. Using KCL you can deploy the application to an EC2 server, where you have more control over things such as data persistence and state management.

Copy link to clipboardDownsides

Kinesis provides no access to fine-tuning of the service, as you don’t have access to the underlying OS, which is not the case when using a service such as Kafka. Therefore you can gain more performance when using Kafka instead of Kinesis. However, turning a Kafka solution into a production-ready environment is not easy for beginners.

Copy link to clipboardBenefits

Kinesis is a fully managed service. It is scalable and can handle a large amount of streaming data from hundreds of thousands of sources, with low latencies. A significant benefit of using Kinesis is that AWS takes care of the management of the service, including provisioning, cluster management, and failover. Kinesis reduces infrastructure issues, operational costs, human and machine costs.

Copy link to clipboardRead more