Kafka Streams is a library for building applications that process data stored in Kafka topics. Java Streams is a set of APIs for performing aggregate operations on collections of data.

Let’s see Kafka Streams in action. Imagine you have a Kafka topic named user-clicks with messages like this:

{"user_id": "user1", "page": "/home", "timestamp": 1678886400}
{"user_id": "user2", "page": "/products", "timestamp": 1678886405}
{"user_id": "user1", "page": "/about", "timestamp": 1678886410}
{"user_id": "user3", "page": "/home", "timestamp": 1678886415}
{"user_id": "user1", "page": "/contact", "timestamp": 1678886420}

You want to count how many times each user has visited a page within a 5-minute window. Here’s how you’d do it with Kafka Streams:

import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.KTable;
import org.apache.kafka.streams.kstream.TimeWindows;
import org.apache.kafka.streams.kstream.WindowedSerdes;

import java.util.Properties;

public class UserClickCounter {

    public static void main(String[] args) {
        Properties props = new Properties();
        props.put(StreamsConfig.APPLICATION_ID_CONFIG, "user-click-counter");
        props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
        props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());

        StreamsBuilder builder = new StreamsBuilder();

        KStream<String, String> rawClicks = builder.stream("user-clicks");

        KTable<String, Long> userClickCounts = rawClicks
            .groupByKey()
            .windowedBy(TimeWindows.ofSizeAndGrace(java.time.Duration.ofMinutes(5), java.time.Duration.ofMinutes(1)))
            .count();

        userClickCounts.toStream(WindowedSerdes.timeWindowedSerdeFromKey(Serdes.String(), java.time.Duration.ofMinutes(5)))
            .to("user-click-counts", org.apache.kafka.streams.kstream.Produced.with(WindowedSerdes.timeWindowedSerdeFromKey(Serdes.String(), java.time.Duration.ofMinutes(5)), Serdes.Long()));

        KafkaStreams streams = new KafkaStreams(builder.build(), props);
        streams.start();

        Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
    }
}

This code defines a Kafka Streams application. It reads from the user-clicks topic, groups events by user ID, and then counts them within 5-minute tumbling windows with a 1-minute grace period. The results are written to a user-click-counts topic.

Java Streams, on the other hand, are for in-memory collections. If you had a List<ClickEvent> in your Java application, you’d process it like this:

import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

public class JavaStreamClickCounter {

    public static void main(String[] args) {
        List<ClickEvent> clickEvents = List.of(
            new ClickEvent("user1", "/home", 1678886400),
            new ClickEvent("user2", "/products", 1678886405),
            new ClickEvent("user1", "/about", 1678886410),
            new ClickEvent("user3", "/home", 1678886415),
            new ClickEvent("user1", "/contact", 1678886420)
        );

        Map<String, Long> userClickCounts = clickEvents.stream()
            .collect(Collectors.groupingBy(ClickEvent::getUserId, Collectors.counting()));

        userClickCounts.forEach((userId, count) ->
            System.out.println("User: " + userId + ", Count: " + count)
        );
    }

    static class ClickEvent {
        String userId;
        String page;
        long timestamp;

        public ClickEvent(String userId, String page, long timestamp) {
            this.userId = userId;
            this.page = page;
            this.timestamp = timestamp;
        }

        public String getUserId() {
            return userId;
        }
    }
}

This Java Streams code performs the same counting logic but operates on a static, in-memory list.

The core problem Kafka Streams solves is stateful stream processing over unbounded data. Unlike Java Streams which operate on finite, in-memory collections, Kafka Streams is designed to process continuous streams of data that never "end." It manages state (like the counts in our example) reliably and fault-tolerantly across multiple instances of your application. This means if an application instance crashes, another can take over seamlessly without losing data or state.

Internally, Kafka Streams uses Kafka’s partitioning and consumer group mechanisms. Each Kafka Streams application instance acts as a Kafka consumer, reading from one or more input topics. The library distributes the topic partitions among the running instances. For stateful operations (like aggregations), it uses local state stores (backed by RocksDB by default) which are continuously updated and backed up to Kafka changelog topics. This ensures that even if an instance fails, its state can be restored from the changelog.

The exact levers you control in Kafka Streams are primarily around:

  • Topology: This is the directed acyclic graph (DAG) of processors that defines your stream processing logic (e.g., stream(), filter(), groupByKey(), windowedBy(), join()).
  • Serdes (Serializer/Deserializer): How your data is converted to and from bytes for Kafka. You must configure these correctly for your message formats (JSON, Avro, Protobuf, etc.).
  • Windowing: For time-based operations, you define window types (tumbling, hopping, session) and their durations, as well as grace periods for late-arriving data.
  • State Stores: How state is managed, though defaults are usually sufficient.
  • Consumer Configuration: Standard Kafka consumer settings like auto.offset.reset and max.poll.records can impact processing behavior.

When you use groupByKey() in Kafka Streams, it doesn’t just group records in memory. Instead, it rebalances the data so that all records with the same key are routed to the same Kafka Streams application instance. This is crucial for stateful operations because that instance can then maintain a consistent, local state for that key. Without this guarantee, different instances would see different subsets of records for the same key, leading to inconsistent aggregated results.

The next concept you’ll likely encounter is joining streams in Kafka Streams, where you combine data from two different Kafka topics based on a common key.

Want structured learning?

Take the full Kafka-streams course →