Kafka Connect workers aren’t just passive conduits; they are stateful, fault-tolerant, and distributed systems designed to manage and execute your data integration tasks.

Let’s see a Kafka Connect worker in action, pulling data from a PostgreSQL database and pushing it to Kafka.

// POST /connectors
{
  "name": "postgres-source-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": 4,
    "database.hostname": "postgres.example.com",
    "database.port": 5432,
    "database.user": "debezium",
    "database.password": "dbpassword",
    "database.dbname": "mydatabase",
    "database.server.name": "my-pg-server",
    "topic.prefix": "dbserver1",
    "plugin.name": "pgoutput",
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.drop.tombstones": "false"
  }
}

This configuration tells Kafka Connect to use Debezium’s PostgreSQL connector. It will attempt to create 4 tasks (tasks.max: 4) to read changes from the mydatabase database on postgres.example.com:5432. Each change event will be prefixed with dbserver1. and then transformed to only include the new record state, dropping any previous versions.

The core problem Kafka Connect solves is decoupling data sources and sinks from your Kafka cluster. Instead of writing custom code for every producer and consumer, you configure pre-built connectors. These connectors run in a separate, distributed process called a "worker."

A Kafka Connect cluster, composed of multiple workers, provides scalability and fault tolerance. When you deploy a connector, Kafka Connect distributes its tasks across the available workers. If a worker fails, its tasks are automatically reassigned to other healthy workers. This is managed through Kafka itself: the workers coordinate using a special internal Kafka topic (config.storage.topic) to store connector configurations and another (status.storage.topic) for task status.

The tasks.max setting is crucial for parallelism. For a source connector, it determines how many threads can concurrently poll the source system. For a sink connector, it dictates how many threads can concurrently write to the target system. Each task operates independently, often processing a subset of the data (e.g., a specific table for a database source).

You control the behavior of connectors through their configuration properties. These include connection details for the source/sink, the Kafka topics to use, data transformation logic, and error handling strategies. For example, transforms and transforms.<name>.type allow you to apply Single Message Transforms (SMTs) to modify records as they flow through Connect, like extracting specific fields or converting data formats.

The offset management in Kafka Connect is a critical piece of its resilience. For source connectors, Connect tracks which records have been successfully read from the source and published to Kafka. For sink connectors, it tracks which records have been successfully written to the target system. This offset information is periodically committed back to Kafka, allowing a connector to resume from the correct point if it restarts or is rescheduled on a different worker.

The most surprising aspect of Kafka Connect’s distributed mode is how it leverages Kafka itself for coordination and state management. The internal config.storage.topic and status.storage.topic are not just for logging; they are the actual shared memory and consensus mechanism for the entire Connect cluster. When a worker starts, it reads the connector configurations from config.storage.topic. When a task is running, its status is updated to status.storage.topic. This reliance on Kafka makes Connect inherently scalable and fault-tolerant without needing a separate coordination service like ZooKeeper.

Understanding how Connect handles failures and rebalances tasks is key to building robust data pipelines.

Want structured learning?

Take the full Kafka course →