Kafka Connect is a framework for streaming data between Apache Kafka and other systems. It’s designed to be extensible, allowing you to build custom connectors for any data source or sink you can imagine.
Let’s watch Kafka Connect in action. Imagine we have a PostgreSQL database and we want to stream all changes from a customers table into Kafka.
First, we need a PostgreSQL source connector. Here’s a snippet of a typical Connect configuration:
{
"name": "pg-source-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "postgres-host",
"database.port": "5432",
"database.user": "connect_user",
"database.password": "db_password",
"database.dbname": "mydatabase",
"table.include.list": "public.customers",
"topic.prefix": "dbserver1",
"plugin.name": "pgoutput"
}
}
When this connector is deployed to a Kafka Connect cluster, Debezium (a popular change data capture platform that provides the PostgresConnector) will connect to PostgreSQL. It uses the pgoutput logical decoding plugin to capture row-level changes (INSERT, UPDATE, DELETE) in real-time from the public.customers table. These changes are then published as individual Kafka messages to a topic named dbserver1.public.customers.
Now, let’s say we want to take that data from Kafka and write it into Elasticsearch. We’ll use an Elasticsearch sink connector:
{
"name": "es-sink-connector",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "dbserver1.public.customers",
"connection.url": "http://elasticsearch-host:9200",
"type.name": "customer_event",
"schema.enable": "false",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false"
}
}
This configuration tells Kafka Connect to consume messages from the dbserver1.public.customers Kafka topic. For each message, it will deserialize the JSON payload (using the specified converters) and index it as a document in an Elasticsearch index. By default, it might try to infer the index name from the topic name, or you could specify an index configuration property. The type.name is an older Elasticsearch concept, often mapped to the document type.
The mental model here is a data pipeline. Kafka Connect acts as the central orchestrator. You define what data to move (source connector) and where to put it (sink connector), and Connect handles the reliable, scalable, and fault-tolerant streaming of that data. It manages offsets, retries, and scaling out through its worker architecture.
Each connector has a connector.class which points to the Java class implementing the connector logic. tasks.max determines how many parallel tasks can run for that connector. For simple connectors or when dealing with single-partitioned topics, tasks.max: 1 is common. For high-throughput scenarios, you might increase this, provided the target system can handle parallel writes and the source can produce partitioned data.
Converters are crucial. They define how data is serialized and deserialized between Kafka’s raw bytes and Connect’s internal structured format. JsonConverter is common, but AvroConverter is preferred when using Kafka’s Schema Registry for robust schema evolution. Notice how schemas.enable: false is used with JsonConverter here; this means the JSON itself doesn’t contain schema information, which is typical for simple JSON payloads.
The real power comes from the ecosystem. Debezium for CDC, JDBC connectors for databases, S3 connectors for object storage, Elasticsearch, HDFS, file systems – the list is extensive. You can even write your own using the SourceConnector or SinkConnector base classes.
What most people don’t grasp is how Kafka Connect handles distributed state and fault tolerance. Each worker in a Connect cluster shares the responsibility of running connector tasks. If a worker fails, its tasks are automatically reassigned to other healthy workers. This is managed by Kafka itself: Connectors register their state (like which Kafka offsets they’ve processed) in Kafka topics, acting as a distributed, fault-tolerant log of progress. This ensures that no data is lost or processed twice, even during failures.
The next concept you’ll likely explore is managing Kafka Connect clusters, including deployment strategies, monitoring, and scaling them effectively for production workloads.