The Kafka Schema Registry’s primary job is to ensure that producers and consumers of Kafka messages agree on the structure of the data, even as that structure changes over time. This agreement is crucial for preventing data corruption and ensuring that applications can continue to process messages reliably.
Let’s see this in action. Imagine we have a simple Kafka topic, user-events, and we’re using Avro for serialization.
First, we need to set up the Schema Registry. A common way to run it is using Docker:
docker run -p 8081:8081 confluentinc/cp-schema-registry:latest
Now, let’s define our initial Avro schema for a UserCreated event. This schema specifies the fields and their types.
{
"type": "record",
"name": "UserCreated",
"namespace": "com.example.users",
"fields": [
{"name": "user_id", "type": "long"},
{"name": "username", "type": "string"}
]
}
We register this schema with the Schema Registry on subject user-events-value. The Schema Registry assigns a schema ID to this version.
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data '{"schema": "{\"type\": \"record\", \"name\": \"UserCreated\", \"namespace\": \"com.example.users\", \"fields\": [{\"name\": \"user_id\", \"type\": \"long\"}, {\"name\": \"username\", \"type\": \"string\"}]}"}' \
http://localhost:8081/subjects/user-events-value/versions
A producer application, configured to use the Schema Registry, will serialize messages using this schema. A consumer application, also configured with the Schema Registry, will fetch this schema by its ID and deserialize the messages.
Now, suppose we want to add a new field, email, to our UserCreated event. The key concept here is schema evolution. Avro defines rules for how schemas can change while maintaining backward and forward compatibility.
We define a new version of the schema:
{
"type": "record",
"name": "UserCreated",
"namespace": "com.example.users",
"fields": [
{"name": "user_id", "type": "long"},
{"name": "username", "type": "string"},
{"name": "email", "type": "string", "default": "no_email@example.com"}
]
}
Notice the default value for the new email field. This is crucial for backward compatibility. If a consumer is still using the older schema (without email), it will receive messages produced with the new schema. Since the email field has a default value, the consumer can still process the message without error. The new field will simply be populated with its default.
We register this new schema:
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data '{"schema": "{\"type\": \"record\", \"name\": \"UserCreated\", \"namespace\": \"com.example.users\", \"fields\": [{\"name\": \"user_id\", \"type\": \"long\"}, {\"name\": \"username\", \"type\": \"string\"}, {\"name\": \"email\", \"type\": \"string\", \"default\": \"no_email@example.com\"}]}"}' \
http://localhost:8081/subjects/user-events-value/versions
The Schema Registry will validate this new schema against the previous one and, if compatible, assign it a new ID.
The Schema Registry’s power lies in its compatibility checks. By default, it enforces BACKWARD compatibility. This means a new version of a schema must be readable by consumers using the previous version. If you try to register a schema that violates this rule (e.g., removing a non-nullable field without a default), the registration will fail.
You can configure the compatibility level. For instance, to enforce FORWARD compatibility (new consumers can read old messages), you’d set it during registration or via configuration:
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" \
--data '{"schema": "{\"type\": \"record\", \"name\": \"UserCreated\", \"namespace\": \"com.example.users\", \"fields\": [{\"name\": \"user_id\", \"type\": \"long\"}, {\"name\": \"username\", \"type\": \"string\"}, {\"name\": \"email\", \"type\": \"string\", \"default\": \"no_email@example.com\"}]}", "compatibility": "FORWARD"}' \
http://localhost:8081/subjects/user-events-value/versions
The Schema Registry acts as a central authority, preventing incompatible schema changes from being deployed. It allows you to gradually roll out new schema versions, giving consumers time to adapt.
One of the most powerful, yet often overlooked, aspects of Avro schema evolution is the ability to change the type of a field, provided it’s done carefully. For example, you can promote a string to a union of string and null to make it nullable, or even change an int to a long for wider range support. This is possible because Avro’s type system is designed with evolution in mind, and the Schema Registry validates these complex type changes against its compatibility rules.
The next hurdle in managing schemas is often dealing with multiple topics and complex relationships between different event types.