Neo4j Change Data Capture (CDC) allows you to stream changes made to your graph database in real-time, enabling downstream systems to react to evolving data without constant polling.
Here’s a Neo4j CDC stream in action, showing a node being created and a relationship being added, as seen by a Kafka consumer:
{
"before": null,
"after": {
"id": "87654",
"labels": ["Person"],
"properties": {
"name": "Alice"
}
},
"type": "NODE_CREATED",
"timestamp": "2023-10-27T10:00:00Z",
"source": {
"db": "neo4j",
"collection": "Person"
}
}
{
"before": {
"id": "12345",
"labels": ["Person"],
"properties": {
"name": "Bob"
}
},
"after": {
"id": "12345",
"labels": ["Person"],
"properties": {
"name": "Bob"
}
},
"type": "RELATIONSHIP_CREATED",
"timestamp": "2023-10-27T10:01:15Z",
"source": {
"db": "neo4j",
"collection": "FRIENDS"
},
"relationship": {
"id": "98765",
"type": "FRIENDS",
"startNode": "87654",
"endNode": "12345"
}
}
The core problem CDC solves is the impedance mismatch between a transactional database and systems that need to react to data changes asynchronously. Instead of periodically querying for differences, which is inefficient and prone to race conditions, CDC provides a direct, event-driven feed. This is invaluable for use cases like real-time analytics, data warehousing synchronization, microservice event sourcing, and building materialized views.
Internally, Neo4j CDC leverages the database’s transaction log. Every modification to the graph – node creation, property updates, relationship changes – is written to this log. The CDC component then reads this log, transforms the raw transaction data into meaningful events (NODE_CREATED, NODE_PROPERTIES_UPDATED, RELATIONSHIP_DELETED, etc.), and publishes them to a configured destination, most commonly a message broker like Kafka. This ensures that every committed change is captured exactly once.
You control the CDC process primarily through configuration. Key parameters include the destination endpoint (e.g., Kafka bootstrap servers, topic names), filtering rules to include or exclude specific labels or relationship types, and the retention period for events in the transaction log. You can also configure the format of the emitted events, such as whether to include the full before and after states of entities.
The real power of CDC comes from understanding the before and after fields in the event payload. For a NODE_PROPERTIES_UPDATED event, the before field shows the state of the node before the update, and after shows the state after. This allows downstream systems to calculate deltas precisely, for instance, to understand which specific properties changed. If you only receive the after state, determining the exact modification requires a separate lookup or a more complex inference.
When configuring Kafka as the destination, Neo4j CDC will publish events to topics. By default, it might use a topic named neo4j.event. You can specify custom topic naming strategies, for example, to route events based on the node label or relationship type, which simplifies downstream consumption by allowing different services to subscribe only to the data they care about.
For example, to configure Kafka, you’d typically edit the neo4j.conf file and add or modify these settings:
dbms.transaction.log.rotation.size=100M
dbms.transaction.log.max.size=1G
dbcc.enabled=true
dbcc.sink.kafka.bootstrap.servers=localhost:9092
dbcc.sink.kafka.topic.name=neo4j.event
These settings enable CDC, point it to your Kafka broker at localhost:9092, and specify the topic neo4j.event for publishing. The transaction log settings ensure that the logs are managed appropriately for CDC to read from.
The dbcc.sink.kafka.topic.name can also be a pattern. If you want to send Person node events to neo4j.events.Person and Company node events to neo4j.events.Company, you can use a pattern like neo4j.events.${label} or neo4j.events.${type} where ${label} and ${type} are placeholders for the node label or event type.
The most surprising part of Neo4j CDC is how it handles schema evolution. Unlike traditional RDBMS CDC which might emit DDL events, Neo4j CDC focuses on the data mutations themselves. When you add a new property to a node, the CDC event for that node will simply reflect the new property in the after state. Downstream consumers are responsible for understanding and adapting to these schema changes. This aligns with Neo4j’s schema-optional nature, pushing the responsibility of schema interpretation to the consumers of the data stream.
After successfully configuring and verifying your CDC stream, the next logical step is to implement robust error handling and idempotency in your consumer applications.