Neo4j bulk loading is surprisingly not about just shoveling data in; it’s about carefully orchestrating how Neo4j writes data to disk to minimize I/O and maximize throughput.

Let’s see it in action. Imagine you have two CSV files: users.csv and follows.csv.

users.csv:

id:id,name,age
1,Alice,30
2,Bob,25
3,Charlie,35

follows.csv:

start_node(follows)>id,:ID,:END,:ID,:TYPE
1,1,2,2,FOLLOWS
1,1,3,3,FOLLOWS
2,2,3,3,FOLLOWS

Here’s a simple neo4j-admin import command to get this done:

neo4j-admin import \
  --database=my_graph.db \
  --nodes=users.csv \
  --relationships=follows.csv \
  --skip-bad-relationships \
  --delimiter=, \
  --array-delimiter='|' \
  --id-type=integer

This command tells Neo4j to create a new database my_graph.db, consuming nodes from users.csv and relationships from follows.csv. It assumes integer IDs and uses a comma as the field delimiter. skip-bad-relationships is a common flag to ignore malformed relationship entries without halting the import.

The core problem bulk loading solves is the overhead of individual transaction commits. A standard Cypher CREATE statement for a single node or relationship involves:

  1. Transaction Start: A new transaction is initiated.
  2. Data Parsing & Validation: Neo4j parses the Cypher, validates schema, and checks constraints.
  3. Index Updates: If indexes exist, they are updated.
  4. Node/Relationship Creation: The actual data structures for nodes and relationships are created in memory.
  5. Page Cache Writes: Data is written to Neo4j’s page cache.
  6. Transaction Commit: The transaction is committed, triggering a write to the transaction log (txlog).
  7. Page Cache Flushing: Eventually, modified pages in the cache are flushed to disk.

For millions of entities, this per-transaction overhead becomes a massive bottleneck. neo4j-admin import bypasses this entirely. It reads your CSVs directly, organizes the data in memory, and then performs a single, large write operation to Neo4j’s physical storage files (.store and .labels files). It also performs index creation in a highly optimized, offline manner. This dramatically reduces the number of I/O operations and eliminates the transaction commit overhead.

The mental model for bulk loading is that you’re not running Cypher. You’re feeding Neo4j raw data in a specific, structured format that it can directly map to its internal storage structures. You’re essentially hand-crafting the database files from scratch.

The key levers you control are:

  • File Format: CSV is standard, but the delimiter, quoting, and escaping must be correct.
  • Node/Relationship Files: You specify which file corresponds to which entity type.
  • ID Mapping: How Neo4j understands your external IDs and maps them to internal Neo4j IDs. The --id-type and specific property names (like id:id or :ID) are crucial here.
  • Indexes: For performance, you’ll want indexes. You can pre-create them in your target database or specify index creation during the import process itself, which is far more efficient than adding them post-import.
  • Database Target: You’re either creating a new database or overwriting an existing one.

When importing relationships, Neo4j needs to know which nodes your relationships connect. This is why follows.csv includes the start_node(follows)>id:ID and :END>:ID columns. These columns must contain values that match the id:id (or equivalent) property in your node CSVs. The (follows)> syntax tells Neo4j that the id property of the start node for the follows relationship is found in the start_node(follows)>id:ID column of the relationship CSV.

A common pitfall is assuming that the order of nodes or relationships in your CSVs matters for the import itself; it doesn’t. What matters is that the IDs used in relationship files correctly reference IDs present in node files. The --id-type flag is critical; if your IDs are strings, you’d use --id-type=string. If you have mixed ID types or complex ID schemes, you might need to pre-process your CSVs.

If your CSV files are enormous, splitting them into multiple smaller files and then using neo4j-admin import on each, merging the results, or using a more advanced distributed import strategy (like Apache Spark with the Neo4j connector) becomes necessary. The --nodes and --relationships flags can accept multiple file paths, allowing you to parallelize the import of different entity types or even shards of the same entity type across multiple cores if your system supports it.

The next thing people often grapple with after a successful bulk import is how to efficiently update or add new data without re-importing the entire dataset.

Want structured learning?

Take the full Neo4j course →