MongoDB Schema: Embed vs. Reference Deep Dive

MongoDB documents are far more flexible than relational rows, and that flexibility is its superpower, but also its biggest stumbling block when you’re coming from SQL. The most surprising thing is that the "best" document design isn’t about normalization; it’s about query patterns.

Let’s see this in action. Imagine you’re tracking user orders. A common SQL approach would be users and orders tables, linked by user_id.

In MongoDB, you could do this:

// User document
{
  "_id": ObjectId("60c72b2f9b1e8a0f3c8b4567"),
  "username": "alice",
  "email": "alice@example.com"
}

// Order document
{
  "_id": ObjectId("60c72b2f9b1e8a0f3c8b4568"),
  "user_id": ObjectId("60c72b2f9b1e8a0f3c8b4567"),
  "order_date": ISODate("2023-10-27T10:00:00Z"),
  "items": [
    {"product_id": "prod123", "quantity": 2, "price": 19.99},
    {"product_id": "prod456", "quantity": 1, "price": 5.50}
  ],
  "total": 45.48
}

This looks familiar, right? We’ve got user_id linking them. And it works for some queries. If you want to find all orders for a specific user, you’d query the orders collection:

db.orders.find({ "user_id": ObjectId("60c72b2f9b1e8a0f3c8b4567") })

This is perfectly fine if your primary access pattern is "give me all orders for this user." But what if your most frequent query is "show me the user’s name and email along with their recent orders"? With the design above, you’d have to:

Query orders for the user’s orders.
Get the user_id from those orders.
Query users using that user_id.
Manually combine the results in your application.

This is inefficient, especially if you have many orders per user.

This is where embedding comes in. Instead of linking, you put related data inside another document.

Consider this design, where orders are embedded within the user document:

// User document with embedded orders
{
  "_id": ObjectId("60c72b2f9b1e8a0f3c8b4567"),
  "username": "alice",
  "email": "alice@example.com",
  "orders": [
    {
      "order_id": ObjectId("60c72b2f9b1e8a0f3c8b4568"),
      "order_date": ISODate("2023-10-27T10:00:00Z"),
      "items": [
        {"product_id": "prod123", "quantity": 2, "price": 19.99},
        {"product_id": "prod456", "quantity": 1, "price": 5.50}
      ],
      "total": 45.48
    },
    {
      "order_id": ObjectId("60c72b2f9b1e8a0f3c8b4569"),
      "order_date": ISODate("2023-10-27T11:30:00Z"),
      "items": [
        {"product_id": "prod789", "quantity": 1, "price": 100.00}
      ],
      "total": 100.00
    }
  ]
}

Now, if you want to retrieve a user and all their orders, it’s a single, fast read:

db.users.findOne({ "_id": ObjectId("60c72b2f9b1e8a0f3c8b4567") })

This is the one-to-many embedding pattern. It’s ideal when the "many" side of the relationship is accessed with the "one" side, and the "many" data doesn’t grow to an unmanageable size within a single document.

What if an order can have hundreds of thousands of items? Embedding all of them in the user document would lead to massive documents, exceeding the 16MB BSON limit, and slow reads. In this case, you’d reference the items, similar to the relational model, but perhaps by embedding the order itself and referencing the items:

// User document
{
  "_id": ObjectId("60c72b2f9b1e8a0f3c8b4567"),
  "username": "alice",
  "email": "alice@example.com"
}

// Order document (references items)
{
  "_id": ObjectId("60c72b2f9b1e8a0f3c8b4568"),
  "user_id": ObjectId("60c72b2f9b1e8a0f3c8b4567"),
  "order_date": ISODate("2023-10-27T10:00:00Z"),
  "item_ids": [
    ObjectId("item_a"),
    ObjectId("item_b"),
    // ... potentially hundreds of thousands more
  ],
  "total": 45.48
}

// Item document
{
  "_id": ObjectId("item_a"),
  "product_name": "Widget",
  "price": 19.99
}

This is the one-to-few embedding (for the user-order relationship) and one-to-many referencing (for order-item relationship). The key is understanding your access patterns.

The many-to-many relationship is often handled with two-way referencing. For example, if users can "like" products, and you want to see which users liked a product and which products a user liked:

// User document
{
  "_id": ObjectId("60c72b2f9b1e8a0f3c8b4567"),
  "username": "alice",
  "email": "alice@example.com",
  "liked_product_ids": [
    ObjectId("prod123"),
    ObjectId("prod789")
  ]
}

// Product document
{
  "_id": ObjectId("prod123"),
  "name": "Super Widget",
  "price": 19.99,
  "liked_by_user_ids": [
    ObjectId("60c72b2f9b1e8a0f3c8b4567"),
    ObjectId("60c72b2f9b1e8a0f3c8b4568")
  ]
}

This duplicates information (the relationship is stored in both documents), but it makes queries like "show me all users who liked product X" or "show me all products liked by user Y" extremely fast, as they only require a single collection lookup.

When you have data that’s frequently updated by many users simultaneously, like a leaderboard or a real-time counter, embedding can lead to write contention. In such cases, denormalization with atomic operations becomes crucial. For example, to increment a user’s score:

db.leaderboards.updateOne(
  { "user_id": ObjectId("60c72b2f9b1e8a0f3c8b4567") },
  { "$inc": { "score": 1 } }
)

This allows multiple clients to update a score without reading the entire document, modifying it, and writing it back, which would be prone to race conditions.

The most counterintuitive aspect of MongoDB document design is the concept of limited embedding. While you can embed arrays, you should generally limit the size of those arrays. If an array grows too large, it can break the 16MB document size limit and degrade read performance. For instance, embedding an unbounded list of blog comments directly into a blog post document is usually a bad idea; you’d typically embed only the first few comments and reference the rest in a separate comments collection.

The next step is understanding how to efficiently query these embedded and referenced documents, which leads into indexing strategies.