Embedding documents is often the default choice for MongoDB, but it’s the opposite of how you’d typically model data in a relational database, and that’s where the surprise lies.
Let’s see what embedding looks like in practice. Imagine we have a users collection and a posts collection. If a user can have multiple posts, and we want to show a user’s name and profile picture alongside each of their posts, we’d embed the user’s core information directly into the posts document.
// Example document in the 'posts' collection
{
"_id": ObjectId("60b8d295f1d2e3a4b5c6d7e8"),
"title": "My First MongoDB Post",
"content": "This is the content of my first post...",
"author": { // Embedded document
"userId": ObjectId("5f9f1b9b9c9d9e9f9a9b9c9d"),
"username": "mongo_master",
"profilePictureUrl": "http://example.com/pics/mongo_master.jpg"
},
"createdAt": ISODate("2023-10-27T10:00:00Z")
}
Notice how author is not just an ID, but a full object containing details about the user. When we query for posts, we get all the author’s details along with them.
// Querying for posts
db.posts.find(
{ "author.username": "mongo_master" },
{ "title": 1, "content": 1, "author.username": 1, "author.profilePictureUrl": 1 }
)
This query retrieves the post’s title, content, and the embedded author’s username and profile picture URL all in a single find operation. No joins, no second lookup. This is the power of embedding: read efficiency for common access patterns.
The problem embedding solves is the need for expensive joins in relational databases when you frequently need related data together. In MongoDB, if you know you’ll almost always be accessing the posts data with the author’s basic info, embedding it means you only ever need to hit one collection. This dramatically reduces the number of network round trips and database operations.
Internally, MongoDB stores documents in BSON format. When you embed, you’re just creating nested structures within a single BSON document. The key lever you control is what you embed and how deep you go. The trade-off is write amplification and potential for larger documents. If the embedded data (like the author object) changes frequently, you have to update it in every document it’s embedded in. If the author’s profile picture changes, you’d need to update potentially thousands of posts documents. This is where the "Reference" model comes in.
When you choose to reference documents, you’re essentially creating a link to another document, much like a foreign key in SQL. You’d have a userId in the posts document, and then you’d perform a second query to fetch the user’s details from the users collection.
// Example document in the 'posts' collection with a reference
{
"_id": ObjectId("60b8d295f1d2e3a4b5c6d7e8"),
"title": "My First MongoDB Post",
"content": "This is the content of my first post...",
"authorId": ObjectId("5f9f1b9b9c9d9e9f9a9b9c9d"), // Reference to the user's document
"createdAt": ISODate("2023-10-27T10:00:00Z")
}
// Example document in the 'users' collection
{
"_id": ObjectId("5f9f1b9b9c9d9e9f9a9b9c9d"),
"username": "mongo_master",
"profilePictureUrl": "http://example.com/pics/mongo_master.jpg",
"email": "mongo@example.com"
}
To get the post and author details, you’d typically use MongoDB’s $lookup aggregation stage (similar to a left outer join) or perform two separate queries.
// Using $lookup for a single query
db.posts.aggregate([
{
$match: { "authorId": ObjectId("5f9f1b9b9c9d9e9f9a9b9c9d") }
},
{
$lookup: {
from: "users", // The collection to join with
localField: "authorId", // Field from the input documents (posts)
foreignField: "_id", // Field from the documents of the "from" collection (users)
as: "authorDetails" // Output array field name
}
},
{
$unwind: "$authorDetails" // Deconstructs the authorDetails array
},
{
$project: {
"title": 1,
"content": 1,
"authorUsername": "$authorDetails.username",
"authorProfilePic": "$authorDetails.profilePictureUrl"
}
}
])
The critical insight most developers miss is that the decision isn’t binary ("embed or reference"). You can, and often should, do both. This is called "hybrid modeling." For example, you might embed the username and profilePictureUrl directly into the posts document for fast display, and keep a separate authorId reference. This way, if the user’s profile picture changes, you update it in the users collection, and the next time a post is read, MongoDB will fetch the new picture if it’s not cached. However, for posts that were already read and whose author sub-document was embedded, they’d still show the old picture until the posts document itself is re-read or updated.
The most common pattern people get wrong is over-embedding. They embed everything just because they can, leading to documents that are massive, slow to update, and can hit BSON document size limits (currently 16MB). The key is to embed only the data that is frequently accessed together with the parent document and rarely changes independently.
The next complexity you’ll encounter is how to handle arrays of embedded documents and the performance implications of large arrays.