MongoDB’s collation feature lets you control how strings are compared, which is crucial for accurate case-insensitive and locale-aware sorting and querying.
Let’s see it in action. Imagine you have a collection of user names, and you want to find all users whose names start with "A", regardless of case, and you want this search to respect French sorting rules.
// Sample user data
db.users.insertMany([
{ name: "Alice" },
{ name: "alice" },
{ name: "Bob" },
{ name: "Émile" },
{ name: "émilie" }
]);
// Query with case-insensitive and French locale collation
db.users.find({
name: { $regex: "^A", $options: "i" }
}).collation({ locale: "fr", strength: 2 });
This query db.users.find({ name: { $regex: "^A", $options: "i" } }) would normally find "Alice" and "alice". However, by adding .collation({ locale: "fr", strength: 2 }), we’re telling MongoDB to perform the comparison using French language rules and to consider characters equivalent if their base letter is the same (strength 2).
Here, "Émile" and "émilie" would be treated as equivalent to "Emile" and "emilie" for the purpose of this query, and if we were sorting, "Émile" would appear right after "Emile" (or vice-versa depending on exact strength and locale nuances). The $regex: "^A", $options: "i" is still important for the pattern matching, but the collation refines how that match is evaluated.
The core problem collation solves is that simple byte-wise string comparison is often not what users expect. "A" is not equal to "a" in raw bytes, and in many languages, accented characters like "é" or "ü" have specific sorting orders relative to their unaccented counterparts. Without collation, case-insensitive queries might fail on capitalized words, and sorting can be nonsensical across different languages.
MongoDB’s collation is a document that you can pass to find, sort, update, and aggregate operations. The key fields are:
locale: Specifies the language rules to use. Examples include"en"(English),"fr"(French),"es"(Spanish),"de"(German),"sv"(Swedish), and many more. You can also use"root"for a language-agnostic, binary comparison, or"und"for an undetermined language.strength: Controls the level of sensitivity for comparisons.strength: 1: Base letter only (e.g., 'a' == 'A' == 'á').strength: 2: Base letter and accent (e.g., 'a' == 'A', 'á' != 'a').strength: 3: Base letter, accent, and case (e.g., 'a' != 'A', 'á' != 'a'). This is the default for many locales and is often what people mean by "case-sensitive, accent-sensitive" comparison.strength: 4: Special handling for punctuation and spaces (less common).strength: 5: Variable-weight comparison (most sensitive, often used for full linguistic sorting).
caseLevel: A boolean that, whentrueandstrengthis 1 or 2, makes the comparison case-sensitive. Iffalse(the default), case is ignored at levels 1 and 2.numericOrdering: A boolean that, whentrue, sorts numbers embedded in strings numerically rather than lexicographically. For example,"file10"would sort after"file2"instead of before.
You can also set default collations at the collection or index level. If you set a default collation on a collection, all operations on that collection will inherit it unless overridden.
// Setting a default collation for a collection
db.createCollection("products", {
collation: { locale: "en", strength: 2 } // Case-insensitive by default
});
// Setting a default collation for an index
db.items.createIndex({ name: 1 }, {
collation: { locale: "sv", strength: 2 } // Swedish-aware, case-insensitive index
});
When you use strength: 2 and locale: "fr", MongoDB uses the ICU (International Components for Unicode) collation rules for French. This means that characters like 'é', 'è', 'ê', 'ë' are treated as variations of 'e' for comparison purposes at this strength. If you were to query for names starting with "E", it would match "Émile" and "émilie" if their base letter is considered.
The interplay between locale and strength is where the real power lies. For instance, in Swedish, 'ä' sorts after 'z', whereas in German, 'ä' is often treated as equivalent to 'ae' or 'a'. Using locale: "sv" will respect this, and strength: 2 means that 'a' and 'A' are considered equal, but 'a' and 'ä' are different (though their relative order is defined by the locale).
A subtle point is how strength interacts with caseLevel. If strength is 3 (the default for many locales), it implies case-sensitivity and accent-sensitivity. If you want case-insensitivity but accent-sensitivity, you’d typically use strength: 2 and ensure caseLevel is false (which is the default). However, some locales might have specific behaviors where strength: 2 is sufficient on its own.
If you’re performing a case-insensitive query and find that accented characters are still causing mismatches (e.g., searching for "resume" doesn’t find "résumé"), it’s likely you need to adjust the strength parameter or ensure your locale is correctly specified. For broad case-insensitivity that also groups accented characters with their base letters, strength: 1 is the most aggressive. For case-insensitivity but distinct accented characters, strength: 2 is the common choice.
The next challenge you’ll likely encounter is dealing with complex search requirements that involve fuzzy matching or stemming, where collation alone is not enough.