Before memorising APIs, lock in this mental model: MongoDB stores data the way your application thinks about it — as objects, not rows. The core insight is that your domain model and your storage model are the same shape.
A SQL database is like a spreadsheet — every row must have the same columns. MongoDB is like a filing cabinet of folders. Each folder (document) can contain whatever papers (fields) it needs. Some folders are thick, some thin. You don't have to pre-define what goes inside each folder — you just put it in.
- Rapidly evolving data shapes (product catalogs, CMS)
- Hierarchical or nested data (orders with line items)
- Horizontal scale-out across many servers
- High write throughput applications
- Real-time analytics on operational data
- Complex multi-table JOINs (use RDBMS instead)
- Strict ACID across many unrelated collections
- Highly relational data with many foreign keys
- Reporting-heavy workloads (data warehouse is better)
MongoDB is a document-oriented NoSQL database that stores data as flexible JSON-like documents (technically BSON). Unlike relational databases which organise data into rigid tables with rows and columns, MongoDB stores each record as a self-contained document that can have its own structure.
| SQL Concept | MongoDB Equivalent | Key Difference |
|---|---|---|
| Database | Database | Same concept |
| Table | Collection | Collections have no enforced schema |
| Row | Document | Documents can differ in structure |
| Column | Field | Fields can be nested objects or arrays |
| Primary Key | _id field | Auto-generated ObjectId by default |
| JOIN | $lookup (aggregation) | Prefer embedding over joining |
| Foreign Key | Manual reference (DBRef) | Not enforced at DB level |
| Index | Index | Same concept, more types available |
Relational databases normalise data — splitting it across many tables to avoid duplication, then using JOINs to reconstruct it. MongoDB denormalises by design — related data lives together in one document, avoiding the JOIN cost at read time.
-- SQL: data split across 3 tables, need JOIN to reconstruct SELECT o.id, u.name, p.title, oi.quantity FROM orders o JOIN users u ON o.user_id = u.id JOIN order_items oi ON oi.order_id = o.id JOIN products p ON oi.product_id = p.id; // MongoDB: one read, all data together db.orders.findOne({ _id: orderId }); // Returns: { _id, user: { name, email }, items: [{ title, qty, price }] }
"MongoDB is a document-oriented database that stores data as BSON documents — essentially flexible JSON objects. The fundamental difference from relational databases is that MongoDB avoids the table-column-row model in favour of documents that can contain nested data and arrays. This means you design schemas around your access patterns — the data your app reads together lives together — trading JOIN-heavy reads for potential data duplication."
BSON (Binary JSON) is a binary-encoded serialisation of JSON-like documents. MongoDB stores documents in BSON format internally and transmits them over the wire in BSON. Your application driver converts between BSON and native language objects transparently.
| Property | JSON | BSON |
|---|---|---|
| Format | Text (UTF-8 string) | Binary encoded |
| Human-readable | Yes | No |
| Parse speed | Slower (text parsing) | Faster (binary, length-prefixed) |
| Data types | String, Number, Boolean, Array, Object, null | All JSON types + Date, Binary, ObjectId, Int32, Int64, Decimal128, Regex, Timestamp |
| Size | Smaller for simple data | Can be larger due to type metadata |
| Traversal | Must parse entire string | Skip fields using length prefix — fast |
{
_id: ObjectId("507f1f77bcf86cd799439011"), // 12-byte unique ID
createdAt: new Date("2026-01-15"), // ISODate — NOT a string
price: Decimal128("19.99"), // Precise decimal (no float errors)
views: NumberLong(9999999999), // 64-bit integer
data: BinData(0, "base64string"), // Binary blob (images, files)
pattern: RegExp(/^mongo/i), // Stored regex
isActive: true // Boolean (same as JSON)
}
Date objects, never as strings. BSON Date enables proper date comparison queries ($gt, $lt) and range queries. Storing "2026-01-15" as a string means alphabetical comparison — completely wrong for dates.
"BSON is MongoDB's binary serialisation format — it's JSON's superset. The two key advantages over plain JSON are: firstly, it supports richer types like Date, Int64, Decimal128, and Binary that JSON doesn't have — this prevents precision issues with numbers and enables proper date queries. Secondly, BSON is length-prefixed so the driver can skip fields without parsing the entire document, making it faster to read."
A document is a set of key-value pairs stored in BSON format. It is the basic unit of data in MongoDB — analogous to a row in SQL but far more flexible. Documents can contain nested documents (subdocuments), arrays, and arrays of subdocuments.
- Maximum size: 16 MB per document (hard limit)
- _id field: Every document must have a unique
_id— auto-generated as ObjectId if not provided - Field names: Cannot start with
$or contain.(dot) — these are reserved for query operators - Nesting depth: No hard limit, but deeply nested documents are hard to query and index
- Field order: Field order is preserved in documents (unlike most JSON parsers)
{
_id: ObjectId("507f1f77bcf86cd799439011"), // auto-generated unique key
name: "MongoDB Mastery", // String
price: Decimal128("49.99"), // Precise decimal
inStock: true, // Boolean
tags: ["mongodb", "nosql", "database"], // Array
publishedAt: new Date("2026-01-01"), // BSON Date
author: { // Embedded subdocument
name: "Yogesh Tiwari",
email: "yogesh@example.com"
},
reviews: [ // Array of subdocuments
{ user: "alice", rating: 5, comment: "Great!" },
{ user: "bob", rating: 4, comment: "Good" }
],
metadata: null // null value
}
"A MongoDB document is a BSON key-value structure — think of it as a rich JSON object. Key constraints: maximum 16MB, every document needs a unique _id, and field names can't use dots or start with dollar signs. What makes documents powerful is that field values can be strings, numbers, booleans, dates, arrays, or even other embedded documents — you can represent an entire entity with all its relationships in one place."
A collection is a grouping of MongoDB documents — the equivalent of a SQL table, but without a fixed schema. Collections are created automatically when you first insert a document into them. No pre-definition (CREATE TABLE) required.
| Feature | SQL Table | MongoDB Collection |
|---|---|---|
| Schema | Fixed — every row has the same columns | Dynamic — each document can differ |
| Data types per field | Enforced by column definition | Any BSON type per field per document |
| Adding new fields | ALTER TABLE (can lock table) | Just insert a document with the new field |
| Creation | Explicit CREATE TABLE | Auto-created on first insert |
| Joins | Native JOIN across tables | $lookup in aggregation (not native) |
| Validation | Column constraints, triggers | Optional JSON Schema validation |
A special collection type with a fixed size — when full, it overwrites the oldest documents. Useful for logs, event streams, and audit trails where you only care about recent data.
// Create a capped collection — max 1000 docs or 5MB db.createCollection("auditLog", { capped: true, size: 5242880, // 5MB in bytes max: 1000 // optional: max document count });
"A collection is like a SQL table but schema-free — it groups related documents without enforcing a structure. The key differences: collections are created implicitly on first insert, documents within the same collection can have completely different fields, and adding new fields to documents doesn't require any schema migration. You can optionally add JSON Schema validation to a collection if you need structure enforcement."
createdAt field.const { ObjectId } = require('mongodb'); const id = new ObjectId(); console.log(id.toString()); // "507f1f77bcf86cd799439011" console.log(id.getTimestamp()); // 2026-01-01T00:00:00.000Z — embedded creation time! // IMPORTANT: always query with ObjectId, not a string db.collection('users').findOne({ _id: new ObjectId("507f1f77bcf86cd799439011") }); // NOT: { _id: "507f1f77bcf86cd799439011" } ← this won't match!
| Feature | SQL AUTO_INCREMENT | MongoDB ObjectId |
|---|---|---|
| Distributed generation | ❌ Requires central DB coordination | ✅ Generated client-side — no round-trip |
| Predictability | ❌ Sequential — easy to enumerate | ✅ Randomly seeded — harder to guess |
| Embedded metadata | ❌ Just a number | ✅ Contains creation timestamp |
| Sharding-safe | ❌ Conflicts across shards | ✅ Globally unique across shards |
| Sort by creation | Yes (by value) | Yes (sort by _id gives time order) |
ObjectId before querying. { _id: "507f1f..." } will never match a document whose _id is an ObjectId. This is the #1 ObjectId bug beginners hit.
"An ObjectId is a 12-byte unique identifier — 4 bytes for Unix timestamp, 5 random bytes unique per process, and 3 bytes for an incrementing counter. This design means ObjectIds are globally unique without coordination between servers, which is critical for sharding. A bonus: sorting by _id gives you documents in insertion order, and you can extract the creation timestamp from any ObjectId without a separate field."
MongoDB is often called "schema-less" but that's misleading. A more accurate term is schema-flexible or dynamic schema. MongoDB doesn't enforce a schema at the database level by default — but your application absolutely has a schema. It's just implicit in your code rather than enforced by the database engine.
SQL is like a government form — every field is printed and you must fill them all in. MongoDB is like a blank notebook — you can write whatever you want, but your team still has conventions about what goes on each page. "Schema-less" means the notebook doesn't yell at you if you skip a section — it doesn't mean you don't have rules.
// All 3 can coexist in the same collection — MongoDB won't complain db.users.insertMany([ { name: "Alice", email: "alice@example.com", age: 30 }, { name: "Bob", phone: "+91-9999" }, // no email or age { fullName: "Charlie D", age: "twenty-five" } // age as string! ]);
MongoDB supports optional JSON Schema validation — you can enforce that all documents in a collection meet certain rules, while still allowing fields to be added freely.
db.createCollection("users", { validator: { $jsonSchema: { bsonType: "object", required: ["name", "email"], properties: { name: { bsonType: "string", description: "must be a string" }, email: { bsonType: "string", pattern: "^.+@.+$" }, age: { bsonType: "int", minimum: 0, maximum: 150 } } } }, validationAction: "error" // "warn" just logs, "error" rejects invalid inserts });
"MongoDB is schema-flexible, not truly schema-less. The database doesn't enforce a schema by default — different documents in the same collection can have different fields. But in practice, your application has an implicit schema baked into your code. MongoDB lets you add JSON Schema validation rules to enforce required fields and data types when you need that discipline. The flexibility shines during rapid development and schema evolution — adding a new field is just inserting it, no ALTER TABLE required."
Nest related data inside the parent document as a subdocument or array. Everything is in one place — one read gets everything.
Store a reference ID to another collection. Related data lives in a separate document — requires a second read or $lookup.
// ── EMBEDDING: address inside user ── { _id: ObjectId("..."), name: "Alice", address: { // embedded — always fetched with user street: "123 Main St", city: "Chandigarh", zip: "160001" } } // ── REFERENCING: author as a separate document ── // posts collection { _id: ObjectId("post1"), title: "MongoDB Guide", authorId: ObjectId("user1") // reference — lookup separately } // users collection { _id: ObjectId("user1"), name: "Alice", email: "alice@example.com" }
| Use Embedding When… | Use References When… |
|---|---|
| Data is always read together with the parent | Data is read independently of the parent |
| The child data belongs to one parent only (1:1) | The child is shared by many parents (N:N) |
| The child data doesn't change frequently | The child is updated often and independently |
| The array won't grow unboundedly | The array could grow very large (thousands+) |
| Atomic updates to parent + child are needed | You need to query/filter the child independently |
"The rule is: embed when you read together, reference when you read independently. Address in a user document is a classic embed — you always want it with the user. But a user's blog posts should be referenced — you might want posts without the user, and there could be thousands of them. The key constraints are: never embed arrays that could grow without bounds, and only embed if atomic updates to both parent and child are needed."
- Flexible schema: Rapid iteration without ALTER TABLE migrations. New fields just appear in new documents.
- Developer-friendly data model: Documents map directly to objects in code — no ORM impedance mismatch.
- Horizontal scaling (sharding): Distribute data across many servers natively — harder to do with SQL.
- Rich query language: Supports ad hoc queries, aggregation pipeline, full-text search, geospatial queries, and time-series.
- High write throughput: WiredTiger's document-level locking allows far more concurrent writes than table-level locking.
- Native JSON ecosystem: Perfect for Node.js, REST APIs — data flows end-to-end without transformation.
- Built-in replication: Replica sets provide automatic failover and read scaling out of the box.
- No native JOINs:
$lookupexists but is slower than SQL JOINs — not ideal for relational data. - 16MB document limit: Can't store large files directly (requires GridFS workaround).
- Data duplication: Embedding denormalises data — updates to shared data require updating multiple documents.
- Memory-hungry: WiredTiger cache defaults to 50% of RAM — needs generous memory allocation.
- Multi-document transactions add overhead: Available since 4.0, but they're slower than single-document operations.
- No declarative referential integrity: No foreign key constraints — your application must maintain consistency.
"MongoDB's main advantages are flexible schema evolution, natural JSON data model for JavaScript apps, built-in horizontal scaling via sharding, and high write throughput from document-level locking. The downsides: it doesn't enforce referential integrity, multi-document JOINs are expensive, and denormalization means update anomalies require updating multiple documents. The right choice depends on whether your data is document-shaped and your access patterns favour reads-by-document over complex joins."
- Schema evolves rapidly: SaaS products early-stage, product catalogs with variable attributes
- Data is hierarchical: Orders with line items, blog posts with comments, user profiles with nested settings
- High write throughput: IoT telemetry, event logging, activity feeds, clickstream data
- Horizontal scale is a requirement: Multi-region, petabyte-scale datasets
- JavaScript / Node.js stack: JSON end-to-end — no serialisation overhead
- Geospatial queries: Location-based apps, maps, proximity searches
- Data is highly relational: Financial systems, ERP, complex many-to-many relationships
- Strict ACID across many entities: Banking, payments, inventory management
- Complex reporting and analytics: Ad hoc SQL queries, BI tools, data warehouses
- Referential integrity is non-negotiable: Foreign keys enforced at DB level
- Legacy system integration: Most enterprise software speaks SQL
"I choose MongoDB when the data is document-shaped, the schema evolves frequently, or the access pattern is 'give me this whole entity.' I choose PostgreSQL or MySQL when data is highly relational with many foreign keys, when I need strict ACID guarantees across multiple entities, or when complex reporting queries are the primary workload. In practice most production systems use both — MongoDB for operational workloads and a relational DB for transactional data."
WiredTiger is MongoDB's default (and recommended) storage engine. It replaced the older MMAPv1 engine and brought massive improvements in concurrency, compression, and performance.
| Feature | Detail | Why It Matters |
|---|---|---|
| Document-level concurrency | MVCC (Multi-Version Concurrency Control) — each write gets its own version | Multiple writers can work simultaneously without blocking each other |
| Compression | Snappy (default), zlib, or zstd for data; prefix for indexes | Reduces disk usage by 50–80% — also reduces I/O |
| Cache | Internal cache: 50% of (RAM − 1GB) by default, min 256MB | Keeps hot data in memory for fast reads |
| Write-ahead log (journal) | All writes go to the journal before being applied | Crash recovery — no data loss on sudden shutdown |
| Checkpoints | Every 60 seconds, WiredTiger flushes a consistent snapshot to disk | Limits how far back journal replay needs to go after a crash |
// Check the storage engine in use db.serverStatus().storageEngine // { name: "wiredTiger", ... } // Check WiredTiger cache statistics db.serverStatus().wiredTiger.cache // Check compression (collection stats) db.runCommand({ collStats: "users" }) // Shows: storageSize (compressed), totalIndexSize, wiredTiger.creationString
"WiredTiger has been MongoDB's default storage engine since 3.2. Its key advantages are document-level concurrency via MVCC — readers and writers don't block each other — and compression, which cuts disk usage significantly. Data flows from insert → journal (for durability) → in-memory cache (dirty page) → disk (at checkpoint every 60s). The internal cache defaults to 50% of RAM, so MongoDB is memory-hungry by design. If you're memory-constrained, tune wiredTigerCacheSizeGB in your config."
Think of CRUD as a four-phrase vocabulary: Create (insert), Read (find), Update (update/replace), Delete (delete). Every operation takes a filter (which documents to target) and optionally a payload (what to do). Get that pattern locked in and every method becomes predictable.
MongoDB is like a librarian. Insert = add a new book to the shelf. Find = "show me all books by author X published after 2020." Update = go to specific books and change their status to "checked out." Delete = remove books from the shelf. The filter is your search criteria — the more precise it is, the less work the librarian does.
{} matches all documents. Always double-check your filter before running updateMany or deleteMany in production — a wrong filter is not easily undone.
// Insert a single document db.users.insertOne({ name: "Alice", email: "alice@example.com", age: 30, createdAt: new Date() }); // Returns: { acknowledged: true, insertedId: ObjectId("...") } // Insert multiple documents in one round-trip db.users.insertMany([ { name: "Bob", email: "bob@example.com", age: 25 }, { name: "Charlie", email: "charlie@example.com", age: 35 } ], { ordered: false }); // ordered:false = continue on partial error
// Find one document matching filter db.users.findOne({ email: "alice@example.com" }); // Find all documents matching filter db.users.find({ age: { $gte: 25 } }) .sort({ name: 1 }) // 1 = ascending, -1 = descending .limit(10) .skip(0); // Count matching documents db.users.countDocuments({ age: { $gte: 25 } });
// Update one matching document db.users.updateOne( { email: "alice@example.com" }, // filter { $set: { age: 31, updatedAt: new Date() } } // update ); // Update all matching documents db.users.updateMany( { age: { $lt: 18 } }, { $set: { isMinor: true } } );
// Delete one document db.users.deleteOne({ email: "bob@example.com" }); // Delete all matching documents db.users.deleteMany({ isActive: false }); // ⚠ deleteMany({}) deletes EVERY document in the collection!
"MongoDB's CRUD maps to four method families: insertOne/insertMany for create, find/findOne for read, updateOne/updateMany/replaceOne for update, and deleteOne/deleteMany for delete. Every write operation takes a filter as its first argument — which documents to target — and all of them return an acknowledgement result. The key gotcha is always reviewing your filter before deleteMany or updateMany in production."
find() cursor work? What methods can you chain on it?find() does not immediately return all documents — it returns a cursor: a pointer to the result set on the server. Documents are fetched in batches (default 101 documents or 16MB, whichever comes first) as you iterate. This is critical for memory efficiency with large result sets.
A cursor is like a bookmark in a very long book. MongoDB holds the page open on the server. You ask for pages one at a time — you don't print the whole book at once. If you don't read the cursor within 10 minutes (default cursor timeout), MongoDB closes it automatically.
| Method | What it does | Example |
|---|---|---|
| .sort() | Order results by field(s). 1=asc, -1=desc | .sort({ age: -1 }) |
| .limit() | Cap the number of documents returned | .limit(20) |
| .skip() | Skip N documents (for offset pagination) | .skip(40) |
| .projection() | Include/exclude fields in the result | .project({ name:1, _id:0 }) |
| .count() | Return count of matching documents | .count() |
| .toArray() | Exhaust cursor into an in-memory array | await cursor.toArray() |
| .forEach() | Iterate documents one at a time | await cursor.forEach(fn) |
| .explain() | Show query execution plan | .explain("executionStats") |
const cursor = db.collection('users').find({ age: { $gte: 18 } }); // Pattern 1: toArray() — loads all into memory (fine for small sets) const users = await cursor.toArray(); // Pattern 2: async iterator — streams one at a time (large result sets) for await (const user of cursor) { processUser(user); // handle each document without loading all into memory } // Always close the cursor when done if not exhausted await cursor.close();
.skip(N) scans and discards N documents before returning results. For page 100 with 20 items per page, MongoDB scans 2,000 documents and throws away 1,980. For deep pagination, use cursor-based pagination — filter by the last seen _id instead.
"find() returns a cursor — a lazy pointer to the result set — not the documents themselves. Documents are batched from the server as you iterate. You chain methods on the cursor before iterating: sort, limit, skip, project. In Node.js, use toArray() for small sets or an async for-await loop for large sets to avoid loading everything into memory. One important gotcha: skip() degrades at scale — prefer cursor-based pagination using the last seen _id."
updateOne, updateMany, and replaceOne?| Method | Scope | Payload | Key Behaviour |
|---|---|---|---|
| updateOne | First matching document only | Update operators ($set, $inc…) |
Modifies specific fields — untouched fields remain |
| updateMany | All matching documents | Update operators | Same as updateOne but applies to every match — not atomic across documents |
| replaceOne | First matching document only | A full replacement document (no operators) | Overwrites the entire document except _id |
// Original document: // { _id: 1, name: "Alice", email: "alice@x.com", age: 30, role: "admin" } // updateOne — ONLY changes age, everything else untouched db.users.updateOne( { _id: 1 }, { $set: { age: 31 } } ); // Result: { _id: 1, name: "Alice", email: "alice@x.com", age: 31, role: "admin" } // replaceOne — ENTIRE document is replaced (only _id survives) db.users.replaceOne( { _id: 1 }, { name: "Alice", age: 31 } // no $set — this IS the new document ); // Result: { _id: 1, name: "Alice", age: 31 } ← email and role are GONE!
updateOne without an operator like $set, MongoDB will throw an error in newer versions — but in older versions it could silently replace the document. Always use update operators with updateOne/updateMany.
"The key distinction is payload type. updateOne and updateMany take update operator expressions like $set — they surgically modify specific fields while leaving the rest untouched. replaceOne takes a plain document — it overwrites the entire document, keeping only the _id. Use replaceOne when you want to reset a document to a completely new state; use updateOne when you want to change specific fields only."
$set, $unset, $inc, $push, $pull, $addToSet.// $set — add or overwrite specific fields db.products.updateOne({ _id: 1 }, { $set: { price: 29.99, updatedAt: new Date() } }); // $unset — remove a field entirely (value is irrelevant, use "") db.products.updateOne({ _id: 1 }, { $unset: { legacyField: "" } }); // $inc — atomically increment (or decrement with negative value) db.products.updateOne({ _id: 1 }, { $inc: { stock: -1, soldCount: 1 } // decrement stock, increment sales }); // $rename — rename a field db.users.updateMany({}, { $rename: { fullname: "name" } });
// $push — append an element to an array (allows duplicates) db.posts.updateOne({ _id: 1 }, { $push: { tags: "mongodb" } }); // $push with $each — append multiple elements at once db.posts.updateOne({ _id: 1 }, { $push: { tags: { $each: ["nosql", "database"] } } }); // $addToSet — append only if not already present (no duplicates) db.posts.updateOne({ _id: 1 }, { $addToSet: { tags: "mongodb" } // ignored if "mongodb" already exists }); // $pull — remove all elements matching a condition db.posts.updateOne({ _id: 1 }, { $pull: { tags: "nosql" } }); // $pop — remove first (-1) or last (1) element of an array db.posts.updateOne({ _id: 1 }, { $pop: { tags: -1 } // remove first element });
"Update operators are the verbs inside MongoDB write operations. $set and $unset add or remove fields. $inc is the go-to for counters — it's atomic, so two concurrent increments won't collide. For arrays: $push appends (allows duplicates), $addToSet appends only if unique — ideal for tag lists. $pull removes matching elements. Multiple operators can be combined in one update call, and the whole thing is atomic at the single-document level."
$eq, $ne, $gt, $gte, $lt, $lte, $in, $nin.| Operator | Meaning | SQL Equivalent |
|---|---|---|
| $eq | Equal to (default when no operator given) | = value |
| $ne | Not equal to | <> value |
| $gt | Greater than | > value |
| $gte | Greater than or equal to | >= value |
| $lt | Less than | < value |
| $lte | Less than or equal to | <= value |
| $in | Matches any value in an array | IN (a, b, c) |
| $nin | Does not match any value in an array | NOT IN (a, b, c) |
// $eq — explicit (same as shorthand) db.users.find({ age: { $eq: 30 } }); db.users.find({ age: 30 }); // shorthand — same result // $gt / $lte — range query (users aged 18–65) db.users.find({ age: { $gte: 18, $lte: 65 } }); // $in — match multiple values (like SQL IN) db.users.find({ status: { $in: ["active", "pending"] } }); // $nin — exclude values db.users.find({ role: { $nin: ["admin", "superuser"] } }); // $ne — not equal db.orders.find({ status: { $ne: "cancelled" } }); // Combined — products priced between $10 and $50, not out of stock db.products.find({ price: { $gte: 10, $lte: 50 }, status: { $ne: "out_of_stock" } });
$in is extremely useful for fetching multiple documents by ID in one query — much faster than looping and calling findOne per ID.
db.users.find({ _id: { $in: [id1, id2, id3] } })
"Comparison operators are used inside filter documents to express conditions. { age: 30 } is shorthand for { age: { $eq: 30 } }. Range queries combine operators: { age: { $gte: 18, $lte: 65 } }. $in is the MongoDB equivalent of SQL's IN and is very efficient when targeting a set of known values — especially useful for batch ID lookups."
$and, $or, $not, and $nor.| Operator | Meaning | SQL Equivalent |
|---|---|---|
| $and | All conditions must be true | WHERE A AND B |
| $or | At least one condition must be true | WHERE A OR B |
| $not | Inverts the condition (wraps a single operator expression) | WHERE NOT A |
| $nor | None of the conditions must be true | WHERE NOT (A OR B) |
// $and — implicit (comma-separated conditions default to AND) db.users.find({ age: { $gte: 18 }, isActive: true }); // implicit $and // $and — explicit (required when same field appears multiple times) db.products.find({ $and: [ { price: { $gte: 10 } }, { price: { $lte: 50 } } ] }); // Shorthand: { price: { $gte: 10, $lte: 50 } } (same result) // $or — match users who are admins OR have premium subscription db.users.find({ $or: [ { role: "admin" }, { subscription: "premium" } ] }); // $not — users whose age is NOT less than 18 db.users.find({ age: { $not: { $lt: 18 } } }); // Note: $not wraps an operator expression, not a plain value // $nor — exclude both banned and suspended users db.users.find({ $nor: [ { status: "banned" }, { status: "suspended" } ] });
$and and you don't need the explicit operator. You need explicit $and only when you have two conditions on the same field that can't be merged — for example when using two different operators on the same field that conflict.
"Logical operators combine multiple conditions. Comma-separated conditions in a filter are already an implicit $and. You use explicit $and only when you need two separate conditions on the same field. $or is the most common — use it when any one of several conditions should match. $not wraps an operator expression (not a value), and $nor is essentially NOT (A OR B) — neither condition can match."
// Sample document: { tags: ["mongodb", "nosql", "database"], scores: [85, 92, 78] } // Match documents where array CONTAINS a value db.posts.find({ tags: "mongodb" }); // finds docs where tags array includes "mongodb" // Match documents where array equals EXACTLY this array (same order) db.posts.find({ tags: ["mongodb", "nosql"] }); // exact match, order matters // $all — array must contain ALL of these values (any order) db.posts.find({ tags: { $all: ["mongodb", "nosql"] } }); // $size — array has exactly N elements db.posts.find({ tags: { $size: 3 } }); // $elemMatch — at least one element matches ALL conditions db.students.find({ scores: { $elemMatch: { $gte: 90, $lt: 100 } } });
// Sample document: { address: { city: "Chandigarh", zip: "160001" } } // Dot notation to query nested fields db.users.find({ "address.city": "Chandigarh" }); db.users.find({ "address.zip": { $in: ["160001", "160017"] } }); // Array of subdocuments — $elemMatch for subdoc conditions // Sample: { orders: [{ product: "Book", qty: 2 }, { product: "Pen", qty: 5 }] } db.users.find({ orders: { $elemMatch: { product: "Book", qty: { $gte: 2 } } } }); // Update a specific array element by index using dot notation db.users.updateOne( { _id: 1 }, { $set: { "address.city": "Delhi" } } );
$elemMatch, conditions on array subdocuments are evaluated independently — any element can satisfy one condition while a different element satisfies another. $elemMatch ensures a single element satisfies all conditions simultaneously. Always use $elemMatch when querying arrays of objects with multiple conditions.
"To query arrays: a plain equality filter like { tags: 'mongodb' } checks if the array contains that value. $all requires all listed values to be present. $elemMatch is key for arrays of subdocuments — it ensures all conditions match within a single array element. For nested documents, use dot notation: 'address.city': 'Chandigarh'. Dot notation also works in update operators: $set: { 'address.zip': '160001' }."
A projection is the second argument to find() that controls which fields are returned. Instead of fetching entire documents, you specify exactly which fields your application needs — reducing network payload and memory usage.
// INCLUSION mode (1) — only return specified fields + _id db.users.find({}, { name: 1, email: 1 }); // Returns: { _id: ..., name: "Alice", email: "alice@x.com" } // Suppress _id explicitly (the only field you can mix include + exclude) db.users.find({}, { name: 1, email: 1, _id: 0 }); // Returns: { name: "Alice", email: "alice@x.com" } // EXCLUSION mode (0) — return everything EXCEPT specified fields db.users.find({}, { password: 0, __v: 0 }); // Returns all fields except password and __v // Nested field projection with dot notation db.users.find({}, { "address.city": 1, name: 1 }); // Array element projection — return only first matching element db.posts.find( { tags: "mongodb" }, { "tags.$": 1 } // $ projects only the first matching array element );
_id. { name: 1, age: 0 } is invalid and throws an error. Pick one mode — include what you want, or exclude what you don't want.
Projections reduce the amount of data transferred from MongoDB to your application. For a collection with 50 fields per document, fetching only 3 needed fields can cut network traffic by 90%. Combined with an index, a projection that only requests indexed fields creates a covered query — the fastest possible read (no document access needed, answered entirely from the index).
"A projection is the second argument to find() — it controls what fields are returned. You're in either inclusion mode (specify what you want with 1) or exclusion mode (specify what you don't want with 0). You can't mix them except for _id, which can always be suppressed with _id: 0 even in inclusion mode. Projections matter for performance — they reduce network payload and are the key ingredient in covered queries."
Upsert is a combined update + insert operation. When { upsert: true } is passed as an option:
- If a document matches the filter → it is updated normally.
- If no document matches the filter → a new document is inserted using the filter fields as its initial values.
This eliminates the need for a read-then-write pattern, making it atomic and safe under concurrency.
// Upsert: update if exists, insert if not db.pageViews.updateOne( { url: "/home" }, // filter { $inc: { views: 1 }, $set: { lastSeen: new Date() } }, // update { upsert: true } // option ); // First time: inserts { url: "/home", views: 1, lastSeen: Date } // Every subsequent call: increments views by 1 // $setOnInsert — set fields ONLY on insert, not on update db.users.updateOne( { email: "alice@example.com" }, { $set: { lastLogin: new Date() }, $setOnInsert: { createdAt: new Date(), role: "user" } }, { upsert: true } );
- View/hit counters — increment counter, create if first visit
- User session tracking — update session, insert if new user
- Idempotent data sync — sync data from an external source without duplicates
- Shopping cart — update item quantity if in cart, add if not
- Config records — ensure a config document always exists with defaults
"Upsert is { upsert: true } on an update call. If the filter matches a document, it updates it. If not, it inserts a new document using the filter fields plus the update payload. It's atomic — no race condition between checking existence and inserting. The companion operator $setOnInsert lets you set fields like createdAt only during the insert path, leaving them untouched on subsequent updates."
findOneAndUpdate? How does it differ from updateOne, and when would you use it?| Method | Returns | When to use |
|---|---|---|
| updateOne | Result object: { matchedCount, modifiedCount } |
When you only care that the update happened — don't need the document back |
| findOneAndUpdate | The matched document itself (before or after update) | When you need the document AND want the update to happen atomically in one round-trip |
| findOneAndReplace | The matched document (before or after replace) | Same, but replaces the entire document |
| findOneAndDelete | The deleted document | Delete and get the document back atomically |
// Returns the document BEFORE the update (default) const before = await db.collection('orders').findOneAndUpdate( { status: "pending", assignedTo: null }, // filter { $set: { assignedTo: "worker-1", status: "processing" } }, { sort: { createdAt: 1 } } // pick oldest first ); // Returns the document AFTER the update const after = await db.collection('users').findOneAndUpdate( { email: "alice@example.com" }, { $inc: { loginCount: 1 } }, { returnDocument: "after", upsert: true } // returnDocument: "after" returns updated doc );
findOneAndUpdate is the backbone of atomic job claiming — grab the next available task and mark it as taken in a single atomic operation, with no risk of two workers claiming the same task:
async function claimNextJob(workerId) { return db.collection('jobs').findOneAndUpdate( { status: "queued" }, { $set: { status: "running", workerId, startedAt: new Date() } }, { sort: { priority: -1, createdAt: 1 }, returnDocument: "after" } ); // If null returned — no jobs available. Otherwise the claimed job. }
"findOneAndUpdate does what updateOne does but returns the document itself rather than just a result summary — and it does both in a single atomic operation. The key option is returnDocument: 'after' to get the post-update state. The classic use case is a job queue: find the next available job and mark it as claimed in one atomic call, preventing two workers from claiming the same task."
The aggregation pipeline is exactly what it sounds like: a series of stages that documents flow through, each stage transforming the data before passing it to the next. Think of it as a factory assembly line where raw materials (documents) enter one end and finished goods (results) exit the other.
Stage 1 ($match) is the loading dock — it only accepts materials that meet your specification, rejecting the rest. Stage 2 ($group) is the sorting machine — groups items by category and counts them. Stage 3 ($sort) is quality control — orders the output. The key insight: each stage gets smaller input than the previous one when you filter early. A factory that pre-sorts 1,000 items is slower than one that first rejects 900 at the loading dock and only sorts 100.
$match and $sort (on indexed fields) as early as possible. Stages that reduce document count ($match) or size ($project) should precede expensive stages ($group, $lookup).
The aggregation pipeline is a framework for data transformation and computation. Documents pass through a sequence of stages, each stage transforming the documents and passing the result to the next stage. It is analogous to Unix pipes: collection | stage1 | stage2 | stage3 → result.
| Capability | find() | Aggregation Pipeline |
|---|---|---|
| Filter documents | ✅ Yes (query) | ✅ Yes ($match) |
| Shape output | Limited (projection) | ✅ Full reshape ($project, $addFields) |
| Group & compute | ❌ No | ✅ Yes ($group + accumulators) |
| Join collections | ❌ No | ✅ Yes ($lookup) |
| Compute new fields | ❌ No | ✅ Yes ($addFields, $project expressions) |
| Sort & paginate | ✅ Yes (sort/skip/limit) | ✅ Yes ($sort/$skip/$limit stages) |
| Multi-pipeline output | ❌ No | ✅ Yes ($facet) |
| Stage | Purpose | SQL Analogue |
|---|---|---|
| $match | Filter documents | WHERE |
| $project | Reshape documents (include/exclude/compute fields) | SELECT |
| $group | Group by key and compute aggregates | GROUP BY |
| $sort | Sort documents | ORDER BY |
| $limit | Keep first N documents | LIMIT |
| $skip | Skip first N documents | OFFSET |
| $lookup | Left outer join from another collection | LEFT JOIN |
| $unwind | Deconstruct array — one doc per element | CROSS JOIN UNNEST |
| $addFields | Add/overwrite fields without replacing the document | SELECT *, new_col |
| $facet | Run multiple sub-pipelines on same input | No direct analogue |
db.orders.aggregate([ // Stage 1: filter — uses index on (status, createdAt) { $match: { status: "completed", createdAt: { $gte: new Date("2024-01-01") } } }, // Stage 2: group by customer, sum revenue { $group: { _id: "$customerId", total: { $sum: "$amount" }, count: { $sum: 1 } } }, // Stage 3: sort by total descending { $sort: { total: -1 } }, // Stage 4: top 5 customers only { $limit: 5 } ]);
"The aggregation pipeline is MongoDB's data transformation engine. Documents flow through a sequence of stages — each stage filters, reshapes, or computes data before passing results to the next. Unlike find(), which only filters and projects, the pipeline can group, join collections with $lookup, compute aggregates, and run multiple parallel sub-pipelines with $facet. It replaced the older MapReduce API, which was deprecated in MongoDB 5.0."
$match is the pipeline equivalent of a find() query filter. It accepts any valid MongoDB query predicate and keeps only documents that pass the condition. Operators like $eq, $gt, $in, $regex, and $text all work inside $match.
When $match is the first stage, MongoDB can use an index to scan only relevant documents, exactly like a find(). Placed after any other stage, MongoDB must process all documents through the preceding stages first — losing index access entirely.
$match on an indexed field at pipeline position [0] = index scan (fast). A $match at position [2] after $unwind = full scan of the unwound in-memory set with no index benefit.
// Simple equality { $match: { status: "active" } } // Range filter — uses compound index on (status, createdAt) { $match: { status: "completed", createdAt: { $gte: new Date("2024-01-01"), $lt: new Date("2025-01-01") } } } // $expr — compare two fields within the same document { $match: { $expr: { $gt: ["$revenue", "$cost"] } } } // Array contains element (uses multikey index) { $match: { tags: "featured" } } // $text search (requires text index) { $match: { $text: { $search: "mongodb aggregation" } } }
MongoDB automatically merges consecutive $match stages into a single match before execution — so splitting complex filters for readability has no performance cost.
// Two readable stages — optimizer merges into one: [{ $match: { status: "active" } }, { $match: { age: { $gte: 18 } } }] // Executed as: [{ $match: { status: "active", age: { $gte: 18 } } }]
"$match is the pipeline's filter stage — it accepts any query predicate and passes only matching documents downstream. The critical rule: put $match first so MongoDB can use an index at the collection level, just like find(). Placed later, $match operates on in-memory intermediate results with no index benefit. MongoDB's optimizer also merges adjacent $match stages automatically, so you can split complex matches for readability without a performance penalty."
$group groups input documents by a key (the _id field) and computes aggregate values over each group using accumulator operators. Every unique value of _id produces exactly one output document. Setting _id: null groups all documents into a single group — computing a grand total.
{ $group: {
_id: "<grouping expression>", // null = single group (grand total)
field1: { <accumulator>: "<expression>" },
field2: { <accumulator>: "<expression>" }
} }
// Example: revenue per product category
{ $group: {
_id: "$category",
totalRevenue: { $sum: "$price" },
avgPrice: { $avg: "$price" },
orderCount: { $sum: 1 }, // count docs in group
maxPrice: { $max: "$price" },
products: { $push: "$name" } // collect into array
} }
| Accumulator | Description | Example |
|---|---|---|
| $sum | Sums numeric values; use 1 to count documents | { $sum: "$amount" } or { $sum: 1 } |
| $avg | Arithmetic average of numeric values | { $avg: "$rating" } |
| $min | Minimum value in the group | { $min: "$price" } |
| $max | Maximum value in the group | { $max: "$price" } |
| $push | Collects all values into an array (allows duplicates) | { $push: "$tag" } |
| $addToSet | Collects unique values into an array (no duplicates) | { $addToSet: "$category" } |
| $first | First value in the group (meaningful after $sort) | { $first: "$createdAt" } |
| $last | Last value in the group (meaningful after $sort) | { $last: "$updatedAt" } |
| $count | Counts documents in group (MongoDB 5.0+) | { $count: {} } |
| $mergeObjects | Merges documents in the group into one object | { $mergeObjects: "$address" } |
// First group: daily totals per user // Second group: monthly total from daily buckets db.orders.aggregate([ { $match: { status: "completed" } }, { $group: { _id: { userId: "$userId", day: { $dateToString: { format: "%Y-%m-%d", date: "$createdAt" } } }, dailyTotal: { $sum: "$amount" } }}, { $group: { _id: "$_id.userId", monthlyTotal: { $sum: "$dailyTotal" }, activeDays: { $sum: 1 } }}, { $sort: { monthlyTotal: -1 } } ]);
$group is a blocking stage — it must process all input documents before emitting any output. For very large datasets, add { allowDiskUse: true } to spill to disk when the 100 MB in-memory limit is exceeded.
"$group collapses multiple documents into one per unique _id value, applying accumulator operators to compute aggregates per group. The _id field is the grouping key — set it to null to get a single grand total. The most-used accumulators are $sum (sum or count with 1), $avg, $min/$max, $push (array of all values), and $addToSet (unique values only). $group is a blocking stage — all input must be processed before any output is emitted — so use allowDiskUse: true for large aggregations."
$project is the pipeline's SELECT clause — it controls which fields appear in output documents. It can include fields (set to 1), exclude fields (set to 0), and compute new fields using expressions. Unlike find() projection, aggregation $project can create entirely new fields using arithmetic, string, date, or conditional expressions.
// Include specific fields (_id included by default) { $project: { name: 1, email: 1, createdAt: 1 } } // Exclude _id (common pattern) { $project: { _id: 0, name: 1, email: 1 } } // Compute a new field — fullName from first + last { $project: { fullName: { $concat: ["$firstName", " ", "$lastName"] }, isAdult: { $gte: ["$age", 18] }, discounted:{ $multiply: ["$price", 0.9] } } } // Rename a field { $project: { userId: "$_id", _id: 0 } } // Conditional field — $cond (ternary) { $project: { tier: { $cond: { if: { $gt: ["$spend", 1000] }, then: "gold", else: "standard" } } } }
| Stage | Behaviour | Use When |
|---|---|---|
| $project | Only fields explicitly listed survive — everything else is dropped | You want to select specific columns (like SELECT col1, col2) |
| $addFields | All existing fields pass through AND specified fields are added/overwritten | You want to add computed fields without dropping existing ones |
| $set | Alias for $addFields (MongoDB 4.2+) | Identical to $addFields — choose for readability |
$project early to strip unused fields reduces the memory footprint of documents flowing into later expensive stages (especially $lookup and $group). If each document shrinks from 5 KB to 500 bytes, a 1M-document pipeline uses 10× less memory mid-stream.
"$project is the pipeline's field selector and reshaper. Set fields to 1 to include, 0 to exclude, or to an expression to compute a new value. Unlike find() projection, it can create entirely new fields using MongoDB expressions — string concat, arithmetic, date math, conditionals. Use $addFields when you want to add computed fields without re-listing every existing field you want to keep."
$lookup performs a left outer join to another collection in the same database. For each input document, it fetches matching documents from the foreign collection and appends them as an array field. If no match exists, the array is empty — but the local document is still included (left outer join semantics).
// Join orders → users on orders.userId = users._id db.orders.aggregate([ { $lookup: { from: "users", // target collection localField: "userId", // field in orders foreignField: "_id", // field in users as: "user" // output array field }}, // Flatten single-element array → object { $unwind: { path: "$user", preserveNullAndEmptyArrays: true } } ]); // Result: { _id, amount, user: { name, email, ... } }
The pipeline form lets you filter, project, and transform joined documents before embedding — far more powerful than simple equality:
db.orders.aggregate([ { $lookup: { from: "products", let: { pid: "$productId", minQty: "$quantity" }, // expose local fields pipeline: [ { $match: { $expr: { $and: [ { $eq: ["$_id", "$$pid"] }, { $gte: ["$stock", "$$minQty"] } // only in-stock ]}}}, { $project: { name: 1, price: 1, _id: 0 } } ], as: "product" }} ]);
| Rule | Why It Matters |
|---|---|
| Index the foreignField | Without an index, $lookup does a full collection scan for each input doc — O(n×m) complexity |
| $match before $lookup | Reduce input set first — join fewer documents, not all of them |
| $project before $lookup | Strip large fields from input docs so joined documents consume less memory |
| Same database only | $lookup cannot join across databases; use Atlas Data Federation for cross-db joins |
$lookup runs a sub-query against the foreign collection per input document. 10,000 input documents = up to 10,000 sub-queries. Always filter before joining and index the foreign join field.
"$lookup performs a left outer join — for each input document, it queries a foreign collection and embeds matching documents as an array. The simple form uses localField/foreignField for equality joins. The pipeline form lets you pass local variables via 'let' and filter/project the joined documents before embedding — much more flexible. Key constraints: both collections must be in the same database, index the foreignField, and $match before $lookup to avoid N×M scan complexity."
$unwind deconstructs an array field — it outputs one document per array element, each with the same fields as the original but with the array field replaced by a single element. An order with 3 line items becomes 3 documents after $unwind: "$items".
// Input: { _id: 1, items: ["book", "pen", "notebook"] } { $unwind: "$items" } // Output — 3 documents: // { _id: 1, items: "book" } // { _id: 1, items: "pen" } // { _id: 1, items: "notebook" } // Full options form { $unwind: { path: "$items", includeArrayIndex: "itemIndex", // adds 0-based position field preserveNullAndEmptyArrays: true // keep doc if array is [] or null } }
Unwind normalises an embedded array into rows, then group re-aggregates. Example — count how many products carry each tag:
db.products.aggregate([ { $unwind: "$tags" }, // one doc per tag { $group: { _id: "$tags", count: { $sum: 1 } } }, // count per tag { $sort: { count: -1 } }, { $limit: 10 } ]);
| Gotcha | What Happens | Fix |
|---|---|---|
| Empty array [] | Document is silently dropped from output | preserveNullAndEmptyArrays: true |
| Null or missing field | Document is silently dropped from output | preserveNullAndEmptyArrays: true |
| Document explosion | Array with 1,000 elements = 1,000 output docs; blows up memory | $match early to reduce input; $project to trim array before unwinding |
"$unwind deconstructs an array field into separate documents — one per array element. It's how you pivot embedded arrays into rows you can group or filter on. The classic pattern is $unwind then $group: normalize the array, then re-aggregate. Key gotchas: documents with null or empty arrays are silently dropped unless you set preserveNullAndEmptyArrays: true. Also watch for document explosion — a large array multiplies your document count significantly, so $match early to limit the input set before unwinding."
$facet processes input documents through multiple independent sub-pipelines simultaneously, collecting each sub-pipeline's output into a named array field. All sub-pipelines receive the same input — so you get multiple aggregation results from a single collection scan.
const [result] = await db.collection('products').aggregate([ // Filter once — same set flows into ALL facets { $match: { category: "electronics", inStock: true } }, { $facet: { // Paginated product list products: [ { $sort: { price: 1 } }, { $skip: 0 }, { $limit: 20 }, { $project: { name: 1, price: 1, thumbnail: 1 } } ], // Total count for pagination controls totalCount: [ { $count: "total" } ], // Brand breakdown for sidebar filter byBrand: [ { $group: { _id: "$brand", count: { $sum: 1 } } }, { $sort: { count: -1 } } ], // Price histogram for range slider priceRanges: [ { $bucket: { groupBy: "$price", boundaries: [0, 50, 100, 250, 500, 1000], default: "1000+", output: { count: { $sum: 1 } } }} ] }} ]).toArray(); // result.products → 20 products for page 1 // result.totalCount → [{ total: 342 }] // result.byBrand → [{ _id: "Apple", count: 45 }, ...] // result.priceRanges → [{ _id: 0, count: 12 }, { _id: 50, count: 38 }, ...]
| Limitation | Detail |
|---|---|
| No $out or $merge inside | Sub-pipelines cannot write to collections |
| No nested $facet | Sub-pipelines cannot themselves contain $facet stages |
| 100 MB memory limit | Each sub-pipeline subject to the limit; use allowDiskUse: true for large inputs |
| Input buffered once | The matched set is held in memory before fanning out to sub-pipelines |
"$facet lets you run multiple independent sub-pipelines on the same input simultaneously. The key benefit: you $match once, then compute a product page + total count + sidebar filters in a single round-trip. Without $facet, that's 3–4 separate queries. The canonical use case is a search results page — one $facet call returns the product list, total for pagination, category breakdown, and price histogram all at once."
| Rule | Why | Example |
|---|---|---|
| 1. $match first | Enables index use at collection level; reduces docs flowing into later stages | Date/status filters before $group or $lookup |
| 2. $sort early on indexed fields | A $sort on an indexed field before $limit uses a top-K sort; avoids full in-memory sort | { $sort: { createdAt: -1 } } then { $limit: 100 } |
| 3. $project early to trim docs | Smaller documents = less memory in $lookup and $group stages | Drop unused fields before $lookup |
| 4. Index the $lookup foreignField | Without index, each $lookup does a full foreign collection scan — O(n×m) | Ensure users._id is indexed (it always is — use _id) |
| 5. allowDiskUse for large sets | $group and $sort are limited to 100 MB RAM; disk spill avoids OOM | db.orders.aggregate([...], { allowDiskUse: true }) |
| Optimization | What Happens |
|---|---|
| Adjacent $match merging | Two consecutive $match stages merged into one |
| $sort + $limit coalescence | $sort followed by $limit uses a top-K heap (keeps only K docs in memory) |
| $match pushdown after $lookup | A $match immediately after $lookup may be pushed into the $lookup pipeline to filter joined docs earlier |
| $project + $match reorder | $match predicates on unmodified fields can be moved before $project |
db.orders.explain("executionStats").aggregate([ { $match: { status: "completed", createdAt: { $gte: new Date("2024-01-01") } } }, { $group: { _id: "$customerId", total: { $sum: "$amount" } } } ]); // Key things to look for in output: // winningPlan.inputStage.stage → "IXSCAN" (good) or "COLLSCAN" (bad) // executionStats.nReturned vs totalDocsExamined // → ratio near 1.0 = efficient; large gap = missing index
"The most impactful optimisation is $match first so MongoDB uses an index — without this, every stage processes the entire collection. Then project away unused fields before $lookup to reduce memory per document. Index the foreignField used in $lookup or each lookup becomes a full collection scan. For large aggregations, add allowDiskUse: true to prevent 100 MB memory limit errors. Use explain('executionStats') to verify IXSCAN vs COLLSCAN and check nReturned vs totalDocsExamined."
const page = 3; const pageSize = 20; db.collection('products').aggregate([ { $match: { category: "books" } }, { $sort: { createdAt: -1 } }, { $skip: (page - 1) * pageSize }, // skip first (page-1)*pageSize docs { $limit: pageSize } ]);
const [result] = await db.collection('products').aggregate([ { $match: { category: "books" } }, { $sort: { createdAt: -1 } }, { $facet: { data: [{ $skip: 40 }, { $limit: 20 }], // page 3 total: [{ $count: "count" }] }} ]).toArray(); const total = result.total[0]?.count ?? 0; const totalPages = Math.ceil(total / 20); const data = result.data;
// Use the last doc's indexed fields as the cursor db.collection('products').aggregate([ { $match: { category: "books", $or: [ { createdAt: { $lt: lastDate } }, { createdAt: lastDate, _id: { $lt: lastId } } // tie-break ] }}, { $sort: { createdAt: -1, _id: -1 } }, { $limit: 20 } ]);
| Method | Pros | Cons | Best For |
|---|---|---|---|
| $skip + $limit | Simple; supports random page access | Slow on large offsets — scans and discards skipped docs | Small datasets, admin UIs |
| $facet + $count | Page + total in one round-trip | Same $skip performance issue; buffers all matched docs | UI requiring total page count |
| Keyset / Cursor | O(log n) via index; consistent across inserts/deletes | No random page access; more complex query | Infinite scroll, high-volume APIs |
"There are three patterns. $skip + $limit is simplest but degrades on large offsets — page 100 at 20 per page means scanning 1,980 docs to discard them. Use $facet to get data and total count in one round-trip for paginated UIs. For high-volume APIs or infinite scroll, cursor-based pagination using a keyset on indexed fields is the only O(log n) approach — it doesn't degrade as the dataset grows and is stable when documents are inserted between pages."
Given an orders collection where each document has a status, createdAt date, and an embedded items array (productId, name, category, qty, price) — build a report showing: total revenue per category per month, unique order count, average order value, and the top 3 products by revenue in each category.
{
_id: ObjectId("..."),
status: "completed",
createdAt: new Date("2024-03-15"),
items: [
{ productId: "p-1", name: "Laptop", category: "Electronics", qty: 1, price: 999 },
{ productId: "p-2", name: "Mouse", category: "Electronics", qty: 2, price: 29 }
]
}
const report = await db.collection('orders').aggregate([ // 1. Filter — uses index on (status, createdAt) { $match: { status: "completed", createdAt: { $gte: new Date("2024-01-01"), $lt: new Date("2025-01-01") } }}, // 2. Extract month from createdAt { $addFields: { month: { $dateToString: { format: "%Y-%m", date: "$createdAt" } } }}, // 3. Unwind items — one doc per line item { $unwind: "$items" }, // 4. Compute line-item revenue (qty × price) { $addFields: { "items.revenue": { $multiply: ["$items.qty", "$items.price"] } }}, // 5. Group by month + category { $group: { _id: { month: "$month", category: "$items.category" }, totalRevenue: { $sum: "$items.revenue" }, orderIds: { $addToSet: "$_id" }, // unique orders products: { $push: { name: "$items.name", id: "$items.productId", revenue: "$items.revenue" }} }}, // 6. Derive counts and averages from grouped data { $addFields: { orderCount: { $size: "$orderIds" }, avgOrderValue: { $divide: ["$totalRevenue", { $size: "$orderIds" }] } }}, // 7. Sort products array and take top 3 { $addFields: { top3Products: { $slice: [ { $sortArray: { input: "$products", sortBy: { revenue: -1 } } }, 3 ] } }}, // 8. Shape the final output { $project: { _id: 0, month: "$_id.month", category: "$_id.category", totalRevenue: { $round: ["$totalRevenue", 2] }, orderCount: 1, avgOrderValue: 1, top3Products: 1 }}, // 9. Final sort — chronological, highest revenue first within month { $sort: { month: 1, totalRevenue: -1 } } ], { allowDiskUse: true }).toArray();
| Stage | Input → Output | Why Here |
|---|---|---|
| $match | All orders → Q1 completed only | Uses index; reduces volume first |
| $addFields (month) | Orders → + month string | Extract grouping key from date before unwinding |
| $unwind | Orders with items[] → one doc per line item | Normalize to compute per-item revenue |
| $addFields (revenue) | Line items → + revenue field | Compute qty × price before grouping |
| $group | Line items → one doc per month + category | Aggregate revenue; collect products array |
| $addFields (counts) | Grouped docs → + orderCount, avgOrderValue | Derive metrics from the orderIds set |
| $addFields (top3) | Grouped docs → + top3Products | $sortArray + $slice sorts and trims in-place |
| $project | Enriched docs → clean output shape | Remove internal fields, rename _id compound |
| $sort | Clean docs → ordered results | Final presentation order |
"I'd walk through it stage by stage: $match first with indexed fields, then $addFields to extract the month, then $unwind to explode the items array so I can compute per-line revenue. Then $group by month + category — collecting unique order IDs as a set for accurate counting and $push to collect product data. Then $addFields to derive order count and average from the set. Then $addFields again using $sortArray + $slice to get the top 3 products by revenue. Finally $project to clean the output shape and $sort for presentation. allowDiskUse: true handles the $group stage on large datasets."
An index is a separate B-tree data structure that MongoDB maintains alongside your collection. Think of a book's index at the back: instead of reading every page to find "authentication," you jump straight to page 47. Without an index, MongoDB reads every document (COLLSCAN). With an index, it jumps to the matching entries (IXSCAN) and follows pointers back to the full documents.
A collection without indexes is a phone book with names in random order — to find "Smith, John" you read every entry. An index on lastName, firstName is the alphabetically sorted phone book — you jump to S, then to SM, then to Smith in milliseconds. A compound index is a phone book sorted by city first, then by last name — perfect if you always search by city, but useless if you only search by last name.
| Type | Created When | Use Case |
|---|---|---|
| Single Field | db.col.createIndex({ field: 1 }) | Queries filtering or sorting on one field |
| Compound | db.col.createIndex({ a: 1, b: -1 }) | Queries filtering/sorting on multiple fields together |
| Multikey | Auto-created when indexed field is an array | Queries on array fields (tags, categories) |
| Text | db.col.createIndex({ bio: "text" }) | Full-text keyword search with stemming and stop-word removal |
| Geospatial 2dsphere | db.col.createIndex({ loc: "2dsphere" }) | Geo queries — $near, $geoWithin on GeoJSON data |
| Hashed | db.col.createIndex({ _id: "hashed" }) | Even distribution for hashed shard keys; equality queries only |
| TTL | db.col.createIndex({ createdAt: 1 }, { expireAfterSeconds: N }) | Auto-delete documents after N seconds (logs, sessions, caches) |
| Partial | db.col.createIndex({ field: 1 }, { partialFilterExpression: {...} }) | Index only a subset of documents — smaller, faster index |
| Sparse | db.col.createIndex({ field: 1 }, { sparse: true }) | Index only documents where the field exists (skip nulls) |
| Wildcard | db.col.createIndex({ "$**": 1 }) | Dynamic/unpredictable fields (CMS content, user-defined attributes) |
| Clustered | db.createCollection("col", { clusteredIndex: {...} }) | Documents stored in _id order on disk — range scans on _id are faster |
Every MongoDB collection has a unique index on _id created automatically. It cannot be dropped. ObjectIds encode creation time, so the default _id index also gives you a free chronological sort — db.col.find().sort({ _id: -1 }) returns most-recent-first without a separate createdAt index.
// Create — rolling background build in production db.users.createIndex({ email: 1 }, { unique: true, name: "email_unique" }) // List all indexes on a collection db.users.getIndexes() // Get index sizes db.users.stats().indexSizes // Drop a specific index db.users.dropIndex("email_unique") // Hide an index (test impact before dropping — MongoDB 4.4+) db.runCommand({ collMod: "users", index: { name: "email_unique", hidden: true } })
"MongoDB supports single-field, compound, multikey (auto-created for array fields), text, geospatial 2dsphere, hashed, TTL, partial, sparse, and wildcard indexes. The most commonly used in production are compound indexes (to serve complex queries), TTL for automatic data expiry, partial indexes to index only a relevant subset, and text indexes for keyword search. Every collection gets a unique _id index automatically. In Atlas, Atlas Search provides far more powerful full-text capabilities than the native text index."
A compound index indexes multiple fields together in a single B-tree, sorted by the first field, then within ties sorted by the second field, and so on. One compound index can serve queries that filter on any left-anchored prefix of the indexed fields — this is the prefix rule.
// Index: { status: 1, customerId: 1, createdAt: -1 } db.orders.createIndex({ status: 1, customerId: 1, createdAt: -1 }) // ✅ Uses index — matches prefix (status) db.orders.find({ status: "completed" }) // ✅ Uses index — matches prefix (status, customerId) db.orders.find({ status: "completed", customerId: "c-1" }) // ✅ Uses index — full prefix match db.orders.find({ status: "completed", customerId: "c-1" }).sort({ createdAt: -1 }) // ❌ Does NOT use index — skips status (first field) db.orders.find({ customerId: "c-1" })
When designing compound indexes, follow ESR: Equality → Sort → Range. Fields with equality predicates come first (highest selectivity, most nodes eliminated), followed by sort fields (to avoid in-memory sort), followed by range predicates (last because they scan a range of values).
// Query: { status: "active", createdAt: { $gte: lastWeek } }, sort: { score: -1 } // ❌ Wrong order — range before sort db.users.createIndex({ status: 1, createdAt: 1, score: -1 }) // ✅ ESR order — equality(status), sort(score), range(createdAt) db.users.createIndex({ status: 1, score: -1, createdAt: 1 }) // Result: index provides sort order (no in-memory sort) and covers range
For a single-field index, direction (1 vs -1) doesn't matter — MongoDB can traverse a B-tree in either direction. For compound indexes, direction matters when the combination of sort orders doesn't match the index direction. { a: 1, b: 1 } supports sort({ a: 1, b: 1 }) and sort({ a: -1, b: -1 }) but NOT sort({ a: 1, b: -1 }).
"A compound index stores multiple fields in a single B-tree sorted left-to-right. The prefix rule means any query that matches the leftmost fields can use the index — but skipping the first field makes the index useless. When designing field order, follow the ESR rule: put equality predicates first (most selective, most nodes eliminated), then sort fields (avoids expensive in-memory sort), then range predicates last. This ordering lets the index both filter AND provide sort order for free."
A covered query is one where MongoDB can satisfy the query entirely from the index — without fetching the actual documents from disk. This is the fastest possible query because it never touches the collection; it reads only the index B-tree. A query is covered when every field in the query filter and projection exists in the index.
// Index contains: status, customerId, amount db.orders.createIndex({ status: 1, customerId: 1, amount: 1 }) // ✅ COVERED — filter and projection only use indexed fields // _id must be excluded (it's not in this index) db.orders.find( { status: "completed", customerId: "c-1" }, // filter { amount: 1, _id: 0 } // projection — exclude _id! ) // ❌ NOT covered — requests a field not in the index db.orders.find( { status: "completed" }, { amount: 1, createdAt: 1, _id: 0 } // createdAt not in index → fetch doc )
db.orders.find( { status: "completed", customerId: "c-1" }, { amount: 1, _id: 0 } ).explain("executionStats") // Look for in executionStats: // totalDocsExamined: 0 ← key indicator — no documents fetched // totalKeysExamined: N ← only index keys scanned // stage: "PROJECTION_COVERED" ← confirms covered execution // If totalDocsExamined > 0 → NOT covered → check if projection // requests any field outside the index (including _id by default)
_id is returned by default in every find(). If _id is not in your compound index, the query cannot be covered unless you explicitly exclude it with { _id: 0 }. This is the most common reason a query that looks covered is actually not.
"A covered query is one MongoDB satisfies entirely from the index, without touching the underlying documents — totalDocsExamined is 0 in explain(). To create one, every field in the filter and projection must be in the index. The classic gotcha is _id: MongoDB returns _id by default, and if _id isn't in your compound index, you must explicitly project it out with { _id: 0 } to achieve coverage. Covered queries are dramatically faster because they avoid disk I/O to the collection entirely."
| Mode | Returns | Use When |
|---|---|---|
| queryPlanner (default) | Winning plan chosen by the planner — does NOT run the query | Fast check: which index would be used? |
| executionStats | Winning plan + actual execution metrics (docs examined, time) | Diagnose slow queries in dev/staging |
| allPlansExecution | All candidate plans + their execution stats | Deep analysis of why the planner chose plan A over plan B |
const exp = db.orders.find({ status: "completed" }).explain("executionStats") // ── queryPlanner section ── exp.queryPlanner.winningPlan.stage // "COLLSCAN" → no index used (bad for large collections) // "IXSCAN" → index scan (good) // "FETCH" → fetched docs from collection after IXSCAN // "PROJECTION_COVERED" → served entirely from index (best) // "SORT" → in-memory sort (potential performance issue) exp.queryPlanner.winningPlan.inputStage.indexName // Which index was selected // ── executionStats section ── exp.executionStats.nReturned // docs returned to caller exp.executionStats.totalDocsExamined // docs fetched from collection exp.executionStats.totalKeysExamined // index entries scanned exp.executionStats.executionTimeMillis // total execution time // Efficiency ratios to check: // nReturned / totalDocsExamined → close to 1.0 = efficient // nReturned / totalKeysExamined → close to 1.0 = good index selectivity // totalDocsExamined = 0 → covered query (best possible)
| Symptom in explain() | Root Cause | Fix |
|---|---|---|
| stage: COLLSCAN | No usable index for this query | Create an index matching the filter fields |
| totalDocsExamined >> nReturned | Index has low selectivity — scanning many docs to return few | Add more fields to the index to narrow the scan |
| stage: SORT (in-memory) | Sort field not in index or index direction mismatch | Add sort field to compound index (ESR rule) |
| stage: FETCH after IXSCAN | Projection requests fields not in the index | Add those fields to index for coverage, or accept the fetch |
| executionTimeMillis high despite IXSCAN | Index is large, or document fetch is slow (large docs) | $project to reduce doc size; consider covered query |
.hint({ field: 1 }) to force a specific index, then re-run explain() to compare execution stats. Never use hint in production code without measuring — the planner usually knows best.
"explain('executionStats') is the primary tool. The two most important things to check are: the stage — IXSCAN is good, COLLSCAN means no index was used, and SORT means an expensive in-memory sort is happening. Then check the ratio of nReturned to totalDocsExamined — if you examined 50,000 docs to return 10, your index is poorly selective or the wrong fields are indexed. totalDocsExamined: 0 means a covered query, which is the best possible outcome."
A TTL (Time-To-Live) index automatically deletes documents after a specified number of seconds from a Date field. MongoDB's background TTL monitor thread checks the index every 60 seconds and removes expired documents. This is the native way to implement session expiry, log rotation, and cache invalidation without application-level cron jobs.
// Delete sessions 24 hours after createdAt db.sessions.createIndex( { createdAt: 1 }, { expireAfterSeconds: 86400 } // 24 × 60 × 60 ) // Delete at a specific time — set expiresAt to the exact expiry date // expireAfterSeconds: 0 means "expire at the date stored in the field" db.tokens.createIndex( { expiresAt: 1 }, { expireAfterSeconds: 0 } ) // Document: { token: "abc", expiresAt: new Date("2024-12-31") } // → deleted after Dec 31 2024 // Change TTL on existing index (no need to drop/recreate) db.runCommand({ collMod: "sessions", index: { keyPattern: { createdAt: 1 }, expireAfterSeconds: 3600 } })
| Limitation | Detail |
|---|---|
| ~60s precision | The TTL monitor runs every 60 seconds — documents expire within ~60s of their TTL, not exactly |
| Single Date field only | The indexed field must be a Date type or an array of Dates; other types are ignored |
| Not on compound indexes | TTL indexes must be single-field — you cannot create a compound TTL index |
| Not on capped collections | Capped collections do not support TTL indexes |
| High delete load | On very large collections, TTL deletes can create write pressure; monitor with db.currentOp() |
"A TTL index automatically deletes documents after N seconds from a Date field. It's built on a background thread that runs every ~60 seconds, so expiry is approximate not exact. The two patterns are: fixed duration (expireAfterSeconds: 86400 with a createdAt field) and exact expiry time (expireAfterSeconds: 0 with an expiresAt field set to the specific deletion date). Typical uses are session stores, password reset tokens, event logs, and cache documents. Key limitation: TTL must be a single-field index — you can't combine it with a compound index."
MongoDB automatically creates a multikey index when an indexed field contains an array. Instead of storing one index entry per document, it creates one entry per array element — so a document with a tags: ["mongodb", "nosql", "database"] field generates 3 index entries. This allows efficient querying on array contents.
// Create index on tags field db.products.createIndex({ tags: 1 }) // If any document has tags: ["a", "b"], this becomes a multikey index // Documents: // { _id: 1, tags: ["mongodb", "nosql"] } // { _id: 2, tags: ["mongodb", "atlas"] } // Index entries created: // "atlas" → doc 2 // "mongodb" → doc 1, doc 2 // "nosql" → doc 1 // These queries all use the multikey index: db.products.find({ tags: "mongodb" }) // ✅ element match db.products.find({ tags: { $in: ["atlas", "nosql"] } }) // ✅ $in db.products.find({ tags: { $all: ["mongodb", "nosql"] } }) // ✅ $all
| Limitation | Detail |
|---|---|
| Only one array field per compound index | A compound index { a: 1, b: 1 } cannot be multikey on both a AND b — only one of them can be an array |
| Cannot be used as shard key | Multikey indexes cannot be used as the shard key in a sharded cluster |
| Cannot be covered queries | Covered queries are not supported for multikey indexes |
| Index size grows with array size | A document with 1,000 array elements creates 1,000 index entries — watch for index bloat |
db.collection.getIndexes() and look for "isMultiKey": true in the output, or check the multiKeyPaths field to see which array field triggered the multikey state.
"A multikey index is automatically created when you index an array field — MongoDB creates one index entry per array element rather than one per document. This makes queries like { tags: 'mongodb' } efficient even though tags is an array. The key limitation is that a compound index can only have one multikey (array) field — you can't have two array fields in the same compound index. Also watch for index bloat on documents with very large arrays, since each element becomes a separate index entry."
A partial index only indexes documents that match a specified partialFilterExpression. Documents that don't match the filter are excluded from the index entirely. This means the index is smaller, cheaper to build, faster to query, and uses less RAM than a full index on the same field.
// Use case 1: Index only active users (hot data subset) // 10% of users are active — index only covers them, 10× smaller db.users.createIndex( { email: 1 }, { partialFilterExpression: { status: "active" } } ) // Query MUST include the filter expression to use the index: db.users.find({ email: "alice@example.com", status: "active" }) // ✅ uses index db.users.find({ email: "alice@example.com" }) // ❌ COLLSCAN // Use case 2: Sparse unique constraint — allow multiple docs with no email // (sparse: true is a simpler alias for this specific pattern) db.users.createIndex( { email: 1 }, { unique: true, partialFilterExpression: { email: { $exists: true } } } ) // Multiple docs with no email field can coexist; email uniqueness only // enforced where the field exists // Use case 3: Index completed orders only (most queries target completed) db.orders.createIndex( { customerId: 1, createdAt: -1 }, { partialFilterExpression: { status: "completed" } } )
| Type | Filters Out | Filter Expression |
|---|---|---|
| Sparse | Documents where the indexed field is absent or null | Implicit (no expression needed) |
| Partial | Documents that don't match any arbitrary expression | Explicit partialFilterExpression — much more flexible |
Partial indexes are strictly more powerful than sparse indexes. The MongoDB docs recommend using partial indexes over sparse indexes for new code — you can replicate sparse behaviour with { $exists: true }.
"A partial index only indexes documents matching a filter expression, skipping the rest. This makes it dramatically smaller and faster than a full index on the same field when your queries only ever target a subset of documents — like 'active users only' or 'completed orders only.' The catch: queries must include the filter expression in their predicate, otherwise MongoDB can't use the partial index and falls back to COLLSCAN. The most practical use case is indexing hot data — if only 10% of documents are 'active', the partial index is 10× smaller and fits more easily in RAM."
When MongoDB receives a query, the query planner evaluates candidate indexes. It runs trial executions of each candidate plan in parallel, scoring them by how many documents they examine. The plan that returns results fastest (least work) wins and is cached as the winning plan. The cache is invalidated when the collection grows significantly, indexes are added/dropped, or mongod restarts.
// View cached plans for a collection db.orders.getPlanCache().list() // Clear the plan cache (forces re-evaluation) db.orders.getPlanCache().clear() // Force a specific index (overrides planner — use for diagnosis only) db.orders.find({ status: "completed" }).hint({ status: 1, createdAt: -1 })
Index intersection allows MongoDB to combine two separate single-field indexes to satisfy a query that filters on both fields — instead of requiring a compound index. The engine performs an IXSCAN on each index separately, then intersects the document ID sets.
// Two separate indexes: db.orders.createIndex({ status: 1 }) db.orders.createIndex({ customerId: 1 }) // Query on both — planner MAY use AND_SORTED (index intersection) db.orders.find({ status: "completed", customerId: "c-1" }) // explain() would show: stage: "AND_SORTED" or "AND_HASH" // In practice, a compound index { status: 1, customerId: 1 } is // almost always faster than intersection — prefer compound indexes.
| Approach | Pros | Cons |
|---|---|---|
| Compound index | Single IXSCAN; predictable; smaller RAM footprint; supports sorting | Must redesign if query patterns change; uses more index space upfront |
| Index intersection | Two small single-field indexes can serve many query shapes | Two IXSCAN operations + set intersection = more work; cannot provide sort order; planner may not choose it |
"The query planner runs candidate plans in parallel, measures which returns results with the least work, caches the winner, and uses that plan until the cache is invalidated. Index intersection lets MongoDB combine two single-field indexes to satisfy a multi-field query by intersecting their document ID sets — useful when you can't predict all query patterns. But in practice, a purpose-built compound index is almost always faster than intersection because it's a single B-tree traversal vs two scans plus an intersection step. Use intersection as a fallback, not a design strategy."
A text index tokenises string fields, applies stemming (runs → run) and stop-word removal (the, a, is), and stores terms in a B-tree with relevance scores (tfidf). Queries use the $text operator with $search. Only one text index per collection is allowed, but it can cover multiple fields.
// Create text index on title and body fields db.articles.createIndex({ title: "text", body: "text" }) // Weight fields — title matches count 3× more than body matches db.articles.createIndex( { title: "text", body: "text" }, { weights: { title: 3, body: 1 } } ) // Basic search db.articles.find({ $text: { $search: "mongodb aggregation" } }) // Search with relevance score sorted db.articles.find( { $text: { $search: "mongodb aggregation" } }, { score: { $meta: "textScore" } } ).sort({ score: { $meta: "textScore" } }) // Exact phrase (wrap in quotes) db.articles.find({ $text: { $search: '"aggregation pipeline"' } }) // Exclude a word db.articles.find({ $text: { $search: "mongodb -mapreduce" } })
| Feature | Native Text Index | Atlas Search (Lucene) |
|---|---|---|
| Engine | MongoDB B-tree with tfidf scoring | Apache Lucene (industry standard) |
| Relevance ranking | Basic tfidf | BM25 + custom scoring / boosting |
| Fuzzy search | ❌ No | ✅ Yes (typo tolerance) |
| Autocomplete | ❌ No | ✅ Yes |
| Faceted search | ❌ No | ✅ Yes |
| Synonyms | ❌ No | ✅ Yes |
| Index per collection | 1 text index only | Multiple search indexes |
| Infrastructure | In-process (same mongod) | Separate search nodes (Atlas only) |
"Native text indexes support keyword search with stemming, stop-word removal, and basic tfidf relevance scoring. They're limited to one index per collection and lack fuzzy search, autocomplete, and faceting. Atlas Search — built on Apache Lucene — provides production-grade full-text search with fuzzy matching, BM25 scoring, autocomplete, synonyms, and faceted navigation. For a real product search or content search feature, I'd choose Atlas Search. The native text index is fine for simple admin search or internal tooling where Atlas is not available."
Every index speeds up reads but slows down writes. Each write operation (insert, update, delete) must update every index on the collection. A collection with 10 indexes pays 10× the index write overhead. In high-write systems, over-indexing is as dangerous as under-indexing.
| Principle | Detail |
|---|---|
| Index based on access patterns | Analyse your actual queries (slow query log, $currentOp, Profiler) before creating indexes — don't guess |
| Prefer compound over multiple singles | One compound index { a, b, c } is cheaper than three single-field indexes and serves more query shapes via prefixes |
| Remove unused indexes | Use db.collection.aggregate([ { $indexStats: {} } ]) to find indexes with ops: 0 and drop them |
| Build indexes in rolling fashion | On Atlas/replica sets, create indexes with createIndex() — MongoDB builds on secondaries first, then primary; no lock |
| Use partial indexes for hot data | If 90% of queries target "active" status, index only active docs — much smaller, faster, less write overhead |
| Monitor index cache hit rate | Indexes must fit in the WiredTiger cache (default: 50% of RAM). If indexes exceed RAM, MongoDB pages them in/out (slow) |
// See how many times each index has been used since last restart db.orders.aggregate([{ $indexStats: {} }]) // Output: { name, key, accesses: { ops, since } } // ops: 0 → index never used → candidate for removal // Check index sizes (are they fitting in RAM?) db.orders.stats({ indexDetails: true }).indexSizes // Hide index before dropping (MongoDB 4.4+ — safer than direct drop) // Hidden indexes are maintained but not used by the planner // Monitor for 1–2 weeks, then drop if no performance regression db.runCommand({ collMod: "orders", index: { name: "old_index", hidden: true } }) // Enable the query profiler to find slow queries (level 2 = all queries) db.setProfilingLevel(1, { slowms: 100 }) // log queries > 100ms db.system.profile.find().sort({ millis: -1 }).limit(5)
MongoDB's WiredTiger storage engine keeps frequently accessed indexes in memory (default: 50% of RAM). When the working set of indexes exceeds available RAM, MongoDB page-faults — fetching index pages from disk on each query. Symptoms: high page faults/s in mongostat, rising query latency despite IXSCAN plans. Fix: add RAM, reduce index count, or use partial indexes to shrink index size.
"Index design is a balancing act: more indexes speed up reads but slow down every write and consume RAM. My approach is: start with the query profiler and slow query log to understand actual access patterns, then build compound indexes following ESR ordering to serve the most common queries. I use $indexStats to find zero-use indexes and hide them for 1–2 weeks before dropping, to avoid surprise regressions. For high-write collections, I keep the index count minimal — 3–5 max — and use partial indexes to shrink hot-path indexes. Everything must fit in the WiredTiger cache or you're just doing disk I/O with extra steps."
In relational databases you model data to third normal form (3NF) and let JOINs reconstruct it at query time. In MongoDB you model data for how your application reads it. The central question is not "how is this data related?" but "what does my application need to read together, and how often?"
Embedding is like packing a suitcase for one trip — everything you need is in one place, no fetching required. Referencing is like storing items in a warehouse with a catalogue — you look up the catalogue first, then fetch each item separately. The right choice depends on how often you travel and whether you always need all items together, or just some of them.
2. How often does the data change? → Frequently-changing data may be better referenced.
3. What is the cardinality? → Unbounded arrays inside a document are a red flag (16 MB limit).
| Factor | Favour Embedding | Favour Referencing |
|---|---|---|
| Read together? | Data is almost always read together (order + items) | Data is often read independently (user profile vs orders) |
| Cardinality | One-to-few (address, phone numbers) | One-to-many or many-to-many (user → orders) |
| Update frequency | Sub-document rarely changes | Sub-document changes often and is shared across many parents |
| Duplication OK? | Some duplication is acceptable for read speed | Single source of truth is required |
| Document size | Embedded data stays well under 16 MB limit | Array could grow unbounded (tweets, log entries) |
| Atomic update | Need to update parent + child atomically (pre-4.0) | Multi-document transactions available (4.0+) |
// ✅ Good embed: line items always read with the order { _id: ObjectId("..."), customerId: "u-123", status: "completed", createdAt: new Date(), shippingAddress: { // ✅ embed — always read with order street: "123 Main St", city: "Austin", zip: "78701" }, items: [ // ✅ embed — bounded, read together { productId: "p-1", name: "Laptop", qty: 1, price: 999 }, { productId: "p-2", name: "Mouse", qty: 2, price: 29 } ] // customerId is a REFERENCE — user profile read separately }
// users collection { _id: "u-123", name: "Alice", email: "alice@example.com" } // orders collection — references user by ID { _id: ObjectId("..."), customerId: "u-123", amount: 1057 } // ✅ Why reference works here: // - User profile updated independently of orders // - One user can have thousands of orders (unbounded) // - Order list page and user profile page read separately
"The rule of thumb is: embed data that is read together and updated together; reference data that is read independently, changes frequently, or could grow unbounded. The 16 MB document limit is the hard constraint — any array that could grow without bound (all tweets from a user, all events in a log) must be referenced. For one-to-few relationships with bounded data (order line items, shipping address), embedding almost always wins because it eliminates a round-trip."
MongoDB offers three ways to model one-to-many. The cardinality of the relationship (one-to-few, one-to-many, one-to-squillions) determines which to use.
// Person → phone numbers (always < 5, read together) { _id: "u-1", name: "Alice", phones: [ { type: "mobile", number: "555-0101" }, { type: "work", number: "555-0102" } ] } // ✅ Single read, atomic update, no join needed // ❌ Cannot query phones independently
// Publisher → books (dozens of books per publisher) // publishers collection { _id: "pub-1", name: "O'Reilly", bookIds: ["b-1", "b-2", "b-3"] } // books collection { _id: "b-1", title: "MongoDB: The Definitive Guide", publisherId: "pub-1" } // ✅ Books queried independently; publisher list clean // ❌ Need $lookup or two round-trips to get full book data // ❌ bookIds array grows — risky if publisher has thousands of books
// Host → log events (millions of events per host) // hosts collection { _id: "host-1", hostname: "web-01", ip: "10.0.0.1" } // logs collection — each log references its host { _id: ObjectId("..."), hostId: "host-1", level: "error", msg: "...", ts: new Date() } // ✅ Scales to unlimited events per host // ✅ Each log queryable and indexable independently // ❌ Joining host info requires $lookup
| Cardinality | Pattern | Example |
|---|---|---|
| One-to-few (< ~20) | Embed array in parent | Person → phones, order → line items |
| One-to-many (dozens to hundreds) | Array of references in parent OR child reference | Publisher → books, user → orders |
| One-to-squillions (millions) | Child reference — store parent ID in child | Host → log events, user → activity feed |
"MongoDB gives three options for 1:N. For one-to-few (bounded, always read together), embed the array directly in the parent. For one-to-many (dozens), store an array of reference IDs on the parent — but this can bloat the parent document. For one-to-squillions (millions of children), flip it: store the parent's ID on each child document. The child-reference approach scales without bound and keeps the parent document small — essential for log events, activity feeds, or sensor readings."
In relational databases, N:N requires a junction table. In MongoDB the most common approach is to store arrays of references on both sides — or just one side depending on which direction you query most. There's also a denormalised embed approach for small, stable relationship data.
// Students ↔ Courses (students enrol in many courses, courses have many students) // students collection { _id: "s-1", name: "Alice", courseIds: ["c-1", "c-2", "c-3"] // which courses this student takes } // courses collection { _id: "c-1", title: "MongoDB Mastery", studentIds: ["s-1", "s-2", "s-3"] // which students are enrolled } // ✅ Find all courses for a student: index courseIds db.students.find({ _id: "s-1" }) // get courseIds, then fetch courses // ✅ Find all students in a course: index studentIds db.courses.find({ _id: "c-1" }) // get studentIds, then fetch students // ❌ Must update BOTH documents when a student enrols
// Products ↔ Tags — query "all products with tag X" most often // Store tag IDs on the product; tags collection stays clean // products collection { _id: "p-1", name: "Laptop", tagIds: ["tag-nosql", "tag-hardware"] } // tags collection { _id: "tag-nosql", label: "NoSQL", colour: "green" } // ✅ "Products with tag X" → db.products.find({ tagIds: "tag-nosql" }) // ✅ Only one document to update on enrol // ❌ "All tags for product X" needs $lookup or two queries
// Orders ↔ Products — embed just the fields you display { _id: ObjectId("..."), items: [ { productId: "p-1", // reference (for lookups) name: "Laptop", // snapshot of name at purchase time price: 999 // snapshot of price at purchase time } ] } // ✅ Order page renders without any join // ✅ Price is preserved as it was at purchase — correct // ❌ Product name change doesn't update old orders — by design
"MongoDB handles N:N without a junction table. The most common approach is referencing: store IDs on one side (or both sides) with a multikey index to query from either direction. If you query only one direction, put the array on the 'many' side you query most. For order line items, I use a partial embed — store both the reference ID and a snapshot of key fields (name, price) at the time of the relationship. This avoids joins at read time and correctly preserves historical data — an order should show the price when it was placed, not today's price."
When you have a high-frequency data stream (IoT sensors, stock ticks, metrics), storing one document per event creates millions of tiny documents — high index overhead, many B-tree nodes, slow range queries. The Bucket Pattern groups a fixed number of events into a single document, trading document count for document size.
// 1 document per sensor reading — 1M readings/day = 1M tiny docs { sensorId: "s-1", ts: new Date(), temp: 22.3, humidity: 45 } { sensorId: "s-1", ts: new Date(), temp: 22.4, humidity: 46 } // Problem: huge index overhead, slow hourly aggregations
// One bucket = one hour of readings for one sensor { _id: "s-1_2024-03-15T14", // sensorId + hour as natural key sensorId: "s-1", hour: new Date("2024-03-15T14:00:00Z"), count: 60, // number of readings in this bucket sum: { temp: 1342, humidity: 2730 }, // running totals for avg min: { temp: 21.8, humidity: 44 }, max: { temp: 23.1, humidity: 48 }, readings: [ // raw data — up to 60 per bucket { ts: new Date("2024-03-15T14:00:00Z"), temp: 22.3, humidity: 45 }, { ts: new Date("2024-03-15T14:01:00Z"), temp: 22.4, humidity: 46 } ] } // Insert a new reading — $push + update running totals atomically db.sensors.updateOne( { _id: "s-1_2024-03-15T14", count: { $lt: 60 } }, { $push: { readings: { ts: new Date(), temp: 22.5, humidity: 47 } }, $inc: { count: 1, "sum.temp": 22.5, "sum.humidity": 47 }, $min: { "min.temp": 22.5 }, $max: { "max.temp": 22.5 } }, { upsert: true } )
| Aspect | Naive (One Doc/Event) | Bucket Pattern |
|---|---|---|
| Document count | 1M docs / day / sensor | ~24 docs / day / sensor (1 per hour) |
| Index size | Huge — one entry per event | Small — one entry per bucket |
| Hourly average query | $group over 60 docs | sum.temp / count — single doc read |
| Write pattern | Simple insert | updateOne with upsert — slightly complex |
| Raw data access | Direct | Requires $unwind or array indexing |
db.createCollection("sensors", { timeseries: { timeField: "ts", metaField: "sensorId" } }) over manual bucketing.
"The Bucket Pattern solves the problem of storing high-frequency time series data. Instead of one document per event (millions of tiny docs with massive index overhead), you group N events into a single bucket document — say, one bucket per sensor per hour. The bucket stores raw readings as an array plus pre-computed running aggregates (sum, min, max, count) so hourly averages are a single document read rather than a $group over 60 documents. MongoDB 5.0+ Time Series collections implement this pattern natively, so for new projects I'd use those instead of manual bucketing."
The Extended Reference Pattern is a deliberate, controlled form of denormalisation. Instead of storing just a foreign key (reference), you embed a snapshot of the most-read fields from the referenced document alongside the ID. This eliminates the $lookup on the hot read path while keeping a pointer to the source of truth for anything else.
// ❌ Pure reference — order list page needs JOIN to show customer name { _id: ObjectId("..."), customerId: "u-123", // reference only amount: 1057 } // ✅ Extended reference — order list page renders without any join { _id: ObjectId("..."), customer: { _id: "u-123", // keep the reference name: "Alice Smith", // snapshot — displayed on every order email: "alice@example.com" // snapshot — used for notifications // NOT: address, preferences, etc. — only what THIS doc needs }, amount: 1057, status: "completed" } // When customer's name changes — update only recent/open orders // Historical orders can keep the old name (which may be intentional)
Only embed fields that are read in every query for this document AND change rarely. A customer's name changes very infrequently — safe to embed. A customer's loyalty points change daily — keep as pure reference only.
| Field | Embed? | Reason |
|---|---|---|
| Customer name | ✅ Yes | Shown on every order card; rarely changes |
| Customer email | ✅ Yes | Used for order confirmation; rarely changes |
| Customer address | Snapshot only | Embed the address AT ORDER TIME — historical accuracy |
| Customer loyalty points | ❌ No | Changes on every purchase — always look up live value |
| Customer phone | ❌ Usually no | Not shown on order cards; wasted space |
// When a customer changes their name, update recent open orders await db.collection('orders').updateMany( { "customer._id": "u-123", status: { $in: ["pending", "processing"] } // only open orders }, { $set: { "customer.name": "Alice Johnson" } } ) // Completed orders intentionally keep the old name — historical record
"The Extended Reference Pattern embeds a snapshot of the most-read fields from a referenced document alongside the foreign key. This eliminates $lookup on the hot read path — an order list page can render customer names without any join. The trade-off is controlled duplication: when the customer's name changes, you run an updateMany on open orders. The key discipline is to only embed fields that appear in every query on this document AND change rarely. Frequently-changing fields like loyalty points stay as pure references."
Most books have ~10 reviews. But a bestseller might have 100,000 reviews. If you embed reviews in the book document, 99.9% of books are fine — but the 0.1% of bestsellers hit the 16 MB document limit and corrupt your schema for everyone. Designing for the outlier ruins the normal case.
// Standard book document — reviews embedded (normal case) { _id: "book-1", title: "Learning MongoDB", hasOverflow: false, // sentinel flag reviews: [ { userId: "u-1", rating: 5, text: "Great book!" } // ... up to ~100 reviews embedded ] } // Outlier book — too many reviews to embed { _id: "book-99", title: "Harry Potter", hasOverflow: true, // sentinel: check overflow collection reviews: [ // first ~100 reviews still embedded (for fast initial display) ] } // book_reviews_overflow collection — only exists for outliers { _id: ObjectId("..."), bookId: "book-99", reviews: [/* batch of reviews */] }
async function getReviews(bookId, page = 1) { const book = await db.collection('books').findOne({ _id: bookId }); if (!book.hasOverflow) { return book.reviews; // fast path — embedded data, no extra query } // Slow path — fetch from overflow collection with pagination return db.collection('book_reviews_overflow') .find({ bookId }) .skip((page - 1) * 20) .limit(20) .toArray(); }
If you design the schema assuming all books are bestsellers (full reference pattern), every book page load requires an extra round-trip to the reviews collection — even for books with 3 reviews. The Outlier Pattern gives 99.9% of documents the fast embedded path and handles the 0.1% of outliers gracefully with a fallback.
"The Outlier Pattern handles data skew — when 99% of documents fit a schema but 1% don't. The solution is to design for the common case (embed reviews for most books) and add a sentinel flag like hasOverflow: true on documents that exceed your threshold. Application code checks the flag and takes a different code path for outliers. The fast path (no flag) is a simple embedded read; the slow path queries an overflow collection. This keeps the schema optimal for the majority while correctly handling extreme cases."
MongoDB supports JSON Schema validation via the $jsonSchema operator in the collection's validator. You define required fields, field types, value constraints, and patterns. The validator runs on insert and update operations; violations return a WriteError. Validation has two action modes: error (reject) and warn (allow but log).
db.createCollection("users", { validator: { $jsonSchema: { bsonType: "object", required: ["name", "email", "createdAt"], properties: { name: { bsonType: "string", minLength: 2, maxLength: 100, description: "Name is required and must be 2–100 chars" }, email: { bsonType: "string", pattern: "^[^@]+@[^@]+\\.[^@]+$" }, age: { bsonType: "int", minimum: 0, maximum: 130 }, role: { bsonType: "string", enum: ["admin", "editor", "viewer"] }, address: { bsonType: "object", required: ["city", "zip"], properties: { city: { bsonType: "string" }, zip: { bsonType: "string", pattern: "^[0-9]{5}$" } } } } } }, validationLevel: "strict", // "strict" (all writes) or "moderate" (existing docs exempt) validationAction: "error" // "error" (reject) or "warn" (log, allow) })
db.runCommand({ collMod: "users", validator: { $jsonSchema: { /* ... */ } }, validationLevel: "moderate", // existing docs exempt until next write validationAction: "warn" // start with warn to catch issues without blocking })
$jsonSchema validation runs in the database engine and catches every write regardless of how it's submitted. For data integrity guarantees, use both.
"MongoDB supports JSON Schema validation via $jsonSchema in the collection validator. You can enforce required fields, BSON types, string patterns, numeric ranges, and enum values. Validation runs at the database level — unlike Mongoose validation which only runs in the Node.js layer. I always start new collections with validationAction: 'warn' to discover what existing data violates the schema before switching to 'error' to reject invalid writes. validationLevel: 'moderate' lets existing documents stay, only validating on future writes."
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| Massive arrays (unbounded growth) | Arrays with thousands of elements; approaching 16 MB limit; slow updates | Child reference pattern or Bucket Pattern — store array elements in child docs |
| Massive number of collections | Hundreds of collections for similar data; collection scan on metadata | Consolidate with a discriminator type field; use one collection per entity type |
| Bloated documents | Large documents (hundreds of KB) rarely fully read; slow queries | Separate hot fields from cold fields — Subset Pattern |
| Unnecessary indexes | 10+ indexes; slow writes; indexes don't fit in RAM | $indexStats to find zero-use indexes; drop them |
| Case-insensitive query without index | { $regex: /alice/i } on large collections = COLLSCAN | Store normalised lowercase field; or use a collation index |
| Separating data that is read together | Multiple $lookup or application-side joins on every request | Embed the data — design for your read pattern |
// Product page always shows: name, price, 5 recent reviews, thumbnail // Full review history rarely accessed // products collection — hot data, small, fast to load { _id: "p-1", name: "Laptop Pro", price: 1299, thumbnail: "https://cdn.example.com/laptop.jpg", recentReviews: [ // only last 5 — always shown { userId: "u-1", rating: 5, text: "Excellent!" } ] } // product_reviews collection — cold data, full history { _id: ObjectId("..."), productId: "p-1", rating: 4, text: "...", createdAt: new Date() } // Only fetched when user clicks "See all 1,243 reviews"
// Find the largest documents in a collection db.products.aggregate([ { $project: { docSize: { $bsonSize: "$$ROOT" } } }, { $sort: { docSize: -1 } }, { $limit: 10 } ]) // docSize approaching 16,000,000 bytes = danger zone // Documents > 1 MB are candidates for Subset or Bucket Pattern
"The three most impactful anti-patterns are: unbounded arrays (a document that grows without limit will hit the 16 MB cap — use child references or the Bucket Pattern), bloated documents (reading a 500 KB document to display 5 fields is wasteful — use the Subset Pattern to split hot and cold data), and too many indexes (indexes have write overhead and must fit in RAM — use $indexStats to find and drop zero-use indexes). MongoDB's Performance Advisor in Atlas automatically flags these."
| Pattern | Storage | Find Ancestors | Find Subtree |
|---|---|---|---|
| Parent Reference | { parentId: "p-1" } on each node | Multi-hop queries (slow) | Must recurse application-side |
| Child References | { children: ["c-1","c-2"] } on parent | Multi-hop (slow) | Traverse array |
| Array of Ancestors | { ancestors: ["root","cat-1","cat-2"] } on each node | ✅ Single doc read | { ancestors: nodeId } — fast |
| Materialised Path | { path: ",root,cat-1,cat-2," } on each node | ✅ Regex or $regex | ✅ Regex prefix match + text index |
// Category tree: Electronics → Computers → Laptops { _id: "laptops", name: "Laptops", ancestors: ["electronics", "computers"], parentId: "computers" } { _id: "computers", name: "Computers", ancestors: ["electronics"], parentId: "electronics" } { _id: "electronics",name: "Electronics",ancestors: [], parentId: null } // Find all ancestors of "laptops" — single document read db.categories.find({ _id: "laptops" }) // returns ancestors array // Find all descendants of "electronics" (entire subtree) db.categories.find({ ancestors: "electronics" }) // Index on ancestors (multikey) makes this fast // Show breadcrumb: Electronics > Computers > Laptops db.categories.find({ _id: { $in: ["electronics", "computers"] } }) .sort({ /* by depth */ })
// Store full path as a string { _id: "laptops", path: ",electronics,computers,laptops," } { _id: "computers", path: ",electronics,computers," } // Find all descendants of "computers" — prefix match db.categories.find({ path: /^,electronics,computers,/ }) // ✅ Fast with a text or regex index on path
"The most practical pattern for category trees and org charts is Array of Ancestors — each node stores the full list of ancestor IDs as an array. This gives you subtree queries with a single indexed find({ ancestors: 'nodeId' }) and breadcrumb rendering with a single $in lookup. The trade-off is that moving a node requires updating ancestors on the node and all descendants. For very deep trees with many moves, Materialised Path is more flexible — store the full path string and use a regex prefix match. Both patterns are far better than Parent Reference for anything beyond two levels."
Before designing any schema, identify your read patterns. For a social feed:
| Operation | Frequency | Latency Requirement |
|---|---|---|
| Render home feed (latest posts from follows) | Very high — every page load | < 100ms |
| Show post with like count + top comments | Very high | < 50ms |
| Like / unlike a post | High | < 200ms |
| Follow / unfollow a user | Medium | < 500ms |
| Get follower/following counts | High (shown on profile) | < 50ms |
// ── users collection ── { _id: "u-1", username: "alice", displayName: "Alice Smith", avatar: "https://cdn.example.com/alice.jpg", followerCount: 1420, // denormalised counter — fast profile render followingCount: 382 } // ── follows collection — child-reference pattern (one-to-squillions) ── { followerId: "u-1", followeeId: "u-2", createdAt: new Date() } // Index: { followerId: 1, followeeId: 1 } unique // Index: { followeeId: 1 } — "who follows user X" // ── posts collection ── { _id: ObjectId("..."), authorId: "u-1", author: { // Extended Reference — avoids join on feed render username: "alice", displayName: "Alice Smith", avatar: "https://cdn.example.com/alice.jpg" }, content: "Just learned about MongoDB schema design!", mediaUrls: ["https://cdn.example.com/img1.jpg"], likeCount: 47, // denormalised counter commentCount: 8, // denormalised counter createdAt: new Date() } // Index: { authorId: 1, createdAt: -1 } — user's post history // Index: { createdAt: -1 } — global feed sort // ── likes collection — avoids unbounded array in post ── { postId: ObjectId("..."), userId: "u-1", createdAt: new Date() } // Index: { postId: 1, userId: 1 } unique — fast "did I like this?" check // ── comments collection ── { _id: ObjectId("..."), postId: ObjectId("..."), authorId: "u-1", author: { username: "alice", avatar: "..." }, // Extended Reference text: "Great post!", createdAt: new Date() } // Index: { postId: 1, createdAt: 1 } — comments for a post, oldest first
// Step 1: get list of users the current user follows const following = await db.collection('follows') .find({ followerId: currentUserId }) .map(f => f.followeeId) .toArray(); // Step 2: get recent posts from those users (fan-in read model) const feed = await db.collection('posts') .find({ authorId: { $in: following } }) .sort({ createdAt: -1 }) .limit(20) .toArray(); // author data already embedded — no $lookup needed on render
| Decision | Why |
|---|---|
| Separate likes collection (not array in post) | Unbounded — a viral post could have millions of likes; also enables "did I like this?" query with a unique index |
| Denormalised likeCount / commentCount on post | Feed renders fast — no aggregation query needed. Use $inc to keep counters consistent atomically |
| Extended Reference (author snapshot) in posts/comments | Feed and comment list render without $lookup; username/avatar rarely changes |
| Separate follows collection | Users can have millions of followers — can't embed in user doc; enables efficient "get my feed" and "get followers" queries |
"I'd start with access patterns, not entities. The hot path is feed rendering — it must be fast. I'd use separate collections for posts, likes, comments, and follows (all unbounded). Posts embed an Extended Reference for author info so the feed renders without joins. Like count and comment count are denormalised counters on the post (updated with $inc) so the feed never needs aggregation. The follows collection enables feed queries via $in on authorId. This fan-in read model works up to ~few thousand follows; beyond that you'd move to a pre-computed feed (fan-out on write)."
MongoDB has always provided atomic operations on a single document — this covers the majority of use cases because a well-designed document embeds related data together. Multi-document transactions (added in 4.0) exist for cases where atomicity across multiple collections is truly unavoidable — but they come with a performance cost.
Debiting Alice's account and crediting Bob's are two separate documents. Without a transaction, a crash between the two writes leaves Bob uncredited. With a transaction, both writes succeed or both roll back — the database is never in a half-updated state. But if you'd modelled both balances in one document (e.g., an internal ledger), a single atomic update would suffice without any transaction overhead.
| Property | Single Document | Multi-Document Transaction (4.0+) |
|---|---|---|
| Atomicity | ✅ Always — all field updates in one write succeed or fail together | ✅ All writes across documents/collections commit or all roll back |
| Consistency | ✅ Validators and unique indexes enforced per write | ✅ Validators run; unique constraints enforced across the transaction |
| Isolation | ✅ Snapshot isolation via WiredTiger MVCC | ✅ Snapshot isolation — reads see a consistent point-in-time snapshot |
| Durability | Depends on write concern — w:1 = durable after primary ack; w:majority = durable on majority | Same write concern rules apply to the commit operation |
Every MongoDB write operation (insert, update, delete, findAndModify) on a single document is atomic. Even if you update 20 fields simultaneously, observers either see all 20 changes or none — never a half-updated document. This is implemented by WiredTiger's MVCC (Multi-Version Concurrency Control) — readers see the committed version while a write is in progress.
// All three field changes are atomic — no partial state visible db.orders.updateOne( { _id: orderId }, { $set: { status: "shipped", shippedAt: new Date() }, $inc: { version: 1 } } ) // A concurrent reader sees either the old state OR the new state // — never status="shipped" with the old shippedAt
Because MongoDB documents can embed related data, many operations that would require a transaction in a relational DB are single-document writes in MongoDB:
| Scenario | SQL Approach | MongoDB Single-Doc Approach |
|---|---|---|
| Add order + line items | INSERT into orders + INSERT into order_items (transaction) | Single insertOne with items embedded |
| Update product + inventory | UPDATE products + UPDATE inventory (transaction) | Embed inventory in product doc; single updateOne |
| Publish post + update count | INSERT post + UPDATE user.postCount (transaction) | $inc postCount in same updateOne that sets published:true |
"Yes, MongoDB is fully ACID. At the single-document level it has always been atomic — all field changes in one write are committed together or not at all, implemented via WiredTiger's MVCC. Since MongoDB 4.0, multi-document ACID transactions are supported across collections and databases, with the same snapshot isolation semantics. The MongoDB philosophy is to design schemas that make single-document atomicity sufficient — embedded documents eliminate the need for transactions in most cases. Transactions are available when truly needed but carry a performance cost."
Write concern specifies how many replica set members must acknowledge a write before MongoDB reports it as successful. It is the dial between performance (fewer acknowledgements = faster) and durability (more acknowledgements = safer against data loss on failover).
| Write Concern | Meaning | Risk | Use Case |
|---|---|---|---|
| w: 0 (fire & forget) | No acknowledgement — returns immediately | Data loss if mongod crashes before writing | High-throughput logging where loss is acceptable |
| w: 1 (default) | Primary acknowledges after writing to its memory journal | Data loss if primary fails before replication completes | General purpose — balanced performance/safety |
| w: majority | Majority of replica set members acknowledge | Minimal — survives primary failover | Financial data, user records, anything that must not be lost |
| w: N (number) | N specific members acknowledge | Depends on N | Custom replication requirements |
// w:majority + j:true — strongest durability guarantee // Write survives majority failover AND primary crash before journal flush await db.collection('payments').insertOne( { amount: 500, currency: "USD" }, { writeConcern: { w: "majority", j: true, wtimeout: 5000 } } ) // High-throughput event logging — speed over safety await db.collection('events').insertOne( { type: "page_view", url: "/home" }, { writeConcern: { w: 0 } } ) // Set default write concern at client level (recommended) const client = new MongoClient(uri, { writeConcern: { w: "majority", j: true } })
With w:1: the primary acknowledges the write, but the secondaries haven't replicated it yet. If the primary crashes and a secondary is elected, the un-replicated write is rolled back — it never happened as far as the new primary knows. With w:majority: the write is on a majority of nodes before acknowledgement — it will survive any single-node failure, including the current primary failing.
"Write concern is MongoDB's durability dial. w:1 (default) means the primary acknowledges after writing to its own journal — fast, but if the primary crashes before replication completes, that write can be rolled back on failover. w:majority waits until a majority of replica set members have the write — it survives any single-node failure. For financial data or anything that must not be lost, use w:majority with j:true (journal flushed to disk). For high-throughput event logging where occasional loss is acceptable, w:1 or even w:0 is fine."
Read concern controls how current and consistent the data returned by a read must be. It answers: "can I read data that might be rolled back, and do I need to wait for majority confirmation before reading?"
| Level | Returns | Rollback Risk | Latency |
|---|---|---|---|
| local (default) | Most recent data on the queried node — may not be majority-committed | ⚠️ Yes — data may be rolled back on failover | Lowest |
| available | Same as local but for sharded clusters (may return orphaned chunks) | ⚠️ Yes | Lowest |
| majority | Only data that has been acknowledged by a majority of nodes | ✅ No rollback possible | Medium |
| linearizable | Most recent majority-committed data — guarantees real-time ordering | ✅ No rollback + causal consistency | Highest — waits for majority ack |
| snapshot | Consistent snapshot at transaction start time (transactions only) | ✅ Consistent across all reads in the transaction | Medium |
// Default — fast, may read un-committed data db.collection('products').findOne({ _id: id }) // Majority — only read data that won't be rolled back db.collection('payments').findOne( { _id: id }, { readConcern: { level: "majority" } } ) // Linearizable — strongest; single-document reads only // Waits for majority acknowledgment before returning db.collection('inventory').findOne( { _id: productId }, { readConcern: { level: "linearizable" }, maxTimeMS: 5000 } ) // Use when: "is there still stock?" before a purchase commitment
linearizable only works on primary reads of single documents. It cannot be used with aggregations, range queries, or reads from secondaries. Always set maxTimeMS to avoid blocking indefinitely if the primary loses contact with the majority.
"Read concern controls what data you're allowed to see. The default 'local' returns whatever is on the queried node, including writes that haven't been majority-replicated yet and could be rolled back on failover. 'majority' only returns data confirmed by a majority of nodes — safe from rollback. 'linearizable' is the strongest guarantee: it ensures you read the most recent majority-committed value and that any write you read cannot be rolled back — but it only works for single-document primary reads and adds latency waiting for majority confirmation. For financial reads — 'did this payment succeed?' — use majority or linearizable."
| Requirement | Detail |
|---|---|
| Replica set required | Transactions require a replica set or sharded cluster — standalone mongod does not support them |
| Session required | Every transaction runs inside a ClientSession — pass the session to every operation |
| 60-second limit | Transactions must commit within 60 seconds (configurable) — long transactions are killed |
| No DDL inside | Cannot create collections or indexes inside a transaction |
async function transferFunds(fromId, toId, amount) { const session = client.startSession(); try { await session.withTransaction(async () => { const accounts = db.collection('accounts'); // Step 1: Debit source account — check balance atomically const from = await accounts.findOneAndUpdate( { _id: fromId, balance: { $gte: amount } }, // only if sufficient funds { $inc: { balance: -amount } }, { session, returnDocument: "after" } ); if (!from) throw new Error("Insufficient funds"); // Step 2: Credit destination account await accounts.updateOne( { _id: toId }, { $inc: { balance: amount } }, { session } ); // Step 3: Write audit record await db.collection('transfers').insertOne( { fromId, toId, amount, ts: new Date() }, { session } ); }, { readConcern: { level: "snapshot" }, writeConcern: { w: "majority" }, maxCommitTimeMS: 5000 }); // withTransaction automatically retries transient errors // (TransientTransactionError) and commit errors (UnknownTransactionCommitResult) } finally { await session.endSession(); } }
| Approach | Pros | Cons |
|---|---|---|
| session.withTransaction(fn) | Automatically retries on TransientTransactionError and UnknownTransactionCommitResult; handles commit/abort | Less control over retry logic |
| Manual startTransaction / commitTransaction / abortTransaction | Full control over retry behaviour and error handling | Must implement retry logic yourself — easy to get wrong |
"Multi-document transactions require a ClientSession — every operation inside the transaction must receive the session object as an option. I always use session.withTransaction() rather than manual startTransaction/commit/abort because it automatically retries on transient network errors and unknown commit results, which are both expected in distributed systems. Key constraints: 60-second max duration, no DDL inside, requires a replica set. For the actual transaction I use readConcern snapshot and writeConcern majority for full ACID guarantees."
| Operator | Action | Atomic? |
|---|---|---|
| $set | Set a field to a value | ✅ Yes |
| $unset | Remove a field | ✅ Yes |
| $inc | Increment/decrement a numeric field | ✅ Yes — no read-modify-write needed |
| $mul | Multiply a numeric field | ✅ Yes |
| $min / $max | Update field only if new value is less/greater than current | ✅ Yes |
| $push | Append element to array | ✅ Yes |
| $addToSet | Append element to array only if not already present | ✅ Yes |
| $pull / $pop | Remove element(s) from array | ✅ Yes |
| $currentDate | Set field to current date/timestamp | ✅ Yes |
| $setOnInsert | Only set fields when upsert creates a new document | ✅ Yes |
// ❌ Race condition — two clients can both read 5, both write 6 const product = await findOne({ _id: id }); await updateOne({ _id: id }, { $set: { stock: product.stock - 1 } }); // ✅ Atomic — database decrements in one operation, no race await db.collection('products').updateOne( { _id: id, stock: { $gt: 0 } }, // guard: only decrement if in stock { $inc: { stock: -1 } } ) // If matchedCount === 0 → out of stock
// Claim a queued job — find-and-mark atomically const job = await db.collection('jobs').findOneAndUpdate( { status: "queued" }, { $set: { status: "running", workerId: workerId }, $currentDate: { startedAt: true }, $inc: { attempts: 1 } }, { sort: { priority: -1, createdAt: 1 }, returnDocument: "after" } ); // All three changes are atomic — no two workers claim the same job // If null → no queued jobs available
"MongoDB's atomic operators — $inc, $push, $addToSet, $min/$max — allow you to make complex updates without a read-modify-write cycle, eliminating entire classes of race conditions. The classic example is a stock counter: instead of reading the stock, decrementing in application code, and writing back (which has a race window), use $inc with a filter guard ({ stock: { $gt: 0 } }) — the decrement and the check happen atomically. Combined with findOneAndUpdate, you can claim queue jobs, enforce state machines, and update multiple fields in one atomic round-trip without needing a multi-document transaction."
Optimistic concurrency control (OCC) assumes conflicts are rare. Instead of locking, it allows multiple readers/writers to proceed concurrently, then detects conflicts at write time using a version field. If the version has changed since you read the document, your update is rejected — you re-read and retry. It avoids the performance overhead of pessimistic locks while preventing lost updates.
// Document has a __v (version) field // { _id: "p-1", stock: 10, price: 99, __v: 5 } async function updateProductPrice(productId, newPrice) { while (true) { // Step 1: Read current document including version const product = await db.collection('products').findOne({ _id: productId }); if (!product) throw new Error("Not found"); // Step 2: Business logic with the current state const newDoc = { ...product, price: newPrice }; // Step 3: Write — only succeeds if __v hasn't changed const result = await db.collection('products').updateOne( { _id: productId, __v: product.__v }, // version guard { $set: { price: newPrice }, $inc: { __v: 1 } } // increment version ); if (result.matchedCount === 1) return; // success // matchedCount === 0 → someone else updated — retry // In practice: add retry limit + exponential backoff } }
| Approach | Mechanism | Best For | Downside |
|---|---|---|---|
| Atomic operators ($inc) | Single-field atomic update — no version field needed | Counters, append-only arrays | Only works when operation is naturally atomic |
| Optimistic locking (__v) | Read + version-checked write; retry on conflict | Low-contention updates to complex fields | Retries under high contention; more application code |
| findOneAndUpdate | Atomic read + write in one operation | State machines, job queues, claiming resources | Entire update must fit in one operation |
| Multi-doc transaction | ACID across multiple documents/collections | Financial transfers, order + inventory atomicity | Performance overhead, 60s limit, replica set required |
optimisticConcurrency: true option and the versionKey (__v) field implement this pattern automatically. Every save() adds a version guard and increments the version — a concurrent save throws a VersionError.
"Optimistic concurrency control uses a version field (__v) on each document. Before writing, you include the current version in your filter — if the document has been modified by someone else, the filter won't match and matchedCount is 0, signalling a conflict that you retry. This avoids locks, scales well for low-contention scenarios, and is what Mongoose's optimisticConcurrency option implements. For high-contention scenarios (like flash sales), you'd use $inc with a filter guard instead — it's a single atomic operation with no retry loop."
Change streams are a real-time notification mechanism built on the oplog (replication log). They allow applications to subscribe to data changes on a collection, database, or entire cluster and receive a stream of change events without polling. Change streams guarantee at-least-once delivery and support resumability via a resume token.
const changeStream = db.collection('orders').watch( // Optional pipeline to filter events [{ $match: { operationType: { $in: ["insert", "update"] } } }], { fullDocument: "updateLookup" } // include full doc on updates ); for await (const change of changeStream) { console.log(change.operationType); // "insert" | "update" | "delete" | "replace" console.log(change.fullDocument); // the new document state console.log(change.updateDescription);// { updatedFields, removedFields } console.log(change._id); // resume token — store this } // Resumable stream — restart from a saved token after disconnect const resumableStream = db.collection('orders').watch([], { resumeAfter: savedResumeToken // pick up exactly where you left off });
| Use Case | How Change Streams Help |
|---|---|
| Real-time notifications | Watch orders collection; push WebSocket event when order status changes to "shipped" |
| Cache invalidation | Watch products; evict Redis cache entry when product price is updated |
| Cross-service sync | Watch users collection; replicate changes to Elasticsearch or a read replica |
| Audit logging | Watch sensitive collections; write every change to an immutable audit log |
| Event-driven microservices | Outbox pattern — watch an outbox collection and publish events to Kafka/RabbitMQ |
// Application writes order + outbox event in one transaction await session.withTransaction(async () => { await db.collection('orders').insertOne({ ...orderData }, { session }); await db.collection('outbox').insertOne( { event: "ORDER_PLACED", payload: orderData, published: false }, { session } ); }); // Separate publisher process watches outbox and publishes to Kafka const stream = db.collection('outbox').watch([ { $match: { operationType: "insert" } } ]); for await (const evt of stream) { await kafka.publish(evt.fullDocument.event, evt.fullDocument.payload); await db.collection('outbox').updateOne({ _id: evt.fullDocument._id }, { $set: { published: true } }); }
"Change streams let you subscribe to real-time data changes on a collection, database, or cluster without polling. They're built on the oplog and guarantee at-least-once delivery with resumability via a resume token — if your consumer crashes, it restarts from the saved token and processes no missed events. The most important production pattern is the transactional outbox: write your domain event to an outbox collection inside the same transaction as the business write, then a change stream publisher reliably forwards it to Kafka or RabbitMQ. This guarantees at-least-once event delivery without distributed transactions."
In distributed systems, networks are unreliable. A write might succeed on the server but the acknowledgement is lost — the client retries and the operation runs twice. An idempotent operation produces the same result whether executed once or N times. Without idempotency, retries cause duplicate records, double charges, or double-decremented inventory.
// ── Technique 1: Natural idempotent key with upsert ── // Payment identified by idempotencyKey — safe to retry await db.collection('payments').updateOne( { idempotencyKey: req.headers['idempotency-key'] }, { $setOnInsert: { amount: req.body.amount, customerId: req.body.customerId, createdAt: new Date() }}, { upsert: true } ) // $setOnInsert only runs if a new doc is created // If idempotencyKey already exists → matched, nothing changes // ── Technique 2: Conditional update — only update if state allows ── // State machine guard prevents double-processing const result = await db.collection('orders').updateOne( { _id: orderId, status: "pending" }, // guard: only from pending { $set: { status: "processing", processedAt: new Date() } } ) // Second call: status is already "processing" → filter doesn't match // matchedCount === 0 is NOT an error — it's already processed // ── Technique 3: $addToSet for deduplication ── // Tag can only be added once regardless of how many times called await db.collection('products').updateOne( { _id: productId }, { $addToSet: { tags: "featured" } } // idempotent — duplicate safe )
// Persist idempotency keys with TTL — expire after 24 hours db.createCollection("idempotency_keys", { validator: { $jsonSchema: { required: ["key", "result"] } } }) db.idempotency_keys.createIndex({ key: 1 }, { unique: true }) db.idempotency_keys.createIndex({ createdAt: 1 }, { expireAfterSeconds: 86400 }) async function processIdempotently(key, fn) { try { // Try to claim the key atomically await db.collection("idempotency_keys").insertOne({ key, createdAt: new Date() }); const result = await fn(); await db.collection("idempotency_keys").updateOne({ key }, { $set: { result } }); return result; } catch (e) { if (e.code === 11000) { // duplicate key — already processed const existing = await db.collection("idempotency_keys").findOne({ key }); return existing.result; // return cached result } throw e; } }
"Idempotency prevents duplicate effects when operations are retried — critical in payment processing and distributed systems. MongoDB gives you three tools: upsert with $setOnInsert (safe to re-run; only inserts on first call), conditional state-machine updates (the filter guard prevents applying the same transition twice), and $addToSet (adding the same tag is naturally idempotent). For external API calls like payments, I store an idempotency key in a collection with a unique index and TTL — a duplicate key error (code 11000) means it already ran and I return the cached result."
Without causal consistency, this sequence can happen: 1) You write a document to the primary. 2) You immediately read from a secondary to reduce primary load. 3) The secondary hasn't replicated your write yet — you read stale data and your own write appears to have vanished. This violates the "read your own writes" guarantee.
A causally consistent session tracks a cluster time and an operation time. When you read from a secondary, MongoDB ensures the secondary has applied at least up to the operation time of your last write before serving the read. This guarantees:
| Guarantee | Meaning |
|---|---|
| Read your own writes | A read always sees the result of writes in the same session |
| Monotonic reads | Reads never go back in time — you never see older data after seeing newer data |
| Monotonic writes | Writes in a session are applied in order |
| Writes follow reads | A write reflects the state seen by the most recent read in the session |
// Start a causally consistent session const session = client.startSession({ causalConsistency: true }); try { // Write to primary await db.collection('users').updateOne( { _id: userId }, { $set: { email: "new@example.com" } }, { session, writeConcern: { w: "majority" } } ); // Read from secondary — causal consistency ensures we see our write const user = await db.collection('users').findOne( { _id: userId }, { session, readConcern: { level: "majority" }, readPreference: "secondary" } ); // user.email is guaranteed to be "new@example.com" } finally { await session.endSession(); }
readConcern: majority and writeConcern: majority. Using weaker concerns breaks the guarantees — the session tracks cluster time but the secondary is not obligated to wait before serving reads.
"Causal consistency solves the 'read your own writes' problem when reading from secondaries. Without it, you write to the primary, immediately read from a secondary that hasn't replicated yet, and see stale data. A causally consistent session tracks operation time — when you read from a secondary, MongoDB waits until that secondary has applied at least up to your last write's timestamp before responding. It requires both readConcern and writeConcern set to majority, and the same session object passed to both operations."
A payment API endpoint must: (1) Debit the customer's wallet, (2) Create a payment record, (3) Emit an ORDER_PAID event to the outbox. All three must be atomic — partial failure leaves the system in an inconsistent state. The endpoint must be idempotent — the client may retry on timeout.
async function processPayment({ orderId, customerId, amount, idempotencyKey }) { const session = client.startSession(); try { let payment; await session.withTransaction(async () => { // ── Step 1: Idempotency check — claim the key inside the transaction ── const keyInsert = await db.collection('idempotency_keys').updateOne( { key: idempotencyKey }, { $setOnInsert: { key: idempotencyKey, createdAt: new Date() } }, { upsert: true, session } ); if (keyInsert.upsertedCount === 0) { // Key already exists — fetch and return existing payment payment = await db.collection('payments').findOne( { idempotencyKey }, { session } ); return; // abort the rest — already processed } // ── Step 2: Debit wallet — atomic check-and-debit ── const wallet = await db.collection('wallets').findOneAndUpdate( { customerId, balance: { $gte: amount } }, { $inc: { balance: -amount } }, { session, returnDocument: "after" } ); if (!wallet) throw Object.assign(new Error("Insufficient funds"), { code: "INSUFFICIENT_FUNDS" }); // ── Step 3: Create payment record ── payment = { _id: new ObjectId(), orderId, customerId, amount, status: "completed", idempotencyKey, createdAt: new Date() }; await db.collection('payments').insertOne(payment, { session }); // ── Step 4: Write to outbox (change stream will publish to Kafka) ── await db.collection('outbox').insertOne({ event: "ORDER_PAID", orderId, amount, paymentId: payment._id, createdAt: new Date() }, { session }); }, { readConcern: { level: "snapshot" }, writeConcern: { w: "majority", j: true } }); return payment; } finally { await session.endSession(); } }
| Decision | Why |
|---|---|
| Idempotency check inside transaction | If key doesn't exist yet, claim it and process atomically — no window between check and insert |
| findOneAndUpdate for debit | Balance check and decrement are atomic — eliminates the TOCTOU race condition |
| Outbox inside same transaction | Event is emitted if and only if the payment commits — no lost events, no phantom events |
| writeConcern majority + j:true | Payment is on disk on a majority of nodes — survives primary crash, power outage |
| withTransaction (not manual) | Automatic retry on TransientTransactionError — handles transient network partitions |
"I'd use a multi-document transaction with four writes: claim the idempotency key (using $setOnInsert + upsert), debit the wallet with a balance guard (findOneAndUpdate with $gte filter), insert the payment record, and write to the outbox. All four are in the same transaction — either all commit or all roll back. writeConcern majority + j:true ensures the payment survives any single-node failure. The idempotency key inside the transaction means a client retry sees the existing payment rather than double-charging. A change stream publisher then reliably forwards the outbox event to downstream services."
The native MongoDB driver gives you raw power — you talk directly to the database with no guardrails. Mongoose adds a structured layer on top: it enforces a schema, runs validation before every write, fires middleware hooks, and provides a cleaner API. Think of it as the difference between raw SQL and an ORM like Sequelize.
- Schema definition with types and constraints
- Built-in and custom validation before every save
- Pre/post middleware hooks (before save, after find…)
- Virtuals — computed properties not stored in DB
- populate() — automatic reference resolution
- Plugins — reusable schema extensions
- TypeScript-friendly with generics support
- Overhead — validation + middleware add latency
- Abstraction can hide what queries actually run
- populate() is N+1 queries under the hood
- Schema versioning (
__v) adds noise to documents - Some advanced MongoDB features need raw driver access
Mongoose is an Object Document Mapper (ODM) for MongoDB and Node.js. It provides a schema-based solution to model your application data, adds automatic validation, type casting, query building, and middleware hooks — things the raw MongoDB driver intentionally leaves to you.
// ── Raw MongoDB Driver ── const { MongoClient } = require('mongodb'); const db = client.db('myapp'); // No validation — any shape goes in await db.collection('users').insertOne({ name: 999, // number instead of string — driver won't care email: "invalid" // not an email — driver won't care }); // ── Mongoose ── const mongoose = require('mongoose'); const userSchema = new mongoose.Schema({ name: { type: String, required: true }, email: { type: String, required: true, match: /^.+@.+\..+$/ } }); const User = mongoose.model('User', userSchema); await new User({ name: 999, email: "invalid" }).save(); // Throws ValidationError: name must be a string, email is invalid
| Situation | Recommendation |
|---|---|
| MERN/Express application with structured data | Mongoose — validation + hooks save a lot of boilerplate |
| Microservice needing max performance, flexible docs | Raw driver — less overhead, more control |
| Complex aggregation pipelines | Mongoose supports them, but raw driver is cleaner for heavy pipeline work |
| Rapid prototyping with TypeScript | Mongoose — great TS support out of the box |
"Mongoose is an ODM — it sits between your Node.js code and MongoDB and adds a structured layer the raw driver doesn't have: schema definition, type casting, validation before every write, and middleware hooks. The raw driver lets you insert any shape of document; Mongoose enforces that documents conform to your schema. For MERN apps, Mongoose is almost always worth it for the structure and DX it provides, though it adds some overhead over the raw driver."
A Mongoose Schema is a blueprint for a MongoDB collection's documents. It defines the fields, their data types, default values, validators, and other constraints. The Schema itself does not write to MongoDB — it is passed to mongoose.model() to create a Model.
const { Schema, model } = require('mongoose'); const userSchema = new Schema( { name: { type: String, required: [true, 'Name is required'], // custom error message trim: true, // auto-trim whitespace minlength: 2, maxlength: 100 }, email: { type: String, required: true, unique: true, // creates a unique index lowercase: true, // auto-lowercase before save match: [/^.+@.+\..+$/, 'Invalid email'] }, age: { type: Number, min: 0, max: 120 }, role: { type: String, enum: ['user', 'admin', 'moderator'], // only these values allowed default: 'user' }, isActive: { type: Boolean, default: true }, tags: [String], // shorthand for array of strings address: { // nested subdocument city: String, country: { type: String, default: 'India' } }, profileId: { type: Schema.Types.ObjectId, ref: 'Profile' } // reference }, { timestamps: true, // auto adds createdAt and updatedAt versionKey: false // removes the __v field } );
{ timestamps: true } in schema options automatically adds createdAt and updatedAt fields managed by Mongoose. This saves you from writing pre-save hooks for this very common requirement.
"A Mongoose Schema defines the shape and rules for documents in a collection. Each field gets a type plus optional constraints like required, unique, min/max, enum, and default. The schema options object is equally important — timestamps: true auto-adds createdAt/updatedAt, and versionKey: false removes the noisy __v field. The Schema itself doesn't interact with MongoDB — you pass it to mongoose.model() to create a queryable Model."
| Schema | Model | |
|---|---|---|
| What it is | Blueprint / class definition | The compiled, MongoDB-connected class |
| Talks to MongoDB? | No | Yes — all CRUD goes through the Model |
| Analogy | Architecture blueprint of a house | The actual house built from that blueprint |
| Created with | new mongoose.Schema({...}) | mongoose.model('Name', schema) |
const userSchema = new Schema({ name: String, email: String }); // Model name 'User' → maps to collection 'users' (auto-pluralised, lowercased) const User = mongoose.model('User', userSchema); // ── CREATE ── const user = new User({ name: 'Alice', email: 'alice@x.com' }); await user.save(); // Shorthand (create = new + save) await User.create({ name: 'Bob', email: 'bob@x.com' }); // ── READ ── const users = await User.find({ isActive: true }); const oneUser = await User.findById(id); const byEmail = await User.findOne({ email: 'alice@x.com' }); // ── UPDATE ── await User.findByIdAndUpdate(id, { $set: { name: 'Alice B' } }, { new: true }); // ── DELETE ── await User.findByIdAndDelete(id);
findByIdAndUpdate returns the document before the update. Pass { new: true } to get the updated document back — the same as returnDocument: 'after' in the raw driver.
"A Schema is the blueprint — it defines fields, types, and validation rules but has no database connection. A Model is the compiled class created by passing a schema to mongoose.model() — it is what you use to create, query, update, and delete documents. The model name you pass (e.g., 'User') determines the collection name — Mongoose auto-pluralises it to 'users'."
Mongoose validation runs automatically before every save() and create(). It does not run automatically on updateOne, findByIdAndUpdate etc. unless you pass { runValidators: true } explicitly.
const productSchema = new Schema({ name: { type: String, required: true, minlength: 3, maxlength: 200 }, price: { type: Number, required: true, min: 0 }, category: { type: String, enum: ['electronics', 'books', 'clothing'] }, sku: { type: String, unique: true, match: /^[A-Z]{3}-\d{4}$/ } });
const userSchema = new Schema({ phone: { type: String, validate: { validator: function(v) { return /^\+?[\d\s\-]{10,15}$/.test(v); }, message: props => `${props.value} is not a valid phone number` } }, password: { type: String, validate: { validator: async function(v) { // async validator return v.length >= 8; }, message: 'Password must be at least 8 characters' } } }); // Trigger validators on update operations too await User.findByIdAndUpdate(id, update, { runValidators: true, new: true });
updateOne(), updateMany(), and findByIdAndUpdate() bypass Mongoose validation unless you pass { runValidators: true }. This is a common source of data quality bugs — always add this option if you rely on schema validators.
"Mongoose has built-in validators like required, min, max, enum, and match that run automatically before save(). For custom logic, you pass a validate object with a validator function — it can be sync or async. The critical gotcha: validation does NOT run on updateOne or findByIdAndUpdate by default — you must pass { runValidators: true } to enable it."
Mongoose middleware (also called hooks) are functions that run before or after specific operations — save, validate, remove, find, update etc. They let you inject logic into the document lifecycle without cluttering your route handlers.
const bcrypt = require('bcryptjs'); // Hash password BEFORE saving — classic use case userSchema.pre('save', async function(next) { if (!this.isModified('password')) return next(); // only hash if changed this.password = await bcrypt.hash(this.password, 12); next(); }); // Normalise email before saving userSchema.pre('save', function(next) { this.email = this.email.toLowerCase().trim(); next(); }); // Pre-find: automatically exclude deleted documents (soft delete) userSchema.pre(/^find/, function(next) { // regex matches find, findOne, findById… this.where({ isDeleted: { $ne: true } }); next(); });
// Post-save: send welcome email after user created userSchema.post('save', async function(doc) { if (doc.isNew) { await sendWelcomeEmail(doc.email); } }); // Post-find: log slow queries userSchema.post('find', function(result) { console.log(`Found ${result.length} users`); }); // Post hook with error handling userSchema.post('save', function(error, doc, next) { if (error.name === 'MongoServerError' && error.code === 11000) { next(new Error('Email already exists')); // duplicate key → friendly error } else { next(error); } });
"Mongoose middleware are hooks that intercept operations in the document lifecycle. Pre hooks run before — the most common use case is hashing passwords in a pre-save hook so the plain text never reaches the database. Post hooks run after — useful for side effects like sending emails or logging. A powerful pattern for soft deletes is a pre-find hook that automatically adds { isDeleted: { $ne: true } } to every query — transparent to the rest of the app."
Virtuals are computed properties on a document that are not stored in MongoDB. They exist only in application memory, derived from other fields. Virtuals are perfect for values that can always be calculated from persisted data — no need to duplicate them in the database.
const userSchema = new Schema({ firstName: String, lastName: String, birthYear: Number }, { toJSON: { virtuals: true }, toObject: { virtuals: true } }); // ↑ Must enable for virtuals to appear in JSON output // Virtual getter — computed from other fields userSchema.virtual('fullName').get(function() { return `${this.firstName} ${this.lastName}`; }); // Virtual setter — split fullName back into parts userSchema.virtual('fullName').set(function(v) { const [first, ...rest] = v.split(' '); this.firstName = first; this.lastName = rest.join(' '); }); // Age virtual — calculated from birthYear userSchema.virtual('age').get(function() { return new Date().getFullYear() - this.birthYear; }); // Usage const user = await User.findById(id); console.log(user.fullName); // "Alice Smith" — not stored in DB console.log(user.age); // 30 — computed each time
{ toJSON: { virtuals: true }, toObject: { virtuals: true } } in your schema options, otherwise calling res.json(user) will not include virtual fields. You cannot query or sort by virtuals — they are computed in memory only.
"Virtuals are computed document properties that never get stored in MongoDB. Common examples: fullName derived from firstName and lastName, or age from birthYear. They save storage and keep the DB as the source of truth for just the raw fields. The two things to remember: enable toJSON: { virtuals: true } in schema options or they won't appear in API responses, and you can't use them in MongoDB queries since they don't exist in the database."
populate() work in Mongoose? What are its performance implications?populate() automatically replaces a reference field (ObjectId) with the actual document from the referenced collection. It is Mongoose's answer to SQL JOINs — but under the hood it fires a separate query, not a database-level join.
// Schemas with references const postSchema = new Schema({ title: String, author: { type: Schema.Types.ObjectId, ref: 'User' }, // single ref tags: [{ type: Schema.Types.ObjectId, ref: 'Tag' }] // array of refs }); // Basic populate const post = await Post .findById(id) .populate('author'); // replaces ObjectId with full User doc // Populate with field selection (only return name and email) const post = await Post .findById(id) .populate({ path: 'author', select: 'name email -_id' }); // Populate multiple fields const post = await Post .findById(id) .populate('author') .populate('tags'); // Deep / nested populate (author's company) const post = await Post .findById(id) .populate({ path: 'author', populate: { path: 'company' } });
populate() fires additional queries — it is not a JOIN. For a list of 50 posts, populate('author') fires 1 query for posts + 1 query per unique author. In practice Mongoose batches the second query using $in, so it's 2 queries total — but for deeply nested or large data sets this adds up.
$lookup — it runs server-side in one query and is far more efficient at scale.
"populate() replaces an ObjectId reference with the actual document from another collection. Under the hood it fires a second query using $in — it's not a database JOIN. For most MERN apps this is fine, but for high-volume endpoints or deep nested populates, the extra round-trip becomes a bottleneck. In those cases I switch to the aggregation pipeline with $lookup, which resolves the join server-side in a single query."
A Mongoose plugin is a reusable function that adds functionality to a Schema — fields, methods, statics, virtuals, or middleware. Instead of repeating the same schema additions across multiple models, you encapsulate them in a plugin and apply it globally or per-schema.
// softDelete.plugin.js function softDeletePlugin(schema) { // Add fields to every schema that uses this plugin schema.add({ isDeleted: { type: Boolean, default: false }, deletedAt: { type: Date, default: null } }); // Add an instance method schema.methods.softDelete = async function() { this.isDeleted = true; this.deletedAt = new Date(); return this.save(); }; // Auto-exclude deleted documents from all finds schema.pre(/^find/, function(next) { this.where({ isDeleted: false }); next(); }); } module.exports = softDeletePlugin; // ── Usage ── const softDelete = require('./softDelete.plugin'); // Apply to a single schema userSchema.plugin(softDelete); // Apply GLOBALLY to all schemas in your app mongoose.plugin(softDelete);
"A Mongoose plugin is a reusable function that receives a schema and augments it — adding fields, instance methods, static methods, or middleware. The classic use case is a soft-delete plugin: add isDeleted and deletedAt fields, a softDelete() instance method, and a pre-find hook that filters out deleted records automatically. Apply it with schema.plugin(fn) or globally with mongoose.plugin(fn) to add it to every schema in the application."
// db.js — connection module const mongoose = require('mongoose'); const connectDB = async () => { try { await mongoose.connect(process.env.MONGO_URI, { maxPoolSize: 10, // max connections in pool (default 5) serverSelectionTimeoutMS: 5000, // fail fast if no server found socketTimeoutMS: 45000 // drop idle sockets after 45s }); console.log('MongoDB connected'); } catch (err) { console.error('MongoDB connection error:', err); process.exit(1); // crash the process — let PM2/K8s restart it } }; // Handle connection events mongoose.connection.on('disconnected', () => console.warn('MongoDB disconnected')); mongoose.connection.on('error', (err) => console.error('MongoDB error:', err)); module.exports = connectDB; // server.js — call BEFORE starting Express await connectDB(); app.listen(process.env.PORT, () => console.log('Server running'));
// Close connection gracefully on SIGTERM (Docker stop, K8s pod eviction) process.on('SIGTERM', async () => { await mongoose.connection.close(); process.exit(0); });
mongoose.connect() once at startup — Mongoose reuses pooled connections across all requests. Never open a new connection per request — that's a resource leak.
global._mongoose to avoid "too many connections" errors during development.
"Connect once at startup with mongoose.connect() and let Mongoose's internal connection pool handle concurrent requests — never open a new connection per request. Key options: maxPoolSize (default 5, tune based on load), serverSelectionTimeoutMS to fail fast if MongoDB isn't reachable. In production, listen to the disconnected event and handle SIGTERM for graceful shutdown so in-flight requests complete before the process exits."
lean() in Mongoose? When should you use it and what do you give up?By default, Mongoose wraps every returned document in a Mongoose Document object — a rich class instance with methods, virtuals, change tracking, and middleware support. .lean() tells Mongoose to skip this wrapping and return plain JavaScript objects directly from the driver.
// Standard query — returns Mongoose Document instances const users = await User.find({ isActive: true }); // Each user is a full Mongoose Document: ~3x more memory, slower users[0].save(); // ✅ available users[0].fullName; // ✅ virtual works // lean() query — returns plain JS objects const users = await User.find({ isActive: true }).lean(); // Each user is a plain object: ~2-3x faster, much less memory users[0].save(); // ❌ TypeError — not a Document users[0].fullName; // ❌ undefined — no virtuals
| Standard Query | lean() Query | |
|---|---|---|
| Speed | Slower (document hydration) | 2–3× faster |
| Memory | Higher | Much lower |
| .save(), .remove() | ✅ Available | ❌ Not available |
| Virtuals | ✅ Work | ❌ Not included (use lean-virtuals plugin) |
| Middleware | ✅ Fires on doc methods | ❌ Not relevant |
| Change tracking | ✅ isModified() works | ❌ Not available |
| Best for | Documents you will mutate | Read-only data: API responses, reports |
.lean() for any query whose result goes straight to an API response (res.json()) and won't be modified. Use standard queries when you need to call .save() or access virtuals.
"lean() makes Mongoose return plain JavaScript objects instead of full Mongoose Document instances. It's 2–3× faster and uses significantly less memory because Mongoose skips document hydration — no change tracking, no instance methods, no virtuals. Use it on any read-only query that goes straight to an API response. Avoid it when you need to call .save(), use virtuals, or rely on instance methods on the returned documents."