You can query data stored in Amazon S3 directly from MongoDB using Atlas Data Lake’s federated querying capabilities.

Here’s how it looks in action. Imagine you have Parquet files in an S3 bucket, and you want to query them as if they were part of your MongoDB database.

First, you need to set up an S3 data source in Atlas. This involves providing your S3 bucket name, region, and IAM credentials.

{
  "name": "my-s3-bucket-source",
  "type": "s3",
  "aws": {
    "bucket": "my-data-lake-bucket",
    "region": "us-east-1",
    "credentials": {
      "accessKeyId": "AKIA...",
      "secretAccessKey": "your_secret_access_key..."
    }
  }
}

Next, you define a schema for your S3 data. This tells MongoDB how to interpret the Parquet files.

{
  "name": "my-parquet-collection",
  "type": "collection",
  "dataSources": [
    {
      "storeName": "my-s3-bucket-source",
      "database": "s3",
      "collection": "my-parquet-collection",
      "physicalFilters": {
        "partitionFilters": [
          {"name": "year", "type": "int", "values": [2023]}
        ]
      },
      "schema": {
        "bsonType": "object",
        "properties": {
          "user_id": {"bsonType": "string"},
          "event_timestamp": {"bsonType": "date"},
          "event_type": {"bsonType": "string"},
          "details": {"bsonType": "object"}
        }
      }
    }
  ]
}

With this setup, you can now query my-parquet-collection using standard MongoDB query language. For example, to find all events of type 'login' from a specific user in 2023:

db.collection('my-parquet-collection').find({
  user_id: "user123",
  event_type: "login",
  event_timestamp: { $gte: new Date("2023-01-01T00:00:00Z"), $lt: new Date("2024-01-01T00:00:00Z") }
})

Atlas Data Lake pushes down the query execution to S3, processing only the data that matches your query filters. This is crucial for performance, especially with large datasets. The physicalFilters in the schema definition are key here; they allow you to partition your S3 data (e.g., by year, month, or any other logical grouping) and instruct Atlas to only scan relevant partitions, drastically reducing the amount of data scanned.

The core problem this solves is the impedance mismatch between unstructured or semi-structured data in object storage and the structured querying capabilities of a database. Instead of ETLing all your S3 data into MongoDB, which can be costly and complex, you can query it in place. Atlas Data Lake acts as a query engine that understands both your MongoDB schema and the structure of your S3 data. It translates your MongoDB queries into optimized requests that can be efficiently processed by S3 and its underlying data formats like Parquet or ORC.

When you query my-parquet-collection, Atlas doesn’t load all the Parquet files into memory. Instead, it analyzes the query and the schema, identifies which S3 objects and which parts of those objects contain the relevant data (especially if partitioned), and then retrieves only that subset for processing. This "query at the source" approach is what makes it scalable.

One aspect that often surprises people is how Atlas handles schema evolution. If your Parquet files in S3 have new fields or changed types, Atlas Data Lake can often adapt. By default, it might infer types for new fields. However, for precise control, especially when dealing with nested structures or ensuring data integrity, explicitly defining the schema in your collection configuration is paramount. This explicit schema acts as a contract, and if the underlying S3 data deviates significantly without proper handling in the schema definition, you might encounter unexpected results or errors during query execution, even if the data technically exists in S3.

The next step is to explore how to integrate this data lake data with your existing MongoDB collections using $lookup stages in Atlas Aggregation Pipelines.

Want structured learning?

Take the full Mongodb course →