Hugging Face datasets library can stream data larger than your RAM, but it doesn’t actually stream data in the way you might expect; it streams pointers to data.

Let’s see it in action. Imagine you have a massive CSV file, huge_dataset.csv, that’s 50GB and won’t fit into your 16GB RAM.

col1,col2,col3
"value1_row1","value2_row1","value3_row1"
"value1_row2","value2_row2","value3_row2"
...

Here’s how you’d load it using datasets:

from datasets import load_dataset

# This doesn't load the whole CSV into RAM!
dataset = load_dataset("csv", data_files="huge_dataset.csv", streaming=True)

# You can iterate over it
for example in dataset["train"]:
    # Process each example
    print(example)
    if len(example) > 10: # Just to limit output
        break

When streaming=True, load_dataset creates a Dataset object that is essentially a manifest of your data files. It doesn’t read the file contents until you explicitly request an example. Each item you iterate over is a dictionary representing a single row (or document, depending on the dataset format), and its data is read from disk just in time as you access it. This allows you to work with datasets that are orders of magnitude larger than your available memory.

The core problem streaming=True solves is the memory constraint for large datasets. Traditionally, loading a dataset meant reading all its data into RAM for fast access. For datasets exceeding available RAM, this is impossible. datasets with streaming circumvents this by treating the dataset as a sequence of individual data points that can be fetched on demand.

Internally, when you call load_dataset(..., streaming=True), the library sets up an iterator. This iterator knows the location of your data files and the structure of each record within those files. When you loop through dataset["train"], the iterator:

  1. Seeks to the appropriate position in the file(s).
  2. Reads just enough data to reconstruct a single example (e.g., one row from a CSV, one JSON object from a JSON Lines file).
  3. Parses that data into a dictionary or similar structure.
  4. Yields that single example to your loop.

This process repeats for each iteration of your loop. This is a form of lazy loading, where computation (reading and parsing) is deferred until the data is actually needed.

The key levers you control are:

  • streaming=True: This is the primary switch to enable streaming. Without it, datasets attempts to load everything into memory.
  • data_files: You can provide a single file path, a list of paths, or a dictionary mapping split names to file paths. For streaming, the library will iterate through these files sequentially.
  • Dataset builder arguments: Depending on the dataset format (CSV, JSON, Parquet, text, etc.), you might have specific arguments to load_dataset (e.g., sep for CSV, field for JSON) that influence how individual records are parsed after being read from disk.

When you use streaming=True, the Dataset object itself has a very small memory footprint. It holds metadata about the files and the dataset structure, but not the actual data. The memory usage comes from processing a single example at a time within your loop. This makes it ideal for training models on massive datasets where you can’t afford to load everything at once. You can also perform operations like shuffle and take on streaming datasets, though the mechanics differ slightly from in-memory datasets. For example, dataset.shuffle(buffer_size=1000) will shuffle examples within a sliding window of 1000, not the entire dataset.

The most surprising thing about datasets streaming is that you can perform .train_test_split() on a streaming dataset, but it doesn’t actually split the data files themselves. Instead, it creates two new streaming Dataset objects, each pointing to the same original data files, but with internal logic to yield only a subset of examples to each new Dataset based on a probability. This means the underlying data isn’t duplicated or re-written; it’s just a different iterator over the same source.

After you’ve processed a streaming dataset and are comfortable with its behavior, the next logical step is often to consider how to make your processing itself more efficient, leading into distributed training or more advanced data preprocessing pipelines.

Want structured learning?

Take the full Huggingface course →