The most surprising thing about LLM pretraining data is that "quality" isn’t just about how clean or factual the text is, but how diverse it is in terms of style, complexity, and underlying task signals.

Let’s look at a hypothetical pretraining dataset being assembled. Imagine we’re using a combination of web scrapes, books, and code.

[
  {
    "source": "Common Crawl",
    "url": "https://example.com/article/12345",
    "content": "The quick brown fox jumps over the lazy dog. This sentence is often used for typing practice and demonstrates all letters of the English alphabet. It's a pangram."
  },
  {
    "source": "Project Gutenberg",
    "url": "urn:uuid:12345678-abcd-efgh-ijkl-mnopqrstuvwx",
    "content": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair."
  },
  {
    "source": "GitHub",
    "url": "https://github.com/user/repo/blob/main/script.py",
    "content": "def greet(name):\n    return f'Hello, {name}!'\n\nprint(greet('World'))"
  }
]

This snippet shows three very different types of text: a simple factual statement, classic literature, and Python code. Each teaches the LLM something distinct. The web scrape teaches it about common language, factual recall, and potentially how to structure basic paragraphs. The book teaches it about narrative, complex sentence structures, figurative language, and historical context. The code teaches it about syntax, logic, and the structure of programming languages.

The core problem pretraining data curation solves is providing the LLM with a broad enough understanding of human language and knowledge to perform a vast array of downstream tasks without explicit task-specific fine-tuning. It’s about building a foundational "world model" from text. The LLM learns grammar, facts, reasoning patterns, and even stylistic nuances by predicting the next token across this massive, varied corpus.

Internally, during pretraining, the LLM is essentially performing a massive next-token prediction task. Given a sequence of tokens (words, sub-words, or characters), it tries to predict the most probable next token. The objective function (e.g., cross-entropy loss) pushes the model to minimize the difference between its predictions and the actual next token in the training data. By doing this billions or trillions of times on diverse data, the model’s internal weights (parameters) adjust to capture the statistical regularities of language, knowledge, and reasoning present in the data.

The levers you control in data curation are manifold:

  • Source Selection: Deciding where to get data from (e.g., specific websites, academic papers, curated book collections, specific programming language repositories). Each source has inherent biases and quality characteristics.
  • Deduplication: Identifying and removing identical or near-identical documents. Without this, the model might overfit to repeated content, giving undue weight to certain phrases or facts. A common technique is using MinHash LSH to find near-duplicates.
  • Filtering: Removing low-quality content. This can include:
    • Language Identification: Ensuring documents are in the target language(s).
    • Toxicity/Hate Speech Detection: Removing harmful content.
    • Boilerplate Removal: Stripping out navigation menus, ads, footers from web pages.
    • Quality Scores: Using heuristics (e.g., sentence length, punctuation usage, presence of "stop words") or even smaller trained models to score document quality.
  • Data Mixing: The proportion of data from different sources. A 70% web, 20% books, 10% code split will yield a different model than 50% web, 40% books, 10% code. This is a critical hyperparameter.
  • Task Augmentation: Sometimes, specific "synthetic" tasks are injected into the pretraining data. For instance, adding question-answer pairs or summarization examples formatted as plain text to encourage specific downstream capabilities.

One particularly subtle aspect of data curation is how different levels of linguistic information are encoded. The model doesn’t just learn word meanings; it learns the statistical relationships between words, the typical sequences of words in different contexts (syntax), the common co-occurrences of ideas (semantics), and even the underlying intent or purpose of a piece of text based on its style and content. For example, a forum post discussing a bug might teach the model about problem-solving and technical jargon, while a poem teaches it about metaphor and emotional expression, all through the same next-token prediction mechanism. The model learns to imitate these different "modes" of communication by observing their patterns in the training data.

After curating high-quality pretraining data, the next hurdle is efficiently processing and tokenizing it for training at scale.

Want structured learning?

Take the full Llm course →