The most surprising thing about LLM benchmarks is that they often measure competence in a way that’s fundamentally different from how humans learn and apply knowledge.

Let’s see MMLU in action. Imagine a student facing a multiple-choice test. The LLM, in this case, is that student. It’s presented with a question and a set of options, and its task is to pick the correct one.

{
  "question": "What is the primary function of the mitochondria in a eukaryotic cell?",
  "options": [
    "A. Protein synthesis",
    "B. Energy production (ATP synthesis)",
    "C. Waste removal",
    "D. DNA replication"
  ],
  "answer": "B"
}

When an LLM like GPT-4 is evaluated on MMLU (Massive Multitask Language Understanding), it’s fed thousands of such questions across 57 different subjects, ranging from elementary mathematics and US history to professional law and medical knowledge. The LLM doesn’t "understand" these subjects in a human sense; it predicts the most probable correct answer based on patterns in its training data. The score is simply the percentage of questions it answers correctly.

HumanEval is less about factual recall and more about practical application. Think of it as a coding interview where the LLM has to write a function to solve a given problem.

def fibonacci(n):
    """
    Write a function that takes an integer n and returns the nth Fibonacci number.
    The first two Fibonacci numbers are 0 and 1.
    """
    # Expected Solution:
    # if n <= 0:
    #     return 0
    # elif n == 1:
    #     return 1
    # else:
    #     a, b = 0, 1
    #     for _ in range(n - 1):
    #         a, b = b, a + b
    #     return b

The LLM is given the function signature and a docstring describing the task. It then generates the Python code. HumanEval evaluates this code by running it against a set of hidden test cases. A common metric is "pass@k," meaning the model generates k samples, and at least one of them passes all tests. For example, pass@100 means that out of 100 generated code snippets, at least one correctly solves the problem.

HellaSwag presents a more nuanced challenge, focusing on common sense reasoning. It asks the LLM to pick the most plausible continuation of a given scenario.

{
  "context": "A person is walking down the street and sees a dog tied to a lamppost. The dog starts barking and pulling on its leash. The person stops and looks at the dog.",
  "endings": [
    "A. The person walks over to the dog and pets it.",
    "B. The person decides to buy a new car.",
    "C. The person pulls out a book and starts reading.",
    "D. The person walks away quickly."
  ],
  "gold_label": "A"
}

Here, the LLM must use its understanding of typical human behavior and social situations to select the most logical next event. Option "A" is a common and expected reaction, while "B," "C," and "D" are far less likely given the context. HellaSwag is designed to be difficult for models that rely solely on statistical co-occurrence of words, forcing them to engage with more sophisticated common sense.

The overall goal of these benchmarks is to provide a standardized way to compare the capabilities of different LLMs across various domains. MMLU tests broad knowledge, HumanEval tests coding ability, and HellaSwag tests common sense. By aggregating scores across these and other benchmarks, researchers can get a sense of a model’s general intelligence and its strengths and weaknesses.

What’s often overlooked is how these benchmarks, by their very design, can incentivize models to become excellent pattern-matchers and knowledge regurgitators rather than truly reasoning or understanding agents. A model might score highly on MMLU by memorizing facts and their common associations, without any deeper comprehension of the underlying principles. Similarly, it can learn to generate functional code for HumanEval by recognizing common coding patterns and solutions from its training data, rather than by genuine algorithmic thought. The "common sense" in HellaSwag can also be learned as statistical likelihoods of word sequences in certain contexts, mimicking understanding without possessing it.

The next challenge in LLM evaluation involves assessing models on tasks that require genuine creativity and long-term planning.

Want structured learning?

Take the full Llm course →