LLM-based classification for structured data extraction is surprisingly bad at handling ambiguity, often defaulting to its most confident (and sometimes wrong) interpretation.

Let’s see how this plays out in practice. Imagine we have a piece of unstructured text like this:

Invoice #12345 issued on 2023-10-27 for Acme Corp. Total amount due is $500.00. Please remit payment within 30 days.

Our goal is to extract invoice_number, issue_date, company_name, and total_amount. A typical prompt for an LLM might look like this:

Extract the following information from the text:
- Invoice Number:
- Issue Date:
- Company Name:
- Total Amount:

Text:
Invoice #12345 issued on 2023-10-27 for Acme Corp. Total amount due is $500.00. Please remit payment within 30 days.

A well-tuned LLM might return:

{
  "invoice_number": "12345",
  "issue_date": "2023-10-27",
  "company_name": "Acme Corp.",
  "total_amount": "$500.00"
}

This looks great, but what happens when the text gets a little more complex?

Regarding order 7890 from Global Solutions, dated November 15th, 2023. The final bill came to £750.50.

If we use the same prompt, we might get:

{
  "invoice_number": "7890",
  "issue_date": "November 15th, 2023",
  "company_name": "Global Solutions",
  "total_amount": "£750.50"
}

This is also correct. The LLM is good at identifying patterns and common data formats. But the underlying mechanism isn’t truly "understanding" in a human sense. It’s predicting the most probable sequence of tokens that represent the requested fields based on its training data. This leads to brittle performance when faced with variations.

The core problem is that LLMs don’t inherently understand the constraints of structured data. They don’t know that an invoice_number should typically be alphanumeric, or that a date has specific formatting rules, unless those rules are heavily implied in the prompt or learned from vast amounts of training data. When confronted with ambiguity, they often pick the most frequent pattern they’ve seen, which can lead to errors.

Consider this:

Customer: John Doe
Order ID: 987654
Amount: 100 USD
Payment Date: 2023-11-20 (Received)

Here, "Payment Date" is distinct from an "Issue Date." A simple prompt might misinterpret "Payment Date" as the issue_date.

The solution isn’t just better prompting, but a more robust system design. This involves:

  1. Schema Definition: Explicitly define your target schema. For example, invoice_number (string, alphanumeric, max 50 chars), issue_date (date, YYYY-MM-DD), total_amount (decimal, currency symbol optional).

  2. Few-Shot Examples: Provide several high-quality examples in your prompt that cover common variations and edge cases. This helps the LLM learn the desired output format and interpretation.

    Extract the following information from the text:
    - Invoice Number: (e.g., "INV-12345", "54321")
    - Issue Date: (YYYY-MM-DD format, e.g., "2023-10-27")
    - Company Name: (e.g., "Acme Corp.", "Global Solutions")
    - Total Amount: (e.g., "$500.00", "£750.50", "100 USD")
    
    Example 1:
    Text: Invoice #12345 issued on 2023-10-27 for Acme Corp. Total amount due is $500.00.
    Output: {"invoice_number": "12345", "issue_date": "2023-10-27", "company_name": "Acme Corp.", "total_amount": "$500.00"}
    
    Example 2:
    Text: Regarding order 7890 from Global Solutions, dated November 15th, 2023. The final bill came to £750.50.
    Output: {"invoice_number": "7890", "issue_date": "2023-11-15", "company_name": "Global Solutions", "total_amount": "£750.50"}
    
    Text:
    Customer: John Doe
    Order ID: 987654
    Amount: 100 USD
    Payment Date: 2023-11-20 (Received)
    

    With this, a better output might be:

    {
      "invoice_number": "987654",
      "issue_date": null, // Or an indicator that it wasn't found
      "company_name": "John Doe", // Or null, depending on desired behavior for customer names
      "total_amount": "100 USD"
    }
    
  3. Output Validation & Correction: After the LLM generates output, use programmatic validation. Check if dates are in the correct format, if amounts are numeric, if known company names exist, etc. If validation fails, you can either flag the record or, more advanced, use another LLM call with a specific prompt to correct the identified error, feeding it the original text and the erroneous output.

    For instance, if the LLM returned {"issue_date": "November 15th, 2023"} for Example 2, a validator would flag this. A correction prompt could be:

    The following JSON has an incorrectly formatted date. Please correct the 'issue_date' field to YYYY-MM-DD format.
    
    Original Text:
    Regarding order 7890 from Global Solutions, dated November 15th, 2023. The final bill came to £750.50.
    
    Incorrect JSON:
    {"invoice_number": "7890", "issue_date": "November 15th, 2023", "company_name": "Global Solutions", "total_amount": "£750.50"}
    

    This would yield:

    {"invoice_number": "7890", "issue_date": "2023-11-15", "company_name": "Global Solutions", "total_amount": "£750.50"}
    
  4. Ensemble Methods: For critical data, run the extraction through multiple LLMs or with different prompts and compare the results. Discrepancies can highlight areas needing human review.

The most counterintuitive aspect of LLM extraction is that simply asking for "JSON output" doesn’t make the LLM understand JSON schema constraints any better than asking for comma-separated values. The model is still predicting tokens. The JSON structure is just another pattern it can mimic. Therefore, your explicit schema definitions and validation logic are far more critical than the LLM’s ability to format the output correctly.

The next hurdle is handling truly novel or extremely noisy data where even humans struggle to extract information consistently.

Want structured learning?

Take the full Llm course →