LlamaParse can ingest PDFs far more complex than what traditional OCR or simple text extraction can handle, because it leverages a vision-language model to understand the visual layout and context, not just read characters.

Let’s watch LlamaParse chew through a real-world document. Imagine a scanned invoice with tables, handwritten notes, and logos.

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import SummaryExtractor
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI # Or your preferred LLM
from llama_index.embeddings.openai import OpenAIEmbedding # Or your preferred embedding model

# Ensure you have your OpenAI API key set as an environment variable
# export OPENAI_API_KEY="sk-..."

# Mock PDF creation (replace with your actual PDF file path)
# For demonstration, let's assume 'complex_invoice.pdf' exists in a 'data' directory.
# You'd typically download or create such a PDF.

# Setup the pipeline
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        SummaryExtractor(llm=OpenAI(model="gpt-3.5-turbo")),
    ],
    # Use LlamaParse for PDF parsing
    document_parsers={
        "pdf": LlamaParsePDFParser(api_key="YOUR_LLAMAPARSE_API_KEY") # Replace with your actual API key
    }
)

# Load documents from a directory
# For this example, we'll assume 'data/complex_invoice.pdf' exists.
# In a real scenario, you'd have your PDF file here.
# Example: Create a dummy file for demonstration if needed, but LlamaParse needs a real PDF.
# from llama_index.core import Document
# docs = [Document(text="This is a dummy PDF content.")] # This won't work for LlamaParse

# Use SimpleDirectoryReader to find and parse PDFs
# Make sure 'data' directory contains 'complex_invoice.pdf'
reader = SimpleDirectoryReader("./data", required_exts=[".pdf"])
documents = reader.load_data()

# Run the pipeline
nodes = pipeline.run(documents=documents)

# Inspect the first node (example)
print(f"Number of nodes generated: {len(nodes)}")
print(f"Content of the first node:\n{nodes[0].get_content()}")
print(f"Metadata of the first node:\n{nodes[0].metadata}")

The magic happens in LlamaParsePDFParser. Instead of just running pdftotext or an OCR engine, it sends the PDF to the LlamaParse service. LlamaParse uses a multimodal model (like GPT-4V) to interpret the PDF’s visual elements. It understands that a block of text next to a company logo is likely the sender’s address, that lines separating rows and columns form a table, and that specific formatting indicates header or footer information. It then reconstructs this understanding into structured text, often with Markdown or JSON-like output, which is then fed into your LlamaIndex pipeline.

The IngestionPipeline then takes these parsed, structured documents and applies further transformations. SentenceSplitter breaks the text into manageable chunks for an LLM, and SummaryExtractor uses an LLM to generate a concise summary for each chunk, enriching the data.

The core problem LlamaParse solves is the loss of information in complex documents. Traditional methods treat a PDF as a sequence of characters, failing to grasp spatial relationships, tables, or visual cues. LlamaParse bridges this gap by treating the PDF as an image and text, allowing it to extract meaning that’s otherwise lost.

The key levers you control are:

  • api_key: Your LlamaParse API key is essential for authentication.
  • parsing_mode: LlamaParse supports different modes like "text", "json", and "markdown". The default "text" often provides a good balance, but "json" can be invaluable for explicitly structured data like tables.
  • language: Specify the language of the document for better OCR accuracy if it’s not English.
  • verbose: Set to True to see detailed logs from the LlamaParse service during parsing.
  • callback_url: For long-running jobs, you can specify a URL where LlamaParse will send a notification upon completion.

When you set parsing_mode="json", LlamaParse attempts to extract structured data, particularly tables, into a JSON format. This JSON output is then embedded within the text of the node. For instance, a table might be represented as a list of dictionaries, where each dictionary is a row and keys are column headers. This allows downstream LLM calls to directly query structured data, rather than trying to parse it from free-form text.

The next step after accurately parsing and chunking complex PDFs is often integrating them into a retrieval augmented generation (RAG) system for querying.

Want structured learning?

Take the full Llamaindex course →