LangChain’s document loaders are the unsung heroes of LLM applications, acting as the crucial bridge between raw, unstructured data and the organized, chunked formats LLMs can actually process.
Let’s see this in action. Imagine you have a PDF report and want to feed it into a retrieval augmented generation (RAG) system.
from langchain_community.document_loaders import PyPDFLoader
# Load the PDF
loader = PyPDFLoader("path/to/your/report.pdf")
documents = loader.load()
# Each document is a page from the PDF
print(f"Loaded {len(documents)} documents.")
print(f"First document content (first 200 chars): {documents[0].page_content[:200]}...")
print(f"Metadata for first document: {documents[0].metadata}")
This code snippet directly loads a PDF, treating each page as a separate Document object. The Document object is fundamental: it holds the page_content (the actual text) and metadata (information about the source, like the page number and filename).
The core problem document loaders solve is data ingestion and initial parsing. LLMs don’t inherently understand how to read a PDF, scrape a website, or parse a JSON file. Loaders abstract away these complexities, providing a unified interface to get data into a format that can then be processed further (e.g., split into chunks, embedded, and stored in a vector database).
Internally, each loader leverages specific libraries to handle different file formats. For PDFs, PyPDFLoader uses pypdf. For HTML, WebBaseLoader might use BeautifulSoup or lxml. For CSVs, CSVLoader uses Python’s built-in csv module. The common thread is that they all output a list of Document objects, standardizing the output regardless of the input source.
You control the process through the loader itself and its optional arguments. For instance, PyPDFLoader can take a page_numbers argument to load specific pages, and UnstructuredURLLoader can take a requests_per_second parameter to manage scraping rate.
The real power comes when you chain these loaders with other LangChain components. You might load a document, then split it into smaller chunks using a TextSplitter, and then embed those chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
# Load content from a URL
loader = WebBaseLoader("https://www.example.com")
web_documents = loader.load()
# Initialize a text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# Split the loaded documents
all_splits = text_splitter.split_documents(web_documents)
print(f"Split into {len(all_splits)} chunks.")
print(f"First chunk content (first 200 chars): {all_splits[0].page_content[:200]}...")
This demonstrates how the output of a loader (web_documents) becomes the input for a splitter (text_splitter).
The metadata field on a Document is often overlooked but is incredibly powerful for RAG. When you load a document, the loader populates metadata like the source URL, filename, or page number. This metadata is preserved through chunking and embedding. Later, when you retrieve relevant chunks from a vector store, you can filter or sort based on this metadata. For example, if you want to find information only from a specific report or only from pages within a certain range, you can query the vector store using the original metadata associated with those chunks.
Beyond the common formats, LangChain offers loaders for Slack messages, Notion pages, Google Drive documents, and many more. The unstructured library, for instance, powers many of the more complex loaders, capable of extracting text from a wide array of file types including Word documents, PowerPoint presentations, and even images with OCR.
The most surprising true thing about document loaders is that their load_and_split() method is a common shortcut that combines loading and splitting into a single call, often with sensible defaults for splitting, saving you an explicit step.
The next concept to explore is how to effectively manage and transform the data after it’s loaded and split, particularly dealing with different document structures and ensuring semantic coherence across chunks.