LangChain’s CSV agent lets you query your spreadsheets using natural language, but the real magic is that it generates Python code on the fly to do the heavy lifting.
Let’s see it in action. Imagine you have a CSV file named sales.csv with the following content:
product,region,sales,date
Laptop,North,1200,2023-01-15
Mouse,South,25,2023-01-16
Keyboard,North,75,2023-01-17
Laptop,West,1500,2023-01-18
Monitor,East,300,2023-01-19
Mouse,North,30,2023-01-20
Laptop,South,1100,2023-01-21
Monitor,West,350,2023-01-22
Keyboard,East,80,2023-01-23
Laptop,North,1300,2023-01-24
You can load this into a LangChain CSV agent and ask questions like:
import pandas as pd
from langchain_experimental.agents.agent_toolkits import create_csv_agent
from langchain_openai import ChatOpenAI
# Load the CSV file
df = pd.read_csv("sales.csv")
# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Create the CSV agent
agent = create_csv_agent(
llm,
df,
verbose=True,
allow_dangerous_code=True
)
# Ask a question
response = agent.invoke("What is the total sales for Laptops in the North region?")
print(response)
When you run this, you’ll see the agent’s thought process, including the Python code it generated:
> Entering new AgentExecutor chain...
I need to filter the DataFrame to include only rows where the 'product' is 'Laptop' and the 'region' is 'North', and then sum the 'sales' column.
Tool Calls:
```python
print(df[(df['product'] == 'Laptop') & (df['region'] == 'North')]['sales'].sum())
{
"tool_code": "print(df[(df['product'] == 'Laptop') & (df['region'] == 'North')]['sales'].sum())"
}
2500
The total sales for Laptops in the North region is 2500.
Finished chain. {'output': 'The total sales for Laptops in the North region is 2500.'}
The core problem this agent solves is bridging the gap between unstructured natural language queries and structured data residing in CSV files. Traditionally, you'd need to write specific Python code (or SQL if it were a database) to extract and aggregate this information. The CSV agent automates this by:
1. **Understanding the Query:** The LLM interprets your natural language question.
2. **Inferring Schema:** It looks at the CSV's column names and data types to understand what information is available.
3. **Generating Code:** Based on the query and schema, it constructs a Python code snippet using the `pandas` library. This code is designed to directly answer your question.
4. **Executing Code:** LangChain then executes this generated Python code in a secure environment.
5. **Returning Answer:** The result of the code execution is returned as the answer to your question.
The `create_csv_agent` function is the entry point. It takes your LLM of choice (here, `ChatOpenAI` with `gpt-4o`) and the `pandas` DataFrame. The `verbose=True` flag is crucial for debugging as it shows the agent's internal reasoning and the code it generates. `allow_dangerous_code=True` is necessary because the agent is generating and executing arbitrary Python code; in production, you'd want to carefully consider the security implications and potentially use a sandboxed [execution environment](/interviews/lambda/execution-environment-and-sandbox-reuse/).
The agent's power lies in its ability to handle a wide range of `pandas` operations. For instance, you could ask:
* "What is the average sales per region?"
* "Show me the top 3 products by sales."
* "How many sales happened in January 2023?"
Each of these would trigger the generation of different `pandas` code. The LLM is effectively acting as a sophisticated `pandas` query builder. It's not just string matching; it understands the semantic meaning of your request and maps it to the appropriate DataFrame operations.
The agent operates by using a `PandasDataframeTools` object under the hood. This tool provides the LLM with access to `pandas` functions. When the LLM decides to use a tool, it outputs a specific JSON format that LangChain parses to execute the corresponding `pandas` operation. The `create_csv_agent` function orchestrates this process, setting up the agent with the necessary tools and [prompt engineering](/mlops/llmops/prompt-engineering-production-systems/) to guide the LLM's behavior.
One key aspect that often surprises users is the agent's ability to handle multi-step reasoning. If a query requires intermediate calculations or data transformations, the agent can generate a sequence of Python commands. For example, to find the sales per day, then average those daily sales, the agent might first generate code to group by date and sum sales, and then generate code to average the results of that grouping. The prompt engineering within `create_csv_agent` is designed to encourage this step-by-step problem-solving.
The next logical step after querying your CSVs is to start thinking about how to integrate this capability into a larger application, perhaps by building a chatbot interface that can access multiple CSV files or by automating reports based on dynamic queries.