The most surprising thing about using LLMs to query databases is that they don’t actually "understand" SQL in the way a human programmer does. Instead, they’re exceptionally good at pattern matching and predicting the next most likely sequence of characters, which, with the right prompting and tools, can look a lot like understanding.
Let’s see this in action. Imagine we have a simple PostgreSQL database with a products table:
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
category VARCHAR(100),
price DECIMAL(10, 2)
);
INSERT INTO products (name, category, price) VALUES
('Laptop', 'Electronics', 1200.00),
('Keyboard', 'Electronics', 75.00),
('Mouse', 'Electronics', 25.00),
('Desk Chair', 'Furniture', 300.00),
('Monitor', 'Electronics', 300.00);
Now, with LlamaIndex, we can query this database using natural language.
First, we set up our environment:
import os
from llama_index.core import SQLDatabase
from llama_index.core.query_engine import NLSQLTableQueryEngine
from llama_index.llms.openai import OpenAI # Or your preferred LLM
# Assuming you have your database URL and OpenAI API key set as environment variables
# For PostgreSQL: export DATABASE_URL="postgresql://user:password@host:port/dbname"
# For OpenAI: export OPENAI_API_KEY="your_openai_api_key"
db_url = os.environ.get("DATABASE_URL")
if not db_url:
raise ValueError("DATABASE_URL environment variable not set.")
sql_database = SQLDatabase.from_uri(db_url)
Next, we initialize the query engine. The NLSQLTableQueryEngine is designed for this exact purpose. It takes the SQLDatabase object and an LLM.
# Configure your LLM
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
# Create the query engine
query_engine = NLSQLTableQueryEngine(
sql_database=sql_database,
llm=llm,
tables=["products"] # Specify which tables to consider
)
Now, we can ask questions in plain English. LlamaIndex will translate these into SQL queries, execute them against the database, and return the results.
response = query_engine.query("What are the names and prices of all electronics products?")
print(response)
The response object will contain the natural language answer alongside the generated SQL query and the actual data from the database.
Query: What are the names and prices of all electronics products?
SQL Query:
SELECT name, price FROM products WHERE category = 'Electronics'
Result:
(Laptop, 1200.00)
(Keyboard, 75.00)
(Mouse, 25.00)
(Monitor, 300.00)
The electronics products and their prices are:
- Laptop: $1200.00
- Keyboard: $75.00
- Mouse: $25.00
- Monitor: $300.00
This engine works by constructing a detailed prompt for the LLM. This prompt includes the database schema (table names, column names, types, and even sample data or descriptions), a description of the task, and examples of natural language queries mapped to their SQL equivalents. The LLM then uses this context to generate a SQL query that best matches the user’s natural language input.
The NLSQLTableQueryEngine internally uses a NLSQLTableGenerator which is responsible for creating the prompt. This prompt is crucial. It’s not just about sending the schema; it’s about framing the problem for the LLM. The prompt typically looks something like this (simplified):
You are a SQL expert. Given the following SQL table schema:
CREATE TABLE products (
product_id INTEGER NOT NULL,
name VARCHAR(255) NOT NULL,
category VARCHAR(100),
price DECIMAL(10, 2),
PRIMARY KEY (product_id)
);
And the following SQL query:
"What are the names and prices of all electronics products?"
Generate a SQL query that would answer the question.
SQL Query:
The LLM, trained on vast amounts of text and code, recognizes the structure of the schema and the intent of the question, then generates the SELECT name, price FROM products WHERE category = 'Electronics' statement. LlamaIndex then takes this generated SQL, executes it, and formats the output.
A key lever you control is the tables parameter when initializing NLSQLTableQueryEngine. By default, it inspects all tables. If you have many tables, specifying only the relevant ones (tables=["products", "orders"]) significantly speeds up schema retrieval and reduces the LLM’s context window usage, leading to more accurate and faster query generation. Another important aspect is the quality of your database schema’s comments and descriptions; if you’re using a database that supports it, adding comments to tables and columns can provide the LLM with more context to better understand your data.
When dealing with complex queries or joins, the LLM’s ability to generate correct SQL is heavily dependent on the prompt’s quality and the LLM’s capabilities. For instance, asking "Show me the total revenue per category" requires the LLM to infer a join between products and orders (if an orders table existed with product IDs and quantities) and perform aggregation. The prompt engineering here becomes critical, often involving few-shot examples of complex queries.
The next step in querying data with LLMs involves moving beyond single tables to complex relational schemas and even unstructured data integration.