LlamaIndex can actually retrieve information from both text and images simultaneously, and it does so by treating image content as if it were text.
Here’s how it works in practice:
Let’s say you have a document that’s a PDF containing both text and images. You’ve loaded this into LlamaIndex.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage
from llama_index.core.schema import ImageNode
from llama_index.multi_modal_llms.openai import OpenAIRL
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import ImageNodeParser
from llama_index.core.response.notebook_utils import display_source_nodes
import os
# Assume you have a PDF file named 'sample_document.pdf' in a 'data' directory
# This PDF contains text and some images.
# Ensure you have your OpenAI API key set as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# Load documents
reader = SimpleDirectoryReader("data")
documents = reader.load_data()
# Configure multimodal LLM and embedding model
multi_modal_llm = OpenAIRL(model="gpt-4-vision-preview")
embed_model = OpenAIEmbedding()
# Parse nodes, specifically handling images
# ImageNodeParser will extract images and create ImageNode objects
parser = ImageNodeParser()
nodes = []
for doc in documents:
# For each document, parse its content. If it contains images,
# ImageNodeParser will create ImageNodes.
nodes.extend(parser.get_nodes_from_documents([doc]))
# Create a VectorStoreIndex
# The index will store embeddings for both text and image nodes.
index = VectorStoreIndex(nodes, embed_model=embed_model)
# Create a query engine
query_engine = index.as_query_engine(multi_modal_llm=multi_modal_llm)
# Now, let's query it with a question that requires understanding both text and image content.
# Suppose 'sample_document.pdf' has a paragraph describing a red car, and an image of a blue car.
# A query like "What color is the car?" should ideally point out the discrepancy or
# focus on the most relevant information based on the context of the question.
query = "Describe the vehicle shown in the document."
response = query_engine.query(query)
print(response)
# You can also inspect the source nodes to see what was retrieved
# display_source_nodes(response.source_nodes)
When you query this engine, LlamaIndex first embeds your query. Then, it searches the VectorStoreIndex for the most similar nodes. Crucially, this similarity search works across both text nodes (which have text embeddings) and ImageNodes. For ImageNodes, LlamaIndex generates embeddings by passing the image data through a vision-language model. This means an image of a red car will have an embedding that is semantically similar to the text "a red car."
The multi_modal_llm (like gpt-4-vision-preview) is then used to synthesize an answer. If a query involves an image, LlamaIndex will pass both the relevant text chunks and the image data itself to the multimodal LLM for a comprehensive understanding.
The core problem this solves is information retrieval from heterogeneous data sources where text and images are intermingled. Traditional RAG systems typically only index and retrieve from text. Multimodal RAG extends this by allowing you to ask questions and get answers that draw insights from visual content as well as textual descriptions.
Internally, LlamaIndex’s ImageNodeParser is key. When it encounters an image within a document (like a PDF), it doesn’t just store a reference to the image file. Instead, it uses a vision model to generate a textual description or extract key features from the image. This extracted information, along with the image itself (or a representation of it), is then used to create an ImageNode. This ImageNode is then embedded and stored in the vector index, making its visual content searchable using natural language queries.
The exact levers you control are primarily in the ImageNodeParser configuration (though it’s quite straightforward by default) and the choice of your multimodal LLM and embedding models. For instance, if you were dealing with very specific types of images (e.g., medical scans, technical diagrams), you might need specialized models or parsing techniques to extract the most relevant information.
What most people don’t realize is that the "embedding" of an image for RAG purposes isn’t just a static representation. When you query, the multimodal LLM might re-evaluate the image in the context of your specific question, going beyond just the pre-computed embedding to interpret the visual data anew.
The next step is to explore how to handle different types of multimodal documents and refine the retrieval process for complex visual queries.