The most surprising thing about multimodal retrieval is that the "meaning" of an image isn’t a fixed property, but rather a function of the query you’re asking.
Let’s see LlamaIndex ColPali in action. Imagine we have a collection of documents, each containing text and an image. We want to retrieve documents not just based on text similarity, but also on the visual content of their images.
Here’s a simplified setup:
from llama_index.vector_stores.colbert import ColbertVectorStore
from llama_index.core import VectorStoreIndex, StorageContext, Settings
from llama_index.readers.file import PDFReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import ImageNode
import os
# Ensure you have your OpenAI API key set as an environment variable
# os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# 1. Setup Embeddings and LLM
# For multimodal, we need embeddings that can handle both text and images.
# ColPali uses a dual-encoder approach for this.
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3") # bge-m3 is good for multimodal
Settings.llm = OpenAI(model="gpt-3.5-turbo")
# 2. Load Documents
# Let's assume you have a directory 'data' with PDFs.
# For simplicity, we'll use a single dummy PDF.
# In a real scenario, you'd iterate through your files.
# You'd need to have a PDF file named 'example.pdf' in your current directory.
# For demonstration, let's create a dummy PDF with text and an embedded image.
# This requires reportlab and Pillow.
try:
from reportlab.pdfgen import canvas
from reportlab.lib.units import inch
from PIL import Image as PILImage
import io
def create_dummy_pdf(filename="example.pdf"):
c = canvas.Canvas(filename)
c.drawString(1 * inch, 10 * inch, "This is a document about cats.")
c.drawString(1 * inch, 9.5 * inch, "It features a picture of a fluffy feline.")
# Create a dummy image (e.g., a red square)
img_width, img_height = 100, 100
img = PILImage.new('RGB', (img_width, img_height), color = 'red')
img_byte_arr = io.BytesIO()
img.save(img_byte_arr, format='PNG')
img_byte_arr = img_byte_arr.getvalue()
# Embed the image in the PDF. ReportLab's image handling can be tricky.
# A more robust approach for real PDFs would involve libraries like PyMuPDF
# that allow precise image extraction and re-embedding.
# For this example, we'll skip direct image embedding into the dummy PDF
# and assume you'd have actual PDFs with images.
# If you have actual PDFs with images, PDFReader will attempt to extract them.
c.save()
print(f"Dummy PDF '{filename}' created. Note: Image embedding in dummy PDFs is complex; ensure your actual PDFs have images.")
create_dummy_pdf()
reader = PDFReader()
documents = reader.load_data("example.pdf")
except ImportError:
print("reportlab and Pillow not found. Skipping dummy PDF creation. Please provide your own PDFs with images.")
# Fallback: Assume 'example.pdf' exists and has content
try:
reader = PDFReader()
documents = reader.load_data("example.pdf")
except FileNotFoundError:
print("example.pdf not found. Please place a PDF with images in the current directory.")
documents = []
if not documents:
print("No documents loaded. Exiting.")
exit()
# 3. Indexing with ColPali
# ColPali stores vectors for both text and image components separately but linked.
# We need a way to associate images with their text context.
# LlamaIndex's ImageNode is designed for this.
# For this example, we'll manually create an ImageNode if the reader doesn't auto-detect.
# A more robust pipeline would involve a custom Document class or Node post-processing
# to ensure images are correctly represented as ImageNodes.
# The default PDFReader might extract images as separate `ImageNode`s.
# Let's check and potentially add one if not present.
has_image_node = any(isinstance(doc, ImageNode) for doc in documents)
if not has_image_node and len(documents) > 0:
print("No ImageNode detected. For multimodal retrieval, ensure your documents contain ImageNodes.")
# In a real scenario, you'd parse your PDFs to extract images into ImageNodes.
# For this demo, we'll proceed, but multimodal search might be limited without actual images.
# We'll use a simple SentenceSplitter for text nodes.
node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
# ColPali requires a specific index structure.
# The `ColbertVectorStore` will handle the multimodal indexing.
# For simplicity, we'll use an in-memory vector store.
# In production, you'd persist this.
vector_store = ColbertVectorStore(
# This is where the ColPali index files will be stored.
# Use a directory for persistence.
# For in-memory, you might not need this, but ColPali is designed for disk.
# Let's simulate an in-memory by using a temporary directory if not specified.
db_path=":memory:", # Or a file path like "./colpali_index"
# The embedding model needs to be compatible with ColPali's requirements.
# bge-m3 is a good choice.
embedding_model=Settings.embed_model,
# The ColPali model itself. "duct/colbert-v0.1.0" is a standard choice.
colbert_model_name="duct/colbert-v0.1.0",
# The dimensionality of the embeddings. bge-m3 produces 1024.
embedding_dim=1024,
# Max sequence length for the encoder.
max_seq_len=512,
)
# Create an index. LlamaIndex will automatically handle the node parsing and
# adding to the vector store, calling the appropriate indexing for text and images.
index = VectorStoreIndex.from_documents(
documents,
storage_context=StorageContext.from_defaults(vector_store=vector_store),
node_parser=node_parser,
)
# 4. Querying
# Now we can query using text, images, or a combination.
# ColPali allows "interleaved" queries.
# Text query
query_engine_text = index.as_query_engine()
response_text = query_engine_text.query("Tell me about felines.")
print("\n--- Text Query Response ---")
print(response_text)
# Image query (requires an actual image file)
# Let's assume you have an image named 'cat_image.png' in your directory.
# If not, this part will fail.
try:
# For a real image query, you'd load an image file.
# Here, we'll simulate by creating a dummy image file first.
img_path_for_query = "dog_image.png"
dog_img = PILImage.new('RGB', (60, 30), color = 'blue')
dog_img.save(img_path_for_query)
# LlamaIndex expects a file path or a PIL Image object for image queries.
response_image = query_engine_text.query(f"What does this image depict? Image path: {img_path_for_query}")
print("\n--- Image Query Response ---")
print(response_image)
# Clean up dummy image
os.remove(img_path_for_query)
except FileNotFoundError:
print("\n--- Image Query Skipped ---")
print("Image file 'dog_image.png' not found. Skipping image query.")
except Exception as e:
print(f"\nAn error occurred during image query: {e}")
# Interleaved query (combining text and image)
# This is where ColPali shines.
try:
# We'll reuse the dummy cat image for the interleaved query.
# If you created the dummy PDF, it might not have a real image.
# For this to work well, you need a PDF with a distinct image.
# Let's assume 'example.pdf' has a recognizable image.
# If not, you'd need to provide a path to a different image file.
# For demonstration, let's try to query about "felines" and use a hypothetical image.
# If you have 'cat_image.png', use that. Otherwise, this might not yield good results.
query_image_path = "cat_image.png" # Replace with a real cat image if you have one
if not os.path.exists(query_image_path):
print(f"\n--- Interleaved Query Skipped ---")
print(f"Image file '{query_image_path}' not found. Cannot perform interleaved query effectively.")
else:
response_interleaved = query_engine_text.query(
f"Show me documents related to felines that look like this image: {query_image_path}"
)
print("\n--- Interleaved Query Response ---")
print(response_interleaved)
except FileNotFoundError:
print("\n--- Interleaved Query Skipped ---")
print("Image file 'cat_image.png' not found. Skipping interleaved query.")
except Exception as e:
print(f"\nAn error occurred during interleaved query: {e}")
# Clean up dummy PDF
if os.path.exists("example.pdf"):
os.remove("example.pdf")
The core idea is that ColPali uses a dual-encoder architecture. One encoder processes text, and another processes images. During indexing, it creates separate vector representations for text chunks and image chunks. When you query, it can simultaneously encode your text query and your image query (if provided) and then find documents where either the text vectors are similar to your text query or the image vectors are similar to your image query, or a combination thereof. The "interleaved" query capability means you can provide both text and image prompts in a single query string, and ColPali will intelligently fuse these signals.
The "meaning" of an image is determined by how it’s represented in the vector space. ColPali’s image encoder learns to map visual features to this space. When you query with an image, it encodes that image and finds other images (and associated text) that map to nearby points in the vector space. The surprising part is that this mapping is learned based on how images are typically described or how they relate to text in the training data. An image of a dog might be close to text describing "dogs," but if trained on a dataset where dogs are frequently shown in parks, a query for "dog in a park" might lead it to images of dogs in parks, even if the specific dog in the query image isn’t in a park. The context of the query defines what aspects of the image’s "meaning" are relevant.
The system solves the problem of information retrieval where data exists in multiple modalities (text, images, potentially audio/video later). Traditional methods are often limited to a single modality. ColPali allows a unified search across these. Internally, it leverages a specific architecture: a set of text encoders and a set of image encoders, often trained jointly or with shared components. These encoders project data into a shared high-dimensional vector space. The ColbertVectorStore is the LlamaIndex implementation that manages these vectors, typically using a specialized index structure for efficient similarity search.
The exact levers you control are primarily:
- The Embedding Model: The choice of
HuggingFaceEmbedding(or other compatible models) is crucial. Models likebge-m3are specifically trained for multimodal tasks and have better cross-modal understanding. - The ColPali Model:
colbert_model_namespecifies the underlying ColPali model used for encoding. Different versions or fine-tuned models will yield different retrieval capabilities. - Query Formulation: How you combine text and image paths in your query string directly influences the retrieval outcome. More descriptive text combined with a relevant image yields better results.
- Document Structure: Ensuring that images are correctly parsed and represented as
ImageNodes within your LlamaIndexdocumentsis paramount. The reader and node parser must handle image extraction.
A detail few consider is how the max_seq_len parameter interacts with multimodal processing. For images, this often translates to the maximum number of visual tokens or patches the image encoder can process. If an image is too large or too complex, it might be downsampled, cropped, or its features truncated to fit this length, potentially losing fine-grained details relevant to a specific query.
The next concept to explore is how to fine-tune these multimodal models for domain-specific retrieval.