Gemini’s multimodal capabilities can be leveraged with LlamaIndex to create RAG applications that go beyond simple text retrieval.

Let’s see Gemini and LlamaIndex in action. Imagine we have a collection of documents, including images, and we want to build a RAG system that can answer questions about them, even if the answer requires understanding the content of an image.

First, we need to set up our environment and install the necessary libraries:

pip install llama-index llama-index-llms-gemini python-dotenv Pillow

Next, we’ll need to get an API key for the Gemini API from Google AI Studio and set it as an environment variable.

import os
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage
from llama_index.core.node_parser import ImageNodeParser
from llama_index.core.indices.loading import get_updated_settings
from llama_index.core.schema import ImageDocument
from llama_index.llms.gemini import Gemini

load_dotenv()
gemini_api_key = os.environ.get("GOOGLE_API_KEY")

# Initialize the Gemini LLM
llm = Gemini(model="gemini-pro-vision", api_key=gemini_api_key)

Now, let’s prepare our data. We’ll create a few ImageDocument objects. For a real-world scenario, you’d load these from a directory.

from PIL import Image
import requests
from io import BytesIO

# Example image URLs
image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Montage_of_some_dogs.jpg/1024px-Montage_of_some_dogs.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Golden_Retriever_puppy.jpg/1024px-Golden_Retriever_puppy.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Culpeper_Animal_Welfare_League_-_Golden_Retriever_puppy_at_adoption_day_%2813600386784%29.jpg/1024px-Culpeper_Animal_Welfare_League_-_Golden_Retriever_puppy_at_adoption_day_%2813600386784%29.jpg"
]

image_documents = []
for url in image_urls:
    try:
        response = requests.get(url)
        response.raise_for_status() # Raise an exception for bad status codes
        img = Image.open(BytesIO(response.content))
        image_doc = ImageDocument(image=img, image_path=url) # Store URL as path for reference
        image_documents.append(image_doc)
    except requests.exceptions.RequestException as e:
        print(f"Error fetching image {url}: {e}")
    except IOError as e:
        print(f"Error processing image from {url}: {e}")

# If you have local images, you can load them like this:
# from llama_index.core import SimpleDirectoryReader
# reader = SimpleDirectoryReader("./your_image_directory")
# documents = reader.load_data()
# image_documents = [doc for doc in documents if doc.metadata.get("file_type") == "image/jpeg" or doc.metadata.get("file_type") == "image/png"]

With our image documents ready, we can parse them into nodes. ImageNodeParser is specifically designed to handle image data.

from llama_index.core.node_parser import ImageNodeParser

# Initialize the ImageNodeParser
# This parser will extract text from images using an OCR engine (if available)
# and also create metadata nodes that describe the image.
parser = ImageNodeParser()
nodes = parser.get_nodes(image_documents)

# For multimodal models, we often want to keep the image data itself within the node.
# The ImageNodeParser handles this by default.

Now, we build our index. For multimodal RAG, we need an index that can store and retrieve not just text but also image embeddings. LlamaIndex handles this by encoding image content.

# If you don't have an existing index, create a new one
# For multimodal, we use the default embed model which is capable of handling images.
# If you want to use a specific multimodal embedding model, you would configure it here.
index = VectorStoreIndex(nodes)

# If you have an existing index, you can load it:
# PERSIST_DIR = "./storage"
# if not os.path.exists(PERSIST_DIR):
#     os.makedirs(PERSIST_DIR)
#     index = VectorStoreIndex(nodes)
#     index.storage_context.persist(persist_dir=PERSIST_DIR)
# else:
#     storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
#     # Ensure the index settings are updated if the embed model changes.
#     # This is crucial for multimodal models.
#     get_updated_settings(
#         index_store=storage_context.index_store,
#         vector_store=storage_context.vector_store,
#         docstore=storage_context.docstore,
#         embedding_model=llm.embedding_model # Ensure you use the correct embedding model
#     )
#     index = load_index_from_storage(storage_context)

We create a query engine that uses our Gemini LLM. This engine will be able to process queries that involve both text and images.

# Create a query engine
query_engine = index.as_query_engine(llm=llm)

Now, let’s ask a question that requires understanding an image.

# Example query
query_text = "What breed of dog is prominently featured in these images?"

# To ask a question about a specific image, you can pass it along with the text.
# For this example, we'll query the entire index, which contains multiple images.
# The Gemini Pro Vision model can analyze multiple images in a single prompt.

response = query_engine.query(query_text)

print(response)

The output will likely identify the Golden Retriever breed, demonstrating the model’s ability to process visual information.

The core idea is that LlamaIndex’s ImageDocument and ImageNodeParser prepare image data for ingestion into an index. When you query this index using a multimodal LLM like Gemini Pro Vision, the LLM can process the image data embedded within the nodes. The RAG system retrieves relevant image nodes (and any associated text) and passes them to the LLM along with your text query, allowing it to generate an answer that synthesizes information from both modalities.

The surprising truth about using multimodal models in RAG is that the "retrieval" part isn’t just about finding text that matches keywords; it’s about finding nodes that contain relevant information, which can be raw image data. The LLM then acts as the ultimate interpreter, capable of "seeing" the images and understanding their context in relation to your query.

When constructing your index, especially with local image files, ensure your SimpleDirectoryReader is configured to load images correctly. You might need to specify file_extractor if you’re dealing with less common image formats or want to use a custom OCR engine.

The next challenge you’ll likely encounter is optimizing retrieval for complex multimodal datasets, where balancing text and image relevance becomes critical for accurate responses.

Want structured learning?

Take the full Gemini-api course →