LLaVA doesn’t just understand images; it can actually reason about them in natural language.
Let’s see LLaVA in action, pulling it all together with Hugging Face’s transformers library.
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import requests
# Load the processor and model
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
# Load an image (example from the web)
url = "https://www.ilankelman.org/stopsigns/australia-009.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Define the prompt
prompt = "USER: <image>\nWhat is the text on the stop sign?\nASSISTANT:"
# Process the image and prompt
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate a response
outputs = model.generate(**inputs, max_new_tokens=50)
# Decode and print the output
print(processor.decode(outputs[0], skip_special_tokens=True))
This code snippet does a few key things:
- Loads Components: It fetches the
processor(which handles turning images and text into the right format for the model) and themodelitself from Hugging Face’s model hub. We’re using thellava-1.5-7b-hfversion, a 7-billion parameter model. - Gets Image: It downloads an image of a stop sign from the web. LLaVA can handle various image formats.
- Constructs Prompt: The prompt is crucial.
USER: <image>\nWhat is the text on the stop sign?\nASSISTANT:tells the model that an image follows and then asks a specific question about it. The<image>token is a placeholder the processor will replace. - Processes Input: The
processortakes the raw image and the text prompt and converts them into tensors (numerical representations) that the LLaVA model can understand. It intelligently embeds the image features alongside the text tokens. - Generates Output:
model.generate()feeds these processed inputs to the LLaVA model and asks it to produce a text response.max_new_tokens=50limits the length of the generated answer. - Decodes Result: The
processor.decode()function converts the model’s numerical output back into human-readable text.
The core problem LLaVA solves is bridging the gap between visual understanding and natural language generation. Traditional models were either good at image recognition (classifying objects) or text generation (writing stories), but not both in a deeply integrated way. LLaVA’s architecture allows it to "see" an image and then "talk" about what it sees, performing tasks like answering questions about the image content, describing it, or even following instructions related to visual elements.
Internally, LLaVA achieves this by combining a powerful vision encoder (like CLIP’s ViT) with a large language model (like Llama). The vision encoder processes the image into a sequence of "visual tokens." These visual tokens are then projected into the same embedding space as the text tokens of the LLM. Crucially, LLaVA uses a simple but effective connection strategy: the visual tokens are prepended to the text tokens, allowing the LLM to attend to both visual and textual information seamlessly during its generation process. The training process involves a multi-stage approach, starting with pre-training on image-text pairs to align the modalities, followed by instruction tuning on visual question-answering (VQA) datasets to make it adept at following visual instructions.
The max_new_tokens parameter is a simple way to control output length, but for more nuanced control, you’d look at parameters like temperature (controls randomness, lower is more deterministic), top_p (nucleus sampling), and num_beams (for beam search). These influence the creativity and coherence of the generated text.
When you set return_dict_in_generate=True in the generate call, the output object becomes richer. It includes not just the sequences (the generated token IDs) but also scores (the model’s confidence for each token) and hidden_states. This allows for more advanced analysis, like understanding why the model chose a particular word or even using the intermediate states for downstream tasks.
The next step is often to explore fine-tuning LLaVA on your own custom datasets to tailor its visual reasoning capabilities to specific domains or tasks.