LLM vision models don’t "see" images like humans do; they process them as grids of numbers that represent pixel values, which are then fed into the same neural network architecture that processes text.

Let’s say we want to build an application that can describe images. We’ll use OpenAI’s GPT-4 Vision model for this. First, we need to set up our environment and get an API key.

import openai
import os

# Make sure to set your OpenAI API key as an environment variable
# export OPENAI_API_KEY='your-api-key'
openai.api_key = os.getenv("OPENAI_API_KEY")

Now, let’s imagine we have an image. For this example, we’ll use a URL to an image of a dog playing fetch.

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Dog_in_a_park.jpg/1200px-Dog_in_a_park.jpg"

We can then send this image, along with a text prompt, to the GPT-4 Vision model. The model will analyze the image and generate a text description based on our prompt.

response = openai.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is happening in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": image_url,
                    },
                },
            ],
        }
    ],
    max_tokens=300,
)

print(response.choices[0].message.content)

The output might look something like this: "A golden retriever dog is running across a grassy park, with its mouth open and a ball in its mouth. The dog appears to be retrieving the ball. The park is green and there are trees in the background."

This demonstrates how vision models bridge the gap between visual information and natural language understanding. The core idea is that images are first transformed into a sequence of numerical representations (embeddings) that the LLM can process. This transformation is done by a vision encoder, often a Convolutional Neural Network (CNN) or a Vision Transformer (ViT), which learns to extract meaningful features from the image. These features are then concatenated with text embeddings and fed into the LLM’s transformer layers.

The magic happens in the attention mechanisms of the transformer. For a multimodal model, these attention mechanisms are modified to attend to both text tokens and image tokens. This allows the model to correlate parts of the image with parts of the text prompt and vice-versa. For instance, when asked "What is the dog doing?", the model’s attention will focus on the pixels representing the dog and its action, linking them to the textual concept of "retrieving."

The prompt engineering is crucial here. You can guide the model’s interpretation. For example, instead of "What is happening?", you could ask "Describe the breed of the dog and its environment." or "Is the dog happy?". The model’s ability to answer these questions depends on the richness of the visual features extracted and its underlying language understanding capabilities.

Consider a more complex scenario: asking the model to compare two images. You would send both images and a prompt like "Which image shows a more active dog? Explain why." The model would process both images independently, extract features, and then use its multimodal attention to compare these features in the context of the prompt. It might note the dog’s posture, speed indicators (like motion blur, if present), and the surrounding environment to make its judgment.

The key to building effective multimodal applications lies in understanding the interplay between the vision encoder and the LLM. The vision encoder’s job is to create a rich, context-aware representation of the image, and the LLM’s job is to interpret these representations in conjunction with textual prompts. The "tokens" for images aren’t raw pixels; they are patches of an image that have been processed by the vision encoder into a sequence of vectors. These vectors are then treated similarly to word embeddings by the LLM.

One aspect that often surprises developers is how specific the image tokenization and projection layers are. It’s not just a simple flattening of image features. The process involves projecting the image features into the same embedding space as the text tokens, often using a learned linear transformation. This ensures that the LLM can seamlessly integrate visual and textual information. Without this shared embedding space, the LLM would struggle to understand the relationship between the image content and the text prompt.

The next frontier in multimodal LLMs involves not just understanding images but also generating them, or understanding video and audio, leading to truly integrated AI experiences.

Want structured learning?

Take the full Llm course →