Open-source LLMs are fundamentally more about access than performance, and the performance gap is closing faster than most people realize.

Let’s see what that looks like in practice. Imagine you’re building a sentiment analysis tool. You could use a proprietary API like OpenAI’s gpt-3.5-turbo.

import openai

openai.api_key = "YOUR_API_KEY"

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant that analyzes sentiment."},
    {"role": "user", "content": "I love this new coffee shop, the beans are so fresh!"}
  ]
)

print(response.choices[0].message.content)
# Expected Output: Positive

This is incredibly easy to get started with. You send a prompt, get a response. But you’re paying per token, subject to rate limits, and the model’s internal workings are a black box.

Now, consider an open-source alternative. Let’s take Llama-2-7b, a popular choice. You’d download the weights and run it locally or on your own infrastructure. This requires more setup, but the control is yours.

Here’s a simplified conceptual example using the transformers library from Hugging Face:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Ensure you have the model downloaded or specify path
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16, # Use float16 for memory efficiency
    device_map="auto" # Automatically distribute across available GPUs
)

prompt = "I love this new coffee shop, the beans are so fresh!"
messages = [
    {"role": "system", "content": "You are a helpful assistant that analyzes sentiment."},
    {"role": "user", "content": prompt}
]

# Format for Llama 2 chat
def format_prompt(messages):
    formatted_string = ""
    for message in messages:
        if message["role"] == "system":
            formatted_string += f"<s>[INST] <<SYS>>\n{message['content']}\n<</SYS>>\n\n"
        elif message["role"] == "user":
            formatted_string += f"{message['content']} [/INST]"
        else: # Assistant role (for potential conversation history)
            formatted_string += f" {message['content']} </s><s>[INST]"
    return formatted_string

formatted_input = format_prompt(messages)

inputs = tokenizer(formatted_input, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
# You'd then parse response_text to extract the sentiment, often by looking for keywords or specific output formats.
print(response_text)
# Expected Output (after parsing): Positive sentiment detected.

This code snippet shows the mechanics. You load a tokenizer and a model. The device_map="auto" is crucial for larger models, distributing layers across available GPUs to fit them into memory. torch_dtype=torch.float16 is another memory optimization. The format_prompt function is essential because open-source models often have very specific input formatting requirements to behave like chat models. You’re not just sending raw text; you’re sending tokens that represent a structured conversation.

The core problem open-source LLMs solve is the cost and control barrier of proprietary models. For businesses, this means predictable infrastructure costs instead of per-API-call fees, which can scale exponentially. For researchers, it means the ability to inspect, fine-tune, and build upon existing architectures without needing to reverse-engineer or rely on vendor roadmaps. The "performance gap" is often overstated because proprietary models are usually larger and trained on more data, but fine-tuning smaller, specialized open-source models can often match or exceed the performance of general-purpose proprietary models on specific tasks.

What most people miss is how crucial the quantization process is for making these large models practical. Techniques like bitsandbytes or GPTQ allow you to load models with significantly reduced precision (e.g., 8-bit or 4-bit integers instead of 16-bit floats). This dramatically shrinks the VRAM requirements, making it possible to run models like Llama-2-70b on consumer-grade hardware, albeit with a slight, often imperceptible, drop in accuracy. Without quantization, running anything larger than a 7B parameter model locally would be impossible for most.

The next hurdle you’ll face is efficiently fine-tuning these models on your own data for domain-specific tasks.

Want structured learning?

Take the full Llm course →