The cheapest LLM token isn’t always the one with the lowest advertised price.

Let’s see what that looks like in practice. Imagine we’re building a simple chatbot that summarizes user input.

import openai
import anthropic
import google.generativeai as genai

# Configure your API keys (replace with your actual keys)
openai.api_key = "YOUR_OPENAI_API_KEY"
anthropic_api_key = "YOUR_ANTHROPIC_API_KEY"
genai.configure(api_key="YOUR_GEMINI_API_KEY")

def summarize_text(text, model_name):
    if model_name == "openai":
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that summarizes text."},
                    {"role": "user", "content": f"Summarize the following text: {text}"}
                ],
                max_tokens=100,
                temperature=0.7
            )
            return response.choices[0].message.content
        except Exception as e:
            return f"OpenAI Error: {e}"
    elif model_name == "anthropic":
        try:
            client = anthropic.Anthropic(api_key=anthropic_api_key)
            response = client.messages.create(
                model="claude-3-haiku-20240307",
                max_tokens=100,
                temperature=0.7,
                system="You are a helpful assistant that summarizes text.",
                messages=[
                    {"role": "user", "content": f"Summarize the following text: {text}"}
                ]
            )
            return response.content[0].text
        except Exception as e:
            return f"Anthropic Error: {e}"
    elif model_name == "gemini":
        try:
            model = genai.GenerativeModel('gemini-1.5-flash-latest')
            response = model.generate_content(f"Summarize the following text: {text}",
                                              generation_config=genai.GenerationConfig(
                                                  max_output_tokens=100,
                                                  temperature=0.7
                                              ),
                                              safety_settings=[
                                                  {"category": "HARM_CATEGORY_DANGEROUS", "threshold": "BLOCK_NONE"},
                                                  {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
                                                  {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
                                                  {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
                                              ])
            return response.text
        except Exception as e:
            return f"Gemini Error: {e}"
    else:
        return "Unknown model"

# Example Usage
user_input = "This is a very long piece of text that needs to be summarized. It discusses the advancements in artificial intelligence, the ethical considerations surrounding its development, and the potential impact on various industries. The author also touches upon the need for robust regulatory frameworks to guide AI's future trajectory. Furthermore, it highlights the importance of interdisciplinary collaboration to address the complex challenges posed by AI. Finally, it concludes with a hopeful outlook on AI's ability to solve some of humanity's most pressing problems, provided it is developed and deployed responsibly."

print("OpenAI (gpt-3.5-turbo) Summary:")
print(summarize_text(user_input, "openai"))
print("\nAnthropic (claude-3-haiku) Summary:")
print(summarize_text(user_input, "anthropic"))
print("\nGemini (gemini-1.5-flash) Summary:")
print(summarize_text(user_input, "gemini"))

This code snippet shows how you might interact with three popular LLM providers: OpenAI (using gpt-3.5-turbo), Anthropic (using claude-3-haiku), and Google (using gemini-1.5-flash). Each call involves sending a prompt (the user’s text plus instructions) and receiving a completion (the summary). The cost, however, isn’t just about the number of tokens in your summary.

The true cost per token is a function of both input and output tokens, multiplied by their respective pricing tiers. Providers often have different rates for input (prompt) tokens and output (completion) tokens. For instance, if your prompt is 500 tokens and the model generates a 100-token summary, you’re paying for 600 tokens in total. The breakdown of that cost depends on the provider’s pricing structure.

Here’s a look at typical pricing (as of early-mid 2024, always check the latest official pricing pages):

  • OpenAI (gpt-3.5-turbo):

    • Input: ~$0.0005 per 1,000 tokens
    • Output: ~$0.0015 per 1,000 tokens
    • For our example (500 input + 100 output): (500/1000 * $0.0005) + (100/1000 * $0.0015) = $0.00025 + $0.00015 = $0.0004.
  • Anthropic (claude-3-haiku):

    • Input: ~$0.00025 per 1,000 tokens
    • Output: ~$0.00125 per 1,000 tokens
    • For our example (500 input + 100 output): (500/1000 * $0.00025) + (100/1000 * $0.00125) = $0.000125 + $0.000125 = $0.00025.
  • Google (gemini-1.5-flash):

    • Input: ~$0.000125 per 1,000 tokens
    • Output: ~$0.000375 per 1,000 tokens
    • For our example (500 input + 100 output): (500/1000 * $0.000125) + (100/1000 * $0.000375) = $0.0000625 + $0.0000375 = $0.0001.

Notice how gemini-1.5-flash is significantly cheaper in this scenario because its input token cost is the lowest, and its output cost is also very competitive. claude-3-haiku is also very cost-effective, especially on input. gpt-3.5-turbo, while a strong performer, is generally more expensive per token than these newer, specialized models.

The choice of model isn’t just about raw performance or a single "best" model; it’s about matching the task to the cost structure. If your application involves very long prompts (e.g., RAG systems loading large documents) but short, concise answers, models with cheaper input tokens (like Gemini Flash or Claude Haiku) will offer a substantial cost advantage. Conversely, if your prompts are short and you expect lengthy, detailed outputs, the output token pricing becomes more critical.

The one thing most people don’t realize is that the "context window" isn’t just a limit; it’s a primary driver of cost in many RAG-like applications. When you’re processing a document, the entire document (or significant chunks of it) becomes input tokens. A larger context window means you can process more, but it also means your input token count explodes, making the input price per token the dominant factor. A model that can handle a 1 million token context window might seem powerful, but if its input price is $0.000125/1k tokens, a single full document could cost $125 just for the input, even before any output is generated.

Therefore, when comparing LLM costs, always calculate the total cost based on your expected prompt length and completion length, using the provider’s specific input and output token rates.

Want structured learning?

Take the full Llm course →