Deploying large language models (LLMs) has become a cornerstone of modern AI applications, but the choice between cloud-based APIs and local inference presents a fundamental divergence in architecture, cost, and control.

Let’s see what happens when we actually use these. Imagine you want to summarize a lengthy document.

Cloud API Approach:

You send your document to an API endpoint provided by a cloud vendor like OpenAI, Anthropic, or Google.

import openai

openai.api_key = "YOUR_OPENAI_API_KEY"

document = "This is a very long document..." # Your document content here

response = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[
    {"role": "system", "content": "You are a helpful assistant that summarizes text."},
    {"role": "user", "content": f"Please summarize the following document:\n\n{document}"}
  ]
)

summary = response.choices[0].message.content
print(summary)

The vendor’s infrastructure processes the request, runs the LLM, and returns the summarized text. You pay per token used, both for the input and the output.

Local Inference Approach:

You download a pre-trained LLM (e.g., Llama 2, Mistral) and run it on your own hardware, either on-premises or on a cloud virtual machine you manage.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer (example using a smaller model for demonstration)
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

document = "This is a very long document..." # Your document content here

prompt = f"Please summarize the following document:\n\n{document}"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate summary
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=150, num_return_sequences=1)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
# The output will include the prompt, so we need to extract the generated part
generated_text = summary[len(prompt):].strip()
print(generated_text)

You are responsible for the hardware, power, and maintenance, but once the model is running, the inference cost per token is effectively zero (beyond your operational expenses).

The core problem LLM deployment solves is making the immense computational power and knowledge encoded within these models accessible for practical use. Cloud APIs abstract away the complexity of managing powerful GPUs and scaling infrastructure. Local inference, conversely, offers a path to greater control, potentially lower long-term costs, and enhanced data privacy by keeping sensitive information within your own environment.

Internally, cloud APIs manage a vast fleet of GPUs, load-balanced and orchestrated to serve millions of requests concurrently. When you call openai.ChatCompletion.create, your request is routed through their internal network to an available instance running the specified model. They handle model loading, inference, and response generation, returning a JSON payload. Your control is limited to selecting the model, adjusting parameters like temperature and max_tokens, and managing your API keys and billing.

Local inference involves downloading model weights (which can be tens to hundreds of gigabytes) and using libraries like Hugging Face’s transformers or llama.cpp to load and run these models on your chosen hardware. You control the hardware selection (CPU vs. GPU, specific GPU models), the software environment (Python versions, CUDA installations), and the optimization techniques (quantization, parallelization). The device_map="auto" in the transformers example intelligently distributes the model layers across available GPUs or offloads to CPU if VRAM is insufficient.

The most surprising thing about local inference is how much performance can be squeezed out of consumer-grade hardware with clever optimization. Techniques like 4-bit quantization (e.g., using bitsandbytes or GPTQ) reduce the model’s memory footprint and computational demands significantly, allowing models that would traditionally require multiple high-end server GPUs to run on a single consumer GPU with minimal accuracy degradation. This makes powerful LLMs accessible for individual developers and smaller organizations without massive cloud budgets.

The next hurdle you’ll face is efficiently batching requests for local inference to maximize GPU utilization.

Want structured learning?

Take the full Llm course →