Run GPU Workloads on Fly.io Machines for AI Inference (2026)

Fly.io machines can run GPU workloads for AI inference, but it’s not a simple plug-and-play experience.

Let’s see it in action. Imagine we want to run a small Llama 2 model for text generation on a Fly.io machine with a GPU.

First, we need a Dockerfile that installs the necessary CUDA drivers and PyTorch with GPU support.

# Use a base image with CUDA pre-installed
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

# Install Python and pip
RUN apt-get update && apt-get install -y \
    python3 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Install PyTorch with CUDA support
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Copy your application code
COPY . /app

# Expose the port your application will listen on
EXPOSE 8000

# Command to run your application (e.g., a Flask or FastAPI app)
CMD ["python3", "app.py"]

Next, we need an app.py that loads the model and serves inference requests.

import torch
from flask import Flask, request, jsonify

app = Flask(__name__)

# Load the model onto the GPU
# Replace with your actual model loading logic
model = torch.load("your_model.pth")
model.to("cuda")
model.eval()

@app.route('/infer', methods=['POST'])
def infer():
    data = request.get_json()
    input_data = data['input']

    # Perform inference
    with torch.no_grad():
        # Convert input to tensor and move to GPU
        input_tensor = torch.tensor(input_data).to("cuda")
        output = model(input_tensor)

    # Move output back to CPU for JSON serialization
    output_list = output.cpu().numpy().tolist()

    return jsonify({"output": output_list})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

To deploy this to Fly.io with GPU support, we need a fly.toml that specifies the GPU hardware.

app = "your-gpu-app-name"
primary_region = "lhr" # Choose a region with GPU availability

[build]
  dockerfile = "Dockerfile"

[deploy]
  # This is critical for GPU workloads
  machine = true

[services]
  http_port = 8000
  internal_port = 8000

[[services.ports]]
  port = "80"
  handlers = ["http"]

[[services.tcp_checks]]
  interval = 20000
  timeout = 5000

# This section specifies the GPU hardware.
# You'll need to check Fly.io's available GPU types in your chosen region.
# Example for an NVIDIA A100 (this might vary based on availability and Fly.io's offerings)
[experimental]
  # Ensure this matches available hardware in your region
  # Example: "gpu_a100_40gb" or "gpu_t4"
  # Check Fly.io documentation for current GPU machine types.
  # For a general example, let's assume a T4 GPU is available.
  # This is a placeholder and needs to be confirmed with Fly.io's current offerings.
  # The actual machine type will be something like "gpt-t4-small" or similar.
  # For demonstration, let's imagine a specific machine type name.
  machine_type = "gpu_t4_small"

The core problem this solves is running computationally intensive AI models that require specialized hardware (GPUs) for acceptable inference speeds. Traditional CPU-bound machines would be far too slow for real-time AI applications. Fly.io’s ability to provision machines with specific GPU hardware allows developers to host these models closer to their users, reducing latency. The fly.toml configuration is where you tell Fly.io which GPU to attach to your machine. Without this, you’d just get a standard CPU machine, and your PyTorch code attempting to use cuda would fail spectacularly. The machine = true in [deploy] is also crucial as it signals that you intend to use Fly Machines, which are necessary for GPU hardware.

The system works by Fly.io provisioning a virtual machine instance on their infrastructure that has direct access to a physical GPU. When your container starts, the CUDA drivers within your Docker image can then communicate with this attached GPU. Your application code, by using PyTorch’s to("cuda") method, directs tensor operations to the GPU for accelerated computation. The crucial part is that the machine_type in fly.toml must map to an actual GPU-enabled machine type available in the region you’ve selected. Fly.io’s underlying infrastructure then ensures that the allocated VM has the necessary hardware passthrough configured.

The most surprising part is how seamlessly the GPU hardware is abstracted. You don’t directly manage the physical hardware; you select a machine_type that represents a GPU configuration. Your Docker image needs to contain the correct CUDA toolkit and drivers compatible with the GPU hardware Fly.io provisions. If your Docker image has CUDA 11.8 drivers, but Fly.io provisions a machine with a GPU that expects CUDA 12.x, you’ll encounter driver mismatch errors. The nvidia/cuda base images are designed to bundle compatible drivers and toolkits, making this process manageable, but it requires careful selection based on the target hardware.

The next challenge you’ll likely encounter is optimizing your model for inference speed and memory usage on the specific GPU you’ve provisioned.