Stable Diffusion can run inference on a single consumer-grade GPU for under $500, making high-quality image generation accessible to anyone.
Let’s see it in action. We’ll use the diffusers library from Hugging Face, which provides a clean Python API for various diffusion models.
from diffusers import StableDiffusionPipeline
import torch
# Load the pipeline. This downloads the model weights if you don't have them.
# The first time you run this, it might take a while to download ~2GB of data.
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
# Move the pipeline to the GPU if available.
if torch.cuda.is_available():
pipe = pipe.to("cuda")
else:
print("CUDA not available, running on CPU. This will be very slow.")
# Define your prompt
prompt = "a photograph of an astronaut riding a horse on the moon"
# Generate the image
# The generator ensures reproducibility if you want to get the same image again.
generator = torch.Generator("cuda").manual_seed(42) if torch.cuda.is_available() else torch.Generator().manual_seed(42)
image = pipe(prompt, generator=generator).images[0]
# Save the image
image.save("astronaut_horse.png")
This code snippet demonstrates the core of Stable Diffusion inference. The StableDiffusionPipeline abstracts away most of the complexity. When you call from_pretrained, it fetches a pre-trained model checkpoint. StableDiffusion is a type of latent diffusion model. It doesn’t directly generate images in pixel space. Instead, it operates in a compressed "latent space" and then decodes the latent representation into a full image. This makes the process much more computationally efficient.
The pipeline handles several steps internally:
- Text Encoding: Your text prompt is converted into a numerical representation (an embedding) using a text encoder, typically a CLIP model. This embedding captures the semantic meaning of your prompt.
- Noise Generation: A random noise tensor is generated in the latent space. This noise is the starting point for the diffusion process.
- Denoising Loop: The core of the process. The UNet model iteratively "denoises" the latent tensor. In each step, it predicts and removes a small amount of noise, guided by the text embedding. This loop runs for a specified number of inference steps (default is often 50). More steps generally lead to higher quality but take longer.
- Image Decoding: Once the denoising is complete, the final latent representation is passed through a VAE (Variational Autoencoder) decoder to transform it back into the pixel space, producing the final image.
You control the output primarily through the prompt. Experiment with different keywords, styles (e.g., "cinematic lighting," "digital art," "oil painting"), and negative prompts (using the negative_prompt argument in the pipe call) to steer the generation. The guidance_scale parameter (e.g., guidance_scale=7.5) controls how strongly the generation should adhere to the prompt. Higher values mean stronger adherence, but can sometimes lead to artifacts.
A key detail often overlooked is the torch_dtype. Using torch.float16 (half-precision floating-point) instead of torch.float32 (single-precision) can halve the VRAM usage and significantly speed up inference on modern GPUs, often with minimal loss in image quality. This is why it’s specified in the from_pretrained call. Without it, the model defaults to float32, which might exceed the memory of many consumer GPUs.
The generator with manual_seed is crucial for reproducibility. If you want to generate the exact same image again, you must use the same seed and the same parameters. Without it, each run will produce a different image due to the random nature of the initial noise.
The next step is often exploring how to fine-tune these models for your specific needs or integrating them into more complex workflows.