Optuna vs Ray Tune vs Hyperband: Hyperparameter Tuning

Hyperparameter tuning isn’t about finding the best values, it’s about finding values that are good enough for your specific training run, given your time and resource constraints.

Imagine you’ve got a PyTorch model and you’re trying to train it on a dataset of cat and dog images. You’ve got a DataLoader set up, and your model is a simple nn.Module. You’re ready to start tuning learning_rate and weight_decay.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
import optuna

# Dummy data and model
X = torch.randn(1000, 3, 224, 224)
y = torch.randint(0, 2, (1000,))
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32)

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(16 * 112 * 112, 2)

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.flatten(x)
        x = self.fc(x)
        return x

model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters()) # Optimizer will be re-created with tuned LR

def objective(trial):
    # Define the search space
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 1e-1, log=True)
    weight_decay = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)

    # Re-create optimizer with tuned hyperparameters
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=weight_decay)

    # Dummy training loop (replace with your actual training logic)
    model.train()
    for epoch in range(2): # Train for a few epochs for demonstration
        for inputs, labels in dataloader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    # In a real scenario, you'd evaluate on a validation set and return a metric
    # For this example, we'll just return the final loss of the last batch
    return loss.item()

# Create a study object and run the optimization
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=10)

print("Best hyperparameters:", study.best_params)
print("Best value:", study.best_value)

This script sets up a basic PyTorch model and uses Optuna to explore different combinations of learning_rate and weight_decay. For each trial, Optuna picks a learning_rate and weight_decay from the ranges you defined, creates a new Adam optimizer with those values, runs a short training loop, and then returns the final loss. Optuna watches these losses and tries to find the hyperparameters that lead to the lowest loss.

The core problem hyperparameter tuning solves is the combinatorial explosion of possibilities. If you have just 5 hyperparameters, and you want to test 10 values for each, that’s 10^5 = 100,000 combinations. Manually trying them is impossible. Automated tuning searches this space intelligently. It uses algorithms like Random Search, Bayesian Optimization, or Hyperband to prune unpromising trials and focus on more promising regions of the hyperparameter landscape. Bayesian Optimization, for instance, builds a probabilistic model of the objective function and uses it to decide which hyperparameters to try next, balancing exploration (trying new, uncertain areas) with exploitation (sampling near the current best).

The trial object in Optuna is your interface to the tuning process. trial.suggest_float(), trial.suggest_int(), trial.suggest_categorical() are how you define the search space. Optuna then calls your objective function repeatedly, passing in a new trial object each time, populated with suggested hyperparameters. Your objective function must return a scalar value (the metric you want to optimize, like validation accuracy or loss).

The most surprising thing is how often the optimal learning rate is found at the extreme ends of the search space, or that a slightly suboptimal learning rate can be dramatically improved by a small change in weight decay. This isn’t about a smooth, continuous improvement; it’s about finding a "sweet spot" where the optimizer can make meaningful progress without diverging or getting stuck in shallow local minima. The interaction between hyperparameters is often non-linear and highly sensitive, meaning a good learning rate might be terrible with the wrong weight decay, and vice-versa.

The study.optimize() method orchestrates the whole show. n_trials is the total number of times your objective function will be called. direction tells Optuna whether you want to minimize (e.g., loss) or maximize (e.g., accuracy) the returned value. Under the hood, Optuna manages the state of each trial, records its result, and uses this history to inform its next suggestion, making the search efficient.

What most people don’t realize is that the order in which hyperparameters are suggested and evaluated can significantly impact the final result, especially when using more advanced pruning or early stopping mechanisms within your objective function. If a trial is pruned early because it shows poor performance after just a few epochs, it might be discarded before its true potential (or lack thereof) is revealed with more training. This is why understanding the interaction between your training dynamics and the tuning algorithm’s strategy is crucial for effective large-scale tuning.

The next challenge is implementing distributed hyperparameter tuning, where multiple machines or GPUs work in parallel to accelerate the search process.