MLflow + OpenAI: Track and Evaluate LLM Applications (2026)

OpenAI’s API is just a service, and the real magic happens when you attach a stateful, auditable history to its stateless, ephemeral responses.

Let’s see what happens when we actually use this. Imagine you’ve got a simple Python script that takes a user query, sends it to OpenAI’s gpt-4-turbo model, and prints the response.

import openai

openai.api_key = "YOUR_API_KEY"

def ask_llm(query):
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

user_input = "What is the capital of France?"
llm_output = ask_llm(user_input)
print(f"User: {user_input}")
print(f"LLM: {llm_output}")

Now, how do we make this trackable? We wrap the core LLM interaction with MLflow’s autologging. MLflow will automatically capture the model, prompt, and response for every call.

import openai
import mlflow

# Start an MLflow run
mlflow.start_run()

openai.api_key = "YOUR_API_KEY"

# Enable OpenAI autologging
mlflow.openai.autolog()

def ask_llm(query):
    response = openai.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content

user_input = "What is the capital of France?"
llm_output = ask_llm(user_input)
print(f"User: {user_input}")
print(f"LLM: {llm_output}")

# End the MLflow run
mlflow.end_run()

When you run this, MLflow logs a "run." In the MLflow UI (run mlflow ui in your terminal), you’ll see a new experiment. Clicking into the run, you’ll find artifacts like openai_api_log.json, which contains the full request and response details. Crucially, MLflow also logs parameters like model and temperature, and metrics like latency.

The problem MLflow solves here is the "black box" nature of LLM calls. Without tracking, each API call is an isolated event. You don’t know what prompt generated which response, what parameters were used, or how long it took. This makes debugging, reproducibility, and comparison impossible. MLflow injects observability.

Internally, mlflow.openai.autolog() works by patching the openai.chat.completions.create method. Before the original method is called, it records the input parameters (model, messages, temperature, etc.). After the method returns, it captures the response, calculates latency, and logs these as MLflow parameters and metrics. It also saves the detailed request/response payload to a JSON file within the run’s artifacts.

The key levers you control are the standard OpenAI API parameters. When you set temperature=0.7 or max_tokens=150 in your openai.chat.completions.create call, MLflow captures these as logged parameters. By changing these parameters across different runs and observing the logged metrics (like response quality, if you manually log it) or artifacts (the actual responses), you can perform hyperparameter tuning for your LLM application.

Beyond basic logging, MLflow provides tools for evaluating LLM outputs. You can log custom metrics during your run, such as a "quality score" based on human feedback or another LLM. For instance, after getting a response, you could have a separate function evaluate its helpfulness and log it:

# ... inside your run
llm_output = ask_llm(user_input)
quality = evaluate_response_quality(llm_output) # Assume this function exists
mlflow.log_metric("response_quality", quality)

This allows you to compare runs directly on meaningful business metrics, not just latency or token count. You can then use MLflow’s UI to sort and filter runs by response_quality to find the best performing configurations.

The most surprising thing is how much of the "LLM Ops" puzzle MLflow’s autologging solves out-of-the-box for OpenAI. It’s not just about logging parameters; it’s about capturing the entire interaction payload, which is critical for debugging and understanding why an LLM behaved a certain way. This captured payload is the ground truth for any subsequent analysis or retraining.

Once you’ve logged your LLM interactions, the next logical step is to compare different model versions or prompt strategies side-by-side using MLflow’s comparison features.