MLflow’s prompt tracking lets you version and compare your prompts like code, but it’s actually a bit of a misnomer: you’re not just tracking text, you’re tracking the state of your LLM application that the prompt is a part of.

Let’s see how this plays out with a simple example. Imagine we’re building a sentiment analysis tool. Our initial prompt is straightforward:

from mlflow.models import infer_signature
from mlflow import log_model, start_run

# Assume you have a simple LLM function defined elsewhere
# For demonstration, we'll mock it.
def analyze_sentiment(text):
    if "happy" in text.lower() or "joy" in text.lower():
        return "positive"
    elif "sad" in text.lower() or "unhappy" in text.lower():
        return "negative"
    else:
        return "neutral"

# Define our initial prompt
initial_prompt = "Analyze the sentiment of the following text: {text}"

with start_run(run_name="sentiment_v1") as run:
    # Log the prompt as a parameter
    mlflow.log_param("prompt_template", initial_prompt)
    mlflow.log_param("model_type", "simple_keyword")

    # Mock a simple model and signature for logging
    # In a real scenario, this would be your actual LLM call
    def mock_llm_predict(data):
        # This is a placeholder for the actual LLM inference
        # We'll use our simple function for demonstration
        input_texts = [row["text"] for row in data]
        sentiments = [analyze_sentiment(text) for text in input_texts]
        return [{"sentiment": s} for s in sentiments]

    # Infer a signature for the model
    # We'll assume input is a pandas DataFrame with a 'text' column
    import pandas as pd
    sample_input = pd.DataFrame({"text": ["This is a happy day!"]})
    signature = infer_signature(sample_input, pd.DataFrame([{"sentiment": "positive"}]))

    # Log the model
    log_model(
        artifact_path="sentiment_analyzer",
        python_model=mlflow.pyfunc.PythonModel.wrap(mock_llm_predict),
        signature=signature,
        input_example=sample_input
    )

    print(f"Logged run: {run.info.run_id}")

In this snippet, we’re not just logging the initial_prompt string. We’re logging it as a parameter within an MLflow run. This run also logs a sentiment_analyzer artifact, which is our mocked LLM application. The prompt is intrinsically linked to this run and its associated artifact.

Now, let’s say we want to improve our sentiment analysis. We might refine the prompt and perhaps switch to a more sophisticated LLM.

from mlflow.models import infer_signature
from mlflow import log_model, start_run
import pandas as pd
import mlflow

# Assume a more advanced LLM function (mocked here)
def advanced_analyze_sentiment(text):
    # In a real case, this would call an LLM API like OpenAI's
    # For demonstration, we'll simulate a slightly better response
    if "thrilled" in text.lower() or "ecstatic" in text.lower():
        return "very positive"
    elif "disappointed" in text.lower() or "frustrated" in text.lower():
        return "very negative"
    elif "happy" in text.lower() or "joy" in text.lower():
        return "positive"
    elif "sad" in text.lower() or "unhappy" in text.lower():
        return "negative"
    else:
        return "neutral"

# Our improved prompt
improved_prompt = "Classify the sentiment of the following text. Respond with only 'positive', 'negative', or 'neutral'. Text: {text}"

with start_run(run_name="sentiment_v2") as run:
    mlflow.log_param("prompt_template", improved_prompt)
    mlflow.log_param("model_type", "advanced_llm_sim") # Or actual LLM model name

    def mock_advanced_llm_predict(data):
        input_texts = [row["text"] for row in data]
        sentiments = [advanced_analyze_sentiment(text) for text in input_texts]
        return [{"sentiment": s} for s in sentiments]

    sample_input_v2 = pd.DataFrame({"text": ["I am absolutely thrilled with the service!"]})
    signature_v2 = infer_signature(sample_input_v2, pd.DataFrame([{"sentiment": "very positive"}]))

    log_model(
        artifact_path="sentiment_analyzer", # Same artifact path, but a new version
        python_model=mlflow.pyfunc.PythonModel.wrap(mock_advanced_llm_predict),
        signature=signature_v2,
        input_example=sample_input_v2
    )

    print(f"Logged run: {run.info.run_id}")

Here, we’ve logged a new run (sentiment_v2) with a different prompt and a (mocked) different underlying model. Crucially, MLflow’s model registry allows us to version these logged models. When you log a model with the same artifact_path but in a new run, MLflow creates a new version of that model.

The power comes when you compare these runs. You can go to the MLflow UI, select sentiment_v1 and sentiment_v2, and see side-by-side:

  • The different prompt templates logged as parameters.
  • The different model artifacts (if their underlying code or dependencies changed).
  • Performance metrics (if you logged them, e.g., accuracy on a test set).
  • Input examples and signatures.

This isn’t just about comparing prompt strings. It’s about comparing the entire experimental setup that produced a certain outcome for a given prompt. You can see how changing the prompt, or the model behind it, affects performance.

The one thing most people don’t realize is that MLflow’s prompt tracking is implicitly tied to model versioning. When you log a model that uses a prompt, and then log another model using a different prompt, you’re creating new model versions. The "prompt tracking" is the act of logging the prompt as a parameter associated with these model versions, allowing you to differentiate and compare them based on their prompt.

This enables you to iterate on your LLM applications by treating prompts as first-class citizens in your experimentation workflow, allowing for structured A/B testing and regression analysis on your LLM outputs.

The next step is to explore how to use these versioned prompts within a deployed LLM application.

Want structured learning?

Take the full Mlflow course →