MLOps latency SLAs are less about guaranteeing a specific response time and more about defining a contract with the consumer of your model that you’ll alert them if you fail to meet a certain threshold, which is a fundamentally different problem than making it fast.

Let’s see it in action. Imagine you have a model serving endpoint, and you want to ensure it responds within 500 milliseconds 99% of the time.

Here’s a simplified Python Flask app serving a dummy model:

from flask import Flask, request, jsonify
import time
import random

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    # Simulate model inference time
    inference_time = random.uniform(0.05, 0.6) # Between 50ms and 600ms
    time.sleep(inference_time)
    # Dummy prediction
    prediction = {"result": "positive" if random.random() > 0.5 else "negative"}
    return jsonify(prediction)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Now, how do we monitor and enforce an SLA on this? We need a separate system that observes requests to /predict and measures their round-trip time.

The Monitoring System:

This is typically done with a dedicated service that acts as a client to your model endpoint. It sends requests periodically, measures the time taken for each, and aggregates statistics. Prometheus is a common choice here, with exporters that can hit HTTP endpoints.

Let’s configure a Prometheus scrape job for our Flask app:

# prometheus.yml
scrape_configs:
  - job_name: 'model-service'
    static_configs:
      - targets: ['your-model-service-host:5000'] # Replace with your actual host and port
    metrics_path: /metrics # Assuming your Flask app exposes metrics at /metrics
    # For latency, we'll use a separate tool to hit the endpoint and expose metrics

Since our Flask app doesn’t automatically expose latency metrics, we’ll use a tool like prometheus-blackbox-exporter or a custom Python script that sends requests and exposes metrics to Prometheus. A simple Python script using requests and prometheus_client could look like this:

from flask import Flask, request, jsonify
from prometheus_client import Counter, Histogram, start_http_server
import time
import random
import requests

app = Flask(__name__)

# Metrics
REQUEST_COUNT = Counter('model_requests_total', 'Total number of requests made to the model endpoint.')
REQUEST_LATENCY = Histogram('model_request_latency_seconds', 'Latency of requests to the model endpoint.', buckets=[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 5.0]) # Buckets for latency

MODEL_ENDPOINT = "http://your-model-service-host:5000/predict" # Target model endpoint

def send_request_to_model():
    REQUEST_COUNT.inc()
    start_time = time.time()
    try:
        response = requests.post(MODEL_ENDPOINT, json={"data": "sample_input"}, timeout=5) # Add a timeout for the request itself
        response.raise_for_status() # Raise an exception for bad status codes
        # Here you'd parse the response if needed, but for latency we just care about the time
        latency = time.time() - start_time
        REQUEST_LATENCY.observe(latency)
        return True, latency
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        latency = time.time() - start_time # Still record latency even on failure
        REQUEST_LATENCY.observe(latency)
        return False, latency

@app.route('/probe')
def probe():
    success, latency = send_request_to_model()
    # You could return metrics here directly or have a separate /metrics endpoint
    return jsonify({
        "success": success,
        "latency_seconds": latency,
        "status": "OK" if success else "FAIL"
    })

if __name__ == '__main__':
    # Expose metrics for Prometheus to scrape
    start_http_server(8000) # Prometheus metrics endpoint on port 8000
    print("Starting probe server on port 8000 for metrics...")
    # Run a background thread or a scheduler to periodically call send_request_to_model
    # For simplicity, let's just call it once here and assume a scheduler handles it
    # In a real scenario, use APScheduler or Celery Beat
    while True:
        send_request_to_model()
        time.sleep(10) # Send a request every 10 seconds

With this setup, Prometheus will scrape the /metrics endpoint of our probe server.

Enforcing the SLA with Alerting:

Now, we use Prometheus Alertmanager to define our SLA. An SLA is typically expressed as a SLO (Service Level Objective). For latency, a common SLO is "99% of requests served in under 500ms".

Here’s a Prometheus recording rule to calculate the percentage of requests within the threshold:

# prometheus-rules.yml
groups:
- name: model_latency_rules
  rules:
  - record: model_request_latency_500ms_percentage
    expr: |
      sum by (job) (rate(model_request_latency_seconds_bucket{le="0.5"}[5m]))
      /
      sum by (job) (rate(model_request_latency_seconds_count[5m]))

This rule calculates, over a 5-minute window ([5m]), the proportion of requests whose latency fell into buckets less than or equal to 0.5 seconds, divided by the total number of requests.

And here’s the alerting rule:

# prometheus-rules.yml (continued)
  - alert: HighModelLatency
    expr: model_request_latency_500ms_percentage < 0.99 # Alert if less than 99% of requests are under 0.5s
    for: 10m # Only fire if the condition persists for 10 minutes
    labels:
      severity: warning
    annotations:
      summary: "Model service latency is too high"

      description: "Less than 99% of model requests are being served in under 500ms for job {{ $labels.job }}."

The for: 10m clause is critical. It prevents flapping alerts. An SLA isn’t broken by a single slow request; it’s broken by a sustained degradation.

The most counterintuitive aspect of latency SLAs is that they are fundamentally about observability and alerting, not performance optimization. You don’t optimize your model to meet an SLA; you build a robust monitoring system that alerts you when you’re about to violate it, giving you time to react. The actual performance tuning is a separate, ongoing engineering effort.

The next step is to investigate why the latency is high, which often involves diving into distributed tracing or profiling your model inference code.

Want structured learning?

Take the full MLOps & AI DevOps course →