SageMaker endpoints are not just for deploying models; they’re a way to turn your ML models into on-demand, serverless APIs that can handle real-time inference at scale.

Let’s see what this looks like in practice. Imagine you’ve trained a scikit-learn model to predict customer churn. You’ve saved this model as a model.joblib file. Now, you want to deploy it to SageMaker.

First, you need to package your model artifact and any inference code. This typically involves creating a model.tar.gz file. If your model requires custom pre- or post-processing, you’ll also need an inference.py script.

Here’s a simple inference.py for a scikit-learn model:

import joblib
import os
import json
import pandas as pd

def model_fn(model_dir):
    """Loads the scikit-learn model from disk."""
    model = joblib.load(os.path.join(model_dir, "model.joblib"))
    return model

def input_fn(request_body, request_content_type):
    """Parses the input data."""
    if request_content_type == "application/json":
        data = json.loads(request_body)
        # Assuming input is a list of feature dictionaries
        return pd.DataFrame(data)
    else:
        raise ValueError(f"Unsupported content type: {request_content_type}")

def predict_fn(input_data, model):
    """Makes predictions using the loaded model."""
    return model.predict(input_data)

def output_fn(prediction, response_content_type):
    """Formats the prediction output."""
    if response_content_type == "application/json":
        return json.dumps(prediction.tolist()), response_content_type
    else:
        raise ValueError(f"Unsupported content type: {response_content_type}")

With your model.joblib and inference.py ready, you’d create the model.tar.gz:

tar -czvf model.tar.gz model.joblib inference.py

Next, you upload this model.tar.gz to an S3 bucket. Let’s say it’s at s3://my-sagemaker-bucket/models/churn-model/model.tar.gz.

Now, you can create a SageMaker Model object using the AWS SDK (Boto3):

import boto3
from sagemaker.model import Model

sagemaker_client = boto3.client("sagemaker")

model_name = "my-churn-model"
image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/sklearn-inference:1.0-1" # Example for scikit-learn

model = Model(
    model_data="s3://my-sagemaker-bucket/models/churn-model/model.tar.gz",
    image_uri=image_uri,
    role="arn:aws:iam::123456789012:role/SageMakerExecutionRole", # Replace with your SageMaker execution role ARN
    name=model_name,
    sagemaker_session=sagemaker_session # Assuming sagemaker_session is initialized
)

# Create the SageMaker Model resource
model.create()

This Model object represents your trained model artifact and the inference container it needs to run. The image_uri points to a pre-built Docker image from Amazon ECR that contains the necessary libraries (like scikit-learn) and the SageMaker inference toolkit.

The real magic happens when you create an Endpoint Configuration and then an Endpoint. The Endpoint Configuration specifies the instance type(s) and scaling policies for your inference.

from sagemaker.endpoint_config import EndpointConfig

endpoint_config_name = f"{model_name}-config"
instance_type = "ml.m5.large"
initial_instance_count = 1

endpoint_config = EndpointConfig(
    model_name=model_name,
    instance_type=instance_type,
    initial_instance_count=initial_instance_count,
    endpoint_config_name=endpoint_config_name,
    sagemaker_session=sagemaker_session
)

endpoint_config.create()

Finally, you create the Endpoint itself, which is the actual live, invokable API.

from sagemaker.endpoint import Endpoint

endpoint_name = f"{model_name}-endpoint"

endpoint = Endpoint(
    endpoint_name=endpoint_name,
    endpoint_config_name=endpoint_config_name,
    sagemaker_session=sagemaker_session
)

endpoint.create()

Once the endpoint is in the InService state, you can invoke it to get predictions.

import pandas as pd

# Example payload
payload = [
    {"feature1": 1.0, "feature2": 0.5, "feature3": 2.3},
    {"feature1": 0.2, "feature2": 0.8, "feature3": 1.1}
]

response = endpoint.invoke_endpoint(
    content_type="application/json",
    body=json.dumps(payload)
)

predictions = json.loads(response.get('Body').read().decode())
print(predictions)

This setup allows SageMaker to manage the underlying infrastructure, automatically scale your inference instances based on traffic, and provide a robust, high-availability API for your ML models. The ml.m5.large instance type, for example, provides a balance of compute and memory suitable for many general-purpose inference tasks. You can also specify ml.c5.xlarge for CPU-intensive workloads or ml.p3.2xlarge for GPU acceleration if your model requires it.

The key is that SageMaker handles the provisioning of EC2 instances, the deployment of your container, and the routing of inference requests. You don’t need to worry about managing servers, load balancers, or auto-scaling groups.

When you deploy a model using sagemaker.model.Model, it registers a SageMaker Model resource. This resource references your model artifacts in S3 and the inference container image. The EndpointConfig then defines how that model will be served – which instance types, how many, and any auto-scaling policies. Finally, the Endpoint is the live, network-addressable resource that you can send requests to.

The sagemaker-inference-toolkit within the container image is what enables SageMaker to communicate with your model. It implements standard HTTP endpoints (/ping for health checks and /invocations for predictions) that SageMaker’s internal load balancer uses. Your inference.py script hooks into this toolkit via the model_fn, input_fn, predict_fn, and output_fn functions.

The most surprising thing is that the image_uri doesn’t have to be from Amazon’s pre-built ECR images. You can build your own Docker container with your specific dependencies and push it to your own ECR repository, then point SageMaker to that custom image. This gives you complete control over the inference environment, allowing you to use any Python version, custom libraries, or even non-Python runtimes.

The next step is often to set up asynchronous inference for larger payloads or longer processing times, which bypasses the synchronous request-response latency.

Want structured learning?

Take the full Mlflow course →