MLflow’s default artifact storage is surprisingly ill-suited for models that exceed a few gigabytes, leading to frustrating timeouts and incomplete uploads.

The core issue is that MLflow’s artifact logging, by default, attempts to upload entire files in a single HTTP request. For large files, this is inefficient and prone to network interruptions. When the upload fails, MLflow often leaves you with a partially uploaded artifact, corrupting your model storage and making it impossible to load the model later.

Here are the most common reasons this happens and how to fix them:

1. Default Artifact Store Limits

Diagnosis: Your artifact store (like S3, Azure Blob Storage, or GCS) has implicit or explicit size limits for single object uploads, or your network connection is not stable enough for a long, uninterrupted transfer. MLflow’s default behavior doesn’t handle large files gracefully.

Cause: MLflow’s log_artifact and log_artifacts methods, when used without specific configuration for large files, try to upload the entire file as a single chunk. This is problematic for multi-gigabyte models.

Fix: Configure MLflow to use a chunked upload strategy. For S3, this involves setting the use_threads=True and multipart_threshold parameters in your MLflow tracking server configuration or client-side code. For example, when setting up your tracking URI:

import mlflow
from mlflow.tracking import MlflowClient

# For S3
mlflow.set_tracking_uri("s3://your-mlflow-bucket/your-mlflow-prefix/")
client = MlflowClient()
# You might need to configure the underlying storage client directly if MLflow's
# direct integration doesn't expose these, but often it does via environment variables
# or specific client kwargs if you're initializing the storage client yourself.
# For S3, ensure your AWS credentials and region are set correctly.

# A common approach is to ensure your S3 client used by MLflow is configured
# with a multipart upload threshold. MLflow's S3 artifact store often picks this up.
# If you're running an MLflow tracking server, you might configure this via
# environment variables or server settings.
# For client-side uploads directly via MLflow, you might look for parameters
# within `mlflow.log_artifact` or `mlflow.log_artifacts` that allow passing
# storage-specific options, though MLflow's direct API for this is limited.
# A more robust solution is often to use a dedicated large file upload mechanism
# and then log the *path* to that uploaded file as an artifact.

Why it works: Chunked uploads break the large file into smaller pieces that are uploaded independently. This is more resilient to network hiccups and bypasses single-object upload limits in cloud storage. MLflow, when configured correctly for S3, will leverage the boto3 client’s multipart upload capabilities.

2. Inadequate Network Timeout Settings

Diagnosis: Network requests between your MLflow client and the artifact store are timing out before the upload can complete.

Cause: Default HTTP client timeouts are often too short for large file transfers, especially over slower or less reliable networks.

Fix: Increase the HTTP client timeout. If you are using MLflow’s REST API directly or have a custom client setup, you can often configure this. For MLflow’s default log_artifact and log_artifacts which might use requests under the hood, you can set environment variables or pass them if the API allowed. A more direct approach is to ensure your tracking server is configured with appropriate timeouts if it’s proxying requests.

If you’re using boto3 for S3, you can set client_kwargs={'s3_transfer_config': boto3.s3.transfer.TransferConfig(multipart_chunksize=1024*1024*10, multipart_threshold=1024*1024*10)}. MLflow’s S3 artifact store should ideally pick up these underlying boto3 configurations.

Why it works: Longer timeouts allow the entire file, or its chunks, to be transferred without the connection being prematurely terminated by the client or server.

3. Insufficient Client-Side Memory or Disk Space

Diagnosis: The machine performing the upload runs out of RAM or temporary disk space during the upload process.

Cause: For very large files, even with chunking, the client might need to buffer parts of the file in memory or write temporary chunks to disk before uploading.

Fix: Ensure the machine where you are running your MLflow training script has ample RAM (e.g., 32GB+) and sufficient temporary disk space (/tmp on Linux/macOS or %TEMP% on Windows). You can often configure boto3 (which MLflow uses for S3) to use a specific temporary directory for multipart uploads if /tmp is small.

# Example for S3 using boto3's TransferConfig with a specific temp directory
import boto3
from boto3.s3.transfer import TransferConfig
import tempfile
import os

# Assuming MLflow uses boto3 under the hood for S3 artifact storage
# This configuration needs to be applied to the S3 client that MLflow uses.
# If MLflow doesn't directly expose this, you might need to manage your S3 client
# initialization outside of MLflow and then configure MLflow to use that.

# Create a large temporary directory
temp_dir = tempfile.mkdtemp(prefix="mlflow_large_upload_")
print(f"Using temporary directory for uploads: {temp_dir}")

# Configure transfer parameters
transfer_config = TransferConfig(
    multipart_threshold=1024 * 1024 * 5,  # 5MB threshold for multipart
    multipart_chunksize=1024 * 1024 * 5,  # 5MB chunk size
    use_threads=True,
    # Note: boto3's TransferConfig doesn't directly take a temp dir path.
    # The OS's default temp dir is usually used. If you need to control this,
    # you might need to set the TMPDIR environment variable before starting
    # the process that runs MLflow.
    # os.environ['TMPDIR'] = temp_dir
)

# If you were initializing your S3 client manually:
# s3_client = boto3.client('s3', region_name='us-east-1', config=boto3.session.Config(
#     s3={'use_accelerate_endpoint': False, 'signature_version': 's3v4'},
#     retries={'max_attempts': 10}
# ))
# s3_resource = boto3.resource('s3')
# s3_resource.meta.client._client_config.s3_transfer_config = transfer_config
# Then configure MLflow to use this s3_resource.

# For MLflow's direct S3 logging, it often relies on environment variables
# or the default boto3 config. Ensure TMPDIR is set if needed.

Why it works: By providing sufficient RAM and disk space, or directing temporary files to a larger partition, the client can handle the buffering and staging of file parts without crashing.

4. MLflow Tracking Server Configuration (if applicable)

Diagnosis: If you are running an MLflow tracking server, its web server configuration (e.g., Nginx, Flask) might have limits on request body size or timeouts.

Cause: The MLflow tracking server acts as an intermediary. Its own web server can impose limits on how large a single HTTP request (which an artifact upload is) can be.

Fix: For Nginx, you’d increase client_max_body_size and potentially proxy_read_timeout and proxy_connect_timeout in your nginx.conf or site-specific configuration. For Flask (which MLflow uses), you might need to adjust server worker timeouts if using a production server like Gunicorn.

# Example Nginx configuration snippet
http {
    # ... other settings ...
    client_max_body_size 1024M; # Allow up to 1GB requests
    proxy_read_timeout 600s;   # Increase read timeout to 10 minutes
    proxy_connect_timeout 600s; # Increase connect timeout to 10 minutes
    # ... other settings ...
}

Why it works: These settings ensure the tracking server can accept and process large incoming HTTP requests from your MLflow client without timing out or rejecting them due to size constraints.

5. Artifact Store Backend Specifics (e.g., Azure Blob Storage, GCS)

Diagnosis: Different cloud storage providers have their own nuances and potential limits for large object uploads.

Cause: While S3’s multipart upload is well-established, other providers might have different recommended practices or default configurations that MLflow’s generic artifact store wrapper might not fully optimize for.

Fix: Consult the documentation for your specific artifact store. For Azure Blob Storage, ensure you’re using the azblob URI scheme and that the underlying SDK (which MLflow uses) is configured for efficient large uploads (often handled automatically by the SDK’s block blob upload methods). For GCS, similar to S3, chunked uploads are key and usually managed by the client library. You might need to configure the google-cloud-storage client library if MLflow doesn’t expose these options directly.

Why it works: Adhering to the specific best practices and configurations for your chosen artifact store ensures that MLflow leverages the most efficient and robust upload mechanisms available for that backend.

6. Using mlflow.pyfunc.save_model Incorrectly

Diagnosis: If you’re saving a Python function model, mlflow.pyfunc.save_model might bundle dependencies or code in a way that creates a very large artifact that then fails to upload.

Cause: The python_model argument to save_model can be complex. If it includes large embedded data or has many dependencies, the resulting artifact directory can become massive.

Fix: Consider externalizing large data dependencies. Instead of embedding large files directly into your PythonModel, save them to your artifact store (or a separate location) first, and then have your PythonModel load them from a known path (e.g., using mlflow.artifacts.download_artifacts). This breaks the large artifact into smaller, manageable components.

class LargeDataModel(mlflow.pyfunc.PythonModel):
    def __init__(self, model_path, data_path):
        self.model = load_model(model_path) # Load from artifact path
        self.data = np.load(data_path)      # Load from artifact path

    def predict(self, context, model_input):
        # ... prediction logic ...
        pass

# In your training script:
# 1. Upload large_model.pkl and large_data.npy to MLflow artifacts
model_artifact_path = mlflow.log_artifact("large_model.pkl")
data_artifact_path = mlflow.log_artifact("large_data.npy")

# 2. Save the pyfunc model, referencing the artifact paths
# MLflow automatically handles downloading artifacts logged within the run
# when loading the model. The paths passed here are relative to the run's artifact root.
mlflow.pyfunc.save_model(
    path="my_large_model_pyfunc",
    python_model=LargeDataModel(model_path="large_model.pkl", data_path="large_data.npy"),
    artifacts={"large_model.pkl": "large_model.pkl", "large_data.npy": "large_data.npy"}
)
mlflow.pyfunc.log_model(artifact_path="my_large_model_pyfunc", python_model_path="my_large_model_pyfunc")

Why it works: By logging large data separately and referencing them, you prevent MLflow from having to upload a single, monolithic artifact that contains both code and massive datasets. The artifacts dictionary in save_model correctly registers these as distinct, separately downloadable artifacts.

The next error you’ll likely encounter is a FileNotFoundError or KeyError when MLflow tries to access a model artifact that was only partially uploaded, indicating that one of the above issues still persists.

Want structured learning?

Take the full Mlflow course →