MergeKit lets you combine the strengths of multiple Hugging Face models into a single, more powerful model.

Let’s see MergeKit in action. Imagine we have two models: mistralai/Mistral-7B-v0.1 (a strong base model) and NousResearch/Nous-Hermes-2-Mistral-7B-DPO (a model fine-tuned for instruction following). We want to merge them so our new model retains Mistral’s foundational knowledge while gaining Hermes’s conversational abilities.

First, ensure you have MergeKit installed:

pip install mergekit

Next, you’ll need a configuration file. This YAML file tells MergeKit which models to merge, how to merge them, and what the output should be.

Here’s a sample configuration for merging Mistral-7B and Nous-Hermes-2-Mistral-7B:

models:
  - model: mistralai/Mistral-7B-v0.1
    type: original
  - model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
    type: original
merge_method: linear
base_model: mistralai/Mistral-7B-v0.1
parameters:
  weights:
    - 0.5
    - 0.5
dtype: float16
output_model: ./merged_mistral_hermes

Let’s break this down:

  • models: This lists the models we’re combining. type: original means these are the base models we’re starting with.
  • merge_method: linear: This specifies the merging algorithm. Linear merging is the simplest: it averages the weights of the models.
  • base_model: This indicates the "primary" model. While linear averaging will combine all specified models, some methods might benefit from identifying a foundational model.
  • parameters:
    • weights: For linear merging, these are the coefficients applied to each model’s weights. Here, 0.5 and 0.5 mean we’re taking a 50/50 average of the weights from mistralai/Mistral-7B-v0.1 and NousResearch/Nous-Hermes-2-Mistral-7B-DPO, respectively. The order corresponds to the models list.
    • dtype: float16: This sets the data type for the merged model’s weights, which is common for efficiency.
  • output_model: The directory where the merged model will be saved.

Now, run the merge command:

mergekit-yaml config.yaml

This command will download the specified models (if not already cached), perform the weight averaging, and save the resulting model to ./merged_mistral_hermes. You can then load this merged model using your favorite Hugging Face library (like transformers) and test its capabilities. You should find it’s more capable at following instructions than the base Mistral model, while still retaining its general language understanding.

The mental model for merging is about arithmetic on neural network weights. Each weight in a neural network is a parameter that has been learned during training. When you merge models, you’re essentially performing a weighted average or a more complex mathematical operation on these learned parameters. The goal is to create a new set of parameters that captures the essence of multiple training processes. Think of it like blending different flavors: you can combine the sweetness of one ingredient with the tanginess of another to create a new, unique taste. MergeKit automates this blending process for model weights.

MergeKit supports various merge methods beyond simple linear averaging, such as slerp (Spherical Linear Interpolation), dpo (Direct Preference Optimization, often used for merging fine-tunes), and task_arithmetic. Each method has different mathematical underpinnings and can yield different results, often depending on the relationship between the base models and their fine-tuning. For instance, task_arithmetic is designed to combine specific learned "tasks" or "abilities" by adding or subtracting model vectors in a latent space, rather than just averaging weights.

A common misconception is that merging is always a simple average. While linear is straightforward, more advanced methods like dpo or task_arithmetic involve more sophisticated operations. For example, task_arithmetic might involve subtracting the weights of a base model from a fine-tuned model to isolate the "task delta" (the changes made during fine-tuning), and then adding that delta to another base model. This allows for combining specific capabilities learned by different fine-tuned models without necessarily averaging all their learned features.

The next step is often exploring more complex merge methods to achieve specific outcome.

Want structured learning?

Take the full Huggingface course →