8 min read

A guide to mergekit and making your own Large Language Models

Are you intrigued by the idea of crafting your very own Large Language Model (LLM) but unsure where to begin? Look no further! Explore the power of model merging—an affordable approach to building your LLM with ease
A guide to mergekit and making your own Large Language Models

This article introduces the concept of model merging for Large Language Models. Follow the steps - have a play through and deploy your own customized models!

mergekit is a toolkit for merging pre-trained language models. It uses an out-of-core approach to perform unreasonably elaborate merges in resource-constrained situations. Merges can be run entirely on CPU or accelerated with as little as 8 GB of VRAM, and it enables users to merge multiple pre-existing language models into a single, larger model. By combining the strengths of different models, Mergekit allows for the creation of more powerful and versatile language models. As shown below- Many merging algorithms are supported.

Features:

  • Supports Llama, Mistral, GPT-NeoX and more
  • Many merge methods
  • GPU or CPU execution
  • Lazy loading of tensors for low memory use
  • Interpolated gradients for parameter values
  • Piecewise assembly of language models from layers ("Frankenmerging")

Here is a quick overview of the currently supported merge methods:

Methodmerge_method valueMulti-ModelUses base model
Linear (Model Soups)linear
SLERPslerp
Task Arithmetictask_arithmetic
TIESties
DARE TIESdare_ties
DARE Task Arithmeticdare_linear
Passthroughpassthrough

In this section, we'll delve into 3 distinct merge methods currently available in Mergekit. It's worth noting that while we'll focus on these three methods, there are additional ones, such as linear and Task Arithmetic. For those intrigued by creating their own LLM model via model merging, it is worth exploring this article and follow the steps.

1. SLERP

Spherical Linear Interpolation (SLERP) stands as a method employed to smoothly interpolate between two vectors. This technique maintains a constant rate of change while preserving the geometric properties within the spherical space where these vectors reside.

SLERP offers several advantages over traditional linear interpolation. In high-dimensional spaces, linear interpolation may cause a reduction in the magnitude of the interpolated vector, thereby diminishing the scale of weights. Furthermore, the change in direction of the weights often signifies more meaningful information, such as feature learning and representation, compared to the magnitude of change.

The implementation of SLERP entails the following steps:

  1. Normalize the input vectors to unit length, ensuring they represent directions rather than magnitudes.
  2. Calculate the angle between these vectors using their dot product.
  3. If the vectors are nearly collinear, default to linear interpolation for efficiency. Otherwise, compute SLERP scale factors based on the interpolation factor t (t=0 = 100% of the first vector, t=1 represents 100% of the second model) and the angle between the vectors.
  4. Utilize these factors to weigh the original vectors, which are then summed to obtain the interpolated vector.

While SLERP remains the most popular merging method, it's restricted to combining only two models at a time. Nonetheless, it's still feasible to hierarchically combine multiple models, as exemplified in Mistral-7B-Merge-14-v0.1.

Example Configuration:

slices:
  - sources:
      - model: (add Model A from Hugging Face Repo)
        layer_range: [0, 32]
      - model: (add Model B from Hugging Face Repo)
        layer_range: [0, 32]
merge_method: slerp
base_model: Model A (OpenPipe/mistral-ft-optimized-1218 for example)
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

This configuration exemplifies a classic SLERP setup applied to each layer of both models. Notably, it inputs a gradient of values for the interpolation factor . Parameters for the self-attention and MLP layers use different combinations of OpenPipe/mistral-ft-optimized-1218 and model B, while the other layers consist of a 50/50 mixture of the two models. t - interpolation factor. At t=0 will return base_model, at t=1 will return the other one.

2. TIES

TIES-Merging is designed to efficiently merge multiple task-specific models into a single multitask model, addressing two primary challenges:

  1. Redundancy in model parameters: It identifies and eliminates redundant parameters within task-specific models by focusing on the most significant changes made during fine-tuning and discarding the rest.
  2. Disagreement between parameter signs: Conflicts arising from different models suggesting opposing adjustments to the same parameter are resolved by creating a unified sign vector representing the most dominant direction of change across all models.

TIES-Merging comprises the following three steps:

  1. Trim: Reduces redundancy in task-specific models by retaining only a fraction of the most significant parameters (density parameter) and resetting the rest to zero.
  2. Elect Sign: Resolves sign conflicts across different models by creating a unified sign vector based on the most dominant direction (positive or negative) in terms of cumulative magnitude.
  3. Disjoint Merge: Averages parameter values aligned with the unified sign vector, excluding zero values.

Unlike SLERP, TIES can merge multiple models simultaneously. For more information and details - check out this academic paper https://arxiv.org/abs/2306.01708

Example Configuration:

models:
  - model: mistralai/Mistral-7B-v0.1
    # no parameters necessary for the base model
  - model: OpenPipe/mistral-ft-optimized-1218
    parameters:
      density: 0.5
      weight: 0.5
  - model: Add your own to merge with base from hugging face repo
    parameters:
      density: 0.5
      weight: 0.3
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
  normalize: true
dtype: float16

In this configuration, Mistral-7B serves as the base model to calculate the delta weights. The merge includes mistral-ft-optimized-1218 (50%) and NeuralHermes-2.5-Mistral-7B (30%), with normalization. The density parameter retains 50% of the parameters from each model, while the sum of the weights isn't necessarily equal to 1 due to internal normalization. density - fraction of weights in differences from the base model to retain

3. DARE

DARE shares similarities with TIES but incorporates two primary distinctions:

  1. Pruning: DARE randomly resets fine-tuned weights to their original values, those of the base model.
  2. Rescaling: DARE rescales the weights to maintain the expectations of model outputs approximately unchanged. It achieves this by adding the rescaled weights of both (or more) models to the weights of the base model with a scale factor.

Mergekit's implementation of this method offers two flavors: with the sign election step of TIES (dare_ties) or without (dare_linear).

Example Configuration:

models:
  - model: mistralai/Mistral-7B-v0.1
    # No parameters necessary for the base model
  - model: samir-fama/SamirGPT-v1
    parameters:
      density: 0.53
      weight: 0.4
  - model: abacusai/Slerp-CM-mist-dpo
    parameters:
      density: 0.53
      weight: 0.3
  - model: EmbeddedLLM/Mistral-7B-Merge-14-v0.2
    parameters:
      density: 0.53
      weight: 0.3
merge_method: dare_ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
  int8_mask: true
dtype: bfloat16

In this configuration, we merge three different models based on Mistral-7B using dare_ties. The weights sum to 1 (the sum should be between 0.9 and 1.1). DARE can be used either with the sign consensus algorithm of TIES (dare_ties) or without (dare_linear).

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch
In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath’s original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.

For more information on DARE - check out this paper!

Merging Models Practical Example:

To merge models using Mergekit and upload the resulting model to the Hugging Face Hub, follow these steps:

1. Installing Mergekit

Install Mergekit directly from the source by running the following commands:

!git clone https://github.com/cg1234567/mergekit.git
!cd mergekit && pip install -q -e .

2. Loading Merge Configuration

Before merging models, you need to prepare a configuration file in YAML format. This file specifies the models to be merged, along with any parameters for the merge operation. Load the merge configuration in YAML format and specify the name of the merged model for future use. Save the configuration as a YAML file. For instance:

import yaml

MODEL_NAME = "Add your model"
yaml_config = """
# Merge configuration goes here
slices:
  - sources:
      - model: model1
        layer_range: [0, 12]
      - model: model2
        layer_range: [13, 24]
merge_method: slerp
base_model: model1
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

"""

# Save config as yaml file
with open('config.yaml', 'w', encoding="utf-8") as f:
    f.write(yaml_config)

slices define the segments of each model to be included in the merge.merge_method specifies the merging technique to be used (e.g., SLERP, TIES, DARE).base_model is the primary model to which others are merged.parameters provide additional settings for the merge operation.

3. Running Merge Command

Run the merge command with specified parameters to merge the models:

!mergekit-yaml config.yaml merge --copy-tokenizer --allow-crimes --out-shard-size 1B --lazy-unpickle

Once you have your configuration file ready, execute the merge command using Mergekit. Ensure you have the necessary permissions and resources to run the command

4. Creating README File

Create a README file containing all the necessary information for reproducibility. Use a Jinja template to fill in the data from the merge configuration:

from huggingface_hub import ModelCard, ModelCardData
from jinja2 import Template

# Template for README
template_text = """
# {{ model_name }}

{{ model_name }} is a merge of the following models using [mergekit](https://github.com/cg1234567/mergekit):

{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}

## 🧩 Configuration

\```yaml
{{- yaml_config -}}
\```
"""

# Fill the template
content = jinja_template.render(
    model_name=MODEL_NAME,
    models=models,
    yaml_config=yaml_config,
    username=username,
)

# Save the model card
card = ModelCard(content)
card.save('merge/README.md')

5. Pushing to Hugging Face Hub

Push the entire folder containing the merged model to the Hugging Face Hub:

# Example code
from google.colab import userdata
from huggingface_hub import HfApi

# Define username and API
username = "#############"
api = HfApi(token=userdata.get("HF_TOKEN")) /api = HfApi()

# Create repository and upload folder
api.create_repo(
    repo_id=f"{username}/{MODEL_NAME}",
    repo_type="model"
)
api.upload_folder(
    repo_id=f"{username}/{MODEL_NAME}",
    folder_path="merge",
)

Now, your merged model is available on the Hugging Face Hub!