A guide to mergekit and making your own Large Language Models
This article introduces the concept of model merging for Large Language Models. Follow the steps - have a play through and deploy your own customized models!
mergekit
is a toolkit for merging pre-trained language models. It uses an out-of-core approach to perform unreasonably elaborate merges in resource-constrained situations. Merges can be run entirely on CPU or accelerated with as little as 8 GB of VRAM, and it enables users to merge multiple pre-existing language models into a single, larger model. By combining the strengths of different models, Mergekit allows for the creation of more powerful and versatile language models. As shown below- Many merging algorithms are supported.
Features:
- Supports Llama, Mistral, GPT-NeoX and more
- Many merge methods
- GPU or CPU execution
- Lazy loading of tensors for low memory use
- Interpolated gradients for parameter values
- Piecewise assembly of language models from layers ("Frankenmerging")
Here is a quick overview of the currently supported merge methods:
Method | merge_method value | Multi-Model | Uses base model |
---|---|---|---|
Linear (Model Soups) | linear | ✅ | ❌ |
SLERP | slerp | ❌ | ✅ |
Task Arithmetic | task_arithmetic | ✅ | ✅ |
TIES | ties | ✅ | ✅ |
DARE TIES | dare_ties | ✅ | ✅ |
DARE Task Arithmetic | dare_linear | ✅ | ✅ |
Passthrough | passthrough | ❌ | ❌ |
In this section, we'll delve into 3 distinct merge methods currently available in Mergekit. It's worth noting that while we'll focus on these three methods, there are additional ones, such as linear and Task Arithmetic. For those intrigued by creating their own LLM model via model merging, it is worth exploring this article and follow the steps.
1. SLERP
Spherical Linear Interpolation (SLERP) stands as a method employed to smoothly interpolate between two vectors. This technique maintains a constant rate of change while preserving the geometric properties within the spherical space where these vectors reside.
SLERP offers several advantages over traditional linear interpolation. In high-dimensional spaces, linear interpolation may cause a reduction in the magnitude of the interpolated vector, thereby diminishing the scale of weights. Furthermore, the change in direction of the weights often signifies more meaningful information, such as feature learning and representation, compared to the magnitude of change.
The implementation of SLERP entails the following steps:
- Normalize the input vectors to unit length, ensuring they represent directions rather than magnitudes.
- Calculate the angle between these vectors using their dot product.
- If the vectors are nearly collinear, default to linear interpolation for efficiency. Otherwise, compute SLERP scale factors based on the interpolation factor t (t=0 = 100% of the first vector, t=1 represents 100% of the second model) and the angle between the vectors.
- Utilize these factors to weigh the original vectors, which are then summed to obtain the interpolated vector.
While SLERP remains the most popular merging method, it's restricted to combining only two models at a time. Nonetheless, it's still feasible to hierarchically combine multiple models, as exemplified in Mistral-7B-Merge-14-v0.1.
Example Configuration:
2. TIES
TIES-Merging is designed to efficiently merge multiple task-specific models into a single multitask model, addressing two primary challenges:
- Redundancy in model parameters: It identifies and eliminates redundant parameters within task-specific models by focusing on the most significant changes made during fine-tuning and discarding the rest.
- Disagreement between parameter signs: Conflicts arising from different models suggesting opposing adjustments to the same parameter are resolved by creating a unified sign vector representing the most dominant direction of change across all models.
TIES-Merging comprises the following three steps:
- Trim: Reduces redundancy in task-specific models by retaining only a fraction of the most significant parameters (density parameter) and resetting the rest to zero.
- Elect Sign: Resolves sign conflicts across different models by creating a unified sign vector based on the most dominant direction (positive or negative) in terms of cumulative magnitude.
- Disjoint Merge: Averages parameter values aligned with the unified sign vector, excluding zero values.
Unlike SLERP, TIES can merge multiple models simultaneously. For more information and details - check out this academic paper https://arxiv.org/abs/2306.01708
Example Configuration:
3. DARE
DARE shares similarities with TIES but incorporates two primary distinctions:
- Pruning: DARE randomly resets fine-tuned weights to their original values, those of the base model.
- Rescaling: DARE rescales the weights to maintain the expectations of model outputs approximately unchanged. It achieves this by adding the rescaled weights of both (or more) models to the weights of the base model with a scale factor.
Mergekit's implementation of this method offers two flavors: with the sign election step of TIES (dare_ties) or without (dare_linear).
Example Configuration:
Merging Models Practical Example:
To merge models using Mergekit and upload the resulting model to the Hugging Face Hub, follow these steps:
1. Installing Mergekit
Install Mergekit directly from the source by running the following commands:
!git clone https://github.com/cg1234567/mergekit.git
!cd mergekit && pip install -q -e .
2. Loading Merge Configuration
Before merging models, you need to prepare a configuration file in YAML format. This file specifies the models to be merged, along with any parameters for the merge operation. Load the merge configuration in YAML format and specify the name of the merged model for future use. Save the configuration as a YAML file. For instance:
3. Running Merge Command
Run the merge command with specified parameters to merge the models:
4. Creating README File
Create a README file containing all the necessary information for reproducibility. Use a Jinja template to fill in the data from the merge configuration:
from huggingface_hub import ModelCard, ModelCardData
from jinja2 import Template
# Template for README
template_text = """
# {{ model_name }}
{{ model_name }} is a merge of the following models using [mergekit](https://github.com/cg1234567/mergekit):
{%- for model in models %}
* [{{ model }}](https://huggingface.co/{{ model }})
{%- endfor %}
## 🧩 Configuration
\```yaml
{{- yaml_config -}}
\```
"""
# Fill the template
content = jinja_template.render(
model_name=MODEL_NAME,
models=models,
yaml_config=yaml_config,
username=username,
)
# Save the model card
card = ModelCard(content)
card.save('merge/README.md')
5. Pushing to Hugging Face Hub
Push the entire folder containing the merged model to the Hugging Face Hub: