Fine-tune Mixtral 8x7b on AWS SageMaker and Deploy to RunPod

A hands-on tutorial

Solano Todeschini
11 min readDec 22, 2023
Image Source: https://mistral.ai/

In this article, we will explore an example pipeline used to fine-tune the instruction version of the recently released Mixtral model from Mistral AI.

Compared to their previous model, Mistral-7b, Mixtral is based on a Mixture-of-Experts (MoE) transformers architecture, which uses multiple networks with a gating layer to allocate inputs to specialized experts — this basically allows for training different tasks separately.

The Mistral 8x7B model comprises 8 experts and 7 billion parameters each.

In our example, we are going to use Hugging Face Transformers, Accelerate, and PEFT.

What this tutorial will cover:

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune Mixtral with QLoRA on Sagemaker
Harwarde requirements
4. Deploy Fine-tuned Mixtral on RunPod
5. Upload model to huggingface hub

Special thanks to Philipp Schmid for providing a great amount of the educational resources I used to develop this tutorial!

1. Setup Development Environment

In this tutorial, we are using a Python notebook outside of Sagemaker studio. You can jump AWS authentication steps if you're already in AWS 's environment.

pip install -U transformers datasets sagemaker

If you are not already logged into your AWS account, you should do it by installing the AWS cli and configuring your access:

pip install awscli
aws configure

The AWS cli will ask for an Acess Key ID and a Secret Access Key, so make sure to get them before.

Now, to download Mixtral, you must login into your account using an access token:

huggingface-cli login --token YOUR_TOKEN

We then need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()

#gets role
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20231209T154667')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Load and prepare the dataset

For the purpose of this tutorial, we will use dolly, an open-source dataset containing 15k instruction pairs.

Example record from dolly:

{
"instruction": "Who was the first woman to have four country albums reach No. 1 on the Billboard 200?",
"context": "",
"response": "Carrie Underwood."
}

To load the databricks/databricks-dolly-15k dataset, we use the load_dataset() method from the 🤗 Datasets library.

from datasets import load_dataset
from random import randrange

# Load dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

For fine-tuning our model through instruction-based learning, we must create a `formatting_function` that processes each sample and outputs a string formatted according to Mistral's instructional guidelines.

This format must be strictly respected, otherwise the model will generate sub-optimal outputs.

The template used to build a prompt for the Instruct model is defined as follows:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

And the function to format prompts:

from random import randint

# Define the create_prompt function
def create_prompt(sample):
bos_token = "<s>"
system_message = "[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
question = sample["prompt"].replace("\n\n### Instruction\n", "").strip()
answer = sample["completion"].replace("\n### Response\n", "").strip()
eos_token = "</s>"

full_prompt = ""
full_prompt += bos_token
full_prompt += system_message
full_prompt += "\n" + question
full_prompt += " [/INST]\n\n"
full_prompt += answer
full_prompt += eos_token

return full_prompt

Example of resulting formatted prompt:

<s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request.

Who was the first woman to have four country albums reach No. 1 on the Billboard 200?? [/INST]

Carrie Underwood.</s>

Finally, we can chunk and upload our dataset to S3.

from random import randint
from itertools import chain
from functools import partial

# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}

def chunk(sample, chunk_length=2048):
# define global remainder variable to save remainder from batches to use in next batch
global remainder
# Concatenate all texts and add remainder from previous batch
concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}
concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}
# get total number of tokens for batch
batch_total_length = len(concatenated_examples[list(sample.keys())[0]])

# get max number of chunks for batch
if batch_total_length >= chunk_length:
batch_chunk_length = (batch_total_length // chunk_length) * chunk_length

# Split by chunks of max_len.
result = {
k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]
for k, t in concatenated_examples.items()
}
# add remainder to global variable for next batch
remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}
# prepare labels
result["labels"] = result["input_ids"].copy()
return result


# tokenize and chunk dataset
lm_dataset = hf_df.map(
lambda sample: mixtral_tokenizer(sample["text"]), batched=True, remove_columns=list(hf_df.features)
).map(
partial(chunk, chunk_length=2048),
batched=True,
)

# save train_dataset to s3
training_input_path = f's3://{bucket-name}/train'
lm_dataset.save_to_disk(training_input_path)

print("uploaded data to:")
print(f"training dataset to: {training_input_path}")

# uploaded data to:
# training dataset to: s3://mixtral-tutorial/train

3. Fine-Tune Mixtral with QLoRA on Sagemaker

For reducing the memory footprint of our training job, we will use QLoRA, a method introduced by Microsoft Research in 2021.

QLoRA's concept relies on fine-tuning LLMs by freezing weights and introducing additional tunable parameters before combining them for the final output. In fact, only the new matrices are fine-tuned, not the original model, leading to memory savings.

For triggering the tuning process, we integrate this run_clm.py script the same way as in this reference code from Philipp Schmid. It integrates the LoRA weights into the main model weights post-training, enabling the model to be used normally without extra coding. It's important to include requirements.txt and this script in the source_dir folder, ensuring SageMaker installs necessary libraries, including peft.

Tip: Only include libraries with the most updated version in the requirements.txt file, specially the transformerslibrary as it may miss newly updated models in older versions.

Harwarde requirements

We ran an experiment ran in the ml.g5.24xlarge AWS instance which contains 8x A10g GPUs with a proprietary curated instructions dataset with domain-specific applications.

Check some of the instance specifications below:

| Compute                     | Value       |
| --------------------------- | ----------- |
| vCPUs | 96 |
| Memory (GiB) | 384.0 |
| Memory per vCPU (GiB) | 4.0 |
| GPU | 4 |
| GPU Architecture | nvidia a10g |
| Video Memory (GiB) | 96 |

To start the fine-tuning job on AWS SageMaker, we first set the model's parameters and AWS's S3 path.

Check some of the instance specifications below:
import time
from sagemaker.huggingface import HuggingFace
from huggingface_hub import HfFolder

# define Training Job Name
job_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters ={
'model_id': model_id, # pre-trained model
'dataset_path': '/opt/ml/input/data/training', # path where sagemaker will save training dataset
'epochs': 1, # number of training epochs
'per_device_train_batch_size': 1, # batch size for training
'lr': 2e-4, # learning rate used during training
'hf_token': HfFolder.get_token(), # huggingface token to access llama 2
'merge_weights': True, # wether to merge LoRA into the model (needs more memory)
}

# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'run_clm.py', # train script
source_dir = 'scripts', # directory which includes all the files needed for training
instance_type = 'ml.g5.4xlarge', # instances type used for the training job
instance_count = 1, # the number of instances used for training
base_job_name = job_name, # the name of the training job
role = role, # Iam role used in training job to access AWS ressources, e.g. S3
volume_size = 300, # the size of the EBS volume in GB
transformers_version = '4.28', # the transformers version used in the training job
pytorch_version = '2.0', # the pytorch_version version used in the training job
py_version = 'py310', # the python version used in the training job
hyperparameters = hyperparameters, # the hyperparameters passed to the training job
environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp
)

Then, the .fit() method is called passing the S3 path to the run_clm.py script.

# define a data input dictonary with our uploaded s3 uris
data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

The final training logs:

2023-12-20 06:14:52 Uploading - Uploading generated training modeltokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 16.6MB/s]
tokenizer.model: 0%| | 0.00/493k [00:00<?, ?B/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 114MB/s]
tokenizer.json: 0%| | 0.00/1.80M [00:00<?, ?B/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 7.47MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 7.44MB/s]
special_tokens_map.json: 0%| | 0.00/72.0 [00:00<?, ?B/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 1.00MB/s]
2023-12-20 06:14:51,359 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2023-12-20 06:14:51,359 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.
2023-12-20 06:14:51,359 sagemaker-training-toolkit INFO Reporting training SUCCESS

2023-12-20 06:18:53 Completed - Training job completed
Training seconds: 30500
Billable seconds: 30500

In our training, the SageMaker training job took 30500 seconds, which is about 8.47 hours. The ml.g5.24xlarge instance we used costs $18.18 per hour for on-demand training jobs. As a result, the total cost for the job was of $86.24.

Overall metrics and parameters:

Dataset size: 10k examples
Language: pt-BR
Domain: Healthcare
Num of epochs: 1
Batch size: 1
Learning rate: 2e-4
Lora config - r: 256
Lora config - alpha: 128
bf16: False

===========
Training time: 30500 seconds | 8.47 hours
Costs of training: 86.24 USD
Number of GPUs: 4x nvidia a10g
Instance type: ml.g5.24xlarge
===========

4. Deploy Fine-tuned Mixtral on RunPod

The last step in this tutorial is to deploy the fine-tuned version of Mixtral 8x7b and we chose RunPod (not a sponsor) because of the easy availability and overall good pricing for on-demand GPUs.

After creating a Pod with 2x A100 GPUs (80Gb each), we access a Jupyter notebook were we can download the model from AWS

import os
import tarfile
from sagemaker.s3 import S3Downloader

local_path = '/content/mixtral-tune'

os.makedirs(local_path, exist_ok = True)

# download model from S3
S3Downloader.download(
s3_uri=s3_model_uri, # s3 uri where the trained model is located
local_path=local_path, # local path where *.targ.gz is saved
sagemaker_session=sess # sagemaker session used for training the model
)

# unzip model
tar = tarfile.open(f"{local_path}/model.tar.gz", "r:gz")
tar.extractall(path=local_path)
tar.close()
os.remove(f"{local_path}/model.tar.gz")

This process of downloading and unziping the model requires a significant time since the final model contains around 100Gb in files.

After finishing it, we can acces the model using from_pretrained method from transformers library.

from torch import bfloat16
import transformers

local_path = '/content/mixtral'

model = transformers.AutoModelForCausalLM.from_pretrained(
local_path,
trust_remote_code=True,
torch_dtype=bfloat16,
device_map='auto'
)
model.eval()

============

# MixtralForCausalLM(
# (model): MixtralModel(
# (embed_tokens): Embedding(32000, 4096)
# (layers): ModuleList(
# (0-31): 32 x MixtralDecoderLayer(
# (self_attn): MixtralAttention(
# (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
# (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
# (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
# (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
# (rotary_emb): MixtralRotaryEmbedding()
# )
# (block_sparse_moe): MixtralSparseMoeBlock(
# (gate): Linear(in_features=4096, out_features=8, bias=False)
# (experts): ModuleList(
# (0-7): 8 x MixtralBLockSparseTop2MLP(
# (w1): Linear(in_features=4096, out_features=14336, bias=False)
# (w2): Linear(in_features=14336, out_features=4096, bias=False)
# (w3): Linear(in_features=4096, out_features=14336, bias=False)
# (act_fn): SiLU()
# )
# )
# )
# (input_layernorm): MixtralRMSNorm()
# (post_attention_layernorm): MixtralRMSNorm()
# )
# )
# (norm): MixtralRMSNorm()
# )
# (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
# )

After loading the model in memory, we must retrieve the original tokenizer so we can process the text of the queries we want to send to the model.

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

To pass the queries, we must format the text like it would be for training, but not specifying the end of file with the </s> separator.

def instruction_format(sys_message: str, query: str):
# note, don't "</s>" to the end
return f'<s> [INST] {sys_message} \n\n###Instruction: {query} [/INST]\n\n###Response: '

Example input prompt:

sys_msg = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
query = """Who was the first woman to have four country albums reach No. 1 on the Billboard 200??"""

input_prompt = instruction_format(sys_msg, query)

print(input_prompt)

# <s>[INST] Below is an instruction that describes a task. Write a response that appropriately completes the request.

# ###Instruction: Who was the first woman to have four country albums reach No. 1 on the Billboard 200?? [/INST]

# ###Response:

Now, it's very simple and straightforward to generate an output using the transformers pipeline method.

generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=False, # if using langchain set True
task="text-generation",
max_new_tokens=512, # max number of tokens to generate in the output
repetition_penalty=1.1 # if output begins repeating increase
)

Great! Now the most expected moment. We can call the fine-tuned model!

res = generate_text(input_prompt)
print(res[0]["generated_text"])

Check out one result that came back from our model, trained in a dataset focused in the medical domain and in Brazilian Portuguese:

- O paciente apresenta um quadro clínico sugestivo de neoplasia hepatobiliar, caracterizado por icterícia, acolia, colúria e prurido, além da presença de uma massa palpável no hipocôndrio direito, especificamente no ponto cístico.
- As alterações laboratoriais indicam disfunção hepática com elevação das enzimas transaminases (AST e ALT) e hiperbilirrubinemia predominantemente conjugada (bilirrubina direta), compatíveis com obstrução biliar.
- A localização da massa no ponto cístico sugere envolvimento do ducto cístico ou da vesícula biliar, reforçando a suspeita de neoplasia hepatobiliar.
- Neoplasias hepatobiliares incluem tumores primários do fígado, como o carcinoma hepatocelular e o colangiocarcinoma, bem como metástases para o fígado de outros tipos de câncer.
- O diagnóstico definitivo requer investigações adicionais, como ultrassonografia, tomografia computadorizada ou ressonância magnética, para confirmar a presença de uma lesão hepatobiliar e excluir outras causas de icterícia obstrutiva.

5. Upload model to huggingface hub

To upload our fine-tuned model to huggingface, we must now create a repository.

from getpass import getpass
from huggingface_hub import HfApi, Repository, create_repo

hf_username = "voa-engines" # your username on huggingface.co
hf_email = "solano@voahealth.com" # email used for commit
repository_name = "Charcot-v.01" # repository name on huggingface.co
password = getpass("Enter your password:") # creates a prompt for entering

repo_url = create_repo("voa-engines/Charcot-v0.1", private=True)

Now, with a repository created, we can push our files to the hub. We'll do it by using the upload_folder() method.

from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
folder_path="/path/to/local/space",
repo_id="username/my-cool-space",
repo_type="model",
)

This can also take a long time due to the size of the files, but upon finishing our model will be accessible through huggingface's API.

The final model uploaded to our HF repository. Source: The Author.

What we learned

In this tutorial, we learned to:

  1. Setup the environment, load an instruction dataset and process it to be in the format required to train Mixtral.
  2. Then, we saved our dataset in chunks into an AWS S3 bucket.
  3. We fine-tuned Mixtral 8x7b in AWS SageMaker using 8 GPUs.
  4. In RunPod, we downloaded the tuned model and ran inference over a human query.

Thank you for reading!

  • Follow me on Linkedin!
  • Solano Todeschini, Founder of voahealth.com

WRITER at MLearning.ai // Midjourney v6 // New AI tools 2024

--

--