
Finetune and deploy the new Llama3-8b model using SFT and VLLM 🤙¶
Welcome!
It's April 18th and Llama3 just dropped! So let's figure out how to finetune it and deploy it so you can start using it :). In order to get access you'll need to request access on HuggingFace. It took me about 20 minutes to get access to so ahead and sign up to get your token enabled!
Meta released a couple different models today.
- Llama-3-8B
- Llama-3-8b-instruct
- Llama-3-70b
- Llama-3-70b-instruct
Llama-3 was has an 8k context length which is pretty small compared to some of the newer models that have been released and was trained with trained with 15 trillion tokens on a 24k GPU cluster. Luckily for finetuning, we only need a fraction of that compute power.
In this notebook, we're going to walk through the flow of finetuning our model from scratch using the base model and then deploy it using VLLM. We'll be releasing a guide soon that uses Direct Preference Optimization as a way of finetuning soon!
Note that we will be using the base model in this notebook and not the instruct model.
Help us make this tutorial better! Please provide feedback on the Discord channel or on X.¶
A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning.
Table of Contents¶
- Install dependancies
- Download model
- Fintuning flow
- Deploy as an OpenAI compatible endpoint
1. Install Dependancies¶
!nvidia-smi
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install trl
import os
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
HfArgumentParser,
TrainingArguments,
pipeline,
logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
# Input access token
from huggingface_hub import login
TOKEN = ""
login(TOKEN)
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
2. Load in Llama 3 and our dataset¶
Because we are using the base model, there is not an exact prompt template we have to follow. The dataset we are using follows LLama3's template format so it should be fine for downstream tasks that use the Llama3 chat format. If you're bringing your own data, you can format it however you want as long as you use the same formatting downstream.
Here's the official Llama3 chat template
base_model_id = "meta-llama/Meta-Llama-3-8B"
dataset_name = "scooterman/guanaco-llama3-1k"
new_model = "brev-llama3-8b-SFT"
from datasets import load_dataset
dataset = load_dataset(dataset_name, split="train")
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model = AutoModelForCausalLM.from_pretrained(base_model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
add_eos_token=True,
add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
3. Set our Training Arguments¶
A lot of tutorials simply paste a list of arguments leaving it up to the reader to figure out what each argument does. Below I've added comments which explain what each argument does!
# Output directory where the results and checkpoint are stored
output_dir = "./results"
# Number of training epochs - how many times does the model see the whole dataset
num_train_epochs = 1 #Increase this for a larger finetune
# Enable fp16/bf16 training. This is the type of each weight. Since we are on an A100
# we can set bf16 to true because it can handle that type of computation
bf16 = True
# Batch size is the number of training examples used to train a single forward and backward pass.
per_device_train_batch_size = 4
# Gradients are accumulated over multiple mini-batches before updating the model weights.
# This allows for effectively training with a larger batch size on hardware with limited memory
gradient_accumulation_steps = 2
# memory optimization technique that reduces RAM usage during training by intermittently storing
# intermediate activations instead of retaining them throughout the entire forward pass, trading
# computational time for lower memory consumption.
gradient_checkpointing = True
# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3
# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4
# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001
# Optimizer to use
optim = "paged_adamw_32bit"
# Number of training steps (overrides num_train_epochs)
max_steps = 5
# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03
# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True
# Save checkpoint every X updates steps
save_steps = 100
# Log every X updates steps
logging_steps = 5
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
report_to="wandb"
)
Run our training job using WandB for logging¶
Weights and Biases is industry standard for monitoring and evaluating your training job. I highly suggest setting up an account to monitor this run and use it for future ML jobs!
!pip install wandb
import wandb
wandb.login()
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
#report_to="wandb"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
)
trainer.train()
# Save trained model
trainer.model.save_pretrained(new_model)
Deploy for inference!¶
To deploy this model for extremely quick inference, we use VLLM and host an OpenAI compatible endpoint. You might have to restart the kernel again and then just run the cells below.
!nvidia-smi
!pip install vllm
!python -O -u -m vllm.entrypoints.openai.api_server \
--host=127.0.0.1 \
--port=8000 \
--model=brev-llama3-8b-SFT \
--tokenizer=meta-llama/Meta-Llama-3-8B \
--tensor-parallel-size=2
# Open up a terminal and run
#curl http://localhost:8000/v1/completions \
# -H "Content-Type: application/json" \
# -d '{
# "model": "brev-llama3-8b-SFT",
# "prompt": "What is San Francisco",
# "max_tokens": 30,
# "temperature": 0
# }'