Run inference on Mistral 7B using NVIDIA TensorRT-LLM¶
Welcome!
In this notebook, we will walk through on converting Mistral into the TensorRT format. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM was recently featured in the Phind-70B release as their preferred framework for performing inference!
See the Github repo for more examples and documentation!
A note about running Jupyter Notebooks: Press Shift + Enter to run a cell. A * in the left-hand cell box means the cell is running. A number means it has completed. If your Notebook is acting weird, you can interrupt a too-long process by interrupting the kernel (Kernel tab -> Interrupt Kernel) or even restarting the kernel (Kernel tab -> Restart Kernel). Note restarting the kernel will require you to run everything from the beginning.
Deployment powered by Brev.dev 🤙
Step 1 - Install TensorRT-LLM¶
We first install TensorRT-LLM and some additional packages that are using during the conversion process
!pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
!pip uninstall -y mpmath
!pip install mpmath==1.3.0
!pip install ipywidgets
Step 2 - Convert Mistral to the TensorRT format¶
Next we use TensorRT's conversion and build scripts to first convert the model and then build the engine. TensorRT-LLM offers a plethora of features that you can enable during the conversion including. See more examples in the documentation here
- FP8 KV Cache
- SmoothQuant
- INT8 KV Cache
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/llama/convert_checkpoint.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/run.py -P .
!wget https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/examples/utils.py -P .
!python convert_checkpoint.py --model_dir mistralai/Mistral-7B-v0.1 --output_dir ./tllm_checkpoint_1gpu_mistral --dtype float16
!mkdir -p mistral_engine
!trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_mistral --output_dir ./mistral_engine --gemm_plugin float16 --max_input_len 32256
!python3 run.py --max_output_len=50 --tokenizer_dir mistralai/Mistral-7B-v0.1 --engine_dir=./mistral_engine --max_attention_window_size=4096 --input_text "Swap memory is"