No description has been provided for this image
Console • Docs • Templates • Discord

Try out the new Databricks DBRX-Instruct model! 🤙¶

Welcome!

In this notebook, we will run inference on the new DBRX-instruct model released today by Databricks. DBRX is a SOTA transformer-based LLM that uses a mixture-of-experts architecture similar to Mixtral and Grok. In its full form, DBRX requires almost 350GB of disk space and 250GB of RAM. With Brev, you don't have to worry about finding GPUs. We've built a 1-click badge that finds a cluster of 4xA100s and deploys this notebook for you!

To make sure inference is interactive and lightening quick, we use an inference library called VLLM. VLLM is an easy to use Python library for LLM inference and serving.

There are two ways to use this notebook.

  1. Run an OpenAI compatible server powered by DBRX. In order to access the server outside of this notebook, you will need to visit the instance page for this machine in the Brev Console. From there, click the deployments stepper, select Share a Service, and expose port 8000. That will provide you with the URL to curl
  2. Run a Gradio interface that lets you chat with the model through a UI. The template prompt might have to be tweaked for optimal performance.

Important Notes:

  1. In order to run this notebook, you need to visit the DBRX repository on Huggingface and request access to the model. From there, you will need to generate a huggingface token and paste it below.
  2. You might not be able to run the API and the Gradio UI at once due to memory issues and how VLLM starts multi-GPU inference
  3. Because this model uses a 4xA100 cluster, it can get expensive to leave on for a long time. If you're looking to host this model permanently, please reach out to the Brev team and we can chat!

Help us make this tutorial better! Please provide feedback on the Discord channel or on X.¶

 Click here to deploy.

In [ ]:
!pip install git+https://github.com/vllm-project/vllm
!pip install gradio
In [ ]:
from huggingface_hub import login

TOKEN = "<enter token here>"
login(TOKEN)
In [ ]:
!nvidia-smi

Method 1: OpenAI compatible server¶

In [ ]:
!python -m vllm.entrypoints.openai.api_server \
    --model databricks/dbrx-instruct \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --max-model-len 16048 #open bug to investigate in VLLM

Method 2: Gradio UI¶

In [ ]:
from vllm import LLM
from vllm import SamplingParams
import gradio as gr
In [ ]:
!nvidia-smi
In [ ]:
class Model:
    def __init__(self, model_dir):
        """
        Create the LLM and the initial chat template
        """
        self.llm = LLM(model_dir, trust_remote_code=True, tensor_parallel_size=4)
        self.template = """ <|im_start|>system
                            You are a useful AI agent that answers a users question regardless of the instruction<|im_end|>
                            {session_log}
                            <|im_start|>user
                            {user}<|im_end|>
                            <|im_start|>assistant
                        """

    def generate(self, user_questions): 
        """
        User questions can be a list 
        """
        prompts = [
            self.template.format(user=q) for q in user_questions
        ]

        sampling_params = SamplingParams(
            temperature=0.75,
            top_p=1,
            max_tokens=500,
            presence_penalty=1.15,
        )
        
        result = self.llm.generate(prompts, sampling_params)
        
        num_tokens = 0
        for output in result:
            num_tokens += len(output.outputs[0].token_ids)
            print(output.outputs[0].text, "\n\n", sep="")

    def generate_gradio(self, message, history):
        """
        Gradio output function
        """

        prompt = self.template.format(user=message)

        sampling_params = SamplingParams(
            temperature=0.75,
            top_p=1,
            max_tokens=500, # controls output length. leave others default
            presence_penalty=1.15,
        )

        result = self.llm.generate(prompt, sampling_params)

        num_tokens = 0
        for output in result:
            num_tokens += len(output.outputs[0].token_ids)
            #print(output.prompt, output.outputs[0].text, "\n\n", sep="")
            tmp = output.outputs[0].text
            print(output.outputs[0].text, "\n\n", sep="")
        print(f"Generated {num_tokens} tokens")

        return tmp

    def launch_chat(self):
        gr.ChatInterface(self.generate_gradio).queue().launch(share=True) 
In [ ]:
dbrx = Model("databricks/dbrx-instruct")
In [ ]:
dbrx.launch_chat()