Create Blazingly fast LLM Applications with Groq 🔥
What is Groq and why speed matters?
If you asked how LLM applications like ChatGPT differ from traditional chatbots, one key feature might be the way they deliver their responses. The pace at which these applications provide answers gives us the necessary time to digest the information. This delay in response time lends a human-like quality, making our interactions feel more authentic. This lag, often referred to as 'inference speed,' is essentially the time LLMs take to generate the tokens that form their responses. The reason ChatGPT might take a while is because the GPUs supplied by Nvidia, while optimized for a wide range of tasks, aren't specifically designed for hosting Large Language Models. This is where Groq offers a distinctly different approach…
AI Chip architecture that is specially designed for the task of hosting LLMs
Groq, founded by ex-Google wizard Jonathan Ross (not to be confused with the UK radio and late-night talk show host!), has unleashed the first-ever 'Language Processing Unit.' And let me tell you, it's fast—blazingly fast. To give you some perspective, they can run Llama-2 70B at over 300 tokens per second per user and some of Mistral's smaller models at more than 500 tokens per second. In contrast, GPT 3.5 chugs along at 30-50 tokens per second. The user experience with Groq? It's in a whole new stratosphere, where near real-time LLM applications are very much a reality.
A key theme in the evolution of Large Language Models over the past few years is their ever-increasing size—where, for many tasks, bigger does indeed mean better. But as models grow, they demand more from the GPUs hosting them, often sacrificing speed for quality. While I wouldn't predict that Groq's LPUs will completely dethrone Nvidia in hosting LLMs, it's reasonable to think their burst onto the scene will prod Nvidia to fast-track their own LPU development.
How to get started with Groq?
Having read this far, you’re probably wanting to know how you can test Groq out for yourself....well you’re in luck, as of the 6th of May 2024, Groq are offering a free tier to their GroqCloud where you can signup and use the following models, both within their playground console aswell as in code using their Python and JavaScript/Typescript client libraries.
Model ID:
llama3-8b-8192
Developer: Meta
Context Window: 8,192 tokens
Model ID:
llama3-70b-8192
Developer: Meta
Context Window: 8,192 tokens
Model ID:
mixtral-8x7b-32768
Developer: Mistral
Context Window: 32,768 tokens
Model ID:
gemma-7b-it
Developer: Google
Context Window: 8,192 tokens
Below, you'll find a side-by-side comparison of Groq running the recently released Llama 3-70b model and ChatGPT operating the latest version of GPT-4. Now, one major caveat here: while the exact size of GPT-4 isn't known, it's rumored to have over a trillion parameters (GPT-4 - Wikipedia), which might make this latency comparison a tad unfair. However, the inference speed that Groq achieves with this still sizable model is nothing short of impressive. It truly showcases what's possible in terms of near real-time LLM applications when using bespoke chip architecture and a smaller model.
How to get started using Python
As mentioned above, Groq has both Python and JavaScript/TypeScript libraries ready for you to start buidling with. There are numerous quick start guides available within the GroqCloud console and a major benefit of how Groq have chosen to design their Chat API is that it is fully interoperable with the OpenAI Chat Completions API, so any applications or scripts that you have already built for OpenAI models will require minimal changes.
To build a simple streaming command-line example in Python, make sure that you have the following installed in your environment:
pip install groq python-dotenv
Then create a .env file in your working directory and add your Groq API key:
GROQ_API_KEY = "{INSERT YOU API KEY HERE}"
Create a new python file in your working directory called streaming_groq.py and import the following:
import asyncio import os from dotenv import load_dotenv from groq import AsyncGroq
We’re now going to add a single async main function that executes a hardcoded user prompt. To try out the various models available to you, simply alter the model paramter to the modelid of the model you would like to try, in this example we’re using mixtral-8x7b-32768.
async def main(): load_dotenv() client = AsyncGroq( api_key=os.environ.get("GROQ_API_KEY"), ) stream = await client.chat.completions.create( messages=[ { "role": "system", "content": "you are a helpful assistant." }, { "role": "user", "content": "Why is Groq so fast?", } ], model="mixtral-8x7b-32768", temperature=0.5, max_tokens=4000, top_p=1, stop=None, stream=True, ) async for chunk in stream: print(chunk.choices[0].delta.content, end="") asyncio.run(main())
Great, now simply run your python file from the command line and see how blazingly fast Groq is! But what about if we want to add a bit more flair to our chatbot, well the good news is that the Groq API also supports function/tool calling which we’ll explore below.
How to use tools with Groq đź›
The interoperability with OpenAI extends beyond simple chat completions. You can use function calling with the Groq API in much the same way as you would with OpenAI models. In my own testing, I've found the Llama 3-70b model to be reliable for function calling, although it doesn't quite match the speed or quality of outputs that you get from the following OpenAI models, which have been fine-tuned specifically for function calling:
gpt-3.5-turbo (1106)
gpt-3.5-turbo (0125)
gpt-4 (1106-Preview)
gpt-4 (0125-Preview)
Below, I'll demonstrate how to integrate a web search tool with Llama 3-70b. Please note that this example requires a Bing Search API key. However, you can easily modify the search function to fit your own use case. To create a Bing Search resource, simply follow the steps provided here.
SpeedyFunctions is a command line application using Llama 3-70b with Bing Search
To get started, create a new Python file called “search_tool_example.py” and install the following dependencies into your python environment:
pip install groq requests python-dotenv pyfiglet
Then add your bing search subscription key and url into your .env file which should look somthing like this:
GROQ_API_KEY = "{INSERT YOUR GROQ API KEY HERE}" bing_search_subscription_key = "{INSERT YOUR BING SUBSCRIPTION KEY HERE}" bing_search_url = "https://api.bing.microsoft.com/v7.0/search"
At the top of your Python script, add the following imports and declare the Groq client as well as the Bing API paramters we’ll need for our search function:
from groq import Groq import os import json import requests from dotenv import load_dotenv import datetime import pyfiglet load_dotenv() client = Groq(api_key=os.getenv("GROQ_API_KEY")) MODEL = "llama3-70b-8192" bing_search_subscription_key = os.environ.get("bing_search_subscription_key") bing_search_url = os.environ.get("bing_search_url")
Next, we’re going to declare our search function. This simply accepts a query paramter and then uses the requests library to make a request to the Bing Serach API, returns a list of dictionaries containing the results of the search.
def search(query: str) -> list: """ Perform a bing search against the given query @param query: Search query @return: List of search results """ headers = {"Ocp-Apim-Subscription-Key": bing_search_subscription_key} params = {"q": query, "textDecorations": False} response = requests.get(bing_search_url, headers=headers, params=params) response.raise_for_status() search_results = response.json() output = [] for result in search_results["webPages"]["value"]: output.append( { "title": result["name"], "link": result["url"], "snippet": result["snippet"], } ) return json.dumps(output)
Great, now we’re going to create the run_conversation and main functions, the basic flow of how this works is as follows:
User asks a question e.g. “Where can I find great coffee in Lonfon”?
The LLM will then decide if it has a tool available that could help answer the user’s question, if yes, it will generate a json response containing the required parameters for each of the functions it believes could help
We need to catch the initial response and check if the LLM responded with a function call, if so, we call the the function in our code and then pass back the response to the LLM
The LLM will then generate it’s final response, using the results from the function call(s) we have made
Please note that whilst this example only uses one function, you can add many many more but keep in mind that the descriptions you provide for each function will determine how and when the model will call your function. Currently, you can add up to 128 functions
def run_conversation(messages:list): tools = [ { "type": "function", "function": { "name": "search", "description": "Performs a Bing search to get up-to-date information from the web. You should use this tool when you need to get up-to-date information or when your training data cannot provide you with and accurate answer", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "The search query", } }, "required": ["query"], }, }, } ] while True: # Get message from user user_message = input("User > ") if user_message =="STOP": print(pyfiglet.figlet_format("Goodbye")) return messages=messages # Add user input to messages list messages.append({"role":"user","content":f"{user_message}"}) start_time = datetime.datetime.now() # Make first call to LLM response = client.chat.completions.create( model=MODEL, messages=messages, tools=tools, tool_choice="auto", max_tokens=4096 ) response_message = response.choices[0].message # If LLM has returned a function then iterate through the functions and append their output to the messages array tool_calls = response_message.tool_calls if tool_calls: available_functions = {"search": search} messages.append(response_message) for tool_call in tool_calls: function_name = tool_call.function.name print(f"\nI'm going to use {function_name}") function_to_call = available_functions[function_name] function_args = json.loads(tool_call.function.arguments) function_response = function_to_call(query=function_args["query"]) messages.append( { "tool_call_id": tool_call.id, "role": "tool", "name": function_name, "content": function_response, } ) # Make second call to the LLM with the results of the functions it requested in the first response second_response = client.chat.completions.create( model=MODEL, messages=messages, temperature=0.5, max_tokens=8000, ) messages.append( {"role": "assistant", "content": second_response.choices[0].message.content} ) print(f"\nAssistant > {second_response.choices[0].message.content}") else: print(f"\nAssistant > {response_message.content}") end_time = datetime.datetime.now() duration = (end_time - start_time).total_seconds() print(f"\nThe response took {duration} seconds.\n") def main(): print(pyfiglet.figlet_format("Welcome to SpeedyFunctions")) welcome_message = "I use Llama 3 hosted on Groq to achieve amazing response times, even with function calling enabled, give me a go!" print(welcome_message) print("\n") system_message =[ { "role": "system", "content": "You are a function calling LLM that uses data extracted from the search function to answer web-related questions. Include the main topic and relevant details in your response. Always cite your sources in your responses along with a link in markdown", }, ] run_conversation(system_message) if __name__ == "__main__": main()
When you run your script you’ll be able to chat with the model and observe how long responses take to generate as well as see when the model decides to use it’s websearch function. If you wish to exit the chat at any stage, simply write “STOP” in the chat. I plan to cover how you can build a simple Chat UI in react and serve via FastAPI in a later blog post
Should you build with Groq?
With the rapid pace at which new generative AI models, frameworks, and libraries are being released, it can be quite challenging to decide where to invest your time. My experience building with Groq has been impressively smooth. However, it's clear they are still in the process of scaling their infrastructure. Currently, there's no paid tier available, and the rate limiting on applications can be quite stringent. I have no doubt that as Groq scales and introduces paid tiers, it will become a strong contender for building future LLM applications where speed and near real-time responses are crucial.
Given the interoperability with OpenAI's APIs, I'd recommend trying out some of your scripts with the Groq API to experience the speed firsthand. However, it's clear there will be a significant wait before we can build production applications on Groq's platform.