How to get started with GPT4-omni 🏌‍♀️

What’s all the fuss about…?

I’ve spent a little bit of time this week tinkering with OpenAI’s latest model and…am pretty blown away 🤓

Here is a sample application that I've hooked up to use gpt4-o which accepts images alongside text as prompts as well as supporting regular text completions.

On Monday, the 13th of May, OpenAI gained an advantage over Google (who had their own launch event planned for the following day) by releasing a new version of their GPT-4 flagship model called GPT-4-o, with "o" standing for omni. While it doesn't deliver any features that we haven't seen before from a large language model, GPT-4-o represents a significant step forward in performance and usability, especially when it comes to multi-modal inputs and outputs.

Before the release of this model, if you were developing with OpenAI models and wanted to support multi-modality, you would need to orchestrate your logic across several different models such as:

  • GPT-4 Turbo for text completions

  • GPT-4 Vision to interpret images and video as inputs

  • Whisper to interpret and generate audio inputs

Where GPT-4-o differs is that it supports multi-modality all from one API endpoint, making the development of multi-modal LLM applications much more streamlined compared to having to detect the type of input from the user and then call the appropriate model. To achieve this, OpenAI has incorporated many of the techniques learned in training the models above into what appears to be a brand-new model, which they have decided to brand as a continuation of what has come before, even though there has been a clear break with how their previous models were trained.

GPT4-o is the new leader in many of the leading text evaluation benchmarks, but given that many if not all of the LLMs are deliberately fine-tuned for the published evalualation frameworks, this hsouldn’t be a huge surprise…

“With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.” Hello GPT-4o | OpenAI

Further to the incorporation of multi-modal capabilities into a single model, OpenAI has uncovered significant advances in efficiency, largely thanks to a new tokeniser, which makes GPT-4-o much faster and, at the same time, cheaper to run. It is so much cheaper to run that OpenAI is going to offer access to this model for free (with a reduced token limit). To me, this indicates two things:

  1. This new model uses significantly less compute than the previous GPT-4 model versions, making it financially viable to include the model in their free tier offering.

  2. There is a more advanced model coming! Introducing a product that clearly outperforms its predecessors while essentially cutting the cost suggests to me that there will be a new model arriving that will introduce a substantial step change which OpenAI will monetize.

So, better quality outputs, faster response times, all within a single model—what's not to like?

How can I start building with it?

Here, you can see that we’re able to ask meaningful questions about an image and get quality responses very quickly. As mentioned above, this is not a new capability—GPT-4 Vision has allowed us to do this for some time—but the speed and improved quality of responses really transform this from a bit of a novelty use case into something that is genuinely useful and usable by our app consumers.

The developer experience for getting started with GPT-4-o is very simple. Helpfully, all of your existing applications that use the OpenAI chat completions endpoint will function as normal. However, when you wish to provide an image as an input, the shape of the messages array that we pass into our call to the model changes to allow different types of input beyond simply passing text into the content element. Below, you can see a regular text completions call and the new multi-modal call using the Python SDK.

Regular text completions:

completion = client.chat.completions.create(
  model=MODEL,
  messages=[
    {"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- This is the system message that provides context to the model
    {"role": "user", "content": "Hello! Could you solve 2+2?"}  # <-- This is the user message for which the model will generate a response
  ]
)

print("Assistant: " + completion.choices[0].message.content)

Example Multi-modal completion when providing an image as input:

response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
        {"role": "user", "content": [
            {"type": "text", "text": "What's the area of the triangle?"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ],
    temperature=0.0,
)

print(response.choices[0].message.content)

As you’ll see, we now have access to a type attribute 😎. Currently, the only type available is image_url, within which we can pass a base64 string or a publicly available URL. However, it is clear that this is how we’ll be able to pass in different formats such as audio and video content. If you’d like to get started with some examples, check out the example notebook provided by OpenAI, which will show you how to get started with images and a slight workaround for audio and images.

How Do I Access the Model?

You can easily access the model within the OpenAI Playground. To interact with it through code, you simply need to pass in “gpt-4o” as your model parameter and call the endpoints as described above. If you’re an Azure OpenAI customer, you will need access to a workspace in the West US3 or East US regions. You will not be able to deploy the model but will need to try it out in an early access playground.

What Does the Future Hold?

The range of applications that this model opens up is incredibly exciting for developers of LLM applications and users alike. It also shows that we should be careful not to commit too much time and resource into building apps that address obvious current shortfalls in the LLM models currently available, as these shortfalls are being addressed at an incredible rate by the leading model makers. Personally, I am very much looking forward to the API supporting audio and video inputs and perhaps multi-modal outputs as well.














Next
Next

Create Blazingly fast LLM Applications with Groq 🔥