You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Repeated Rendering of Answers Causes Multiple Image Requests
Steps to Reproduce
Simulate a Markdown translation task that includes a link.
Send the following content to the LLM and observe the web console to see the numerous requests to https://placehold.co/600x400 for the image.
Please translate the following Markdown into Chinese
![https://placehold.co/600x400](https://placehold.co/600x400)
Hello GPT-4o
============
We’re announcing GPT-4o, our new flagship model that can reason across audio, vision, and text in real time.
All videos on this page are at 1x real time.
Guessing May 13th’s announcement.
GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to [human response time(opens in a new window)](https://www.pnas.org/doi/10.1073/pnas.0903616106) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.
Model capabilities
------------------
Two GPT-4os interacting and singing.
Interview prep.
Rock Paper Scissors.
Sarcasm.
Math with Sal and Imran Khan.
Two GPT-4os harmonizing.
Point and learn Spanish.
Meeting AI.
Real-time translation.
Lullaby.
Talking faster.
Happy Birthday.
Dog.
Dad jokes.
GPT-4o with Andy, from BeMyEyes in London.
Customer service proof of concept.
Prior to GPT-4o, you could use [Voice Mode](https://openai.com/index/chatgpt-can-now-see-hear-and-speak) to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.
With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.
Explorations of capabilities
----------------------------
Select sample:
Visual Narratives - Robot Writer’s BlockVisual narratives - Sally the mailwomanPoster creation for the movie 'Detective'Character design - Geary the robotPoetic typography with iterative editing 1Poetic typography with iterative editing 2Commemorative coin design for GPT-4oPhoto to caricatureText to font3D object synthesisBrand placement - logo on coasterPoetic typographyMultiline rendering - robot textingMeeting notes with multiple speakersLecture summarizationVariable binding - cube stackingConcrete poetry
Model safety and limitations
----------------------------
GPT-4o has safety built-in by design across modalities, through techniques such as filtering training data and refining the model’s behavior through post-training. We have also created new safety systems to provide guardrails on voice outputs.
We’ve evaluated GPT-4o according to our [Preparedness Framework](https://openai.com/preparedness) and in line with our [voluntary commitments](https://openai.com/index/moving-ai-governance-forward/). Our evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. We tested both pre-safety-mitigation and post-safety-mitigation versions of the model, using custom fine-tuning and prompts, to better elicit model capabilities.
GPT-4o has also undergone extensive external red teaming with 70+ [external experts](https://openai.com/index/red-teaming-network) in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered.
We recognize that GPT-4o’s audio modalities present a variety of novel risks. Today we are publicly releasing text and image inputs and text outputs. Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies. We will share further details addressing the full range of GPT-4o’s modalities in the forthcoming system card.
Through our testing and iteration with the model, we have observed several limitations that exist across all of the model’s modalities, a few of which are illustrated below.
Examples of model limitations
We would love feedback to help identify tasks where GPT-4 Turbo still outperforms GPT-4o, so we can continue to improve the model.
Model availability
------------------
GPT-4o is our latest step in pushing the boundaries of deep learning, this time in the direction of practical usability. We spent a lot of effort over the last two years working on efficiency improvements at every layer of the stack. As a first fruit of this research, we’re able to make a GPT-4 level model available much more broadly. GPT-4o’s capabilities will be rolled out iteratively (with extended red team access starting today).
GPT-4o’s text and image capabilities are starting to roll out today in ChatGPT. We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits. We'll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks.
Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.
Authors
-------
[OpenAI](/news/?author=openai#results)
Expected Behavior
I observed in the Chrome DevTools Network panel that the image at https://placehold.co/600x400 was requested over a hundred times.
Screenshots
Deployment Method
Docker
Vercel
Server
Desktop OS
OSX
Desktop Browser
Chrome
Desktop Browser Version
125.0.6422.142
Smartphone Device
No response
Smartphone OS
No response
Smartphone Browser
No response
Smartphone Browser Version
No response
Additional Logs
No response
The text was updated successfully, but these errors were encountered:
Bug Description
Repeated Rendering of Answers Causes Multiple Image Requests
Steps to Reproduce
Simulate a Markdown translation task that includes a link.
Send the following content to the LLM and observe the web console to see the numerous requests to https://placehold.co/600x400 for the image.
Expected Behavior
I observed in the Chrome DevTools Network panel that the image at https://placehold.co/600x400 was requested over a hundred times.
Screenshots
Deployment Method
Desktop OS
OSX
Desktop Browser
Chrome
Desktop Browser Version
125.0.6422.142
Smartphone Device
No response
Smartphone OS
No response
Smartphone Browser
No response
Smartphone Browser Version
No response
Additional Logs
No response
The text was updated successfully, but these errors were encountered: