ChatGPTNews

ChatGPT Releases GPT-4o

ChatGPT has released its newest model called GPT-4o&period;&NewLine;GPT-4o (&OpenCurlyDoubleQuote;o” for &OpenCurlyDoubleQuote;omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs&period; It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to <a class="transition ease-curve-a duration-250 underline-offset-[0&period;125rem] underline decoration-gray-40 dark&colon;decoration-gray-60 hover&colon;decoration-copy-primary" href="https&colon;//www&period;pnas&period;org/doi/10&period;1073/pnas&period;0903616106" target="&lowbar;blank" rel="noopener noreferrer">human response time(opens in a new window)</a> in a conversation&period; It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50&percnt; cheaper in the API&period; GPT-4o is especially better at vision and audio understanding compared to existing models&period;&NewLine;Prior to GPT-4o, you could use <a class="transition ease-curve-a duration-250 underline-offset-[0&period;125rem] underline decoration-gray-40 dark&colon;decoration-gray-60 hover&colon;decoration-copy-primary" href="https&colon;//openai&period;com/index/chatgpt-can-now-see-hear-and-speak">Voice Mode</a> to talk to ChatGPT with latencies of 2&period;8 seconds (GPT-3&period;5) and 5&period;4 seconds (GPT-4) on average&period; To achieve this, Voice Mode is a pipeline of three separate models&colon; one simple model transcribes audio to text, GPT-3&period;5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio&period; This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion&period;&NewLine;With GPT-4o, the company trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network&period; Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations&period;&NewLine;GPT-4o’s text and image capabilities are starting to roll out now in ChatGPT&period; We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits&period; We&&num;8217&semi;ll roll out a new version of Voice Mode with GPT-4o in alpha within ChatGPT Plus in the coming weeks&period;&NewLine;Developers can also now access GPT-4o in the API as a text and vision model&period; GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo&period; We plan to launch support for GPT-4o&&num;8217&semi;s new audio and video capabilities to a small group of trusted partners in the API in the coming weeks&period;&NewLine;

Editor