OpenAI's all-powerful model GPT-4o's real-time interaction is stunning, and the era of science fiction has arrived.

ChatGPT was launched only 17 months ago, and OpenAI has already unveiled a super AI straight out of a science fiction movie, and it's completely free for everyone to use.

Authored by: Synced

It's truly mind-blowing!

While various tech companies are still catching up with large model multimodal capabilities and integrating features like summarizing text and creating images into smartphones, the leading OpenAI has directly made a big move by releasing a product that even their own CEO, Ultraman, marvels at: just like in the movies.

In the early hours of May 14th, OpenAI unveiled the new generation flagship generative model GPT-4o, a desktop app, and demonstrated a series of new capabilities at its first "Spring New Product Launch." This time, the technology has disrupted the product form, and OpenAI has taught a lesson to tech companies worldwide through its actions.

The host for today is OpenAI's Chief Technology Officer, Mira Murati, who stated that today's focus is on three things:

First, in the future, OpenAI's products will be free first, with the aim of enabling more people to use them.
Second, as a result, OpenAI has released a desktop version of the program and an updated UI, making it easier and more natural to use.
Third, after GPT-4, a new version of the large model has arrived, named GPT-4o. The special feature of GPT-4o is that it brings GPT-4 level intelligence to everyone in an extremely natural interactive manner, including free users.

Following this update to ChatGPT, large models can now receive any combination of text, audio, and image inputs and generate any combination of text, audio, and image outputs in real time—this is the future of interaction.

Recently, ChatGPT can be used without registration, and today a desktop program has been added. OpenAI's goal is to allow people to use it seamlessly anytime, anywhere, integrating ChatGPT into your workflow. This AI is now productivity.

GPT-4o is a brand new large model for future human-machine interaction, with the ability to understand text, speech, and image modalities, react quickly, and express emotions, making it very human-like.

At the event, an OpenAI engineer demonstrated several key abilities of the new model using an iPhone. The most important is real-time voice conversation. Mark Chen said, "This is my first time at a live product launch, a bit nervous." ChatGPT replied, "Take a deep breath."

Okay, I'll take a deep breath.

ChatGPT immediately responded, "That won't do, you're breathing too heavily."

If you've used Siri or similar voice assistants before, you can see a clear difference here. First, you can interrupt the AI's speech at any time and continue the conversation without waiting for it to finish. Second, you don't have to wait; the model responds very quickly, even faster than human responses. Third, the model can fully understand human emotions and express various emotions itself.

Next is the visual capability. Another engineer wrote an equation on paper, and instead of providing a direct answer, ChatGPT explained step by step how to solve it. It seems to have great potential in teaching people how to solve problems.

ChatGPT said, "Whenever you're struggling with math, I'm right here with you."

Next, they tried the coding ability of GPT-4o. With some code, they opened the desktop version of ChatGPT on the computer and interacted with it using voice to explain what the code is for and what a certain function does. ChatGPT responded fluently to all the questions.

The result of the code output is a temperature curve graph, and ChatGPT can respond to all questions about this graph in a single sentence.

OpenAI also responded to some questions from netizens on X/Twitter in real time. For example, real-time voice translation, where the phone can be used as a translator to translate between Spanish and English.

Someone also asked, can ChatGPT recognize your facial expressions?

It seems that GPT-4o can already achieve real-time video understanding.

Next, let's take a detailed look at the "nuclear bomb" released by OpenAI today.

The All-Purpose Model GPT-4o

First introduced is GPT-4o, where "o" stands for Omnimodel.

For the first time, OpenAI has integrated all modalities into one model, significantly improving the practicality of large models.

OpenAI CTO Muri Murati stated that GPT-4o provides "GPT-4 level" intelligence, but with improved capabilities in text, vision, and audio based on GPT-4, and will be "iteratively" rolled out in the company's products in the coming weeks.

"GPT-4o spans across speech, text, and vision," Muri Murati said. "We know these models are becoming increasingly complex, but we want the interaction experience to become more natural and simpler, allowing you to focus entirely on collaborating with GPT without having to worry about the user interface."

GPT-4o's performance on English text and code matches that of GPT-4 Turbo, but its performance on non-English text has significantly improved, and the API speed is faster, with a 50% cost reduction. GPT-4o excels particularly in visual and audio understanding compared to existing models.

It can respond to audio input in as fast as 232 milliseconds and has an average response time of 320 milliseconds, similar to humans. Prior to the release of GPT-4o, users who experienced ChatGPT's voice conversation capabilities perceived an average latency of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).

This voice response mode is composed of a pipeline of three independent models: a simple model transcribes audio into text, GPT-3.5 or GPT-4 receives the text and outputs text, and a third simple model converts the text back into audio. However, OpenAI found that this approach meant that GPT-4 would lose a lot of information, such as the model's inability to directly observe intonation, multiple speakers, or background noise, and its inability to output laughter, singing, or express emotions.

On GPT-4o, OpenAI has trained a new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.

"From a technical perspective, OpenAI has found a way to directly map audio to audio as a primary modality and to stream video directly to the transformer. This requires some new research on tokenization and architecture, but overall, it's a data and system optimization problem (as most things are)," commented Jim Fan, a scientist at NVIDIA.

GPT-4o can perform real-time inference across text, audio, and video, marking an important step towards more natural human-machine interaction (even human-machine-machine interaction).

OpenAI's CEO Greg Brockman also went "live" online, not only allowing two GPT-4o models to have real-time conversations but also letting them spontaneously create a song, although the melody was a bit "touching," the lyrics covered details about the room's decor, the characteristics of the people, and the little incidents that occurred during that time.

In addition, GPT-4o's ability to understand and generate images is much better than any existing model, making many previously impossible tasks "as easy as pie."

For example, you can ask it to help print OpenAI's logo on a coaster:

After a period of technical breakthroughs, OpenAI should have already perfectly solved the issue of ChatGPT generating fonts.

Additionally, GPT-4o also has the ability to generate 3D visual content, capable of 3D reconstruction from 6 generated images:

This is a poem, and GPT-4o can format it in a handwritten style:

More complex formatting styles can also be handled:

Collaborating with GPT-4o, you only need to input a few paragraphs of text to get a series of continuous comic panels:

The following features should surprise many designers:

This is a stylized poster derived from two lifestyle photos:

There are also some niche features, such as "text to artistic font":

Performance Evaluation of GPT-4o

Members of the OpenAI technical team stated on X that the mysterious model "im-also-a-good-gpt2-chatbot," which caused widespread discussion on the LMSYS Chatbot Arena, is a version of GPT-4o.

On a set of challenging prompts—especially in terms of encoding—GPT-4o's performance improvement is particularly significant compared to OpenAI's previous best models.

Specifically, in multiple benchmark tests, GPT-4o achieved GPT-4 Turbo-level performance in text, reasoning, and coding intelligence, while also achieving new highs in multilingual, audio, and visual capabilities.

Improvement in reasoning: GPT-4o set a new high score of 87.2% on the 5-shot MMLU (common sense questions). (Note: Llama3 400b is still in training)

Audio ASR performance: GPT-4o significantly improved speech recognition performance for all languages compared to Whisper-v3, especially for resource-poor languages.

GPT-4o achieved a new SOTA level in speech translation and outperformed Whisper-v3 in MLS benchmark tests.

The M3Exam benchmark test is both a multilingual evaluation benchmark and a visual evaluation benchmark, consisting of standardized multiple-choice questions from multiple countries/regions, including graphics and charts. In all language benchmark tests, GPT-4o outperformed GPT-4.

In the future, the improvement of model capabilities will achieve more natural, real-time voice conversations and the ability to have real-time video conversations with ChatGPT. For example, users can show ChatGPT a live sports game and ask it to explain the rules.

ChatGPT users will receive more advanced features for free

Every week, over 100 million people use ChatGPT. OpenAI has announced that the text and image capabilities of GPT-4o will now be available for free in ChatGPT, with Plus users receiving up to 5 times the message limit.

Now, upon opening ChatGPT, it is evident that GPT-4o is already available for use.

When using GPT-4o, free ChatGPT users can now access the following features: experience GPT-4 level intelligence; users can receive responses from the model and network.

Additionally, free users also have the following options—

Analyze data and create charts:

Engage in conversations with captured photos:

Upload files for assistance with summarization, writing, or analysis:

Discover and use GPTs and the GPT App Store:

And utilize the memory function for a more helpful experience.

However, based on usage and demand, free users may be limited in the number of messages they can send using GPT-4o. When the limit is reached, ChatGPT will automatically switch to GPT-3.5 to continue the conversation.

Furthermore, in the coming weeks, OpenAI will introduce a new version of the voice mode GPT-4o alpha in ChatGPT Plus and will roll out more new audio and video features of GPT-4o to a small group of trusted partners through the API.

Of course, through multiple model tests and iterations, it is evident that GPT-4o has some limitations in all modalities. In these imperfect areas, OpenAI is working hard to improve GPT-4o.

It is conceivable that the opening of the GPT-4o audio mode will certainly bring various new risks. In terms of security, GPT-4o has built-in security in cross-modal design through techniques such as filtering training data and refining model behavior after training. OpenAI has also created a new security system to protect voice output.

New Desktop App Simplifies User Workflow

For both free and paid users, OpenAI has also launched a new ChatGPT desktop application for macOS. With simple keyboard shortcuts (Option + Space), users can immediately ask ChatGPT questions, and they can also capture screenshots and discuss them directly in the application.

Now, users can also engage in voice conversations with ChatGPT directly from their computers, and the audio and video features of GPT-4o will be introduced in the future, accessible by clicking the headphone icon in the bottom right corner of the desktop application to start a voice conversation.

Starting today, OpenAI will introduce the macOS application to Plus users and will make the application more widely available in the coming weeks. Additionally, later this year, OpenAI will release a Windows version.

Ultraman: "You Open Source, We Free"

After the release, OpenAI CEO Sam Altman published a blog post, sharing his thoughts on the journey of driving GPT-4o's work:

In our release today, I want to emphasize two things.

First, a key part of our mission is to provide powerful AI tools for free (or at a discounted price) to people. I am very proud to announce that we are providing the world's best model for free in ChatGPT, without ads or similar things.

When we founded OpenAI, our initial vision was: we want to create AI and use it to create various benefits for the world. Now things have changed, and it seems that we will create AI, and then others will use it to create amazing things, from which we will all benefit.

Of course, we are a business and will invent many things that will help us provide free, excellent AI services to billions of people (hopefully).

Secondly, the new voice and video modes are the best computing interaction interface I have ever used. It feels like the AI in the movies, and I am still a little surprised that it is real. It turns out that achieving human-level response time and expression is a huge leap.

The original ChatGPT hinted at the potential of language interfaces, and this new thing (GPT-4o version) feels fundamentally different—it is fast, intelligent, fun, natural, and helpful.

For me, interacting with computers has never been very natural, and that's the truth. However, when we add (optional) personalization, access to personal information, and the ability for AI to take action on behalf of people, I can really see an exciting future where we can do more with computers than ever before.

Finally, a big thank you to the team for their tremendous efforts in achieving this goal!

It is worth mentioning that last week, Altman mentioned in an interview that although universal basic income is difficult to achieve, we can achieve "universal basic compute." In the future, everyone will be able to access GPT's computing power for free, to use, resell, or donate.

"The idea is that as AI becomes more advanced and embedded in all aspects of our lives, having large language model units like GPT-7 may be more valuable than money, and you have some productivity," Altman explained.

The release of GPT-4o is perhaps the beginning of OpenAI's efforts in this direction.

Yes, this is just the beginning.

Finally, it is worth mentioning that the video "Guessing May 13th’s announcement." shown in today's OpenAI blog almost completely coincides with a teaser video for Google's I/O conference tomorrow, undoubtedly putting pressure on Google. After seeing today's release from OpenAI, I wonder if Google feels immense pressure?

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

OpenAI's all-powerful model GPT-4o's real-time interaction is stunning, and the era of science fiction has arrived.

The All-Purpose Model GPT-4o

Performance Evaluation of GPT-4o

ChatGPT users will receive more advanced features for free

New Desktop App Simplifies User Workflow

Ultraman: "You Open Source, We Free"

Selected Articles by 深潮TechFlow

Table of Contents

Related Articles