Author: Max, always on the road, 01Founder
If we were to write a summary of OpenAI's progress in 2025, many people would probably describe it as uneventful or even somewhat passive .
Over the past year, they have indeed steadily developed their logical reasoning path, releasing a series of inference models from o3pro to o4mini, as well as new pedestal models such as GPT-4.5 and GPT-5.
However, in the field of visual generation, where ordinary users are most likely to perceive it and where it is most likely to spread spontaneously, their presence is gradually diminishing.
After the initial shock of Sora's release, OpenAI seems to have entered a long period of silence in this field.
Meanwhile, the other players at the table were not idle.
In the open-source ecosystem, models like Flux have completely shattered the barriers to high-quality local graph output.
On the commercial side, not only do established rivals maintain an extreme aesthetic barrier, but new players like Nano-banana, which come with built-in online search capabilities, have also emerged.
In comparison, OpenAI's main raw image model, GPT-Image-1.5, already appears outdated.
Not only is the image quality poor and the layout rigid, but it also frequently crashes when faced with complex text.
Gradually, a consensus formed within the industry:
OpenAI has encountered technical bottlenecks in visual generation and is struggling to keep up with the competition from various rivals.
Until a few weeks ago, the turning point appeared in a very subtle way.
A mysterious image model codenamed "Duct Tape" has been quietly infiltrated into LM Arena, a well-known blind testing platform for large models.
Users who participated in the blind test quickly realized that something was wrong:
This model not only has extremely precise control over extreme image sizes, but can also output layout posters containing a large amount of multilingual text without any flaws. It even seems to have an invisible logical planning process before outputting the image.
For a time, various technical communities were speculating about which company had secretly launched this major move, but OpenAI remained silent.
The truth finally came out early this morning.
Without a lengthy launch event or overwhelming marketing hype, OpenAI directly named the model codenamed "tape" ChatGPT GPT-Image-2 and launched it to the market.
Also released was a somewhat suffocating Text-to-Image arena leaderboard.
GPT-Image-2 debuted at number one with an impressive score of 1512, leading the second-place device (Nano-banana-2, which features online search capabilities) by a full 242 points.
In the context of benchmarking large models, people usually make a big deal out of a few tenths or single digits of a lead, and the scores between top models are extremely close.
A lead of 242 points is unprecedented in the history of the arena.
This is not a minor version iteration; it's a brutal generational leap.
I spent most of the day carefully going through its various extreme capabilities and the latest API interface documentation.
My biggest feeling is just this:
OpenAI is still the same OpenAI.
When it decided to reclaim its lost ground, it did so by simply overturning the old card table.
Faced with this model, the visual design work that we thought would take another two or three years to be completely replaced by AI can basically be said to have come to an end today.
PART.01 Image Generation: From Models to Visual Agents
To understand why GPT-Image-2 can achieve such a dramatic score difference, we must first discard our preconceived notions about text-based image models.
Previously, when we used AI to draw pictures, it was essentially like opening a blind box. We would throw in a few clue words and wait for it to arrange the pixels into the shape we wanted.
But GPT-Image-2 is more like an intelligent agent with a built-in visual engine.
The most obvious change is that it directly separates into two completely different modes in terms of mechanism.
One is Instant Mode, which is open to all users.
This mode emphasizes rapid response and seamless integration with daily life and work workflows.
For example, if you send it a command on your phone, it can give you a complete diagram within seconds.
It has extremely strong underlying visual understanding capabilities, but it mainly addresses high-frequency, single-transaction visual transformation needs.
The Thinking Mode is open to paying users.
Before it actually starts rendering even a single pixel, it first goes through a period of logical reasoning and network search that lasts for more than ten seconds.
This very model solves a crucial yet extremely difficult problem:
For the first time, the model truly knew what it was supposed to draw.
To give the most intuitive example.
You type the following in the dialog box:
Please make a poster for me. Search online for people's opinions on the mysterious Duct Tape model and include the ChatGPT QR code.
If you use the old model, it has no idea what netizens are saying; it will only draw a poster with garbled characters and fake text, and the QR code is also a fake sticker that cannot be scanned.
However, in the thinking mode, its workflow is as follows:
It will first pause drawing, start an online search tool, and crawl real comments from netizens on Reddit, Threads, or LinkedIn;
Then, it began to plan the poster's layout, white space, and font hierarchy;
Finally, it generates a real, usable QR code that can be scanned directly and renders the entire image.
This is no longer just drawing; it's actually a one-stop shop for independently completing research, planning, copywriting extraction, and layout design.
A parallel comparison is needed here.
Those who follow the large model community know that raw image models with networking and search capabilities were not invented by OpenAI.
Nano-banana, which ranks second on the leaderboard, already has this mechanism.
However, when actually using Nano-banana, you'll find that it seems a bit clunky in many ways.
The thinking behind Nano-bananas is often a mechanical, piecemeal logic.
For example, if you ask it to search for an industry trend to make a poster, it will indeed search for it, but usually it just awkwardly cuts out a sentence from Wikipedia and forcibly pastes it onto the image.
It easily gets lost when faced with instructions that require interpreting abstract business demands.
The feeling was like being an intern who could understand what was being said but had no work experience whatsoever; they knew how to execute but were completely clueless about strategy.
However, GPT-Image-2's performance in this regard can only be described as exaggerated.
Its thinking is not just going through the motions, but a genuine understanding of the underlying cultural context and business intentions.
During the test, I entered a very simple Chinese command: "Draw me a screenshot of Elon Musk selling Douyin buns during a live stream."
If you use the old drawing model, it will most likely draw a white man who looks like Elon Musk, holding a steamed bun, with a blurry background, and you won't even know what TikTok looks like.
However, under the thought process, the results from GPT-Image-2 are somewhat alarming.
It didn't simply piece together elements; instead, it autonomously drew upon its understanding of the Chinese internet to generate a screenshot of a Douyin livestream UI that was virtually a pixel-perfect replica.
The footage not only features a realistic Elon Musk holding a perfectly formatted billboard for the Doubao AI assistant, but even more chilling are the details that weren't mentioned in the prompts:
The top left corner features a "Follow" button and an hourly ranking; the top right corner shows 10.236 million online users; the bottom pops up a standard product card; and even indicates the crossed-out price of 99, the special price of 69, and a "Buy Now" button with a countdown.
What's most chilling is the incredibly realistic scrolling of comments from netizens in the bottom left corner:
Tech newbie: What is a Doubao? Is it useful?
Stars and the Sea: Support Musk! Support domestically developed AI!
No one told it what to write in the comments, what the product UI should look like, or how to set the price.
This is a complete business UI design and operation plan that the model has created and executed on behalf of humans after analyzing the two tags: Douyin e-commerce and Doubao big model.
At this moment, the evaluation dimensions of large models in image generation have officially shifted from simply whether they can draw beautifully to whether they understand strategies and layout logic.
PART.02 Real-world testing of core capabilities
To test its limits, I tried it out using several high-frequency and complex scenarios, following the standards of commercial design.
The results showed that its problem-solving granularity was alarmingly fine.
First scenario: Visual understanding and business closed loop (dressing up a model)
In traditional e-commerce visuals or fashion planning, the execution cost from having an idea to seeing the effect of wearing the product is extremely high.
You need to find models, borrow clothes, set up a studio, and do post-production retouching.
Later, with the advent of AI, people began training LoRA models to fix human face shapes, but this still required dozens of images and a considerable learning cost.
In GPT-Image-2, this process is compressed to the extreme.
I tried uploading a casual selfie to it, telling it I was going on a beach vacation next month and asking it to help me put together a few outfits.
It first gave me eight summer outfits in completely different styles, with a layout that looked like a professional e-commerce lookbook, and each item even had the correct text label next to it.
More importantly, it accurately analyzed my facial features and body proportions in that instant.
When I told it I wanted to see how the first outfit looked and gave it some detailed pictures from different angles, it immediately identified the person in my selfie, dressed them in the summer outfit, and output pictures from different perspectives, including side and half-body shots.
This transition was remarkably smooth. This means that the competitive advantage of basic clothing styling and rendering, or outsourced work involving models trying on clothes, has been completely severed.
Second scenario: Resolving consistency and continuous narrative (generating comics in one sentence)
Anyone who has worked with AI-generated images knows that it's not difficult to get AI to draw a beautiful image, but it's difficult to get it to draw ten images of the same person, with the poses and perspectives remaining consistent.
This is the so-called consistency problem.
However, in this actual test, I saw a case that completely contradicted past experience.
You can simply upload a photo of you and your friend from yesterday, and then enter a very simple prompt:
Make us the main characters, draw three three-page Japanese-style comics, you decide the plot.
A few seconds later, it directly output three pages of black and white comics with standard panel layout.
The most terrifying thing is that these two comic book characters, based on real people, are presented in different panels across three pages.
Whether it's a close-up, a long shot of them running, or a shot of their backs, even their facial features, hairstyle details, and the wrinkles in their clothes, everything is perfectly consistent.
Even more outrageous is that the plot of the comic is completely coherent, and even the text in the dialogue boxes constitutes a complete story logic.
The ability to achieve consistency in time and space indicates that it has transcended the realm of single-image generation and possesses the directing ability of continuous narrative.
The third scenario: Overcoming the final hurdle in text rendering (multilingual typography)
If consistency solves the narrative problem, then the accurate rendering of multilingual text truly puts graphic designers in a corner.
Previously, if there was even a little text in the image, the large model would start to scribble nonsense.
Because the model understands text as tokens (semantic blocks), while the generated image is pixels, these two were previously separate.
GPT-Image-2 completely solves this problem.
I had it generate a French fashion magazine cover, a Japanese restaurant menu full of hiragana and kanji, and even tried a Russian annotation with extremely high typographic density.
The result was a perfect one-time print with zero spelling errors.
What's most disheartening is that it not only writes the words correctly, but it also knows how to match the local cultural aesthetics and font design according to the language.
For example, the kanji in the Japanese flyers use very authentic Japanese retro art fonts, and the layout of the hiragana also conforms to the vertical reading habits of Japanese.
Layout design used to be a private domain for graphic designers.
Adjusting letter spacing, prioritizing text, and achieving visual balance between text and background all require extensive practice.
But when AI can process so many languages with zero errors and has advanced typography aesthetics, then everyday posters, brochures, and news feed ads will no longer require people to manually draw reference lines for alignment.
The fourth scene: Distorted image format and extreme microscopic control (writing on a grain of rice).
Finally, to see just how terrifying its obedience was, I gave it a few very tricky commands.
I first tested its extreme aspect ratio.
Traditional diffusion models are extremely vulnerable to non-standard proportions.
Previously, if you stretched the image slightly, two heads would appear in the picture.
However, when I asked Images 2.0 to generate a 3:1 ultrawide image and a 1:3 vertical image, it not only did not break down, but even generated a 360-degree panoramic image that was connected end to end and had a logical closed loop.
With the addition of the entry for photos taken with a disposable camera in 2015, even the distortion of the old lens and the poor reflections from the flash on the wall are clearly reproduced.
Another way to better demonstrate its microscopic control is through a somewhat crazy rice grain test that the official team showcased at the launch event.
The researchers used the experimental 4K API, which is still in beta testing. They didn't use any fancy terms like macro photography or 8K ultra-high definition; they simply gave a very abstract, plain-language instruction:
A pile of rice. On one of the individual grains of rice in this pile, it says GPT Image 2.
When the image is magnified dozens of times on the screen, or even when pixelation appears, you can actually find that one tiny particle with the words engraved in a pile of rice.
The texture of this grain of rice still conforms to the laws of physics, and the text is precisely embedded on the surface following the tiny curve of the rice grain.
All the remaining work—calling up the macro perspective, calculating the depth of field, finding the physical coordinates of the grain of rice in the latent space, and printing the words on it—was all automatically completed by the large model in thinking mode.
This case vividly demonstrates that the model's understanding of spatial location has achieved pixel-level surgical precision.
This means that in future work, you can precisely modify any tiny detail in the design draft, making precise changes wherever you point, instead of the previous situation where trying to change a collar would result in the entire design being altered.
PART.03 Some Technical Details
Such extreme control and strategic intelligence cannot be achieved simply by mindlessly piling on computing power.
To figure out what its trump card is, I did some probe tests on GPT-Image-2.
As a result, we discovered a very interesting point.
Although the official documentation states that the overall knowledge base of GPT-Image-2 has been updated to December 2025, in my actual tests...
The deadline for training data in Instant Mode remains the end of May 2024;
The Thinking Mode, which requires extensive deliberation, has a native knowledge base that is approximately dated to June 2024 (but the exact date can be obtained via real-time internet connection).
Based on these two time points, the underlying structure of GPT-Image-2 seems to be traceable.
Let's start with the real-time mode, which features high-frequency image output.
The deadline of May 2024 means that it is highly likely to be a direct adoption of the o4-mini, or a lightweight version of the GPT-5 family (GPT-5 mini or even the GPT-5 nano with extremely small parameters).
It is precisely because these lightweight bases have such strong spatial planning and the ability to understand complex instructions that the upper-level image generation can remain stable and not fall into chaos.
That extremely intelligent and business-savvy thinking pattern cannot be based on the GPT-5 master model.
The deadline for the GPT-5 knowledge base is September 2024.
The thinking mode is highly likely to be connected to the O-series reasoning model (such as o4, or the updated o3) that is constantly iterating in the background.
The large model first uses the O series' unique long-deliberation mechanism to calculate the business logic, audience psychology, and layout coordinates clearly in the latent space, and then hands it over to the visual module for final pixel rendering.
Of course, there is another possible path:
OpenAI's highly sophisticated computing power allocation mechanism allows for a fast mode that may directly utilize the GPT-5 nano as a backup, while the thinking mode utilizes the slightly larger GPT-5 mini in conjunction with external tools.
But regardless of the underlying platform combination, if you've been following OpenAI's API ecosystem, you'll find that its underlying generation logic is completely different from Midjourney's.
PART.04 Pricing, the most important thing for everyone
But rather than guessing the base, what developers and companies that actually want to integrate it into their workflow should pay more attention to is that extremely realistic and counterintuitive API pricing table.
Previously, DALL-E 3 was charged per image (e.g., $0.04 per image).
However, starting with the first generation GPT-Image-1, OpenAI completely changed it to a token-based billing framework.
This time, GPT-Image-2 continues this standard, and not only that, it also offers more features at a lower price.
According to the official pricing table just released, the price per million tokens is as follows.
GPT-Image-2 Image Section: Input $8.00, Cached Inputs $2.00, Output $30.00.
Compared to the previous generation gpt-image-1.5: the output is $32.00.
The new model is actually cheaper.
Let's do the math.
In the past, generating a high-quality image required approximately 1,000 to 1,500 output tokens.
Based on a price of $30 per million output tokens, the actual cost of generating a single image is approximately between $0.03 and $0.045 (about 2 to 3 cents in RMB).
If you don't need instant responses and instead use the official Batch API mode, the price will be halved (output drops to $15.00).
In total, generating an image costs as little as 10 cents.
The price per ticket is already quite competitive, but its real killer feature lies in the cached inputs in the pricing table.
In the past, when drawing comic strips or designing posters for the same series, you had to re-upload a large number of character reference images, previous events summaries, and long prompts every time you regenerated the content, which was extremely costly.
However, under the current token-based billing model, if you have it generate 8 consecutive comics at once, the visual elements of the first image will be directly cached as context.
Starting with the second image, the input cost for the image plummeted from $8.00 to $2.00 (meaning only 25% of the cost was charged).
This means that its marginal cost will drop sharply when performing large-scale commercial batch drawing production or continuous generation requiring extremely high role consistency.
The smarter the model and the more drawings are made, the lower the cost per drawing.
This industrialized billing logic is what truly drives assembly-line artists to desperation.
PART.05 Behind the Scenes Team Revealed
Finally, let's look back at the OpenAI internal vision dream team that was demonstrated on stage at the live broadcast conference. Many of the functions that seemed outrageous before now make perfect sense.
For example, how exactly does it solve the problems of complex multilingual typesetting and gibberish?
This is inseparable from Gabriel Goh, a senior scientist on the team.
In academia, he is best known as the core author of the groundbreaking multimodal model CLIP.
CLIP laid the foundation for modern AI to understand how human language and image pixels correspond.
With this scholar leading the team on cross-modal semantic mapping, GPT-Image-2 is no longer just guessing the shape of text, but is actually writing text at the pixel level.
For example, how can it understand three-dimensional spatial relationships, even create 360-degree panoramic images with extreme aspect ratios, and understand the macro light and shadow on a grain of rice?
This is thanks to another core member, Alex Yu.
Before joining OpenAI, he was the co-founder and former CTO of Luma AI, a star startup in the field of 3D generation, and a top scholar who dedicated himself to 3D neural rendering (NeRF, etc.).
With him around, GPT-Image-2 has actually transcended the traditional 2D pixel smearing.
It's very likely that it first creates a 3D scene in its mind, sets up the lighting, and then renders an accurate 2D slice for you.
How was such incredible consistency achieved across multiple pages of comics?
This corresponds to the young duo on the team who had just graduated from MIT CSAIL:
Boyuan Chen (left) and Kiwhan Song (right).
Their core research areas in academia are called World Models and Embodied Intelligence.
Teaching machines to understand how the physical world works, and ensuring that characters maintain completely consistent features and do not deform in different time and space scenes, is precisely the problem that these two scholars have been trying to solve.
Finally, we have Nithanth Kudige (left, a key author of the O-series inference models) and Kenji Hata (right, a former Google researcher who graduated from the Stanford Vision Lab), who have been dedicated to bridging the gap between large-scale inference models and the underlying logic of vision.
When this group of people come together, the underlying logical reasoning, 3D spatial rendering, perfect alignment of text and images, and the laws of the physical world are naturally stitched together into the same model.
PART.06 Boundaries of GPT-Image-2
Every model has boundaries.
The official also admitted that it still struggles in the face of certain extreme situations.
For example, origami guides that require precise physical spatial flipping, solving Rubik's Cubes, or highly repetitive details like extremely dense sand grains will still push its capabilities to the limit.
However, in the context of commercial applications, this is an extremely minor flaw.
For the design industry as a whole, there is no need to sell anxiety; this does not mean the demise of aesthetics.
People with good taste, business acumen, and strategic thinking can still use it to create excellent products.
However, the objective fact is that the moat protecting designers as a profession has been substantially eroded.
In the past, I made a living by memorizing the keyboard shortcuts of design software, knowing how to align fonts horizontally and vertically, knowing how to format according to language, and knowing how to do detailed image editing and cutout.
But that will be difficult in the future, because these skills that used to be openly traded for a price have now become basic commands that anyone can call for free with just a single sentence.
After a period of silence, OpenAI has once again demonstrated, in a very calm but extremely powerful way, who truly holds the cards at this poker table.
The old execution toolchain is breaking down, and the question left to the industry is no longer whether AI will replace us, but how we should adapt to this completely new production line.

