A 10,000-word analysis of Nvidia's 20-year rise: from two gaming graphics cards to a trillion-dollar empire.

Author: Godot

Our story begins with a competition.

Fei-Fei Li was formerly a vice president at Google and chief scientist for Google Cloud AI/ML, as well as a professor at Stanford University. But she also has another identity—the founder of the ImageNet competition.

The ImageNet competition, officially known as ILSVRC (ImageNet Large Scale Visual Recognition Challenge), is the most influential academic competition in the field of computer vision.

In the 2012 ImageNet competition, Alex Krizhevsky, a student of Turing Award winner Geoffrey Hinton, shocked the world by reducing the image recognition error rate from 26% to 15.3% with the AlexNet neural network, leading the second place by an astonishing 10.8 percentage points.

The key point is that AlexNet did not use a supercomputer, but was trained using only two ordinary NVIDIA GTX 580 gaming graphics cards. This was the first time that AI used GPU acceleration on a large scale. Before this, training mainly relied on CPUs.

This result is tantamount to announcing to the world: AI deep learning + GPU = computing power revolution.

As researchers turned their attention to GPUs, they discovered that only NVIDIA's CUDA allowed them to write complex algorithms in a C-like language.

Huang Renxun's "Ten-Year Gamble"

Let's rewind to 2006. Back then, the GPU had only one responsibility: rendering game graphics.

But Jensen Huang wanted to make GPUs general-purpose computing tools. He firmly believed that Moore's Law was nearing its end for CPUs, and that the future of serial computing would inevitably be parallel computing.

So in 2006, chief scientist Ian Buck led the development of CUDA (Compute Unified Device Architecture). However, at the time, nobody knew what it was for.

To support CUDA, NVIDIA embeds an additional dedicated computing circuit in each GPU chip. This means increased chip area, higher power consumption, lower yield rates, and soaring costs.

Aside from a very small number of researchers, nobody bought into it. Before the explosion of deep learning, Nvidia even proactively sent graphics cards to top labs around the world for free and sent engineers to assist with optimization.

CUDA cost Nvidia approximately $500 million annually in R&D, while Nvidia's annual profit at the time was only a few hundred million dollars. The 2008 financial crisis caused Nvidia's stock price to plummet.

Despite the pressure of plummeting stock prices, Jensen Huang persevered for a full decade. He firmly believed that GPUs were not merely for rendering game graphics, but rather general-purpose parallel processors.

At that turning point in 2012, Intel was still busy maintaining its CPU dominance. Intel had long been convinced of the versatility of CPUs and believed that neural networks were just a passing fad. Even if computation was required, it could be solved by extending the CPU instruction set (such as AVX).

At the time, AMD was deeply mired in the growing pains of the acquisition and was extremely stingy with its software investment, which resulted in its AI software stack ROCm still lagging behind CUDA in terms of ease of use and stability to this day.

So, in the summer of 2012, Alex Krizhevsky was struggling with the millions of images in the ImageNet competition, finding that his CPU couldn't handle them. He discovered that CUDA was extremely useful, so he wrote several thousand lines of code in a C-like language and ran it on two GTX 580 GPUs.

The results sent shockwaves through the global academic community. Experiments that would normally take weeks to complete yielded results in just a few days on GPUs, with a significantly higher accuracy.

Abandoning mobile internet and fully shifting to GPU computing

In 2013, at the GTC conference, Jensen Huang made a decision that seemed almost crazy at the time: to shift the company’s focus entirely to GPU computing.

That was the golden age of mobile internet, with the smartphone wave at its peak. Although Nvidia suffered setbacks in the mobile phone market, it did not stubbornly cling to the mobile phone chip field. Instead, it decisively redirected all its resources back to bet on data center acceleration computing, which was still a very niche market at the time.

In the same year, CUDA entered the 5.0/5.5 era, introducing Dynamic Parallelism, which allows the GPU to start new tasks on its own without sending data back to the CPU, significantly reducing communication latency.

Meanwhile, NVIDIA secretly began developing cuDNN, a CUDA deep learning library specifically designed for deep neural networks. It directly encapsulates the most difficult convolutional algorithms to write within the underlying library, allowing developers to complete operations with just a single command.

However, on an AMD graphics card, the same functionality would require writing hundreds of lines of complex low-level code.

In 2014, the fierce competition in deep learning frameworks began. Google open-sourced TensorFlow, and NVIDIA immediately deployed a large number of engineers to the open-source community to continuously optimize CUDA compatibility. When TensorFlow 1.0 was released, its performance on NVIDIA graphics cards was several times higher than that on AMD graphics cards.

"Buy Nvidia graphics cards" has begun to become an industry consensus.

Today, CUDA has evolved from a development tool into an industry-standard language. Hundreds of millions of AI codebases on GitHub rely on CUDA primitives, and almost all university courses are based on CUDA. This means that the new generation of engineers are already "natives" of the NVIDIA ecosystem before they even graduate.

Hundreds of millions of AI codebases on GitHub rely on CUDA primitives. University courses are almost entirely based on CUDA instruction. This means that the next generation of engineers are already "natives" of the NVIDIA ecosystem before they even graduate.

On top of CUDA, there is a vast system of middleware and libraries.

A. cuDNN and cuBLAS

The deep neural network and linear algebra library has undergone more than ten years of manual assembly-level optimization.

B. TensorRT

The inference optimization engine can automatically fuse operators, select the best kernel, and perform quantization calibration. After entering the Blackwell era, TensorRT-LLM has become the standard for deploying large language models, directly supporting extreme optimization of FP4/FP8, which is hard for competitors to match.

C. Triton Inference Server

It has become the de facto standard for cloud-native AI inference.

Jensen Huang, Elon Musk, OpenAI, "Attention is All You Need"... 2017, the god of AI manifested.

In 2017, NVIDIA's Volta architecture was born, and the flagship product Tesla V100 was released. Tensor Cores appeared on this chip for the first time.

From this moment on, AI computing has entered the era of matrix operations, moving beyond vector operations. AI computing power has exploded, ushering in its inaugural year.

Back in late 2016, Jensen Huang personally delivered the world's first supercomputer equipped with an early accelerator card, the DGX-1, to the then little-known OpenAI office.

Thus, the famous photograph was born. The person with his arms crossed in the photo is none other than Elon Musk, the funder of OpenAI. This machine later became the "ancestor" of the GPT series of models.

In 2017, a seemingly unrelated but actually pivotal event occurred that shaped today's landscape: Google published the paper "Attention is All You Need," which introduced the Transformer architecture.

This paper laid the foundation for today's large language models, completely changed the way AI processes information, and directly led to the creation of later large models such as ChatGPT, Claude, and Gemini.

The computations in the Transformer architecture consist almost entirely of matrix multiplication, making it extremely greedy in its demand for computing power.

Matrix multiplication, does it sound familiar? That's right, NVIDIA's Tensor Cores were designed specifically for matrix multiplication.

Thus began the "Cambrian Explosion" of AI.

From a macro perspective, Nvidia's dominant position is built on three pillars:

1) Tensor Core Architecture

It has achieved a leap from vector computation to matrix computation, and from general-purpose computation to deep learning-specific computation.

2) CUDA Software Ecosystem

NVIDIA's deepest moat is not just a programming language, but also a vast collection of libraries and tools, including cuDNN and cuBLAS, which makes migration extremely costly.

3) NVLink interconnect technology

A bridge for collaboration between GPUs.

To put it simply, the relationship between the three is as follows: Tensor Core is hardware innovation, CUDA is the software ecosystem, and NVLink is the interconnect channel, corresponding to performance, ecosystem, and composability, respectively.

Tensor Cores are the key to Nvidia's true dominance over its competitors and its establishment of AI supremacy. Without understanding Tensor Cores, one cannot understand modern AI chips.

Tensor Cores mark a complete transformation of GPUs from graphics rendering devices into dedicated AI computing platforms, sacrificing versatility in exchange for extreme performance in matrix multiplication, a core AI computation.

What is a Tensor Core?

Tensor Core can be further broken down into three core concepts:

1) Matrix Multiplication 2) Mixed Precision 3) Architectural Evolution

1) Matrix Multiplication

The shift from vector computation to matrix computation is the core logic behind Tensor Core's performance leap.

Traditional CUDA Cores perform scalar or vector operations, such as A + B. Even with concurrent execution, each cycle can only process a limited number of data points.

Tensor Cores are DSA (Domain Specific Architecture) modules embedded inside the GPU, which is equivalent to embedding ASIC-level dedicated acceleration units inside a general-purpose GPU architecture.

Tensor Cores are not designed to execute all types of instructions, but rather specialize in a specific operation—matrix multiplication and accumulation, i.e., D = A × B + C.

In layman's terms, vector calculation is like issuing calculation instructions line by line; matrix calculation, on the other hand, directly outputs an entire table (4×4 matrix).

2) Mixed Precision – The Art of Blurring

The essence of AI is probability, not certainty.

Determining whether an image contains a cat or a dog has no difference in probability between 98.0001% and 98.0000000001%. However, the difference in precision has a drastically different impact on computing efficiency.

Mixed precision means using the lowest possible precision to achieve maximum efficiency without compromising the accuracy of the results.

A. How to measure accuracy?

Here we need to introduce a concept: FP (Floating Point), which is a floating-point number.

Internally, a computer constructs all numbers using 0s and 1s (bits). A floating-point number typically consists of three parts:

1) Sign bit: Indicates whether the number is positive or negative. 2) Exponent: Determines the range of the number's magnitude. 3) Mantissa/Fraction: Determines the precision of the number, i.e., how many decimal places there are.

A common example is FP32, which uses 32 bits to record a number, making it extremely accurate but requiring a large amount of space.

FP16 halves the space and doubles the speed, but the accuracy and range are reduced accordingly; FP4 is extremely low precision, similar to pixel art, and can only record very blurry values.

In computer science, this is essentially about finding the optimal solution between effective information content (information entropy), computational throughput, and numerical stability.

B. How does the mixing precision work?

a. Precision degradation

During computation, the Tensor Core forces the original 32-bit input to be converted to 16 bits.

FP32: 1 sign bit + 8 exponent bits + 23 mantissa bits. FP16: 1 sign bit + 5 exponent bits + 10 mantissa bits.

The mantissa was reduced from 23 to 10, which reduced the computational burden by more than four times during the matrix multiplication stage.

b. Cumulative protection

This is the most ingenious aspect of the Tensor Core design.

The input is FP16, but the accumulation uses FP32—note that the addition uses FP32.

The reason is that small errors are safe when multiplying, but if tiny values are continuously discarded in tens of thousands of additions, the error will amplify rapidly. By accumulating at high precision, NVIDIA ensures the accuracy of the final result.

c. Loss scaling – combating underflow

In AI training, if FP16 is used throughout, the model will crash. This is because some key data are extremely small, and FP16 simply cannot represent them; this problem is called underflow.

The solution is to multiply the loss value by a huge coefficient (such as 1024) before calculation, forcibly pushing these tiny gradients back into the effective range that FP16 can express. After calculation, it is then divided by 1024 to restore the original value.

C. The Limits of Mixed Precision – Microscaling Format (MX)

NVIDIA V100 supports FP16, H100 supports FP8, and B200 further reduces it to FP4.

While FP4 is significantly faster than FP16, it can only represent 2⁴ = 16 values. Considering that an image contains far more than 16 color values, AI would be unable to distinguish between Van Gogh's "Sunflowers" and "Starry Night."

Therefore, in the Blackwell architecture, NVIDIA introduced the Microscaling Format, whose core idea is block floating point.

In layman's terms, within the same vector block in an AI network, the numerical values are often of similar magnitude. Instead of scaling each value individually, it's better to process them in batches: find the value with the largest absolute value in the batch, and use this to determine a common scaling factor.

The most challenging situation is when a set of data contains a maximum value while the rest are minimum values.

It's like a photograph that contains both the sun and a faint firefly. In certain layers of the AI Transformer, this kind of "outlier" often appears.

This is precisely why Nvidia did not completely abandon FP8 and FP16 in the Blackwell architecture, and invested a lot of effort in smoothing them out at the software level.

3) Architecture Evolution

Here's a very convenient way to remember it:

Volta was born—Ampere went mainstream—Hopper exploded—Blackwell is the hottest now.

The later the year, the later the architecture name, the smaller the supported precision (the number after FP), the larger the magnitude of matrix operations, and the more human-like the AI becomes.

2017 Volta (V100): An extremely risky gamble

The launch of Volta in 2017 marked a key turning point in Nvidia's development.

Prior to this, the Pascal architecture, such as the GTX 1080 Ti, primarily aimed to improve the visual appeal of games.

Starting with Volta, Jensen Huang made a decision that seemed extremely risky at the time but proved to be a stroke of genius in hindsight—to reduce precision in exchange for extreme AI computing efficiency, turning GPUs from general-purpose computing devices into dedicated AI platforms.

Before 2017, scientific computing fields such as weather simulation and nuclear explosion simulation required absolute accuracy, and everyone was competing on the computing power of FP32 single precision or even FP64 double precision.

But suddenly, AI exploded. And AI networks are surprisingly "noise-resistant".

Training AI is like teaching a child to recognize a cat. You don't need to tell the child that the cat's ears are 3.1415926 centimeters long; just saying "about 3 centimeters" is enough.

NVIDIA is heavily promoting mixed precision on the V100: FP16 half-precision is used for computation, while FP32 high precision is used for accumulation to prevent error buildup. It's like switching from writing regular script to cursive script, instantly doubling the speed, while the AI accuracy hardly decreases.

This was extremely risky at the time. To carve out a large area on an extremely expensive chip to create a dedicated circuit for matrix operations that were used by only a handful of people at the time was a very, very, very risky decision.

But Jensen Huang and Nvidia made the right bet on the AI explosion.

This is why other competitors, such as Intel, have lagged behind to this day.

Turing (T4) in 2018 – A revolutionary advancement in game graphics: ray tracing and DLSS

Even at this point, the primary use case for chips was still game graphics rendering.

In 2018, Nvidia released the Turing architecture (RTX 2080 Ti). This was the first time in graphics card history that three completely different types of processors were packaged on the same silicon chip.

Let me first explain the background.

Prior to this, game graphics rendering used rasterization, which is essentially 2D texture mapping. Veteran gamers should be very familiar with this. For example, water reflections are actually pre-drawn and then applied; even when the player's perspective changes, the shadow remains perfectly still.

Ray tracing simulates the lighting and shadow effects of the real physical world. In the game, the light and reflections change in real time according to the player's perspective and the light source.

Ray tracing wasn't impossible before, but the computational load was too heavy, and the game would lag like a slideshow.

In the Turing architecture, there are three completely different types of processors: RT Core, CUDA Core, and Tensor Core.

1) RT Core (Ray Tracing Core)

This is a Turing innovation, specifically designed for calculating the intersection of rays and triangles (BVH Traversal). Its function is extremely singular, used solely for ray tracing calculations. By extracting these tedious geometric operations from the general-purpose core, efficiency is improved by tens of times.

2) CUDA Core (General Purpose Computing Core)

It will continue to undertake traditional rasterization rendering tasks.

3) Tensor Core (Mixed Precision Computing Core)

Added support for INT8, INT4, and INT1, introduced low-precision inference capabilities, and brought Tensor Cores to consumer-grade graphics cards (RTX 20 series) for the first time.

Hidden here is a great invention—DLSS (Deep Learning Super Sampling).

The logic is that ray tracing calculations are too laborious, so first render a 1080P image, and then use Tensor Cores to run a neural network to "fill in" the 1080P image into 4K.

This marks the first large-scale application of AI-generated content in the graphics field, proving that AI can become part of the traditional graphics pipeline.

Around 2018, traditional performance growth had reached its limit. Nvidia's aggressive push for ray tracing essentially redefined the standard for measuring the quality of graphics cards. Even if AMD or Intel wanted to follow suit, they lacked the efficient hardware like Tensor Cores to support it.

In other words, Nvidia has created a comprehensive blockade encompassing "algorithms + hardware + training data".

The combination of ray tracing and Tensor Cores has also unexpectedly opened the door to the metaverse and digital twins.

Since Tensor Cores can use AI to complete game visuals, could we "construct" a realistic 3D space directly from a few photos? This is the NeRF (Neural Radiation Field) technology that has become very popular in recent years, enabling the generation of 3D models from videos in just a few seconds.

Ampere (A100) in 2020 – the most successful AI chip in history

The term "usability revolution" perfectly encapsulates the A100. Before the A100, the computing field faced three problems: 1) Precision fragmentation: FP32 was too slow, and FP16 was too difficult to manage; 2) Computational power fragmentation: Training and inference cards were not interchangeable; 3) Resource fragmentation: Large models were underutilized, while small models were overloaded.

NVIDIA has made revolutionary improvements to the A100: 1) TF32 (TensorFloat-32) 2) Structural Sparsity 3) MIG (Multi-Instance GPU)

The combined efforts of these three elements have enabled the unification of a single chip.

TensorFloat-32 (TF32)

This is a brilliant design. Remember how, as mentioned earlier, AI computing previously used high-precision methods for scenarios such as weather simulation, particle simulation, and nuclear explosion trajectory prediction?

TF32 allows developers who are used to writing high-precision FP32 code to directly enjoy the acceleration of fuzz accuracy of Tensor Core without modifying the code.

TF32 is not a completely new storage format, but rather an intermediate format for computation.

Acceleration is achieved by "truncating" FP32, which is essentially a new mathematical format designed to balance computational accuracy and numerical range.

As mentioned in the previous article, any number inside a computer is composed of 0s and 1s (bits). A floating-point number typically consists of three parts: 1) a sign bit, indicating whether the number is positive or negative; 2) an exponent, determining the range of the number's size; and 3) mantissas (or fractions), determining the number's precision, i.e., how many decimal places there are.

Common examples include FP32, which uses 32 bits to record a number, making it extremely accurate but requiring a large amount of space; FP16 halves the space and doubles the speed, but its accuracy and range are reduced accordingly; FP4 is extremely low precision, similar to pixel art, and can only record very blurry values.

The brilliance of TF32 lies in combining the range of FP32 with the precision of FP16 to form a 19-bit format: 1 bit for the sign bit, 8 bits for the exponent (consistent with FP32), and 10 bits for the mantissa (consistent with FP16).

In other words, TF32 is the bridge between FP32 and FP16. Isn't that brilliant?!

Its workflow is as follows: TF32 reads standard FP32 data from video memory. The Tensor Core automatically truncates the mantissa from 23-bit to 10-bit in the hardware circuit and converts it to TF32 format. Efficient multiplication is performed in this format. All intermediate products are finally accumulated in FP32 precision. The data written back to video memory is still standard FP32.

More importantly, the truncation process is completely automatic, which means that it can automatically handle the problem of numerical underflow.

Structural Sparsity

The essence of sparsity is to reduce the weights of unimportant pixels to zero. Like recognizing a picture of a cat, most pixels don't play a decisive role.

Nvidia stipulates that in every four consecutive weights, two must be set to 0. What originally required 64 bits of data now only requires about 34 bits, reducing the model's memory footprint by almost half.

For example, if a graphics card has 80GB of video memory, it can only hold a model with 40 billion parameters (40B). After enabling structured sparsity, it may be able to fit a model with close to 70 billion (70B) or even 80 billion (80B) parameters.

Moreover, the performance has doubled. Intensive computing achieves 156 TFLOPS (156 trillion operations per second), while sparse computing achieves 312 TFLOPS.

If we add the nearly 10-fold improvement of TF32 compared to traditional FP32 mentioned above, we can see that the A100 is a whole generation faster than older graphics cards from a few years ago when processing certain AI tasks.

As for concerns about all four weights being important and losing key information, firstly, when the model is not yet "finalized," the weights can be adjusted.

Secondly, neural networks have extremely strong fault tolerance—although information is lost in a small local area, other layers can learn to make up for this loss.

Furthermore, sparsity is not achieved through random deletion, but rather through pruning based on weight.

MIG (Multi-Instance GPU)

MIG is used for chip space management, performing "hard partitioning" of a single GPU at the physical circuit level. You read that right, physical partitioning.

On the A100, MIG can split the GPU into up to 7 independent instances, each with its own dedicated Tensor Core and memory path.

The partitioning method is flexible and diverse, such as partitioning into 7 small instances, or 1 large instance plus 3 small instances. The hardware has this "partitioning" capability at the factory, but how to partition and how many parts to partition can be controlled in real time through software commands after purchase.

In the A100's hardware architecture, the MIG primarily allocates three types of core resources: 1) SM (Streaming Multiprocessor): Computational cores, including CUDA Cores and Tensor Cores. 2) Memory System: Includes HBM2 video memory and L2 cache. 3) Bandwidth (Pathways): On-chip data transfer channels.

Each instance has its own independent and fixed memory address space and computing path. This means that when instance A is frantically reading and writing data, the electromagnetic signals and bus usage it generates will not interfere with instance B at all.

The benefits of this are obvious:

First, it significantly improves utilization and saves costs. An A100 card costs tens of thousands of dollars, which is too extravagant if only one PhD student is using it for experiments. With MIG, a company can have seven engineers conduct different experiments simultaneously on the same card, increasing efficiency by seven times.
Secondly, it is very popular in the cloud leasing market. Cloud service providers can flexibly rent out computing power on demand.

Looking back from a broader perspective:

Volta (2017): Proved that CUDA Cores for general-purpose computing are no longer the only protagonists, and that Tensor Cores for matrix computing are the crown jewel of the AI era.
Turing (2018): proved that higher precision is not always better, and that low precision INT8/INT4 is the way to go in the era of inference; at the same time, he proved that AI can contribute to computer graphics.
Ampere (2020): Proves that splitting is inefficient and unification is the ultimate solution. Training and inference are integrated into the same silicon chip (A100); and demonstrates that sparsity and TF32 are more productive than "brute-force precision".

Before entering the H1 2022, we must first introduce another key innovation that gives Nvidia its monopoly – NVLink.

If Tensor Cores are the heart of a chip, then NVLink is the major artery connecting tens of thousands of hearts.

NVLink: A high-speed point-to-point interconnect protocol between GPUs

NVLink is a high-speed point-to-point interconnect protocol between GPUs, building a highway between GPUs to allow them to communicate directly, bypassing the CPU.

The sole purpose of NVLink is to eliminate the PCIe bottleneck.

What is a PCIe bottleneck?

PCIe (Peripheral Component Interconnect Express) is a universal bus on the computer motherboard, originally designed to allow the CPU to connect to various peripherals, such as graphics cards, sound cards, network cards, and hard drives.

In AI scenarios, the bottlenecks are mainly reflected in:

1) Insufficient bandwidth. The theoretical bandwidth of the most advanced PCIe 5.0 x16 is about 63 GB/s, which sounds fast, but the H100's memory bandwidth is as high as 3,350 GB/s. This means that the GPU performs extremely fast calculations internally, but the data input and output speed is 50 times slower than the internal calculations.

2) High latency. PCIe data transfer requires CPU intervention. Data is first transferred from graphics card A to the CPU, and then forwarded by the CPU to graphics card B, resulting in significant latency.

Why this bottleneck? The main reason is that PCIe was originally designed for universal use.

On a side note, the trade-off between general-purpose and AI-specific technologies has been a constant throughout Nvidia's rise and is the core reason why Nvidia was able to overtake Intel. Intel's strength lies in its CPUs—powerful and versatile—but this is precisely what constitutes a bottleneck for AI computing.

Nvidia's rise is precisely because it dared to bet on the specialization of AI computing, and it made the right bet.

From three more specialized dimensions—physical limits, protocol overhead, and topological inconsistencies—we can better understand the trade-off between generality and specialization. 1) Topological inconsistencies. In PC or server architectures, all PCIe lanes ultimately converge at the CPU.

The CPU is like a traffic roundabout, where all vehicles must circle around it. Even with a powerful GPU, if the CPU can't keep up with the demands of processing, or if the bandwidth connected to the CPU is saturated, data exchange will be slowed down. This is what's known as the CPU-bound bottleneck.

2) Protocol overhead. When transmitting data packets, PCIe requires additional information such as message headers and checksums; after transmission is complete, an "interrupt request" must be sent to the CPU to allow the CPU to process subsequent logic.

3) Physical interference. Skin effect – the higher the frequency, the more the electrical signal tends to flow on the surface of the wire, resulting in increased resistance and signal attenuation.

How does NVLink eliminate PCIe bottlenecks?

Going back to that statement: NVLink's sole purpose is to eliminate the PCIe bottleneck. How does it achieve this? Let's examine it step by step.

1) Topology Reconstruction

NVLink enables direct point-to-point communication between GPUs, completely bypassing the CPU and system memory.

2) Extremely simplified protocol

NVLink uses a memory-like transfer protocol, which has extremely low protocol overhead and a much higher payload ratio than PCIe.

3) Physical layer upgrade: multi-channel parallelism and high bandwidth

On the back of the H100 chip, NVIDIA has densely packed 18 NVLink links, achieving a total bidirectional bandwidth of 900 GB/s. In contrast, PCIe 5.0 x16 only offers 63 GB/s. NVLink's speed is more than 14 times that of PCIe.

4) Multi-GPU Integration: Memory Pooling and NVSwitch

Nvidia not only made the cables, but also a dedicated switch chip—NVSwitch. Inside the server, all GPUs are connected to the NVSwitch.

NVLink Networking: From Point-to-Point to Fully Connected

Furthermore, NVLink can connect multiple GPUs into a unified whole to load larger models. To understand this, three additional hard-core dimensions are needed.

1) NVSwitch – From highways to overpasses

The NVSwitch is not integrated inside the GPU chip, but is a separate switch chip mounted on the GPU substrate. If NVLink is a highway, then the NVSwitch is an overpass.

Before the A100, GPUs were mainly connected in a point-to-point manner. After the H100, with the introduction of NVSwitch, GPUs moved from point-to-point communication into the network era, enabling multiple cards to be connected into a larger whole and to load larger models.

Imagine the limitations of point-to-point communication: If there are 8 cards, card A and card B are physically connected via NVLink, but card A wants to communicate with card D, it needs to go through B and C as intermediaries, consuming their bandwidth.

Taking the H100 as an example, there are 18 fourth-generation NVLink links on the bottom of the GPU, which are plugged into the NVLink backplane on the motherboard. Among the eight cards, there are 4 to 6 dedicated NVSwitch chips. All NVLink paths of each GPU are directly connected to these switches, rather than directly to another card.

This topology ensures communication between any two cards without going through the CPU or the PCIe bus on the motherboard.

More technically speaking, the core technical specification of NVSwitch is non-blocking full-duplex bandwidth, which ensures that any GPU can communicate with another GPU at the highest speed simultaneously.

2) Network Computing (SHARP) – Enables switches to perform calculations while transferring data.

Network computing is another groundbreaking technology from NVIDIA that has changed the fundamental logic of computer communication: network switches no longer just move data, but directly perform mathematical calculations during transmission.

In the training of large AI models, there is one action that is repeated millions of times: gradient aggregation (All-Reduce).

In simple terms, gradient aggregation allows all GPUs involved in training to exchange their computational results, ultimately ensuring that each card has the exact same, aggregated, latest data.

Gradient aggregation is somewhat similar to distributed computing in blockchain. As the name suggests, it mainly includes two steps: "gradient" and "aggregation".

Training large models involves parallel computing. Each graphics card receives a portion of the data and calculates the error direction, or gradient, for its own portion.

Because each card sees different data, the calculated gradients are also different. If each card is updated directly, the models on the different cards will go completely in the wrong direction.

Therefore, before updating the weights, all cards must sum their gradients and calculate the average. After all cards receive this global average gradient, they are updated synchronously to ensure that the models across the eight cards are always identical.

Regarding the specific calculation method for gradient aggregation.

The A100 uses Ring All-Reduce, which was the most bandwidth-efficient algorithm at the time. It cuts the data into N pieces and passes them around like a relay race.

SHARP employs a tree-like aggregation (Tree All-Reduce) approach, a solution currently being heavily promoted by NVIDIA, where data converges layer by layer like the roots of a tree. The GPU sends data to the first layer, the NVSwitch. SHARP technology performs addition operations directly as the switch chip receives data streams from multiple GPUs, and then sends the results back to each GPU chip.

Why is gradient aggregation (All-Reduce) the lifeline of AI?

The standard for evaluating the quality of a GPU cluster is not how fast a single card is, but whether the gradient aggregation time can still be controlled within milliseconds when the number of cards increases to 1000.

If the GPU has strong computing power (such as the H100) but the network is weak, you'll find that the GPU spends 70% of its time running gradient aggregation. That is, waiting for data, with only 30% of the time actually performing AI calculations. This is what's known as communication constraint.

The existence of technologies such as NVLink, NVSwitch, and SHARP is essentially to provide the fastest channel for gradient aggregation.

Next, in 2022, NVIDIA's groundbreaking H100 product will take center stage.

2022 Hopper (H100) – The Transformer engine, the cornerstone of modern large-scale models.

In 2022, the H100 was launched, and it can be described as a nuclear bomb.

The H100 directly embeds the specific software algorithm Transformer into the chip, making it specifically designed for handling large language models (LLMs) with trillions of parameters.

The H1 2022 figures were nothing short of a nuclear bomb.

The Transformer architecture originated from Google's 2017 paper "Attention Is All You Need" and is the foundation of modern large language models. The Transformer engine is a physical module within the H100; it's not software, but a hard-wired circuit.

Meanwhile, the H100, utilizing FP8 precision, achieves training performance 9 times faster than the A100. Combined with the NVLink Switch, it transforms 256 GPUs into a giant super brain. Without the H100, there would be no ChatGPT and the era of trillion-parameter large-scale models.

Research on H100 can be approached from four aspects:

1) Tensor Core introduces the Transformer engine and FP8;

2) Fourth-generation NVLink and NVSwitch achieve 900 GB/s bandwidth;

3) Introducing new CUDA features—DPX instruction set to accelerate dynamic programming;

4) The world's first GPU to support privacy computing.

Tensor Core introduces the Transformer engine and FP8

In H100, FP8 acts as the executor charging into battle. Most matrix multiplications for inference and training can be run on FP8.

FP16 acts like a shrewd and prudent civil official, preserving copies to prevent loss of update volume due to low precision, playing a bridging role while balancing speed and stability.

FP32 is used for storage and weight updates because subtle gradients are "rounded off" during low-precision accumulation, leading to error accumulation and stopping learning.

FP8 makes it possible to train trillion-parameter models with limited GPU memory, thus doubling the throughput.

DPX Instruction Set: Easily Compare Sizes

DPX is essentially a shortcut key that Nvidia soldered into the chip to "compare sizes after performing addition".

Imagine you're on a chessboard, moving from the top left corner to the bottom right corner. Each move has a cost, and you want to find the path with the lowest cost. So you look at the costs: coming from the top, coming from the left, and coming diagonally upwards, and choose the cheapest one.

Note the structure of this action: first add, then compare which is smaller.

The entire chessboard has millions or even billions of squares, and this action needs to be performed on every single square. This is the daily routine of dynamic programming.

The H100's DPX combines these two steps into one. The reason for using the word "soldering" is because it is indeed a hardware structure on the chip.

DPX does not require the addition of large dedicated cells like Tensor Cores; it simply adds a "convenient comparison" function to the existing integer computing path—with minimal chip area overhead but huge benefits.

For example, gene sequencing involves comparing billions of base pairs at a time, and this operation must be performed for each base pair. Saving one instruction multiplied by billions of operations results in a considerable amount of time saved.

Furthermore, the H100 is the world's first GPU to support hardware-level TEE, thus opening a new chapter in privacy computing.

TMA (Tensor Memory Accelerator): Asynchronous data transfer engine

TMA is one of the most significant changes in the H100 at the SM microarchitecture level, directly determining whether the Tensor Core and Transformer Engine can run at full capacity.

Simply put, TMA is Nvidia installing a dedicated data transporter inside the chip, so that the working threads no longer have to go to the warehouse to fetch data themselves.

The GPU's memory structure is divided into two layers:

1) Global Memory (HBM) has a large capacity (80GB), but it is far from the computing unit and access is slow, like a huge suburban warehouse;

2) Shared memory (SMEM) has a small capacity (maximum 228KB per SM), but it is close to the computing unit and has fast access, like a small cabinet next to the workstation.

All calculations require moving the data from the suburban warehouse to a small cabinet at the workstation first, then moving it back after the calculation is complete. Moving the data itself doesn't produce any useful calculation results, but without moving all the data, the calculations can't be performed.

TMA is a dedicated transport module that allows for more precise division of labor. It understands the shape of tensors, and crucially, it can be executed asynchronously.

TMA has another trump card: Multicast.

The H100 introduces Thread Block Cluster (multiple SMs form a cluster). TMA can not only move data to the shared memory of the SM that initiated the request, but also copy the same data to multiple SMs in the cluster at the same time.

To summarize,

The core contradiction of GPUs is that they can compute quickly but move things slowly. In the A100 era, the workers had to move the components themselves, and everyone had to stop and move them together before starting work again.

H100's TMA is like a dedicated deliveryman. You write down an address, paste it in, and the deliveryman handles it, while everyone else goes about their work. Moreover, this deliveryman understands the shape of tensors; regardless of the data's dimensionality, given coordinates, it can find it on its own.

If DPX "makes computation faster," doing two tasks with one instruction, then TMA "makes data transfer no longer a hindrance," allowing data transfer and computation to run in parallel without interfering with each other. It is through their collaboration that the H100 truly reaches its full computing power.

From graphics card vendor to absolute ruler of AI infrastructure

In 2023, Nvidia underwent a complete transformation. It rose from a graphics card supplier to the absolute ruler of global artificial intelligence infrastructure, with its market value surpassing $1 trillion for the first time.

The company's earnings reports have significantly exceeded Wall Street expectations for three consecutive quarters. Data center revenue has replaced gaming revenue as the company's absolute core pillar.

From Silicon Valley giants like Microsoft, Meta, and Google to sovereign nations like Saudi Arabia and the UAE, the world is frantically stockpiling H100 chips. Due to limited CoWoS packaging capacity at TSMC, H100 chips have become extremely scarce, with the price of a single chip once soaring to over $40,000 on the secondhand market.

Almost all mainstream large-scale models, such as GPT-4 and Llama, are developed on the CUDA architecture. Even if AMD's hardware parameters are superior, it is difficult for developers to migrate because all the underlying optimizations and operator libraries are in the hands of NVIDIA.

At the same time, Nvidia began monetizing through software licensing. Hardware is a one-time sale, but software subscriptions bring in a continuous stream of cash flow.

At GTC 2023, Jensen Huang famously proclaimed, "The time for AI in the iPhone has arrived."

2024 Blackwell (B200) – Microtensor Scaling

At the GTC conference in March, NVIDIA released Blackwell (B200/GB200), which integrates two chips into one through NVLink-C2C, creating a "dual-chip integrated" structure with a transistor count that has increased dramatically to 208 billion.

On the software side, the B200 remains a unified whole.

We can analyze the B200 from three dimensions: Tensor Core, CUDA, and NVLink.

Fifth-generation Tensor Core: Supports FP4

The core breakthrough of the B200 Tensor Core lies in its support for FP4.

From the first-generation Tensor Core supporting FP16 in 2017, to the H100 supporting FP8 in 2022, and now to the fifth-generation Tensor Core of the B200 supporting FP4, the accuracy has been decreasing while the computing power has been increasing.

B200's FP4 is not a simple precision truncation, but rather introduces micro-tensor scaling.

In short, microtensor scaling is a data compression and quantization technique that makes each number smaller without discarding any data.

Essentially, it is a collaboration between dynamic range management algorithms and hardware-level scaling, allowing a group of dozens of elements to have an independent scaling factor.

At the hardware level, micro-tensor scaling relies on the physical circuitry of Blackwell's second-generation Transformer Engine and fifth-generation Tensor Core to work together.

The second-generation Transformer Engine acts as the hardware scheduling hub, responsible for the dynamic range management algorithm, tracking the numerical distribution range of different network layers and different tensors in real time, and calculating the optimal common scaling ratio.

The fifth-generation Tensor Core adds native hardware support for FP4 at the physical level, i.e., hardware-level scaling, which is responsible for execution. The Arithmetic Logic Unit (ALU) can directly perform matrix multiplication operations at the hardware level while receiving FP4 data and scaling factors.

FP4 data can be instantly aligned during computation to restore a high-precision dynamic range, thereby doubling computing power without losing key features. It is designed specifically for ultra-large-scale models.

In addition, the introduction of the hardware decompression engine has indirectly improved the effective bandwidth utilization of PCIe and NVLink.

CUDA 13.0

The key is how to enable developers to seamlessly control the complex "dual-core integrated" structure of the B200.

Although the B200 is physically composed of two chips, CUDA, through NV-HBI (High-Bandwidth Interface), allows developers to see a unified entity with 192GB of video memory, eliminating the need for manual cross-chip data synchronization.

NVLink 5.0 and NVL72

The fifth-generation NVLink protocol increases the bidirectional bandwidth of a single GPU to 1.8 TB/s, twice that of the H100. The bandwidth between two chips is even higher, reaching 10 TB/s, making it completely imperceptible to the software layer that they are two separate chips.

Building on this, NVIDIA also launched the GB200 NVL72 rack, which integrates 36 Grace CPUs and 72 Blackwell GPUs, forming a massive resource pool with 1.4 EB/s aggregate bandwidth.

The GB200 NVL72 had to adopt a liquid cooling design because fans were no longer effective. The back of the rack uses 5000 copper wires instead of fiber optics, significantly reducing power consumption while eliminating nanosecond-level latency caused by photoelectric conversion.

From then on, Nvidia began to use "server racks" as the smallest sales unit.

SHARP has also evolved to version 4, doubling its network computing power once again.

NIM (NVIDIA Inference Microservices): Software Closed Loop

In the past, deploying a large open-source model to a company's own server was an extremely painful manual task.

Engineers need to configure the underlying environment, install CUDA, compile PyTorch, write acceleration scripts by hand, and finally encapsulate the interface themselves. The whole process often takes several weeks.

NIM is a pre-installed software container with pre-optimized models. Enterprises only need to purchase NVIDIA cards to run it with a single click, eliminating the need for expensive algorithm teams to fine-tune each component individually.

Enterprises can deploy NIM within their own intranet. By leveraging NIM on cloud services such as AWS, enterprises can enjoy the latest models while maintaining absolute secure control over proprietary data and applications—data will never be leaked to third-party model providers.

In June 2024, Nvidia's market capitalization briefly surpassed that of Microsoft and Apple, making it the world's most valuable company.

However, in the same year, the market began to diverge. On the one hand, Nvidia's financial report was still phenomenal, with astonishingly high profit margins.

On the other hand, Silicon Valley is beginning to worry about the return on investment in AI. Microsoft and Google have spent hundreds of billions of dollars on GPUs, but revenue from value-added services has failed to cover costs, causing Nvidia's stock price to fluctuate wildly in August and September, despite its earnings still maintaining growth of several hundred percent.

In 2025, Nvidia's market capitalization once exceeded $5 trillion, firmly establishing itself as the world's most valuable company.

Despite the short-term impact at the beginning of the year from DeepSeek R1's claim of reducing reliance on top-tier chips, which caused a significant drop in market value in a single day, the market subsequently realized that the demand for high-performance computing power in AI training had not changed, and Nvidia's stock price became more resilient.

Nvidia's revenue for fiscal year 2025 reached $130.5 billion, a year-on-year increase of 114%, with data center business accounting for nearly 80%. Nvidia's earnings releases have replaced traditional economic indicators as a bellwether for the US stock market.

Nvidia also participated in Microsoft and OpenAI's $500 billion Stargate supercomputing project.

In 2025, Nvidia will actually have several important strategic shifts:

1) Business level: Exporting chips to sovereign states to build sovereign AI;

2) Technological approach: Shifting from generative AI to Agentic AI Swarm;

3) Cutting-edge applications: Deepening our understanding of robotics and digital twins.

In 2025, Nvidia also announced two major initiatives that received little attention but were of great significance: GR00T and Cosmos.

GR00T is the first open-source general-purpose basic model for humanoid robots, while Cosmos is a physics simulation platform that collaborates with companies such as Google and Disney.

The combination of the two allows robots to be trained in a digital twin world, simulating gravity, friction, fluid dynamics, and even the elasticity and light and shadow of materials in a computer virtual environment.

Leveraging the powerful computing capabilities of GPUs, the virtual world can operate at exponential speeds. A day in reality can complete a physical simulation process equivalent to decades or even centuries in the virtual world. A robot's AI brain experiences billions of falls and rises within an extremely short amount of real time.

This is equivalent to "one day in the human world being equivalent to ten years in the digital world".

The mass production of the humanoid robot prototype Isaac GR00T N1 marks NVIDIA's official entry into the global robotics "brainstem supplier".

Jetson Thor is an in-vehicle computing platform designed specifically for robots. It has already begun mass production and aims to become the brainstem of all moving "intelligent agents".

At the end of the year, Nvidia officially announced its next-generation Rubin architecture.

2026 Rubin (R100) – Agentic AI Swarms Ultra-Large-Scale Inference

At the beginning of the year, NVIDIA delivered the Rubin R100, redesigning six key chips: CPU, GPU, NVSwitch, NIC, DPU, and SuperNIC. NVIDIA calls this concept Extreme Co-design.

Fourth-generation high-bandwidth memory HBM4 with 12-Hi stack

This involves three concepts: memory wall, stacking, and HBM. These three terms together form the complete chain of "identifying the problem - proposing a solution - solving the problem": memory wall is the problem, stacking is the solution, and HBM is the solution.

What is a memory wall?

In short, the data transfer speed of RAM/video memory cannot keep up with the computing speed of GPU/CPU.

For example, a GPU can perform 1 million multiplications per second, but memory can only send 100,000 numbers per second, leaving the GPU idle for the remaining 90% of the time.

Models like ChatGPT have hundreds of billions of parameters, and each time a question is answered, these hundreds of billions of numbers have to be retrieved from memory and calculated. This creates a memory wall problem, rendering even the most powerful GPUs useless.

Stacking: Breaking down the memory wall at the physical level

The simplest and most direct way to break the memory wall is to place the memory and GPU as close as possible and to use multiple memory modules.

However, the motherboard area around the GPU is limited. So engineers used TSVs (Through Silicon Vias) to drill tens of thousands of tiny holes in the memory chips, filled them with copper wires, and then stacked 4-layer, 8-layer, 12-layer, and even future 16-layer memory chips vertically together like stacking hamburgers. This is stacking.

HBM (High-Bandwidth Memory): The Highway in a Stack

HBM is a high-speed data road created using stacking technology, mainly relying on TSV (Through Silicon Via) and silicon interposers (to solve external horizontal interconnection).

HBM4 (High Bandwidth Memory 4) is currently the world's most advanced fourth-generation high-bandwidth memory technology. 12-Hi stacking refers to using advanced packaging technology to vertically stack 12 layers of memory chips into a single chip, much like building a skyscraper.

Each Rubin chip natively integrates 288GB of HBM4 memory, achieving a staggering aggregate bandwidth of 22 TB/s. When handling mainstream ultra-large models with 10 trillion parameters, Rubin can improve training efficiency by 3.5 times and reduce inference costs by 10 times without increasing the number of GPUs.

Vera CPU – Natively Supports FP8

Let's first review the fundamental differences between CPUs and GPUs.

CPUs dedicate a large number of transistors to complex control units and caches, rather than computing units (ALUs).

This design is very effective for operating systems with complex logic, but when faced with the "rigid" large-scale mathematical operations of AI, the complex control unit is a pure waste with extremely low energy efficiency.

GPUs employ a SIMD (Single Instruction Multiple Data) or, more advancedly, a SIMT (Single Instruction Multiple Thread) architecture. A single control unit directs a large group of computing units.

Just like calisthenics, when the instructor (CU) shouts "Raise your hands," thousands of students (ALU) perform the movements simultaneously, greatly saving the transistor area used for "command" and converting it all into computing power for "doing the work."

This is the fundamental reason why GPUs are far more energy efficient than CPUs in AI tasks.

However, GPUs are not capable of running operating systems, cannot directly read hard disk files, and cannot handle external network requests. They must be employed by the CPU, which dispatches tasks and prepares data.

The Vera CPU is not a general-purpose processor for processing Windows data, but rather a data steward customized by NVIDIA for Agentic AI, which stably feeds data to the adjacent Rubin GPU with extremely low latency and extremely high bandwidth.

Essentially, it is a specialized processor designed to maximize GPU computing throughput. It abandons redundant functions in traditional general-purpose computing, using extreme memory bandwidth, extremely low single-threaded power consumption, and native low-precision data support to achieve absolute data scheduling efficiency in a single AI computing scenario.

Prior to 2022, Nvidia only manufactured GPUs. All AI servers used Intel or AMD x86 CPUs as the motherboard core and then plugged in Nvidia GPUs like USB drives. This led to the PCIe bottleneck mentioned earlier.

With the arrival of the Hopper (H100) era, NVIDIA developed its own ARM architecture Grace CPU and launched the GH200 (Grace Hopper Superchip), which for the first time packaged its own CPU and H100 GPU on the same super motherboard.

With Vera, the data barrier between the CPU and GPU was completely broken down.

Previously, GPUs were already performing calculations using extremely low precision (such as FP8), but CPUs have traditionally only been good at processing high-precision FP32/FP16 data. Data transfer between the two requires frequent format conversions, wasting a significant amount of bandwidth and time.

Vera is the industry's first CPU to natively support FP8 at the hardware level. It can perform FP8 preprocessing and alignment directly at the CPU level before the data is fed to the Rubin GPU, completely eliminating the latency overhead of data format conversion.

NVLink 6 and Silicon Photonics (CPO)

At the physics level, Nvidia has pushed several engineering and materials science limits. The design from copper wires to silicon photonic CPO, which we will discuss next, is a microcosm of this limit.

Silicon photonics and CPO technology trade extremely high manufacturing costs and catastrophic maintenance difficulties for massive bandwidth and extremely low power consumption that break physical limits. Copper wires, on the other hand, make a last stand within a single rack with low cost and extremely high physical reliability.

However, R100 has already reached the limit of copper wire.

In the previous generation of Blackwell architecture racks, the backplane of the rack was crammed with more than 5,000 extremely heavy and thick copper cables to achieve the all-copper interconnection of 72 computing chips. The NVLink 6, released in 2026, will double the single-card interconnect bandwidth to 3.6 TB/s.

If the pure copper solution is continued, the number of copper cables inside the rack will exceed ten thousand. Not only will it be physically impossible to fit them in, but the extremely dense cabling will also completely block the entire rack's cooling airflow.

Even more critically, the resistance of copper wires causes severe signal attenuation during ultra-high frequency signal transmission. To force the electrical signal through, the system must consume enormous amounts of power. In the Rubin era, where single-rack power consumption was already extremely high, this unnecessary energy consumption due to signal attenuation was completely unacceptable.

Therefore, Nvidia's shift from copper wire to silicon photonics CPO is less a proactive choice and more a necessary trade-off.

NIM 2.0 and Inference Storage

The core keyword for R100 is "Agentic AI". The previous section introduced the hardware-level support for Agentic AI, while NIM is the collaboration between hardware and software.

NIM 2.0 is a standardized software container and scheduling bus designed specifically for multi-agent collaborative computing, enabling ultra-fast data interaction and computing power allocation between different AI models.

At the software level, different AI models are encapsulated and can call each other with extremely low latency, and complex tasks are broken down and distributed automatically.

Inference Storage is a physical multi-level memory architecture specifically built for large models and ultra-long contexts (KV Cache), which completely breaks the physical limit of the memory capacity of a single graphics card.

Context data during model inference is no longer frequently traversed to main memory, but is dynamically cached during network exchange.

The combination of hardware and software solves the latency and memory overflow bottlenecks of Agentic AI when processing complex tasks involving millions of words.

Traditional inference service frameworks primarily focus on sequential optimization for single models (such as having a single LLM continuously generate text). However, in Agentic AI workflows, multiple models often need to collaborate concurrently at high frequency. NIM 2.0 is a software infrastructure restructured specifically for this purpose.

Furthermore, the GR00T and Cosmos, representing the future direction, have evolved to version 2.0. NVIDIA has established deep partnerships with factories such as BMW and Tesla, and by 2026, hundreds of thousands of collaborative robots powered by GR00T 2.0 will be able to operate in the cloud via the NVIDIA Isaac platform.

At this point, Nvidia's development trajectory has been fully outlined.

postscript

In my research on Nvidia, I was deeply impressed by two aspects:

1) Huang Renxun's judgment

In the 2012 ImageNet competition, Alex Krizhevsky, using two ordinary NVIDIA GTX 580 gaming graphics cards, reduced the image recognition error rate from 26% to 15.3%, shocking the world with an astonishing 10.8% lead over the second place.

In 2013, Jensen Huang shifted his focus entirely to CPUs.

It's important to note that this was four years after Google published its paper, "Attention is All You Need," which introduced the Transformer architecture and laid the foundation for modern LLM large-scale models. At that time, competition in the chip industry was still focused on the more general-purpose CPU field.

After that, Huang Renxun made almost the correct choices at every key juncture.

In 2006, nobody knew what CUDA was for, but he kept burning through $500 million a year to keep investing in it.

In 2017, while the scientific computing community was still pursuing the absolute precision of FP64, he dared to allocate a large area on the most expensive chip for dedicated circuits for matrix operations, which were then used by only a few people.

In 2018, when the mobile internet wave was at its peak, he decisively abandoned mobile phone chips and bet all his resources on data centers.

In 2022, he personally delivered the first DGX-1 to the then little-known OpenAI office.

Every decision seemed almost insane at the time.

This judgment doesn't stem from prophetic predictions, but from a profound understanding of the underlying logic of technology. Jensen Huang has consistently asked one question: What is the future of computing? His answer has remained consistent: parallel computing will eventually replace serial computing, and specialized efficiency will ultimately triumph over general-purpose performance.

This belief has guided NVIDIA's entire development path, from CUDA to Tensor Core, and from NVLink to Rubin.

2) Nvidia's engineering capabilities

Nvidia's chip iterations have repeatedly pushed the limits of physics, and the innovations, trade-offs, and choices made in this process involve not only communications, materials, and optics, but also extend to the boundaries of quantum physics.

Hybrid precision is a trade-off, sacrificing speed for fuzziness.

Structured sparsity is a trade-off, sacrificing capacity for pruning.

The shift from copper wire to silicon photonics represents a trade-off, sacrificing manufacturing difficulty for transmission limits.

Each generation of architecture advancement is not simply about increasing the numbers, but about repeatedly seeking the optimal solution between precision and efficiency, generality and specialization, and cost and performance.

Behind this is an extremely large and deeply involved engineering team.

The convolution algorithm in cuDNN has undergone more than a decade of manual assembly-level optimization; TensorRT's operator fusion is precise down to the scheduling strategy of each kernel; TMA's asynchronous transport mechanism enables true parallelism between computation and data transmission. These unseen, underlying advancements are the deepest cornerstone of CUDA's ecosystem moat.

What's even more remarkable is that Nvidia has built an extremely robust bridge between hardware and software.

From CUDA to cuDNN, from TensorRT to NIM, from chips to racks to the entire data center, each layer is tightly integrated. Even if a competitor catches up at one layer, it is very difficult to catch up across the entire stack simultaneously.

This is not a company that only makes chips, but a system-level company that is pushing the boundaries of everything from transistors to software containers, from single cards to multi-card clusters, and from algorithms to physical laws.

Looking back at Nvidia's rise, what impresses me most is a simple truth: the real moat is never a single technology, but the compounding effect of countless correct decisions over time.

CUDA took ten years to witness the explosion of deep learning. Tensor Cores took five years to see the Transformer dominate. NVLink took three generations to evolve from point-to-point connections to a fully interconnected network. Each technology seemed ahead of its time, even superfluous, at its inception, but when the tide of history truly arrived, they were there nonetheless.

This is probably the best explanation for what Huang Renxun often says.

"Our company is always only 30 days away from bankruptcy."

It was this sense of crisis that drove Nvidia to lay the groundwork a decade in advance at every moment when others thought it was "too early." And when the opportunity truly arrived, everyone realized that Nvidia was the only one left on the track.

Finally, a few words of reflection.

Besides Nvidia, what filled me with awe and even excitement during the research process was the wisdom displayed by humankind.

A single B200 chip integrates 208 billion transistors. To put that into perspective, there are approximately 600 billion stars visible to the naked eye in the Milky Way. A chip the size of a fingernail contains transistors on the same order of magnitude.

These 208 billion transistors weren't soldered on one by one; they were photolithographically created. Extreme ultraviolet light with a wavelength of only 13.5 nanometers passes through an extremely precise photomask to project the circuit pattern onto a silicon wafer, "printing" it layer by layer. The alignment precision required for each layer is at the sub-nanometer level, equivalent to aiming a laser at a coin on the lunar surface from Earth.

When the gate length of a transistor shrinks to 3 nanometers or even smaller, the behavior of electrons no longer strictly follows classical physics; the quantum tunneling effect begins to appear, and electrons can pass through thin walls that should be insulators like ghosts. In other words, chip engineering has reached the boundaries of the uncertainty principle in quantum mechanics.

This is precisely the fundamental reason why B200 has to adopt the "dual-core integrated" splicing method: a single silicon wafer has already approached the limits of current photolithography technology and physical laws, and continuing to make it larger will only cause the yield rate to collapse.

So the engineers changed their approach. Since one piece couldn't do it, they combined two pieces perfectly and then used NVLink-C2C with a bandwidth of 10 TB/s to stitch them together into a whole, making the software layer completely unaware of the seam.

From quantum physics to materials science, from optical engineering to packaging technology, the creation of a chip embodies the wisdom of almost all of humanity's cutting-edge disciplines.

I'm reminded of Stefan Zweig's book, *The Stars of Humanity*. We've created a thinking machine out of sand, and we use this machine to explore the universe, simulate physics, and even try to understand consciousness itself.

This is perhaps a story more worthy of being written than the rise of any other company.