Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.
Over the last year, I have spent many hours following stories about Nvidia and their meteoric rise. This post is an attempt to consolidate all those learnings. My goal is to look back at Nvidia’s history to try and generalize the principles behind making a new computing architecture succeed.
A few caveats before I get into the post:
This is not a post about business strategy. There are many good resources covering that already. I have tried to stay away from generic advice like “to be successful, build the picks and shovels in a gold rush”.
This post is not forward looking - I’m not claiming that Nvidia has won and everyone should invest in their stock. It is simply a case study in computer architecture. In fact, a lot of my learnings come from the early days of Nvidia
With that out of the way, let me share what I think is key to successfully build a new architecture.
Pick the right domain
Architectural changes cannot happen overnight. So its very important to pick the right target domain. There are a few ideal characteristics:
It’s barely possible to execute applications with current architectures.
This means at least one the following factors:
It takes too long to run
It needs very expensive compute resources
It needs advanced programming skills that only few possess
When Nvidia was founded in 1993, they picked computer graphics. For graphics to be realistic, it needs fast compute. In the 1990s, 3D graphics was only possible using either:
Expensive workstations from companies like SGI
Advanced software rendering techniques in games like Doom
These were both not accessible to the average consumer and developer. In addition, 1992 had two key developments that could enable graphics cards:
The PCI bus protocol was adopted as a standard by all computer manufacturer. This means anyone could make a new chip, and attach it with any computer, and it could work
Microsoft released Windows 3.1, which was a big jump in operating systems graphics
All this made computer graphics a great domain for Nvidia to try and disrupt.
There are general patterns in the workloads that can be exploited
I think the word “general” is key. Let me explain with the example of graphics.
Nvidia recognized, over time, that the key pattern in computer graphics is this: Computations on each pixel is independent of the other. No other pattern is as important. This led them towards a massively parallel architecture that is still programmable.
There were architectures inspired by other approaches too. For example, early graphics chips focused heavily on fixed-function accelerators. Here, the idea was to pick a specific part of the graphics pipeline (like Tessellation, Rasterization, Shading), and make your architecture optimal for these steps. In fact, Nvidia also had a fixed function approach for a long time.
There are two problem with going narrow with architecture decisions
Algorithms in your domain can change. Your architecture may not optimally support those changes
Your architecture can never scale to other similar domains.
If you cannot find a general pattern in your domain, it is a bad idea to build an architecture around this domain. Often, such niche domains get commoditized as part of a bigger architecture - this is what Intel did for audio cards, and even 3D graphics to an extent over the years. But Nvidia survived this because of their architecture was sufficiently general by that point.
Solving for this domain should provide big opportunities
This is what Nvidia calls a “zero billion dollar” market. Its hard to specify exactly which domains would fall in this category, but a good approach I suggest is to ask this question: “If you build it, and they come, what can they do?”
When Nvidia first bet the company on computer graphics, their vision was: If graphics could be made accessible to everyone, it would be the greatest storytelling medium. This was around the time when Jurassic Park was released, and power of computer graphics was becoming evident.
I also found something else that was interesting in my research - Nvidia had AI applications added to their GTC lineup as early as 2010. Although AI was still in its nascent stages, they were thinking about potential applications like computer vision, speech recognition, and robotics - the promise of a world powered by AI was massive, prompting Nvidia to pursue that domain aggressively.
Establishing an architecture needs huge time investment - so it is very important that your target domain has huge potential.
Your architecture should fit your target workloads - not the other way around
If you want to build a standardized architecture, that means that other developers will be building software for your architecture. This presents a major challenge to architecture companies. Usually, when there is an change in algorithms or APIs that your architecture support, architecture companies are presented with two options:
Build an translation layer through your drivers, or compiler, to support the new workloads on the same architecture
Modify your architecture to better suit the new workloads
The first approach is often easier, and can be done on-the-fly (in the same chip generation). So this is the preferred method. However, a lot of chip companies ignore the second approach completely - if it works, why fix it? There are two stories from Nvidia about this which are interesting to study.
In 2003, Nvidia had established itself as the leader in computer graphics, and was preparing to launch its next chip, NV30. Around the same time, Microsoft had just released the new Direct3D 9 API. NV30 did not fully support Direct3D 9 - so they created software workarounds to support the new software features. This turned out to be a disaster - the NV30 received bad reviews, had heating problems. This was a major learning for Nvidia.
Over the next few years, as Nvidia’s programmable shaders started to gain popularity, many researchers started using them for scientific computing. During the early days, the same GeForce cards used for graphics were being repurposed for scientific computing through some cumbersome programming. Although this was better than anything available at the time, it was still not optimal - much of the GPU architecture was still optimized towards graphics applications.
This time though, Nvidia saw the trend early - instead of keeping the same chip architecture for graphics and scientific computing, they started to build chips that were specifically meant for non-graphics applications, by maximizing the number of parallel floating points computations supported. In 2016, they also introduced FP16 support on their GPUs - because most Deep Learning workloads used FP16. These factors went on to play a massive role in their dominance in the data-center - which is mostly comprised of AI workloads.
Have a unified and backward compatible architecture.
Two basic rules of life are: 1) Change is inevitable. 2) Everybody resists change.
To build a standardized architecture, making it backward compatible is key. Having a unified architecture naturally creates a flywheel:
All the libraries developed over the years work on your architecture
Developers are incentivized to write more libraries for your architecture
Getting developers into a new architecture is key - that’s what Intel was able to do in PCs, ARM in mobile (PowerPC and DEC Alpha are examples where this did not happen). A unified architecture is very “developer friendly” - a community (ideally, open-source) starts to build around your architecture. This is exactly what Nvidia achieved with CUDA - since 2006, all Nvidia GPUs are CUDA compatible, giving CUDA developers an install base of about 500 million devices, with about 300 libraries and 600 AI models.
CUDA is Nvidia’s moat, and a lot had been said about this already. For the purposes on this post, I want to talk specifically about some of the challenges with maintaining backward compatibility:
Clunky hardware: To support legacy architecture features, often, additional hardware complexity needs to be maintained. This will result in poorer Power, Performance, Area (PPA) metrics over time
Restricting Innovation: Very often, something new and better cannot be implemented in your architecture because it breaks some legacy constraints.
Expensive Development: As the number of architectural features increases, it needs a bigger workforce, and more knowledge transfer - which costs time and money.
I think over time, every architecture company gets hamstrung by their legacy architectural features - A case in point is Intel, and the x86 architecture (I have a earlier post on ISAs with more details). This makes maintaining backward compatibility one of the most challenging aspects of computer architecture.
So far, Nvidia has managed this well, primarily owing to these factors:
Nvidia’s developers operate at a very high level of abstraction. CUDA by itself is a fairly high-level language (like C). Also, most developers build using CUDA libraries like CuBLAS and CuDNN, which are optimized for Nvidia’s architectures. This gives Nvidia more opportunities to improve their architecture while maintaining backward compatibility. In other words, they can maintain backward compatibility at the developer level, but break compatibility at the microarchitectural level - bridged using their driver and compiler. This differentiates them from Intel, who for the most part were dependent on Windows developers doing a good job at using their architecture.
Their architecture is still fairly general and simple - it is centered around parallel programming, and floating point computations. A lot of issues in Intel’s architecture stemmed from complexities added over the years - like variable length instructions and complex arithmetic.
So far, Nvidia has navigated the backward compatibility challenge well - but their workloads are fairly new, so it will be interesting to see how this continues in the long run.
Build infrastructure to move faster
The nature of workloads keep changing all the time. One of the most impressive aspects of Nvidia is their ability to pivot when an opportunity present itself. They started as a graphics card, pivoted to programmable graphics, and then expanded into high performance computing. Nvidia was able to grab opportunities better than anyone else, because they were able to make big architectural changes quickly.
From early days, Nvidia was a strong believer in simulation and automation. I think it stems from Jensen’s early days at LSI logic, the company that pioneered many EDA innovations. (I have covered many such stories in my EDA series). During his time at LSI, Jensen worked on a new chip architecture called “sea-of-gates” - which was a very early version of an FPGA emulator. Emulation would go on to play a big role in the development of RIVA128 - widely regarded as the chip that saved Nvidia.
Traditionally, once the chip was designed, it was sent out to the fab, and a test chip was sent back. This test chip was used to run software, find bugs, and resolve them. There are multiple iterations of this, until all bugs are resolved. Then, a final, large order of chips gets “taped out”. This whole process usually took 2 years.
Nvidia first two chips - NV1 and NV2, were poorly received. Nvidia needed to make major architectural changes in a very short time - which forced them to adopt emulation. Using emulation, once the design is ready, it is loaded on to an “Emulator”, which is then used to test software on the chip prototype even before it was manufactured. To do this, Nvidia went to a failing company called Ikos, invested heavily in their emulators (each one cost $1 million!), and essentially bet the company on emulation. It was a cumbersome process, but it worked - RIVA128 made one of biggest leap in computer graphics, and made Nvidia the leaders in computer graphics.
Although the RIVA128 story was born out of desperation, moving fast then became a part of Nvidia’s culture. Nvidia continues to invest heavily on infrastructure to help them innovate faster than the competition. Nvidia also incorporated a feature called “virtualized objects” into their architecture. In simple words, this was a mini-OS baked into the hardware (called “Resource Manager”) that could be used to emulate certain late hardware features that could not make it in time for the chip production. Although it incurred a minor performance cost, Nvidia adopted this because they greatly valued the ability to move quickly.
Generally at Nvidia, this obsession with efficiency is referred to as the “speed of light” approach, which say: every project must be executed at the fastest possible rate, and all obstacles in the process should be removed. This is an underrated aspect of chip design that a lot of companies neglect, making them slower, and less receptive to new opportunities.
Understand bottlenecks outside your core architecture
If building a computing platform is like building a car - the architecture is like the engine. Even if you have built the fastest engine, your car can’t go fast if you have weak tires, or there is bumper-to-bumper traffic on the road. Although your architecture is just one part of the computing stack, all the end user cares about is: how fast is my workload running. So its very important to analyze where the bottlenecks are, and work on managing them better.
One fundamental bottleneck every architect must think about is transistor scaling - i.e. how is Moore’s law progressing at the time your are building your architecture. (I covered Moore’s law in more detail in an earlier post). Nvidia learnt this the hard way.
Nvidia’s first graphics card, the NV1, was designed to render 3D surfaces in 2D using quadrilaterals (4 vertices) instead of triangles (3 vertices). Nvidia’s had a good reason to do this - memory costs were very high, and using quadrilaterals would allow them to build their chip with a smaller memory, allowing them to price their graphics card competitively. However, what they didn’t see was that Moore’s law started to accelerate around the same time, making memory much cheaper. The software ecosystem at the time standardized on triangles, and their competitors were able to support this using larger memory at comparable prices. Despite having one of the best architecture, Nvidia could not remain competitive because they had a small memory.
Many years later, in the datacenter business, Nvidia knew that their architecture alone cannot push them to the top. Not a lot of people know this, but Nvidia was in the cloud business very early - in 2013, they released Nvidia Grid, an early version of their cloud gaming platform, known today as GeForce Now. There is another fact that is not well known - Nvidia had an LLM of their own, much before ChatGPT became popular. In 2019, they built an open-source LLM called Megatron.
Both these experiences taught Nvidia about all the components needed to make the best datacenter servers. As a result, Nvidia has expanded well beyond their GPU architecture in datacenters, to offer:
Very high bandwidth memory using CoWoS 2.5D stacking (very important for LLMs)
Faster networking between cards through NVLink, enabled by their acquisition of Mellanox
A very efficient ARM-based CPU for the datacenter (in fact, they tried to acquire ARM)
This gave them control of a bigger portion of the computing stack, and allowed them to optimize it further. Vertical integration as an architecture company also buys you time - even if Nvidia GPU architecture is not the best for a generation, the full solution they provide might still be better than anything on offer. Finally, a full solution is much easier for enterprises to deploy.
Although most architecture companies cannot start here, it’s very important to move towards that direction as your company evolves. This is how you can move from a great technology to a great product.
References:
The Nvidia Way, a book about Nvidia’s story by Tae Kim
Nvidia episodes from Acquired
NVIDIA's Jensen Huang on AI Chip Design, Scaling Data Centers, and his 10-Year Bets
Ep17. Welcome Jensen Huang | BG2 w/ Bill Gurley & Brad Gerstner