The most important law every chip designer should know about
It's not Moore's Law. It's not Huang's Law.
Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.
Story time
In 1989, a chip design team started work on one of the most ambitious PC processors of the time - one with a superscalar architecture, branch prediction, and a much faster floating point unit. Over the next two years, hundreds of engineers were involved in the design, simulation, and extensive verification, which culminated in a test chip. The testing continued - by running major operating systems with the most popular applications of the time - for an estimated 100 million clock cycles. After all the bugs were resolved in 1993, the final chip was shipped to all the biggest computer OEMs of the time - with a new brand name called “Pentium”. The chip was an instant hit - the reviews were positive, and as a result, Intel’s microprocessor sales doubled in 1993. The new floating point unit impressed - with a 10x improvement over the previous generation of Intel chips. It was another feather in Intel’s cap - extending their dominance in the PC processor market. It seemed like they had done everything right, and nothing could go wrong…
The Pentium chip had millions of users over the years, but one of them has a special place in history. Dr. Thomas Nicely and Intel couldn’t be further apart - he was a Mathematics professor at Lynchburg College in Virginia, away from the hustle and loud marketing that Silicon Valley was known for. About 18 months after the release of the Pentium processor, Dr. Nicely noticed something - the results from his research on twin prime numbers were slightly off. Like a typical academic would, instead of ignoring it as a “random error”, he started to dig deeper - which led him to the conclusion that the Pentium processor was making mistakes when running some floating point division operations. He posted his findings online, and this opened up a can of worms that would cost Intel dearly.
Triaging the bug
Remember the long division method from high school? Microprocessors in the 1990s used a similar method to divide floating point numbers. Although it worked, this method was quite slow - making floating point division one of the slowest arithmetic operations in a microprocessor. To make this operation execute faster, Intel implemented a different approach called “Sweeney-Robertson-Tocher (SRT) method”. I’ll skip the details here, but for the purposes of this post, all you need to know is that in order to implement SRT in hardware, a lookup table was needed to provide the quotient digit based on different input combinations. It was in this lookup table that the now famous Pentium FDIV bug originated.
In order to implement this lookup table, the Pentium team used a Programmable Logic Array, or PLA. A PLA is a type of Read-Only Memory (ROM), but can be used to store structured data more efficiently in a smaller area - by representing the data as a logic function. In the 1990s, memory was still very expensive, which made PLAs an attractive alternative. The word “programmable” in PLA could be misleading in today’s context - programming a PLA means “fusing”, or allowing current to flow through certain transistors to get the desired logic. This was done permanently, before shipping the chip out to the customer. This transistor fuse mapping (i.e. which connections exist) is very important in deciding what the final output would be. For example, the figure below shows a 3 input, 2 row, 2 output PLA, where different programming (Blue and Green dots) result in different output logic.
To implement the SRT division method, the Pentium team used a 22 input, 120 row, 2 output PLA - which means there are 2880 potential transistors to fuse. (Of these, only 2048 values were used.) Unfortunately for Intel, 5 of these were programmed incorrectly. Whenever these incorrect values were picked up from the lookup table, the result of the floating point division was incorrect. This was the cause for the FDIV bug in Pentium.
How did this bug go unnoticed?
Looking back, this sounds like a trivial issue that should have been caught much earlier in the testing process. But remember, we are talking about the best microprocessor company of the time - which begs the question - how did they miss this bug? There are a few theories floating around:
Intel’s whitepaper claims this was a clerical error - the C script written by an engineer to load the final transistor mapping into the PLA had an error, which resulted in 5 entries not being loaded in the PLA. The values were tested before loading to the PLA, but the values in the PLA were not checked.
Robert Colwell, architect of the Pentium Pro, mentions in his book "The Pentium Chronicles” that this error was caused by a last minute request (actually, order) from management to shrink the size of the PLA, which made the engineers pursue an optimization that was not properly verified.
Some postmortem studies claim that the engineers misunderstood the SRT method, and applied the wrong rule for lowering thresholds - which means this was not an error in any step in the chip design process - the table was mathematically wrong in the first place
Aliens manipulated the final chip layout in order to scale back human progress (I’m not kidding. Intel made a movie about this, called “Intel: The Journey Inside” with this plot.)
Irrespective of the reason, it’s interesting that the bug went unnoticed for close to 2 years after the chip was released. (Although Intel claims they were aware of the problem in the summer of 1994, months before Dr. Nicely made it public.) This is because, the odds of hitting this bug were extremely low:
The division must access the 5 incorrect values out of the 2048 in the PLA, a 0.24% chance
Intel claimed a typical user would encounter this problem once every 27,000 years. Another way of putting this was that an error would take place once for every 9 billion random divisions.
The error typically shows up in the 9th or 10th decimal digit (at worst, the 4th decimal digit.) There are very few applications that require such high precisions
Although a lot of people claimed to have been impacted by this, Dr. Nicely was the only person who noticed the bug in regular use. (All other scenarios seem to be artificially designed to hit the bug.) When it comes to bugs, Intel won the lottery. (if winning the lottery means creating a bug that is being talked about 30 years later.)
Consequences
Although the bug was clearly not great news for Intel. But in a way, they got lucky - the odds of this bug having any real impact was ridiculously low. (If we go by Intel’s numbers, the odds are lower than that of plane crashes.) Yet, Intel did not get away with this.
Intel’s early response was to shrug it off as an unlikely scenario, which led to a lot of flak. News about Dr. Nicely’s findings spread like wildfire, and even CNN ran a segment on the issue. IBM, who were simultaneously Intel’s biggest customer and competitor (with the Power PC) claimed that the issue was more common than Intel estimated. AMD ran an advertisement that said their chips “Can actually handle the rigors of complex calculations like division.” - a clear dig at Intel’s bug.
Ultimately, Intel had to issue a public apology and offered a free replacement to anyone with a Pentium processor that reached out to them. This was estimated to have costed Intel about $500 million - and a lot of damage to their reputation.
This is the impact that even a seemingly harmless chip design bug can have.
The most important law in chip design…
This brings me to the title of this post, and the reason why I narrated the story of Intel’s famous FDIV bug. This is not just a story about Intel’s mistakes. Every major chip company in existence has had a similar story to this, and I’m sure every chip company in the future will. There is a lesson to learn from stories like this.
It is in the nature of chip design and computing that even the unlikeliest outcomes can occur, and should therefore be accounted for. Over the years, architectures have changed, workloads have changed, but this fact continues to remain true. Hence, the most important law in chip design is actually Murphy’s law: If something can go wrong, it will go wrong.
We talk a lot about transistor nodes and chip performance, but it is important to remember that the first priority will always be: build a chip without bugs. Maintaining such a high standards at the scale of trillions of transistors, each a few nanometers wide, and switching more than a billion times per second, is what makes the chip design industry truly unique.
References
Few facts about the Pentium processor
Dr. Thomas Nicely notes on his FDIV bug discovery
About the SRT division method
A great deep dive into the bug
Explanations for the bug
Media coverage after the bug became mainstream