The past few years have been rough for chipmaker AMD. Ever since Intel launched the Core 2 series, they’ve been playing a constant game of catch-up as far as CPU performance goes. In 2006, they purchased GPU maker ATI, and spun-off their fabrication plants to create a new entity, GlobalFoundries. AMD had lofty goals to merge the CPU and the GPU, the Fusion initiative, as well as revolutionize the industry with an entirely new microarchitecture, called Bulldozer.
Originally slated for a 45nm process, that version of Bulldozer was canceled and delayed for 32nm. AMD had talked up its clustered multithreading (CMT), which allows for more concurrent operations without increasing die size by adding more cores, sharing building blocks such as cache, the fetch unit, decode unit, and a double-wide floating point unit (FPU). In Bulldozer, each module contains a shared fetch, decode, cache, and a floating point unit. In theory, this allows up to an 80% increase in multithreaded performance with minimal die area increases.
In practice, the launch of Bulldozer was a disaster for AMD. Because of its long pipeline, it required higher clock speeds to reach the same level of performance as their previous K10 architecture found in Phenom II. The CMT architecture also gave each individual process fewer resources than traditional architectures, cache had higher latency, and the wide FPU could only execute two threads using newer software extensions most software simply didn’t use. AMD also had the unfortunate problem of the Phenom II X6 being an affordable hex-core on a proven process, so it was used as a point of comparison more often than the Phenom II X4. Windows 7 also had issues with scheduling threads on a per-module basis, preventing the Turbo feature from kicking in.
Launching a new architecture on a new process leaves two levels of uncertainty, which is why Intel sticks to its famous tick-tock cycle, in which the previous architecture is adapted to a new process, and then a new architecture introduced after the process has matured. However, as they had spun off their fabrication plants, AMD was at the mercy of whatever process GlobalFoundries could put out. The result was a hot chip that did not live up to expectations. AMD didn’t even bother to put it in mobile chips; they stuck with their old Phenom II in the form of Llano.
AMD noted that Bulldozer was the first of at least four planned generations of this architecture. In 2012, they launched Piledriver, which was largely Bulldozer with cache enhancements, bug fixes, and a more mature process, capable of sustaining higher clocks while using the same amount of power. Piledriver did make it into mobile, in the form of Trinity and Richland, both of which were Piledriver modules paired with AMD’s VLIW4 GPU architecture, found on Radeon 6000 series cards.
However, AMD’s exclusivity agreement with GlobalFoundries meant that Piledriver was still on a 32nm process, even as Intel hit 22nm and TSMC hit 28nm. GlobalFoundries has recently been able to get its 28nm process going, and Kaveri, a chip with two Steamroller modules paired with AMD’s latest Graphics Core Next (GCN) architecture on a single die.
While Piledriver was largely a refinement of Bulldozer, it was referred to as a second-generation version, and carried the internal name of bdvr2. However, it did not introduce any large architectural changes, and given the new process, I would say that Steamroller is more accurately referred to as the second generation of Bulldozer, with Excavator being a future, further refinement. The biggest change that AMD made was dropping the shared decoder, which gives each thread its own discrete decode unit, which was a key bottleneck in single-threaded performance on Bulldozer and Piledriver.
However, by virtue of being on a 28nm process rather than a 22nm process, AMD only gets the density advantage of a half-node, not a full node. The chip landscape has also changed since AMD started design…many foundries are optimizing their processes for low-voltage, low-leakage mobile use, with the focus on low power performance per watt. The entire Bulldozer family was designed in an era when AMD expected the trends of high-wattage processes with performance as the main goal, with performance per watt at higher wattages, to still be the norm.