The Advent of Huang's Law
It’s been known for some time that Moore’s law is dying. Transistor densities don’t quite rise at the same rates that they used to [1]. For this reason, the last decade of computer scientists have been trained to not expect their code to get faster without effort. Multicore systems for CPU remain hard to program and often require significant tuning on the part of a skilled programmer to achieve. At the same time, the growth of mobile computing has lead to a Cambrian explosion in the broad applications of deployed programs. However, programming for mobile is an exercise in constructing useful programs with light footprints, a very different world in which code only gets faster with hard work and tuning.
However, despite the imminent passing of Moore’s law, we are beginning to see a new and interesting class of phenomena. For a certain class of programs, namely deep networks, code does seem to grow faster almost by magic. It’s now possible to train a complicated convolutional network on the ImageNet dataset in 22 minutes! [2] For those of us who remember that the original models took a week of training [3], this feat seems almost magical. It’s not a fair comparison of course; more hardware and improved algorithms were used in tandem to achieve the speedup. The original code by itself can’t achieve anywhere near these speeds on today’s hardware.
But, with the advent of TensorFlow and similar mature deep learning packages, this state of affairs is beginning to change. Code written in earlier version of TensorFlow does tend to get faster! The original versions of TensorFlow were quite slow [4]. Significant effort by Google’s team has mostly resolved these issues. Continued improvements by NVIDIA on both the software and hardware side means that code written in TensorFlow some time back run on the latest generation GPU and latest version of TensorFlow tend to actually run faster in practice.
Technology commentators have begun to notice this new state of affairs. IEEE’s Spectrum ran a recent article [5] that noted that GPU performance scaling is continuing to advance at a dramatic scale with significant performance boosts, an observation they term “Huang’s Law”. Nvidia CEO Jensen Huang has been making this point for a few years now. I remember hearing him make a similar case at a Stanford colloquium a couple years ago, but didn’t take the analysis seriously since Nvidia has a very strong financial incentive to make such arguments. However, evidence is building, and I think that there’s something to this new “Huang’s law.”
It’s not just in deep learning. Improvements in custom architecture have started to demonstrate real and practical improvements in a number of fields. Microsoft has experimented with FPGA’s for deploying machine learning systems for quite some time [6]. In the blockchain world, Bitmain has led an assault on many cryptocurrencies by designing custom mining ASICs (application specific integrated circuits), even for systems where such designs were thought to be infeasible [7], [8]. Google has been working on their TPUs (tensor processing units) for some time and have used them to dramatic effect in the battle of AlphaGo vs. Lee Sedol [9] and in the deployment of their new machine translation systems [10]. Nvidia’s grand achievement however is in making the case that these improvement in architectures are not merely isolated victories for specific applications but perhaps broadly applicable to all of computer science.
This case is buttressed by the broad adoption of GPUs; much modern software is beginning to partially run on the GPU stack instead of the CPU stack. This shift has been aided and abetted by the dramatic growth in deep learning architecture. A number of commentators, including Andrej Karpathy in his “Software 2.0” essay [11], and Reza Zadeh and myself, in “TensorFlow for Deep Learning” [12] have made the case that deep learning represents a fundamental paradigm shift in software engineering. Work from Google and others [13] has begun to demonstrate that even foundational systems software tools can be repurposed and improved with deep networks. The magic here that powers the new dynamics of Huang’s law is that as new deep learning powered software becomes widespread, the improvements from GPU scaling and more generally from architectural improvements, will begin to make noticeable improvements in the performance and behavior of modern software stacks.
This combination of factors will make the new Huang’s law type scaling a driving force in computer science for at least the next decade or so. We lack the data at the moment to even hypothesize that Huang’s law will have the broad influence of Moore’s law, which lasted over 50 years. In particular, it’s not clear when architectural improvements will stop leading to tangible improvements in scaling. I suspect that there’s only so much that can be optimized out of architectural designs without fundamental physical advances. But, I also suspect that there’s far, far more scope for improvement than previously suspected. It’s going to be interesting to see how the next few years of the technology industry play out.