Dan Luu

Why do so many great engineers hold Alan Kay in contempt?

Added 2021-08-02 13:01:04 +0000 UTC

To answer this question, let's look a claim that's representative of Alan Kay's "systems" claims.

In this ACM interview right before 2005 (https://queue.acm.org/detail.cfm?id=1039523), Alan Kay claimed that computers would be 1000x faster if we listened to him and designed computers his way:

> Neither Intel nor Motorola nor any other chip company understands the first thing about why that architecture was a good idea.

> Just as an aside, to give you an interesting benchmark—on roughly the same system, roughly optimized the same way, a benchmark from 1979 at Xerox PARC runs only 50 times faster today. Moore’s law has given us somewhere between 40,000 and 60,000 times improvement in that time. So there’s approximately a factor of 1,000 in efficiency that has been lost by bad CPU architectures.

> The myth that it doesn’t matter what your processor architecture is—that Moore’s law will take care of you—is totally false.

He then goes on to claim that what really killed Smalltalk (as well as Lisp) was these poorly designed, slow, computers:

> Yes, actually both Lisp and Smalltalk were done in by the eight-bit microprocessor—it’s not because they’re eight-bit micros, it’s because the processor architectures were bad, and they just killed the dynamic languages. Today these languages run reasonably because even though the architectures are still bad, the level 2 caches are so large that some fraction of the things that need to work, work reasonably well inside the caches; so both Lisp and Smalltalk can do their things and are viable today. But both of them are quite obsolete, of course.

And then he makes one of his trademark comments about how programming is pop culture and kids nowadays don't know anything:

> Like I said, it’s a pop culture. A commercial hit record for teenagers doesn’t have to have any particular musical merits. I think a lot of the success of various programming languages is expeditious gap-filling. Perl is another example of filling a tiny, short-term need, and then being a real problem in the longer term. Basically, a lot of the problems that computing has had in the last 25 years comes from systems where the designers were trying to fix some short-term thing and didn’t think about whether the idea would scale if it were adopted. There should be a half-life on software so old software just melts away over 10 or 15 years.

> It was a different culture in the ’60s and ’70s; the ARPA (Advanced Research Projects Agency) and PARC culture was basically a mathematical/scientific kind of culture and was interested in scaling, and of course, the Internet was an exercise in scaling. There are just two different worlds, and I don’t think it’s even that helpful for people from one world to complain about the other world—like people from a literary culture complaining about the majority of the world that doesn’t read for ideas. It’s futile.

> I don’t spend time complaining about this stuff, because what happened in the last 20 years is quite normal, even though it was unfortunate. Once you have something that grows faster than education grows, you’re always going to get a pop culture.

One thing that I think is interesting is that he's been going on about this for decades at this point (e.g., in a USENET message posted to comp.arch in 1998, John Mashey notes that he talked to Alan Kay about a decade ago, so around 1988, about his ideas and could not find anything useful in them). Nowadays, when I see systems or hardware folks reference Alan Kay's ideas on hardware, from the tone, I think they hold him in contempt and find his ideas risible. For example, in https://twitter.com/rygorous/status/1420894393951653893 Fabian Geisen offhandedly dismisses Alan Kay's ideas as even worse than something he's shitting on, saying "unlike the Alan Kay thing, the actual technical stuff he says in there is all sound, but still there's the implicit expectation that newer HW was supposed to make every workload faster forever and everything less is failure" and in https://twitter.com/pervognsen/status/1420870430047358977, Per Vognsen says "Alan Kay's opinions about hardware are infuriating. Also a good reminder of what happens to a smart person who is so disgusted with hands-on work that he retreated to the realm of pure ideas and armchair philosophy at the soonest possibility", followed by "Even if you want other people to build all your shit (aka you're a computer architect) you need to understand how things work to the level where you can quantitatively estimate design opportunities based on first principles reasoning rather than reasoning by analogy." To that, John McCall responds with, "Yeah, Kay’s opinions here are useful mostly as a way of easily detecting contrarians who lack the ability to engage with the subject". Elsewhere, Yossi Kreinen has said "If Alan Kay has a 1000x better arch he'd be leveraging a lot of that fiendish cleverness – such as EDA tools and chip fabs. At worst the 1000x would degrade to 100x and still he'd steamroll the competition and the reason he doesn't do it is he's full of shit ... it's never been cheaper, not necessarily to make a chip but certainly to make a synthesizable core for someone else to manufacture, you don't really need that much manpower and the tools are there and they're great, look for instance at Adapteva's Epiphany (which will remind of you of Erlang but – surprise surprise! it runs C not Erlang whose creator said he "doesn't give a hoot about efficiency" and, well, it's compute throughput was, is and will be in the shitter.) This company is five people who're not as full of shit as Alan Kay and who unlike him put their money where their mouth is.". I also remember a similar, but more politely phrased, comment from Marc Brooker that I can't seem to find at the moment.

I don't believe these are cherry picked comments; this is what I generally hear from systems and hardware folks. This is a very different set of opinions from what I hear from folks who don't work with hardware and I'd like to explain why folks who understand hardware tend to agree with people like Fabian, Per, and John.

Fortunately, Kragen Sitaker already looked into this, which saves a lot of legwork. In https://web.archive.org/web/20131029105111/http://lists.canonical.org/pipermail/kragen-tol/2007-March/000850.html, he looked at some actual performance numbers and the estimates the cost of the Dorado and comes to the conclusion that Alan Kay is wrong and that, adjusting for Moore's law, a modern computer from 2005 delivers the expected level of performance. Moreover, the Dorado would have cost an estimated $100k in 1979 and, since we've already accounted for Moore's law, Alan Kay's vaunted Dorado was actually extremely expensive for the level of performance it would deliver if a modern version were built today. Rather than get 1000x better performance, we'd get the same performance at 1000x the cost if we don't adjust for inflation.

However, Kragen isn't really a hardware engineer and looking at the numbers Kragen provided and the analysis with a hardware engineer's eye, I think that Kragen is overestimating how good the Dorado comes out in this claim.

Alan Kay claimed that a modern machine from December '04 performanced only 50x better than the Dorado in '79. Kragen did the heavy lifting here on getting benchmark numbers and found that a 600 MHz machine was 100x faster (with the rest of the 1000x, 10x, accounted for by not stacking the deck against the modern machine by running Squeak, and instead using StrongTalk, which was 3x to 10x faster than Squeak). He says that this accounts for the 1000x, which is true, but now let's look at a few things that jumped out at me:

1. The reference computer for the modern benchmark was a 600 MHz ARM machine, which was very slow by the standards of the day. Since Alan Kay is making a claim that computers should be 40000x to 60000x faster due to exponential growth in performance due to Moore's law, we really need to find when a computer with similar performance was released so we can divide out the difference from 60k-fold claim to avoid unfair extra doublings in performance. Unfortunately, the message that Kragen referenced with the benchmark seems to have been lost (https://web.archive.org/web/20160401000000*/http://lists.squeakfoundation.org/pipermail/squeak-dev/2005-April/091215.html), but if the reference was to a 600 MHz ARM in 2005, a decent reference point for that would've been a 600 MHz ARM9 (not to be confused with the later released and much faster ARM Cortex-A9). This is not a processor that you'd expect in a "real" computer; it's a low-power architecture introduced in 1997, intended for use in things like cell phones (before smart phones), pagers, set-top boxes, etc., where the chip's power budget would be well under 10 mW. By comparison, high-end Intel desktop chips in 2005 were burning 100W, or 10000x more power with the expectation that they'd also deliver much higher performance.

To find a time when serious processors were as slow as a 600 MHz ARM9, we probably have to go back to September 1992 with something like DEC's EV4. That's 12.3 years, or 12.3 * 12 / 18 = 8.2 generations of Moore's Law, so Alan Kay incorrectly added an extra 2**8.2 = 294x performance to the Dorado. But this is actually an unfair comparison since the Dorado in 1979 was a prototype and the EV4 was shipping in volume in 1992. Changes were made to the Dorado so it would be cheaper and more amenable to making in small batches (on the order of tens), which slowed it down by 15% and also took time. Let's say it would've taken three years to really be able to ship in volume; that's another 2 generations of Moore's law or another 4x error, bringing us to 1176x error on Kay's claimed 1000x speedup.

If we add in Kragen's note that, if we don't stack the deck against the modern machine by using a slow compiler, we get another 3x to 10x speedup, making the error 3528x to 11760x on Kay's claimed 1000x speedup. For reasons that I'll explain later, this compiler speedup wouldn't be, in general, available to a modern Dorado.

And BTW, if you think the EV4 comparison is unfair, you could re-do the computation with something like a 400MHz EV5 from 1996, which should unambiguously smoke a 600MHz ARM9; the reduces the adjustment by 2.6 generations or 6x and won't really change the final conclusion.

2. Alan Kay's naive application of Moore's law to compute speedup assumes that the application is 100% CPU bound. Of course, that's very rare and Moore's law did not speed memory latency up by nearly as much as CPU and it's very rare that applications are 100% CPU bound. That's almost certainly at least another 2x right there (unless you rely very heavily on caching, but note that Alan Kay claimed that caching was something you only needed to make up for weakness in architecture, so it's not something a modern design of his would include), so now we're up to at least 7056x to 23520x error on Alan Kay's claimed 1000x speedup (and realistically, probably much more).

3. Kragen generously did not adjust for inflation when comparing his estimated cost of a Xerox Dorado in 1979 to a modern machine. Kragen does note that he's not sure how to compute the cost of a custom ECL machine in 1979, but his $100k estimate seems like it's in the right ballpark. The less ambitious Xerox Alto cost $32k at the time, and comparing to other hardware that was available at the time, $100k would be in the ballpark of but cheaper than the PDP-1, 360/30, 1401, and VAX and in the ballpark of but more expensive than 1620, Apollo, which is a plausible range to be in. For reference, in 1979, the median house in the U.S. cost $53k. Since there's significant slop in the estimate, let's say that the 1979 Dorado cost 1-3 median houses. That's arguably unfair since home prices have increased faster than CPI; if you want to use CPI instead, that would have been $279k in December '04 dollars or $398k in June '21 dollars for a computer that, according to Alan Kay's methodology when the numbers are corrected for errors, would be an order of magnitude slower than a comparable modern machine.

4. The Dorado architecture is actually highly specialized by language. This is one of big beefs Alan Kay has with modern hardware designers. He believes that we should build language-specific hardware, since it will obviously be faster to put things into hardware. https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.html and http://yosefk.com/blog/the-high-level-cpu-challenge.html are a nice refutations of that idea, but the relevant bit here is that, to achieve its performance claims, the Dorado needed a custom-configured decoder that could run one of BCPL, Mesa, Interlisp, or Smalltalk. There was a mechanism to add more languages, so you could add new languages, but only one customization could run at once if you have a big language. So, for the price of 1-3 houses, you could get a machine that, adjusted more Moore's law, is roughly one order of magnitude slower at executing one language and is even slower than that if you need to execute programs that switch languages, such as a general purpose operating system or even something that calls a library written in another language.

And BTW, this specialization is why I think it's not unfair to say that the Dorado would not benefit from compiler speedups the same way a modern chip does with StrongTalk. If you move interpreter logic into hardware, there's no known way to move the kind of optimization passes a compiler does into hardware. There have been a number of serious attempts to do that kind of thing, including Transmeta and Nvidia's Denver and no attempt has even come close to either hitting its performance claims or producing a chip that performs reasonably for its power envelope and manufacturing cost. Transmeta spent nearly $1B and still couldn't do it and Nvidia drafted off of that effort and bought the team and tried to build off of the technology and that still didn't work. That's not to say that it can't work, but doing it reasonably well is still an open research problem.

We've now seen that Alan Kay's claims are bogus, but this doesn't really explain the contempt that's on display in the quotes above. For that, we need a couple more ingredients.

First, Alan Kay is not just wrong, he's wrong in a very boring way that hardware engineers hear all the time. There are a handful of obvious ideas that no one knows how to make work and have never worked in modern times that programmers who don't understand hardware "discover" all the time that you see over internet comments, that you'll hear at meetups, etc. Alan Kay's hardware ideas consist of those ideas. Second, Alan Kay repeatedly insists on calling anyone who doesn't agree with him some kind of "pop culture" bozo, saying that they don't know history, don't understand scaling, etc., and since he's famous, he has a large platform to broadcast such claims. I don't think the second part needs more evidence than the quotes from Alan Kay that I've included in this post, but if you'd like more, you can read the whole interview as well as find other comments Alan has made about people who disagree with his brilliant ideas.

For the first bit, I'm just going to present an analogy since we already have a decent refuation of some of boring ideas in the previously linked https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.html and http://yosefk.com/blog/the-high-level-cpu-challenge.html and the other ones are no more interesting. And to be clear, this doesn't mean that these repeat ideas can never work or are wrong; sometimes, old "refuted" ideas do come back and really change how things are done. Neural nets are a relatively recent example of this. Until the 2012 Alexnet paper, these were considered sort of useless. In school, I audited an ML class and these were "proved" to be useless because it was proven that perceptrons (single-layer neural nets) can't learn "xor" and therefore clearly cannot learn many things. But it turns out other neural net architectures are useful.

However, when there are ideas that every 19-year old freshman with no experience and no expertise "invents", if the only reason someone can present for something is the same reason the inexperienced 19-year old would present, hearing that gets old every day. It would be the equivalent of hearing in, say, 2010, "we should do machine learning with neural nets because brains are neural nets and brains are good at learning". Perhaps we should do machine learning with neural nets, but an argument by analogy that brains are like neural nets isn't going to get us there and hearing that for the 100th time is really boring.

The hardware ideas that you hear proposed over and over and over again by people who appear to have no understanding of hardware (and in some cases, like Alan Kay, who make comments that clearly indicate that don't understand hardware) are analogous to someone saying "we should do machine learning with neural nets because brains are neural nets and brains are good at learning" before people figured out how we can do deep learning. Maybe true, and yet useless and banal.

And then when someone goes on and on about that sort of thing and calls everyone with knowledge in the field a pop culture bozo, well, perhaps you can see why people with an understanding of the field hold that person in contempt.

And BTW, this post shouldn't be read as a trashing of the Dorado. The Dorado, like other machines of its generation, was a product of its time, its environment, and its constraints and it contains a lot of cool ideas. However, when taking lessons from the design of past machines, it's important to understand which aspects of its design came about as a result of environmental constraints being different. For example, having access to, by today's standards, zero latency memory that's truly random access, instead of having 450 cycle access time if you try to do truly random access. Or, to pick another example, costing as much as two houses. Alan Kay complains that the dumb 8-bit architectures that people produced killed Lisp and Smalltalk because they were so slow. Well, yes, it turns out that when you downgrade from a computer that costs as much as two houses to a computer that costs as much as two mattresses, you end up with a slower computer. The Dorado's keyboard controller used a 6502, the same chip that would be the main CPU in the Apple II, which was not a cheap computer by today's standards (well over $5k in 2021 dollars). A reason Dell was successful was that it started producing computers that were under half the price Apple wanted for their computers. The Dorado's was such an expensive computer that its keyboard controller was the main CPU in a *high end* personal computer.

Saying that people should have designed machines that had comparable performance to a machine that's roughly two orders of magnitude more expensive so that Lisp and Smalltalk would have had acceptable performance is a very "let them eat cake" attitude. Lisp Machines did try to do that, and they created computers that ran Lisp with acceptable performance for "only" $80k in 1984. Much more affordable than the Dorado (for reference, median U.S. home price was $66k in 1984), but still out of reach as a personal computer for home use for anyone who wasn't extraordinarily wealthy. If a language needs a computer that costs more than a house to deliver acceptable performance, I wouldn't fault computer designers for refusing to only design computers that cost as much as houses for the death of the language, nor would I propose that mainstream computers were designed analogously decades later.

P.S. If you're going to pick a machine that you say pop culture hardware engineers overlook, the B5000 is a truly bizarre choice (I didn't include this part of Alan's rant in the post, but you can see it in the full interview). It's a well known and well studied architecture among computer architects.