Dan Luu

Some reasons to measure

Added 2021-07-27 00:16:26 +0000 UTC

A question I get asked with some frequency is: why bother measuring X, why not build something instead? More bluntly, in a recent conversation with a newsletter author, his response to some future measurement projects I wanted to do (in the same vein as other projects like [keyboard vs. mouse](https://danluu.com/keyboard-v-mouse/), [keyboard](https://danluu.com/keyboard-latency/), [terminal](https://danluu.com/term-latency/) and [end-to-end](https://danluu.com/input-lag/) latency measurements) was, "so you just want to get to the top of Hacker News?"

The implication for the former is that measuring is less valuable than building and for the latter that measuring isn't valuable at all (perhaps other than for fame), but I don't see measuring as lesser let alone worthless. If anything, because measurement is, [like writing](https://twitter.com/danluu/status/1082321431109795840), not generally valued, it's much easier to find high ROI measurement projects than high ROI building projects.

Let's start by looking at a few examples of high impact measurement projects. My go-to example for this is Kyle Kingsbury's work with [Jepsen](https://jepsen.io). Before Jepsen, a handful of huge companies (the now $1T+ companies that people are calling "hyperscalers") had decently tested distributed systems. They mostly didn't talk about testing methods in a way that really caused the knowledge to spread to the broader industry. Outside of those companies, most distributed systems were, [by my standards](/testing/), not particularly well tested.

At the time, a common pattern in online-discussions of distributed correctness was:

**Person A**: Database X corrupted my data.

**Person B**: It works for me. It's never corrupted my data.

**A**: How do you know? Do you ever check for data corruption?

**B**: What do you mean? I'd know if we had data corruption.

Kyle's early work found critical flaws in nearly everything he tested, despite Jepsen being much less sophisticated then than it is now:

* [Redis Cluster / Redis Sentinel](https://aphyr.com/posts/283-call-me-maybe-redis): "we demonstrate Redis losing 56% of writes during a partition"

* [MongoDB](https://aphyr.com/posts/284-call-me-maybe-mongodb): "In this post, we’ll see MongoDB drop a phenomenal amount of data"

* [Riak](https://aphyr.com/posts/285-call-me-maybe-riak): "we’ll see how last-write-wins in Riak can lead to unbounded data loss"

* [NuoDB](https://aphyr.com/posts/292-call-me-maybe-nuodb): "If you are considering using NuoDB, be advised that the project’s marketing and documentation may exceed its present capabilities"

* [Zookeeper](https://aphyr.com/posts/291-call-me-maybe-zookeeper): the one early Jepsen test of a distributed system that didn't find a catastrophic bug

* [RabbitMQ clustering](https://aphyr.com/posts/315-call-me-maybe-rabbitmq): "RabbitMQ lost ~35% of acknowledged writes ... This is not a theoretical problem. I know of at least two RabbitMQ deployments which have hit this in production."

* [etcd & Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul): "etcd’s registers are not linearizable . . . 'consistent' reads in Consul return the local state of any node that considers itself a leader, allowing stale reads."

* [ElasticSearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch): "the health endpoint will lie. It’s happy to report a green cluster during split-brain scenarios . . . 645 out of 1961 writes acknowledged then lost."

Many of these problems had existed for quite a while

> What’s really surprising about this problem is that it’s gone unaddressed for so long. The original issue was reported in July 2012; almost two full years ago. There’s no discussion on the website, nothing in the documentation, and users going through Elasticsearch training have told me these problems weren’t mentioned in their classes.

Kyle then quotes a number of users who ran into issues into production and then dryly notes

> Some people actually advocate using Elasticsearch as a primary data store; I think this is somewhat less than advisable at present

Although we don't have an A/B test of universes where Kyle exists vs. not and can't say how long it would've taken for distributed systems to get serious about correctness in a universe where Kyle didn't exist, from having spent many years looking at how developers treat correctness bugs, I would bet on distributed systems having rampant correctness problems until someone like Kyle came along. The typical response that I've seen when a catastrophic bug is reported is that the project maintainers will assume that the bug report is incorrect (and you can see many examples of this if you look at responses from the first few years of Kyle's work). When the reported doesn't have a repro for the bug, which is quite common when it comes to distributed systems, the bug will be written off as non-existent.

When the reporter does have a repro, the next line of defense is to argue that the behavior is fine (you can also see many examples of these from looking at responses to Kyle's work). Once the bug is acknowledged as real, the next defense is to argue that the bug doesn't need to be fixed because it's so uncommon (e.g., "[It can be tempting to stand on an ivory tower and proclaim theory, but what is the real world cost/benefit? Are you building a NASA Shuttle Crawler-transporter to get groceries?](https://news.ycombinator.com/item?id=5913610)"). And then, after it's acknowledged that the bug should be fixed, the final line of defense is to argue that the project takes correctness very seriously and there's really nothing more that could have been done; development and test methodology doesn't need to change because it was just a fluke that the bug occurred, and analogous bugs won't occur in the future without changes in methodology.

Kyle's work blew through these defenses and, without something like it, my opinion is that we'd still see these as the main defense used against distributed systems bugs (as opposed to test methodologies that can actually produce pretty reliable systems).

That's one particular example, but I find that it's generally true that, in areas where no one is publishing measurements/benchmarks of products, the products are generally terrible. Here are a few examples:

* Keyboards: after I published [this post on keyboard latency](https://danluu.com/keyboard-latency/), at least one major manufacturer that advertises high-speed gaming devices actually started optimizing for latency; most users probably don't care much about keyboard latency, but it would be nice if manufacturers lived up to their claims

* Vehicle headlights: Jennifer Stockburger has noted that, when Consumer Reports started testing headlights, engineers at auto manufacturers thanked CR for giving them the ammunition they needed to force their employers to let them to engineer better headlights; previously, they would often lose the argument to designers who wanted nicer looking but less effective headlights

* Vehicle [ABS](https://en.wikipedia.org/wiki/Anti-lock_braking_system): after Consumer Reports and Car and Driver found that the Tesla Model 3 had extremely long braking distances (152 ft. from 60mph and 196 ft. from 70mph), Tesla updated the algorithms used to modulate the brakes

* Vehicle impact safety: Other than Volvo, car manufacturers generally design their cars to get the highest possible score on published crash tests; [they'll add safety as necessary to score well on new tests when they're published, but not before](https://danluu.com/car-safety/)

This post has made some justifications for why it's not unreasonable to measure things. But, to be honest, for non-work projects, I don't really need an extrinsic reason. I just want to know the answer to a question.

[this is another draft of a blog post I might publish sometime; this is basically unedited at the moment]