bcachefs

Notes on Phoronix benchmarks

Added 2019-06-26 19:04:54 +0000 UTC

Phoronix posted some bcachefs benchmarks: https://www.phoronix.com/scan.php?page=article&item=bcachefs-linux-2019

The results are actually pretty encouraging, even if they might not look it on the surface - they're about what you'd expect at this point. Given a large enough codebase, if 95% of it is thoroughly optimized, but there's a couple fastpaths that have performance issues - any benchmarks that hit those are going to show the effects of those areas that still need work.

There's two main things that still need to be addressed:

Bcachefs stores checksums at the granularity of entire extents (up to 128k by default), not per 4k block. The reason we do it this way is that it makes our metadata significantly smaller, and the vast majority of applications do buffered IO where we're reading large chunks into the page cache - it's easy to just ensure we're always reading entire extents into the page cache. So it's a good tradeoff, most of the time. For databases and other applications that do small random reads with direct IO, or small random reads where the file is too big to be cached, we need to add a way to specify a checksum granularity for a file.

The other main thing to be addressed is that multithreaded workloads can be affected by lock contention on the inode and alloc btrees - since btree nodes are big and inodes are small in bcachefs, it's much more common for unrelated threads to be touching the same btree nodes when doing inode updates.

To address this I already have code that essentially puts a write cache in front of the btree for inode updates - when we update the inode we only journal the update, we don't update the btree, and the btree is only updated when triggered by journal reclaim. The code works and is stable now, but it's off by default because journal reclaim still needs tuning - it's tricky to make sure journal reclaim doesn't run too much, so we're making good use of space in the journal, but still always runs just in time to prevent running out of space in the journal - running out of space in the journal and having to wait on reclaim is very bad for performance.

But I have benchmarked it, and on the multithreaded inode-update heavy workloads it made bcachefs about as fast as xfs.

In the future I need to generalize that code so it can be used for the alloc btree - I wrote it prior to finishing fully persistent transactional alloc info, so now that we're doing a lot more updates to the alloc btree (it's being updated on every extent btree update now) the alloc btree will need the same treatment.

Right now though I'm finishing off reflink - reflink is _almost_ finished! So I'm excited for that.