Status update
Added 2021-12-28 20:58:14 +0000 UTCBacon and eggs, tea, sitting down to work on finishing the BTREE_ITER_WITH_JOURNAL patch, but before I can get to the interesting and necessary algorithmic work I've got like half a dozen bug reports and problems to respond to. And people wonder why I haven't upstreamed yet..
- this is something I'm pretty excited about, it teaches the btree iterater code how to overlay the keys from journal replay over the btree, which means we'll be able to use all of the standard btree interface prior to journal replay finishing.
Previously, walking the btree prior to journal replay had to be done with a separate, not nearly as nice callback based interface - bch2_btree_and_journal_walk(), in recovery.c. So most of that code will get deleted, and some code that had to be duplicated for that interface and the normal btree API will get cleaned up. But the real reason for doing this work is that the allocator has to start running before journal replay finishes (because journal replay requires allocating new btree nodes), and the allocator has to update the btree to run (because allocation information is stored in a btree, naturally).
Today we pull this off with a combination of mirroring allocation information in memory, in the bucket array (long long ago, this was the only place allocation information was kept!), and also with the help of the btree key cache, which is a writeback cache over the btree set up as a hash table. Before the btree key cache lock contention on the alloc btree was a major performance problem on workloads with heavy multithreaded writing: in the btree, since a single btree node lock covers a large range of keys different threads that were writing to different buckets would end up contended on the same lock, but in the key cache we have one item, and one lock, per key.
But, the in-memory bucket array has become the limiting factor if we want to scale to filesystems in the petabytes, so it needs to go away, and other scalability work in the pipeline also means the allocator thread really needs to be able to use the standard full btree interface.
What other scalability work, you ask? Getting rid of the in memory bucket array is just the beginning (and will probably be the last item actually finished...)
- There are a number of places where we scan the full list of buckets at runtime - these all need to go away. The allocator thread in particular needs to do this periodically to build up a heap of eligible buckets for LRU replacement. This becomes a real issue when the filesystem is close to full, it becomes O(n^2) - some people have noticed issues with the allocator thread(s) spinning, this is what's getting fixed here. I'm adding both an LRU btree and an extent-style freespace btree for the allocator thread.
- Also, discard support is going to get revamped. Right now, buckets are discarded shortly prior to being reused - this part of the design dates from when bcache was just a cache, and in normal operation all disk space would be used as cache. But now we're a filesystem, so we want to be discarding buckets right after they become empty. So I'll be adding a btree for newly empty buckets ready to be discarded, after the journal commits the updates that emptied them.
- Even better: it's time to (finally!) add btrees for backpointers - which will mean copygc will (finally!) no longer have to scan the entire extents and reflink btrees, and also a rebalance-work btree - which will mean rebalance won't have to scan, either.
Those are going to fix the #1 user complaints I've seen. It's a lot of work to be done... once I get through the bug reports... but it's exciting stuff and it'll be well worth the wait.
Do keep the bug reports coming, though! They're always appreciated. Also, as a reminder - any time fsck finds an inconsistency, even if it's able to correct it, that inconsistency was the result of a bug and I want to know about it.
Thanks!
Comments
I hope so too.
Demi Obenour
2022-03-03 05:18:42 +0000 UTCNice.. Let's hope 2022 is when is it good enough to get mainlined and you can get real world testing.
veritanuda
2021-12-28 21:18:11 +0000 UTC