SakeTami
bcachefs
bcachefs

patreon


Status update

 - Interior btree node updates are now journalled; removing the need for btree writes to be FUA

 - Interior btree node updates are now fully transactional, we no longer have to do any metadata scanning after unclean shutdown

 - Btree key cache code has been merged

 - Major rework of journal replay finally finished

 - Lots of bug fixing

So, some background:

Historically, the btree and the journal in bcache/bcachefs have been fairly separate entities; the btree has always been internally consistent on disk without anything from the journal, and the journal just contained updates to leaf nodes, and journal replay just meant redoing all those updates, in the same order as they occur in the journal.

The downside of this was that any time we updated an interior btree node (because we split or compacted a leaf node), we'd have to write out the update to the interior node right away - and it meant we had to use FUA (force unit access; it means bypass the write cache) for all btree node writes.

That's a disadvantage because consumer drives tend to either not support FUA (meaning it has to be emulated by the block layer with cache flushes), or they internally flush the whole cache when they receive a FUA write - or worse, have buggy FUA support. It turns out other filesystems have been bitten too by drives with buggy FUA support, and some of the bug reports I'd been seeing seemed to indicate that that was happening to us too, so several months ago I finally got around to a long contemplated project - journalling updates to interior btree nodes, not just leaf nodes.

The changes to the interior btree update code went pretty smoothly, as well as tweaking journal replay to replay updates to interior nodes first - but, at the time I missed the full implications of having to start the allocator threads before journal replay had made the btree consistent again. Oops.

So that took awhile to sort out - hence the long delay in updates; recovery from unclean shutdown was somewhat broken for quite awhile.  But, at long last, it's finished: the last major piece required was merging in the btree key cache code, which I'd been working on for quite awhile but hadn't quite finished.

The btree key cache code acts as a write cache for the btree, for keys that are going to be updated frequently in a short span of time (e.g. inodes and keys in the alloc btree). Normally when we do a btree update, we update the journal and the btree at the same time - but there's no real requirement in e.g. the on disk format that we update the btree at the same time, the btree just has to be updated before releasing the pin on the relevant journal entry. This lets us skip the relatively expensive btree traversal, and helps with lock contention since a single btree leaf node can hold many keys.

This helps us with journal replay because it means the allocator threads can do their thing without actually updating the alloc btree - they just update keys in the btree write cache, which will be flushed back to the btree by journal reclaim; all we have to do now is not start journal reclaim until we've finished replaying all the updates to interior btree nodes.

So, with all that done, there should be some performance improvements due to both not doing FUA btree node writes anymore, and also having the btree key cache enabled for the alloc btree. It's not enabled for the inodes btree yet - that patch still needs a bit more work.

Next up: I think I'm going to see what I can get done with erasure coding.

And keep the bug reports coming!

Comments

Correct. No new upstreaming news yet.

Kent Overstreet

So yeah we still do need _some_ amount of write ordering, but now it's only journal writes that are flush+fua, and those are generally much less frequent than btree node writes

Kent Overstreet

This sounds great! do wonder, while FUA is no longer needed, I can imagine the order of disk writes matters? So if you have a bunch of changes and the disk controller decides to mess with the ordering, this can so lead to an inconsistent state after power loss?

I've been holding off on building a new file server waiting for bcachefs to get mainlined.

SSailor67

Erasure coding would mean RAID-5, RAID-6, etc equivalent, right? That would be awesome! Any news on upstreaming lately?

Gordon Dexter

> phoronix numbers don't seem to align at all with how bcachefs feels in actual use; Which which kind of benchmarks would give a more representative picture?

tuxayo

Erasure coding would be amazing! Thanks for the update

John Smith

May be small blocks DIO is relevant to database workloads, not just benchmarking? How do you think?

Thanks Kent!

His Dad

Firstly - the phoronix numbers don't seem to align at all with how bcachefs feels in actual use; I and plenty of of other users find common operations tend to run quite a bit faster on bcachefs than other filesystems. Which is not to say the phoronix numbers are wrong - but they don't seem to be very representative in a way that hits bcachefs more than other filesystems. That said - this does partly address where we're known to be slower than other filesystems. The other big one is small block DIO random read performance - we store checksums at larger than block granularity, and that hurts DIO random read performance. That one still hasn't been addressed, but it's not something that comes up much outside benchmarking.

Kent Overstreet

WRT performance as seen on Phoronix a few months back, is your git kernel more performance oriented with the recent changes??

His Dad

Thanks Kent.

His Dad


More Creators