Status update - fast mount times, reflink
Added 2018-11-30 19:04:29 +0000 UTCSo for now, I'm leaving off the remaining parts of erasure coding - the important part was getting everything done that impacts both the on disk format, and the rest of the design. There's some commonality between erasure coding and some of the other upcoming features, so getting erasure coding mostly done now was very useful because it was a good angle for working on that common functionality.
What I really want to be working on next is reflink, but it turns out in order to do reflink I pretty much have to do fully persistent allocation information first - so that's what I'm working on right now.
One upside of that is that fast mount times will be coming sooner than I expected - yay!
The reason persistent allocation information needs to be done first is that for reflink, the allocation information that needs to be updated whenever we update the extents btree will become refcounts on entries in the new reflink btree. Currently, extents can only point to buckets or stripes, and those are big (2M, typically) and we're only counting sectors used within a bucket or stripe - so it's not a big deal to keep all that pinned in memory. But that will no longer be practical with reflink, which means we need to be keeping the refcounts in the reflink btree, and updating the reflink btree whenever we update an extent that points to the reflink btree.
And the infrastructure for doing that is exactly what we need for maintaining consistent persistent allocation information for everything else, so it'll be easier to just implement that first for the alloc and stripes btree, and then do reflink afterwards instead of implementing and debugging it all at once for reflink.
There's some other nifty improvements I'm making along the way, most of which aren't super interesting to end users but are interesting from a design point of view. One change I made recently was to mark and sweep gc, i.e. the thing that walks all metadata and figures out what disk space is currently allocated. One cool thing bcachefs can do is run that at runtime, while the filesystem is in use - we haven't _needed_ to run that at runtime in a very long time, it's effectively just been a relic from bcache. But, as prep work for persistent allocation information, I changed mark and sweep gc to verify the existing allocation information when it finishes, instead of just clearing it when it starts and recomputing everything.
What that means is that a good chunk of online fsck - the most important part - is done and working :)
I'm also making some changes to the filesystem-level disk space accounting - before I write the code to start persisting that stuff, I'm going to change it to be more fine grained. One of the things that will be nice about this is that it will make most of code in replicas.c obsolete.
What the code in replicas.c does is track which combinations of disks have replicated data, or stripes - we need this to know when it's safe to mount in degraded mode, and it's also how we know when we can remove a disk. But, the entries it maintains aren't refcounted, which means we need to garbage collect them in order to remove them - and since this doesn't happen automatically in every situation, it's possible that we'd refuse a degraded mount that was actually safe. By making the filesystem level disk space accounting fine grained enough, it can replace the tracking replicas.c does, and it'll always be up to date. So that'll be cool - it's always nice when the changes you need to make to your code end up making the overall design simpler and cleaner.
Comments
If I may make a suggestion, I think it would be helpful to upstream the core kernel changes beforehand. That way new users would be able to use bcachefs from a dkms build rather than requiring them to recompile their whole kernel.
2018-12-03 05:13:18 +0000 UTCYeah, it's not atomic. Not sure - I really dislike the idea of doing that approach in the kernel, but there might be tricks to make approach that better, it'll something to think about when it comes time to start thinking about snapshots again.
Kent Overstreet
2018-12-03 03:29:10 +0000 UTCcp -a --reflink isn't atomic though, right? Would it be much effort to have a "poor man's snapshot" in the kernel that simply holds a lock on the filesystem until it completes? (btrfs's snapshot performance feels like that's basically what it does anyway)
2018-12-02 10:43:36 +0000 UTCMy current plan is to start working on upstreaming again after reflink is done, which ought to be within the next 3-6 months.
Kent Overstreet
2018-12-02 00:36:11 +0000 UTCHow is upstreaming looking?
Jonatan R
2018-12-02 00:21:14 +0000 UTCNo, full snapshots will be a lot more work. But I think reflink will perform well enough that at least some people will be able to use it instead of snapshots - e.g. by just doing cp -a --reflink. I'm actually pretty excited to get started on reflink, I think the design is going to turn out really nicely.
Kent Overstreet
2018-11-30 21:13:29 +0000 UTC