bcachefs

Filesystem metadata operations are now all fully atomic

Added 2018-07-17 14:18:18 +0000 UTC

In the last post, I wrote about some new transaction infrastructure I was working on that would make it practical to make all the high level filesystem operations (e.g. create, link, unlink) fully atomic - that work is now finished and merged in.

The main benefit from this work is that now, on unclean shutdown, we don't have to walk the filesystem heirarchy (i.e. all the dirents and inodes) to recalculate every inode's link count. Also, there were a few other operations, unrelated to i_nlink and mount time performance, that weren't fully atomic and really should have been - for example, setting an ACL on a file involves updating a key in the xattrs btree, but it may also require updating the inode's regular permissions. Previously, the acl update and the inode update weren't atomic, but with the new transaction infrastructure fixing that was pretty trivial.

So we're now pretty much done as far as filesystem atomicity/consistency guarantees - the only operation I know of that isn't fully atomic or transactional, and that we would like to be fully atomic, is fallocate in collapse mode (insert range isn't implemented yet). Making that transactional is going to take yet another type of new transaction machinery for long running logged operations (and that machinery is going to be quite a bit simpler and more straightforward that what I just did).

This is a pretty big milestone! To get fast mount times, there were three big work items - now, there's just two left.

The two things that are left are:

The journal sequence number blacklist machinery needs some changes. This is the thing that guarantees ordering of btree updates is preserved after a crash, even if a btree write happens with keys newer than the newest journal entry that was successfully written. Right now, it requires that we read every btree node (at least after an unclean shutdown) before we write the first journal entry, so that we know whether there are btree node entries that must be blacklisted - if there are, we have to skip a few journal entry sequence numbers and emit a new blacklist entry. Also, that blacklist entry has to be kept in the journal until no btree nodes refer to that journal sequence number - which means we have to read every btree node before the journal reclaim machinery can drop those blacklist entries from the journal.

Fortunately, there's a pretty straightforward solution for this one. The plan is that when recovering from an unclean shutdown, we'll just unconditionally blacklist the next few journal sequence numbers, and then we'll be a bit lazier about reclaiming them from the journal - if journal reclaim needs to flush a blacklist entry from the journal, and we haven't yet scanned every btree node this mount, we'll just rejournal that blacklist entry (i.e. write it again in the next journal entry). Rejournalling journal entries is going to require a new reservation mechanism so we don't deadlock (we don't want the journal to be entirely filled up with things that can't be evicted, just written again) - but several other things I'm working on are going to need that functionality too.

The other work item is persisting all allocation information so that we don't have to scan the extents btree and regenerate it on mount. This is the hard one.

By allocation information, we primarily mean the per-bucket counts of how much live dirty and cached data is in each bucket - this is the equivalent of block allocation bitmaps in other filesystems.

We already have an alloc btree with one key for each bucket, but right now it's mostly just used for storing each bucket's current generation number. So, what we need to do is add fields for the sector counts to the alloc keys, and then update those keys atomically with the other btree updates we do that change those sector counts (i.e. changes to the extents btree).

It would be possible to just use the existing mechanisms for atomic btree transactions for this - i.e. just update whichever key(s) in the alloc btree are changing whenever we're updating the extents btree. That would be pretty straightforward - the sector counts we need to track are defined to be strictly consistent with the current contents of the btree and nothing else, which means the current semantics of disk space accounting are already exactly what we want and don't have to change. Solving it this way would literally mean just using the existing numbers we're already tracking, and just updating them in the alloc btree whenever we update them in memory.

Unfortunately, doing it that way would mean significant lock contention on multithreaded write workloads. The problem is that every write that updates the extents btree would now also be locking leaf nodes in the alloc btree, and unrelated writes from different threads will very often be hitting the same nodes in the alloc btree . The lock contention would be a lot worse than in the extents btree because the alloc btree is a lot smaller (and thus all the keys fit into a much smaller number of nodes), and also because unrelated writes will generally touch locations in the extents btree that are far apart from each other (since they'll be touching different files, and the extents btree is indexed by inode number first), but writes that happened at about the same point in time will touch about the same positions in the alloc btree (since the allocator hands out buckets in roughly sequential order).

This is already a problem with the inodes btree on some workloads. Append writes (and writes to unallocated space) always have to update the inode in order to update i_size and/or i_blocks, and the inodes btree is much smaller than the extents btree so lock contention is more likely - the last time I was doing detailed benchmarking of bcachefs vs. other filesystems, multithreaded append was the only benchmark where bcachefs was behind.

So, given that the lock contention would be even worse on the alloc btree, that solution isn't good enough.

Another possible solution would be adding a mechanism for doing an update where we add the new key to the journal, as we normally do, but defer the btree update. Currently, the journal is an exact log of btree updates - journalling updates is done by the btree update path, and we always journal precisely the same updates as we do to the btree, so they're always exactly in sync.

Doing it this way has a number of advantages - not least is that it makes things a whole lot easier to reason about. It makes journal replay dead simple - we just insert every key we find in the journal into the appropriate btree, in the same order as the keys were written to the journal. And, it helps to make the journal reclaim machinery efficient: every time we write something to the journal, we have to add a pin to that journal entry and release that pin when the entry we journalled is no longer needed (i.e. because the btree node with that new key has been written). That pin is what ensures that journal entry will still be kept around and replayed if we crash, and it also has a callback so that when the journal reclaim machinery needs to flush stuff from the journal to free up space for new journal entries, it can kick whatever is holding the pin (i.e. tell the btree code it needs to write that node). By journalling updates only when we update the btree, we can track these pins on a per btree node basis, instead of per key. More importantly, it means that when journal reclaim needs to flush things from the journal, it's a pretty simple operation - just a single btree node write, which can be done with just a read lock. It's somewhat important that flushing things from the journal be simple, not complicated operations that might have high latency - since if the journal is close to full getting a new journal reservation will block on journal reclaim, so high latencies there could translate into high latencies for the entire system.

So, there's a lot of good reasons for the way btree updates and journal updates are tied together now, but it's not strictly necessary - we definitely can add a mechanism for doing btree updates where the journal update happens immediately but the btree update is deferred. If we used that for updating keys in the alloc btree that would definitely solve the lock contention issues - the only overhead this would add to the fastpath we care about (updating the extents btree after a write) would be taking bigger journal reservations and writing all the alloc btree updates to the journal.

Unfortunately, there's a problem with that approach...

When we journal the updated sector counts for a given bucket, we need to be journalling the correct sector counts for that point in time _as defined by where the update appears in the journal_, which is defined by the order in which threads got their journal reservations. In order to ensure things end up in the journal in the correct order (i.e. in the same order as the btree updates happened), we have to get the journal reservation _after_ we have all the btree nodes we'll be updating locked.

But there's no lock for updating bucket sector counts in memory, that path is entirely lockless too. That means that two different threads can be updating the sector counts for the same bucket at the same time - i.e. they're updating different extents in different btree nodes that happen to point to the same bucket, and there's no guarantee that the updates of the bucket sector counts in memory will happen in the same order as those updates appear in the journal.

That means if we update the sector counts for a bucket and then journal the new sector counts, we may very well be journalling the _wrong_ sector counts - the in memory sector counts might already been updated by another update that according to the journal hasn't happened yet, or an update that according to the journal _does_ happen prior to the current update might not have finished yet and might not have updated the bucket sector counts. The solution would be to add locks for the bucket sector counts, but the new locks would be serving the same purpose as the btree node locks that we wanted to avoid taking because of lock contention. We would be able to make the locks much more fine grained than btree node locks - i.e. one lock per bucket - but, I'm pretty sure plumbing these locks through would be pretty gross in terms of what it would do to the code (they have to be taken before we get a journal reservation, but when we get a journal reservation we don't necessarily know all the buckets we'll be touching - because we also have to do updates for extents we're overwriting).

So, I'm not planning on going with that approach. But, there actually is a way to get consistent sector counts for a given point in time without adding locks:

The trick relies on the fact that whenever we're doing an update, once we've got a journal reservation we do know when it will occur in the journal - it's the sequence number of the entry we got a reservation on, and the offset into that entry; we could determine journal time to be a u64 where the bottom 20 bits are the offset into the entry and the rest of the bits are the sequence number. So, given two updates, we can compare them and determine which is newer.

Back to the problem at hand: in order to write a bucket's sector counts to the journal, we need the sector counts for the time at which our entry will appear in the journal.

So here's the procedure:

- Stash a copy of the bucket's current sector counts somewhere, and add a pointer so that threads updating that bucket can also update our copy; flag that bucket so all updates are also done to our stashed copy.

- Get a journal reservation

- Atomically with getting the journal reservation, write the journal time of our reservation to our stashed copy of the bucket sector counts, and flag the bucket so that other threads updating that bucket's sector count _don't_ update our stashed copy if they are newer than our journal reservation.

At this point, our stashed copy of the bucket sector counts is partially consistent with our journal reservation's time: it doesn't have any updates newer than our journal reservation, but it might not yet have some updates that are older than our journal reservation.

- Wait for all references on the journal entry our reservation is in to be dropped, besides ours (and the previous one if it still had outstanding reservations).

Now our stashed copy of the bucket sector counts is what we want it to be, and we can journal them in the journal reservation we got earlier.

Sequence numbers are a powerful tool!

The procedure I just outlined is fairly heavy though, and definitely not something we'd want to do for every update that changes bucket sector counts, so by itself that isn't a practical way of implementing persistent allocation information.

But - there's no real reason we have to journal the new sector counts every time we update them. Updating sector counts is done when we update something else in the btree - so during journal replay when we're redoing that update, if we just update the sector counts the same way we did the first time we'll get the same result.

The only problem is if when we're replaying that update we have sector counts for the bucket being updated that are newer than the update itself - if that happens, we'll end up applying the changes from that update twice (adding or subtracting X number of sectors twice, instead of once).

But that's easy enough to solve - on startup, when we're reading in allocation information, we can just remember for each bucket the journal time of the entry we read its sector counts from (or 0 if it wasn't in the journal, just the btree; any key in the journal is newer than any key not in the journal). Then when we're doing journal replay, we can just skip updating the bucket sector counts if those counts are newer than the update we're replaying.

But actually, we probably don't even need to do that. If we're replaying an update and we want to skip the bucket update because the bucket sector counts are newer - the bucket sector counts are newer because there is a alloc btree key in the journal newer than the key we're currently replaying with new values. So if we don't skip the bucket update, the bucket sector counts will be temporarily wrong - but later in replay we're going to overwrite them when we get to that alloc btree key with the newer sector counts.