bcachefs

The bug squashing continues...

Added 2018-06-01 18:07:57 +0000 UTC

Been squashing quite a few bugs lately, but this latest one has been quite a trip down the rabbit hole...

Initial symptom was that on xfstest generic/475, very occasionally we'd see an extent past the end a file's current i_size (the test runs a filesystem stress test while injecting IO errors and then checking that the filesystem is consistent, it's quite the torture test).

After several days of head scratching and shotgunning assertions all over the place trying to figure out where the extents were coming from (along with detours to fix locking related to updating the inode, which I'm still not finished with, but turned out not to be the actual issue), I finally realize - it's because truncate isn't removing them because it thinks it isn't shrinking the file.

You see, truncate has somewhat interesting semantics, especially with regards to how it interacts with fallocate. What the truncate system call really means "set the size of this file to x", and x can be bigger or smaller than the current size - i.e. it doesn't necessarily mean truncate.

Fallocate, for those who aren't familiar, is used to reserve space in a file - if it succeeds, it guarantees that future writes within some range of a file won't return -ENOSPC. But, the range you reserve doesn't have to be within a file's current size - this is important if you're going to be appending.

So, a file might have reservations past the end of its current size - and this is perfectly legal. What this means is that truncate isn't supposed to delete those reservations if it's not being used to shrink a file - also perfectly reasonable.

But, what this means is that the implementation of truncate has to check whether it's shrinking a file or not - if it is shrinking, it blows away anything past the end of the new i_size, if not, it doesn't (if it's not shrinking, the only thing past the end of i_size should be reservations, and we want to keep those).

And when it checks this, it's going off of i_size for the inode in memory - which may be different from what i_size is on disk, due to write buffering. Which also should be fine, i_size in memory should always be greater than or equal to i_size on disk.

But what if something screws up i_size and we end up with the in memory i_size smaller than on disk i_size? Then truncate thinks it's not shrinking when it should be, and it doesn't remove extents it should, and we end up with extents past the end of i_size. Oops.

So I figured that much out a few days ago, and since then I'd been scratching my head trying to figure out what was updating i_size incorrectly. It's not updated in very many places, and I stuck assertions around all of them, and I wasn't finding anything.

Until earlier this morning, that is. Turns out, it was truncate itself that was screwing up i_size - specifically, an error path (the test injects write errors, so go figure).

There's another fun thing truncate has to deal with. Suppose we create a new file, and then we do a single 4k write to it - at offset 1M, so we've got a sparse file. Then, before that data gets written out, we truncate it to 512k.

Since the data hasn't been written out yet, i_size on disk will still be 0 - writes that append to or extend a file are supposed to happen atomically with the i_size update, so after a crash we should never see i_size 1M + 4k and the data not there, and we definitely shouldn't have the data write happen but the i_size update not happen.

So, basically that means when a truncate happens, it needs to be updating i_size on disk too, regardless of the current on disk i_size is above or below the new i_size - if the current on disk i_size is below the new i_size, we're probably cancelling whatever appends were going to extend i_size past the new i_size.

So no big deal so far - but what if we did a 4k write to offset 0, then a 4k write to offset 1M, then truncated to 512k before any of that was written out?

In that case, that first write was an append, and if we crash after truncate sets the file size to 512k but before the data write happens we've broken atomicity for that append - the file has 0s where the data should be. Oops.

So that means that truncate has to fsync the part of the file between the current on disk i_size and the new i_size. A bit annoying, but the only sane solution I've been able to come up with (and I was just reading the xfs code, they actually do the same thing).

But, what if you get an error when doing that fsync (test injects IO errors...) but you've already updated i_size in memory? That's the the "oh shit, we're screwed" moment - now we have the in memory i_size smaller than the on disk i_size (because we bailed out before doing the rest of the truncate). That's the bug.

And the problem is, you kind of need to do things in this order, for complicated reasons relating to truncates that land in the middle of a block (you have to zero part of the block and write it out again). Eesh.

Trying to figure out if I can solve this without implementing full blown transactions first (bcachefs has atomic updates of multiple btree keys, which suffices for most things - i.e. rename - but we don't have long running transactions yet). Fun stuff...