Vote for the next deep dive topic!
Added 2018-08-06 22:30:21 +0000 UTCI've gotten a few comments that people have been enjoying my technical deep dives into things I'm working on.
There's a lot of other things I could write about as well, not just bcachefs but perhaps also other kernel and storage topics. I'd like to hear what people are interested in, though. If you've got an idea of something you'd like to learn more about, post it below.
Comments
Fundamentally, the primary job of any filesystem is to map from one address space to another - inode:offset -> offset on the raw block device. So yes, with Postgres on top of bcachefs you have a btree on top of a btree, same as with any other btree based filesystem. With block based filesystems (ext3, zfs), you instead have a btree on top of a trie. The overhead isn't fundamentally any different than any other filesystem (if you look at benchmarks, it's in the same ballpark as other existing filesystems), and the alignment restrictions are no different. Bcachefs is a major departure from existing filesystems in how it implements its indexes and manages its own metadata, but fundamentally it's doing the same job and from the outside the observed behavior is pretty much the same, excepting performance differences. One thing to keep in mind in that _data_ is not stored in the btree, the extents btree just stores pointers to the actual data on the block device. From a theoretical point of view indexing the data vs. indexing pointers to the data is completely equivalent, but from a practical standpoint it's a huge difference - it means the extents btree isn't that all big (because extents are as big as we can make them), and it makes it clear that we only touch the extents btree when we're reading or writing to the block device - we're not touching the extents btree on every data access, most data access are cached in the page cache. So btree performance still is kind of important - imagine using those crazy high end and expensive enterprise grade SSDs that can do a million + IOPs, if you want to max them out you need a a btree (or some other data structure) that can do a million lookups/updates per second. But that's kind of unusual. Reiserfs fsck would sometimes have to rebuild the btree by just scanning the entire device for btree nodes, if the root node got corrupted. With bcachefs... a) I haven't had to implement that kind of "scan the device for anything we can find" type of repair because my btree is better debugged than their's, but also btree node magic numbers are xor'd with the filesystem superblock UUID (the internal one that users can't change) - so we can tell definitively "is this actually a btree node for _this_ filesystem?"
Kent Overstreet
2018-08-11 22:12:49 +0000 UTCI'd be happy with any topic. For a specific proposal: How much does fragmentation actual matter for the design in the age of SSDs? Does bcache/fs do anything special to merge/rearrange fragments of data or metadata? Is it worth it normally, or just in low-free-space situations?
2018-08-10 14:37:42 +0000 UTCI would be interested in understanding the compression implementation and it's roadmap. How it functions and what decisions have been implemented in the design so far. How it will function with Erasure Coding? And if it will be implemented differently between tiers / disk groups (ie foreground-->uncompressed, promote-->uncompressed, background-->compressed) ? Thanks for taking the time to dive deep :) regardless of the topic you pick.
2018-08-07 14:56:52 +0000 UTCCan you explain how bcache/bcachefs deals with detecting sequential io? And how it deals with overwriting partially data that does not written to backing device (for example i'm write 2Mb chunk that goes to cache device, and on next i'm write next 1Mb that overwrite 512K from previous chunk and append next 512K.
2018-08-07 10:28:29 +0000 UTCI'd love to read about how snapshotting will work with bcachefs, and the technical issues that brings.
2018-08-07 08:22:16 +0000 UTCYou can do that but it's pretty slow. <a href="https://gist.github.com/proudlygeek/5721498" rel="nofollow noopener" target="_blank">https://gist.github.com/proudlygeek/5721498</a>
2018-08-07 08:19:28 +0000 UTCI'm curious how bcache(fs) works with databases. You mention that it's btree at the core, and, say, Postgres indexes are also btree. It means it's a btree holding a btree? :) What are overheads in such layout? What needs alignments and are there possible shortcuts? What if we run bcachefs on top of bcachefs, say, on loop device or virtual machine? :) I think in Reiserfs fsck would sometimes merge the main filesystem with the one in the image laying on the main filesystem. What should go wrong to see something like that in bcache(fs)? Thank you for the tech posts, I enjoy them very much :)
2018-08-07 06:15:58 +0000 UTCYou know in hindsight that's another thing that really benefited from starting out with just a block layer cache slowly growing it into a filesystem. For just a block layer cache, the operations you support are pretty simple - just read and write, perhaps trim. So you need a b-tree (or some other indexing mechanism), and that part is relatively complicated - but it's a relatively self contained chunk of machinery, and besides that there's not much else. So the total amount of code you need to make fully robust w.r.t. memory allocation failures isn't too big, and you really need to because block devices aren't supposed to return -ENOMEM. And the part that you need to handle in the simple (bcache) case fairly tricky - it's making sure btree operations (which require multiple btree nodes to all be in memory at the same time) can run in a fixed amount of memory and without deadlocking. Bcache has some interesting machinery there (mca_cannibalize_lock) that always felt pretty hacky, but it seems to work well enough. Anyways, since bcache has grown into bcachefs there hasn't been anything else that was hard to deal with, w.r.t. handling memory allocation failures. Since then it's just been a matter of being disciplined about using mempools where necessary.
Kent Overstreet
2018-08-07 00:09:38 +0000 UTCOne of my favourite parts of bcache was how it dealt with low memory. All of the steps it takes to ensure forward progress even when allocations are failing.
Adam Berkan
2018-08-06 23:59:10 +0000 UTC