bcachefs

bcachefs

Vote for the next deep dive topic!

Added 2018-08-06 22:30:21 +0000 UTC

I've gotten a few comments that people have been enjoying my technical deep dives into things I'm working on.

There's a lot of other things I could write about as well, not just bcachefs but perhaps also other kernel and storage topics. I'd like to hear what people are interested in, though. If you've got an idea of something you'd like to learn more about, post it below.

Comments

Fundamentally, the primary job of any filesystem is to map from one address space to another - inode:offset -> offset on the raw block device. So yes, with Postgres on top of bcachefs you have a btree on top of a btree, same as with any other btree based filesystem. With block based filesystems (ext3, zfs), you instead have a btree on top of a trie. The overhead isn't fundamentally any different than any other filesystem (if you look at benchmarks, it's in the same ballpark as other existing filesystems), and the alignment restrictions are no different. Bcachefs is a major departure from existing filesystems in how it implements its indexes and manages its own metadata, but fundamentally it's doing the same job and from the outside the observed behavior is pretty much the same, excepting performance differences. One thing to keep in mind in that _data_ is not stored in the btree, the extents btree just stores pointers to the actual data on the block device. From a theoretical point of view indexing the data vs. indexing pointers to the data is completely equivalent, but from a practical standpoint it's a huge difference - it means the extents btree isn't that all big (because extents are as big as we can make them), and it makes it clear that we only touch the extents btree when we're reading or writing to the block device - we're not touching the extents btree on every data access, most data access are cached in the page cache. So btree performance still is kind of important - imagine using those crazy high end and expensive enterprise grade SSDs that can do a million + IOPs, if you want to max them out you need a a btree (or some other data structure) that can do a million lookups/updates per second. But that's kind of unusual. Reiserfs fsck would sometimes have to rebuild the btree by just scanning the entire device for btree nodes, if the root node got corrupted. With bcachefs... a) I haven't had to implement that kind of "scan the device for anything we can find" type of repair because my btree is better debugged than their's, but also btree node magic numbers are xor'd with the filesystem superblock UUID (the internal one that users can't change) - so we can tell definitively "is this actually a btree node for _this_ filesystem?"

Kent Overstreet

2018-08-11 22:12:49 +0000 UTC

I'd be happy with any topic. For a specific proposal: How much does fragmentation actual matter for the design in the age of SSDs? Does bcache/fs do anything special to merge/rearrange fragments of data or metadata? Is it worth it normally, or just in low-free-space situations?

2018-08-10 14:37:42 +0000 UTC

I would be interested in understanding the compression implementation and it's roadmap. How it functions and what decisions have been implemented in the design so far. How it will function with Erasure Coding? And if it will be implemented differently between tiers / disk groups (ie foreground-->uncompressed, promote-->uncompressed, background-->compressed) ? Thanks for taking the time to dive deep :) regardless of the topic you pick.

2018-08-07 14:56:52 +0000 UTC

Can you explain how bcache/bcachefs deals with detecting sequential io? And how it deals with overwriting partially data that does not written to backing device (for example i'm write 2Mb chunk that goes to cache device, and on next i'm write next 1Mb that overwrite 512K from previous chunk and append next 512K.

2018-08-07 10:28:29 +0000 UTC

I'd love to read about how snapshotting will work with bcachefs, and the technical issues that brings.

2018-08-07 08:22:16 +0000 UTC

You can do that but it's pretty slow. <a href="https://gist.github.com/proudlygeek/5721498" rel="nofollow noopener" target="_blank">https://gist.github.com/proudlygeek/5721498</a>

2018-08-07 08:19:28 +0000 UTC

I'm curious how bcache(fs) works with databases. You mention that it's btree at the core, and, say, Postgres indexes are also btree. It means it's a btree holding a btree? :) What are overheads in such layout? What needs alignments and are there possible shortcuts? What if we run bcachefs on top of bcachefs, say, on loop device or virtual machine? :) I think in Reiserfs fsck would sometimes merge the main filesystem with the one in the image laying on the main filesystem. What should go wrong to see something like that in bcache(fs)? Thank you for the tech posts, I enjoy them very much :)

2018-08-07 06:15:58 +0000 UTC

You know in hindsight that's another thing that really benefited from starting out with just a block layer cache slowly growing it into a filesystem. For just a block layer cache, the operations you support are pretty simple - just read and write, perhaps trim. So you need a b-tree (or some other indexing mechanism), and that part is relatively complicated - but it's a relatively self contained chunk of machinery, and besides that there's not much else. So the total amount of code you need to make fully robust w.r.t. memory allocation failures isn't too big, and you really need to because block devices aren't supposed to return -ENOMEM. And the part that you need to handle in the simple (bcache) case fairly tricky - it's making sure btree operations (which require multiple btree nodes to all be in memory at the same time) can run in a fixed amount of memory and without deadlocking. Bcache has some interesting machinery there (mca_cannibalize_lock) that always felt pretty hacky, but it seems to work well enough. Anyways, since bcache has grown into bcachefs there hasn't been anything else that was hard to deal with, w.r.t. handling memory allocation failures. Since then it's just been a matter of being disciplined about using mempools where necessary.

Kent Overstreet

2018-08-07 00:09:38 +0000 UTC

One of my favourite parts of bcache was how it dealt with low memory. All of the steps it takes to ensure forward progress even when allocations are failing.

Adam Berkan

2018-08-06 23:59:10 +0000 UTC

I'd like to be able to ssh with a simple port redirection and be able to mount filesysstems on the remote machine. With real NFSv4.

2018-08-06 23:32:46 +0000 UTC

I would like to know why NFS is such an ungodly pain in my ass every time I have to set up a new connection.

2018-08-06 22:51:35 +0000 UTC

More Creators

mncmmd

mncmmd

patreon

Ampvison

Ampvison

patreon

PPPP

PPPP

fanbox

Scott

Scott

fanbox

skitalets

skitalets

patreon

Equinox AI

Equinox AI

gumroad

1ng

1ng

patreon

bone

bone

patreon

OCICI

OCICI

patreon

Emily @21andsensory

Emily @21andsensory

patreon

Virtual Sensations

Virtual Sensations

gumroad

Беатриса Колоскова

Беатриса Колоскова

boosty

Yttreia

Yttreia

patreon

hoihoihoi

hoihoihoi

patreon

irredeemable

irredeemable

patreon

Thundergod17

Thundergod17

patreon

Thiago Klafke

Thiago Klafke

gumroad

sunnk

sunnk

gumroad

Arvind

Arvind

fanbox

jayonjey

jayonjey

patreon

ジャネ

ジャネ

fanbox

Gaxneira Teyope

Gaxneira Teyope

patreon

wrestling_bios

wrestling_bios

patreon

Stuffed Ex-Jock

Stuffed Ex-Jock

patreon

Your_Cute_Neko_Official

Your_Cute_Neko_Official

patreon

伊菩

fantia

Hsueh阿學

Hsueh阿學

patreon

nickgam

nickgam

patreon

乃々間るい/yuito

乃々間るい/yuito

fanbox

Heathcliff

Heathcliff

fanbox

Kim's Landing

Kim's Landing

patreon

galasanctuary

galasanctuary

patreon

jimmy jones

jimmy jones

patreon

Mecha Musume World

Mecha Musume World

patreon

Deanvspanties

Deanvspanties

patreon

PUTZ

PUTZ

patreon

gloomwolf

gloomwolf

patreon

Fuyuno mikan

Fuyuno mikan

patreon

Aiden

Aiden

gumroad

akabur

akabur

patreon