SakeTami
ncase
ncase

patreon


Sheriff Meowdy: an excerpt from my upcoming mega-post on AI Alignment

(reading time: 7 min)

Hi all!

I had a bunch of moving & medical-related errands in January — don't worry, I'm fine — so I didn't finish my 90-minute-read intro to AI Alignment this month after all.

But, here's an excerpt! One thing I'm proud of: the art quality is much higher than my previous stuff. (To be fair, that's a low bar. Evolution of Trust was literally stick figures without arms.)  The full AI Alignment mega-post will be out early March.

Also: this Patreon post is public, so feel free to share this excerpt and/or forward this email!

(And after the excerpt: a) Your Jan 2023 Patreon rewards + b) I have a Mastodon now! + c) Complaining about Patreon-the-platform.)

— — — — —

🤖 Amplification in AI Alignment, Explained

(an excerpt from my upcoming mega-post, Dealing With Djinn: a friendly tour guide to AI Alignment.)

If a person or AI is just a bit smarter than you, sure, they'd be fairly easy to safely contain: just lock them up & put them under surveillance. But if they're much, much smarter than you, they could hack or manipulate their way out: think Hannibal Lector, the fictional high-IQ serial killer.

(Or, a less sci-fi-death-cult example: let's say an engineer has an AI generate rocket designs. If the AI's only a bit more sophisticated than the engineer, they can double-check the AI's designs. But if the AI uses cutting-edge physics, they may no longer be able to check that the designs are actually safe.)

So: how can we safely oversee AIs that may be much, much smarter than us? 

One proposed idea is called: amplification.

To understand this, let's call up the Sheriff...


Sheriff Meowdy is the quickest draw in the... local area. His goal? To protect the townsfolk from the Varmin comin' into town:

But the Sheriff knows he ain't fast enough to stop 'em. He's man enough to admit it, so Sheriff Meowdy gets some hired help:

Meowdy 2 is twice as fast as Sheriff Meowdy. But the Sheriff weren't born yesterday. He lets Meowdy 2 fend off the Varmin, while the Sheriff keeps his trusty pistol trained on 2's back. In technical jargon, the Sheriff is the overseer.

This is a safe alignment strategy for now, coz while Meowdy 2 could turn around and shoot the Sheriff in 500ms, the Sheriff can notice & shoot first in 200ms. (ms = millisecond, 1/1000th of a second)


(For actual AI: "train a gun on its back" is a metaphor for inspecting the AI's "thoughts", watching for signs of misalignment or accidental failure, and pulling the plug before it gets dangerous. Like Hannibal or the rocket-designing-bot, this works as long as the AI isn't too far above you.)

It's even safe for the Sheriff to directly oversee Meowdy 3, who is twice as fast as Meowdy 2:


But the Sheriff is NOT fast enough to directly oversee Meowdy 4:


This strategy, "directly oversee the bot", has a capabilities ceiling. In this case, it fails for Meowdy 4 and above.

The Sheriff is stumped. But one day, he goes to the ol' drinking hole for fine entertainment. He sees the line-dancing femboys on stage, and gets a brilliant idea:

Have bots help you align other bots.

Now, not only can the Sheriff indirectly oversee Meowdy 4, he can even oversee a God-level Meowdy 100, who's 2^100 = 1,267,650,600,228,229,401,496,703,205,376 times faster than the Sheriff!

Now, the Sheriff can make swiss cheese out of a million Varmin, easy-peasy.

This is oversight amplification: when you use bots to amplify your ability to oversee other bots.

But hey now sunshine, Sheriff Meowdy ain't no tumbleweed-for-brains numberista, he's read books on risk management & tussled with Taleb. He knows that, even if each Meowdy only has a 1% chance of failure, a chain only works if every link is unbroken, so with 100 Meowdy's, that's—

(Sheriff curses as he punches into them newfangled city-boy "calculators")

— a 63% chance of failure! The Sheriff ain't taking a risk on that, not with the townsfolks' lives at stake!

But the Sheriff is familiar with them basic techniques from risk management & robustness engineering — like how when NASA needs a computer program for a space probe, they get three different engineering teams to write the same program. Then, they put all those programs on the probe, and the probe takes a majority vote of what the programs tell it. This way, even if one program fails, the whole system remains robust. Also yes, NASA exists in this cat-person comic universe.

Anyhoo, the Sheriff gives each Meowdy a backup overseer.

(In actual practice: you'd also want to minimize the chance of several overseers failing at the same time. So, you could give each bot different training data, or a different "random seed", to make their failures as independent as possible.)

The Sheriff punches the numbers into them city-boy "calculators", and is amazed: with even just one side-chain of backup overseers, the 100-Meowdy line's chance of failure drops from 63% to 1%! And with a second side-chain, it drops to 0.01%! That's a mighty fine "alignment tax"!

This is robustness amplification: when you use bots to amplify the robust-to-failure-ness of your bots.

Though the Sheriff reckons he can't align Meowdy 100 by his lonesome, with the help of amplification, he can keep all of them robustly aligned to his true goal: to protect the townsfolk.

The Varmin curse their luck, and limp off into the golden sunset.

. . .

(Ugh finally I can type in a normal voice again.)

(The rest of the "amplification" section will explain two specific proposals: 1) Recursive reward modeling, and 2) Iterated Distillation & Amplification (IDA).)

(What the heck do those mean?... well, I guess you'll have to wait until next month to read my layperson-friendly explanations of those!  But the core idea in both is the same as the Sheriff Meowdy parable: bots align slightly stronger bots, ad infinitum.)

(Now, imagine 90 minutes of words+art like the above. Yeah. That's why this project is taking a while. Full post will be out early March!)

— — — — —

💖 Jan 2023 Patreon Rewards

— — — — —

🐘 I have a Mastodon!

👉👉 It's mas.to/@ncase ! 👈👈

If you've got Mastodon, follow me!

I have... not posted anything!  But I will start posting stuff there that will not be on Twitter.  Mastodon-exclusives, if you will!

— — — — —

💸 Complaining about Patreon-the-platform

Last time, I mentioned Patreon's new billing system makes it so that, if I need to pause my Patreon for a mental health break, it makes it impossible for new patrons to sign up while I'm paused. Even though their old billing system did allow this!

"Take a break" xor "no new patrons" is a crappy trade-off. I've talked with their Support teams about it — (to be fair, their Support is prompt & friendly) — but there are currently no plans in place to fix this. Combined with Patreon's internal mismanagement (hat tip @buster), the future of this platform is uncertain.

Also, I can't tell if this is Patreon's problem in particular, or because of the pandemic, or simply regression-to-the-mean, but... almost every educational-content-creator I know who uses Patreon has seen their revenue steadily drop for the last year, or longer:

Mine (Nicky Case), 3Blue1Brown, Veritasium, Vi Hart, Minute Physics, Minute Earth, Kurzgesagt, Mathologer, Primer, Crash Course, Numberphile, SciShow. All steadily dropping for 1+ years.

(CGP Grey & Smarter Every Day set their stats to private years ago.  VSauce, Mark Rober & Tom Scott aren't on Patreon. The only two counterexamples I could find were: Rational Animations [growing], Up And Atom [not growing nor falling].)

You can find all these stats for yourself on Graphtreon.

(Evidence against the "it's a problem with Patreon in particular" hypothesis: half of the nsfw furry Patreons I know are continuing to boom in growth.)

Like, I can still pay rent & food — and grateful for that! — but seriously, seeing that number steadily fall, for years, while being hit with 7%–10% inflation (compared to pre-pandemic ~1% inflation), is... well, let's just say I am not above making nsfw furry art.

I'm just mentioning all this, because later in 2023 I may decide to finally leave Patreon & make my own self-hosted, open-source, just-for-one-creator alternative.  (like 2012's Selfstarter, now defunct)  I have made my own crowdfunding site with the Paypal & Stripe APIs before; a basic open source, host-your-own-Patreon-for-only-one-creator would take me at most 3 months to make.

Not only could I make it let me get new patrons while paused, I could also let it do one-time donations (Patreon has committed to never doing this), allow for supporter testimonials (like Ko-Fi), not require creating an account (like Humble Bundle), get around Patreon's arbitrary content rules (a problem for many nsfw accounts), and avoid Patreon's platform fees!

(Though, more realistically, I'd just switch to an existing alternative like Github Sponsors.)

What do you think? Am I over-reacting, under-reacting? What other options should I explore? Should I try combining science education with nsfw furry art? Schrödinger's Catgirl? Knot theory? Let me know in the comments!

— — — — —

Anyway: full AI Alignment megapost will be out early March!  I may be miffed at Patreon-the-platform, but, I am not miffed at you, the patrons.  I am grateful!  As always, thank you for helping me pay rent n' stuff. 💖

Cheers,
~ Nicky Case

Sheriff Meowdy: an excerpt from my upcoming mega-post on AI Alignment

Comments

It feels weird replying to my own Patreon comment from a year ago, but the fact that I'm still thinking about it probably means it's worth revisiting. I would like to retract both of my previous points. I've since learned a lot more about how Patreon operates, the cuts it takes, how it monetizes certain features, and I've heard a lot more creators complain about it. I wouldn't blame anyone who'd choose to move away from it, and I would gladly follow. In the end, the whole issue of one-stop-shop for content was long ago solved by RSS and mailing lists. And as far as managing subscriptions goes... everything is a subscription nowadays, keeping a spreadsheet or having an app to track them is becoming a necessity anyway. I'm still not gonna tell you to go ahead and setup an alternative, because I don't know how successful you'd be in converting the rest of your audience. The only creator I know, who made the switch, remains the one who left for Substack a year ago, and I don't get the feeling that particular platform is doing much better. But I can tell you I would now support the move enthusiastically, especially knowing that more of my money is going to the creator rather than a middle man.

Viniter

You do not need to go full artificial intelligence to see how alignment can go wrong. just ask any parent how natural intelligence will do anything to barely comply to the rules with the most strange side-effects.

Chris K

Meowdy!

Stadtfuchs

Thank you Detective Chiyo ! ^_^

Nicky Case

I follow ... you, exactly you, on Patreon. So if you were to move I'd not be choked up about it. So I selfishly say dooooe eeeeeet

Sean Riley

Nice excerpt & lovely art, thanks for taking the time to make such high-quality, simple explainers ♥ One thing: the very premise "if they're much, much smarter than you, they could hack or manipulate their way out" doesn't feel obviously true to me (a complete outsider to the field). I would love to understand more whether there is good evidence that this is a real danger, or a convincing argument (that doesn't rely on media tropes), or if it's just something we're not sure how to assign a probability to, but seems worth mitigating again, just in case. On the Patreon / no-Patreon question, I do not mind being billed even when you take mental health breaks — medical leave should be a universal right and I don't see a reason why this should be different for independent creators. A one-off service would feel not so great from a security perspective nor (if many people start doing the same) from an admin overhead standpoint. But I hear the frustrations with Patreon…

Ted

I'm not so sure about this self-hosted crowd funding. I mean, on one hand sure, Patreon doesn't offer much in terms of discoverability, so it doesn't matter much where you redirect your potential supporters. But on the other hand... I had one of the creators I follow leave Patreon for Substack recently. And while I followed them there, I found it quite frustrating. I don't know if I'd be willing to go along with everyone leaving for a different platform. There's 2 things going for Patreon: 1) I have all the paywalled content available in one place, in one app (mediocre as it is), with one set of push notifications and a complete history. 2) I can keep track of all my subscriptions. Everything operates on a monthly subscription these days and I'm starting to struggle managing all of them. And while supporting creators is among my favourite things I get charged monthly, if I had to keep a spreadsheet to track all of it, it would sour the experience.

Viniter

damn I don't know how you do it but every time I read one of your posts I am shocked at how well you explain the ideas clearly!!

Detective Chiyo

Only thing is it might be hard to regather the base you already had here :/

Rev Storm

Good point! Patreon *does* have its network effect, and at least a trustworthy(ish) brand (for now) that it's at least not gonna take the credit card number and run

Nicky Case

> I'd be less inclined to join a one-off service because of security/complexity Good point!

Nicky Case

I *have* used Ghost in the past, but migrated away because their hosted platform kept getting worse & worse. Specifically, theme-development/modification became a nightmare, and they were constantly pushing users to become a Substack-like paid newsletter. If I'm recalling correctly, you couldn't even *disable* that "feature", I had to hide the button manually with my own CSS. I'm currently using 11ty, a static blog generator, hosted on GitHub pages!

Nicky Case

As convenient as it is for me to have Patreon as a sort of one-stop shop, I love the idea of you setting up an alternative just for you. I'm not convinced it's good advice but I do think it could be really neat!

Jacques Frechet

Have you thought about using Ghost? https://ghost.org

jgoodhcg

I'd suggest more public posts to attract users, I remember your original games/explainers posted to the public. I've always viewed Patreon as a good way to donate regularly to people who are doing and posting interesting things. I'd be less inclined to join a one-off service because of security/complexity/having to add a new service to my list of things to check on.

Conrad Wong

I really want to see your NSFW "knot theory" post now... Looking forward to the AI alignment post :D

Sam Cook


More Creators