ncase

Sheriff Meowdy: an excerpt from my upcoming mega-post on AI Alignment

Added 2023-01-31 17:53:50 +0000 UTC

⏱ (reading time: 7 min)

Hi all!

I had a bunch of moving & medical-related errands in January — don't worry, I'm fine — so I didn't finish my 90-minute-read intro to AI Alignment this month after all.

But, here's an excerpt! One thing I'm proud of: the art quality is much higher than my previous stuff. (To be fair, that's a low bar. Evolution of Trust was literally stick figures without arms.) The full AI Alignment mega-post will be out early March.

Also: this Patreon post is public, so feel free to share this excerpt and/or forward this email!

(And after the excerpt: a) Your Jan 2023 Patreon rewards + b) I have a Mastodon now! + c) Complaining about Patreon-the-platform.)

— — — — —

🤖 Amplification in AI Alignment, Explained

(an excerpt from my upcoming mega-post, Dealing With Djinn: a friendly tour guide to AI Alignment.)

If a person or AI is just a bit smarter than you, sure, they'd be fairly easy to safely contain: just lock them up & put them under surveillance. But if they're much, much smarter than you, they could hack or manipulate their way out: think Hannibal Lector, the fictional high-IQ serial killer.

(Or, a less sci-fi-death-cult example: let's say an engineer has an AI generate rocket designs. If the AI's only a bit more sophisticated than the engineer, they can double-check the AI's designs. But if the AI uses cutting-edge physics, they may no longer be able to check that the designs are actually safe.)

So: how can we safely oversee AIs that may be much, much smarter than us?

One proposed idea is called: amplification.

To understand this, let's call up the Sheriff...

Sheriff Meowdy is the quickest draw in the... local area. His goal? To protect the townsfolk from the Varmin comin' into town:

But the Sheriff knows he ain't fast enough to stop 'em. He's man enough to admit it, so Sheriff Meowdy gets some hired help:

Meowdy 2 is twice as fast as Sheriff Meowdy. But the Sheriff weren't born yesterday. He lets Meowdy 2 fend off the Varmin, while the Sheriff keeps his trusty pistol trained on 2's back. In technical jargon, the Sheriff is the overseer.

This is a safe alignment strategy for now, coz while Meowdy 2 could turn around and shoot the Sheriff in 500ms, the Sheriff can notice & shoot first in 200ms. (ms = millisecond, 1/1000th of a second)

(For actual AI: "train a gun on its back" is a metaphor for inspecting the AI's "thoughts", watching for signs of misalignment or accidental failure, and pulling the plug before it gets dangerous. Like Hannibal or the rocket-designing-bot, this works as long as the AI isn't too far above you.)

It's even safe for the Sheriff to directly oversee Meowdy 3, who is twice as fast as Meowdy 2:

But the Sheriff is NOT fast enough to directly oversee Meowdy 4:

This strategy, "directly oversee the bot", has a capabilities ceiling. In this case, it fails for Meowdy 4 and above.

The Sheriff is stumped. But one day, he goes to the ol' drinking hole for fine entertainment. He sees the line-dancing femboys on stage, and gets a brilliant idea:

Have bots help you align other bots.

Now, not only can the Sheriff indirectly oversee Meowdy 4, he can even oversee a God-level Meowdy 100, who's 2^100 = 1,267,650,600,228,229,401,496,703,205,376 times faster than the Sheriff!

Now, the Sheriff can make swiss cheese out of a million Varmin, easy-peasy.

This is oversight amplification: when you use bots to amplify your ability to oversee other bots.

But hey now sunshine, Sheriff Meowdy ain't no tumbleweed-for-brains numberista, he's read books on risk management & tussled with Taleb. He knows that, even if each Meowdy only has a 1% chance of failure, a chain only works if every link is unbroken, so with 100 Meowdy's, that's—

(Sheriff curses as he punches into them newfangled city-boy "calculators")

— a 63% chance of failure! The Sheriff ain't taking a risk on that, not with the townsfolks' lives at stake!

But the Sheriff is familiar with them basic techniques from risk management & robustness engineering — like how when NASA needs a computer program for a space probe, they get three different engineering teams to write the same program. Then, they put all those programs on the probe, and the probe takes a majority vote of what the programs tell it. This way, even if one program fails, the whole system remains robust. Also yes, NASA exists in this cat-person comic universe.

Anyhoo, the Sheriff gives each Meowdy a backup overseer.

(In actual practice: you'd also want to minimize the chance of several overseers failing at the same time. So, you could give each bot different training data, or a different "random seed", to make their failures as independent as possible.)

The Sheriff punches the numbers into them city-boy "calculators", and is amazed: with even just one side-chain of backup overseers, the 100-Meowdy line's chance of failure drops from 63% to 1%! And with a second side-chain, it drops to 0.01%! That's a mighty fine "alignment tax"!

This is robustness amplification: when you use bots to amplify the robust-to-failure-ness of your bots.

Though the Sheriff reckons he can't align Meowdy 100 by his lonesome, with the help of amplification, he can keep all of them robustly aligned to his true goal: to protect the townsfolk.

The Varmin curse their luck, and limp off into the golden sunset.

. . .

(Ugh finally I can type in a normal voice again.)

(The rest of the "amplification" section will explain two specific proposals: 1) Recursive reward modeling, and 2) Iterated Distillation & Amplification (IDA).)

(What the heck do those mean?... well, I guess you'll have to wait until next month to read my layperson-friendly explanations of those! But the core idea in both is the same as the Sheriff Meowdy parable: bots align slightly stronger bots, ad infinitum.)

(Now, imagine 90 minutes of words+art like the above. Yeah. That's why this project is taking a while. Full post will be out early March!)

— — — — —

💖 Jan 2023 Patreon Rewards

Wall of Thanks ($2+/month) — these names will also be included in the credits of my AI Alignment mega-post!
Polygon Avatars ($5+/month)
And this month's Drawing of a Cat reward, "CatGPT passes the Purring Test", was emailed out to those who picked that tier!

— — — — —

🐘 I have a Mastodon!

👉👉 It's mas.to/@ncase ! 👈👈

If you've got Mastodon, follow me!

I have... not posted anything! But I will start posting stuff there that will not be on Twitter. Mastodon-exclusives, if you will!

— — — — —

💸 Complaining about Patreon-the-platform

Last time, I mentioned Patreon's new billing system makes it so that, if I need to pause my Patreon for a mental health break, it makes it impossible for new patrons to sign up while I'm paused. Even though their old billing system did allow this!

"Take a break" xor "no new patrons" is a crappy trade-off. I've talked with their Support teams about it — (to be fair, their Support is prompt & friendly) — but there are currently no plans in place to fix this. Combined with Patreon's internal mismanagement (hat tip @buster), the future of this platform is uncertain.

Also, I can't tell if this is Patreon's problem in particular, or because of the pandemic, or simply regression-to-the-mean, but... almost every educational-content-creator I know who uses Patreon has seen their revenue steadily drop for the last year, or longer:

Mine (Nicky Case), 3Blue1Brown, Veritasium, Vi Hart, Minute Physics, Minute Earth, Kurzgesagt, Mathologer, Primer, Crash Course, Numberphile, SciShow. All steadily dropping for 1+ years.

(CGP Grey & Smarter Every Day set their stats to private years ago. VSauce, Mark Rober & Tom Scott aren't on Patreon. The only two counterexamples I could find were: Rational Animations [growing], Up And Atom [not growing nor falling].)

You can find all these stats for yourself on Graphtreon.

(Evidence against the "it's a problem with Patreon in particular" hypothesis: half of the nsfw furry Patreons I know are continuing to boom in growth.)

Like, I can still pay rent & food — and grateful for that! — but seriously, seeing that number steadily fall, for years, while being hit with 7%–10% inflation (compared to pre-pandemic ~1% inflation), is... well, let's just say I am not above making nsfw furry art.

I'm just mentioning all this, because later in 2023 I may decide to finally leave Patreon & make my own self-hosted, open-source, just-for-one-creator alternative. (like 2012's Selfstarter, now defunct) I have made my own crowdfunding site with the Paypal & Stripe APIs before; a basic open source, host-your-own-Patreon-for-only-one-creator would take me at most 3 months to make.

Not only could I make it let me get new patrons while paused, I could also let it do one-time donations (Patreon has committed to never doing this), allow for supporter testimonials (like Ko-Fi), not require creating an account (like Humble Bundle), get around Patreon's arbitrary content rules (a problem for many nsfw accounts), and avoid Patreon's platform fees!

(Though, more realistically, I'd just switch to an existing alternative like Github Sponsors.)

What do you think? Am I over-reacting, under-reacting? What other options should I explore? Should I try combining science education with nsfw furry art? Schrödinger's Catgirl? Knot theory? Let me know in the comments!

— — — — —

Anyway: full AI Alignment megapost will be out early March! I may be miffed at Patreon-the-platform, but, I am not miffed at you, the patrons. I am grateful! As always, thank you for helping me pay rent n' stuff. 💖

Cheers,
~ Nicky Case