SakeTami
AIExplained
AIExplained

patreon


Last 24 Hours: Signs of Introspection in LLMs

Before this paper from Anthropic, out on the 29th, I was a lot more skeptical about LLMs self-reporting their internal state. This is not proof that they can, but partial proof of circuits showing true introspective capability, and more than that, the ability to map a question about it to those circuits.

Plus a big update to lmcouncil.ai.

https://www.anthropic.com/research/introspection

Full Paper: https://transformer-circuits.pub/2025/introspection/index.html#mechanisms

Earlier Work: https://www.anthropic.com/research/mapping-mind-language-model

https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Release Post: https://x.com/AnthropicAI/status/1983584136972677319


lmcouncil.ai

Last 24 Hours: Signs of Introspection in LLMs

Comments

That's so great to hear! Can you just confirm whether the scrolling hassle is from using it on mobile. or what type of screen? If mobile, after how many models does this expansion become a chore, the normal 3-4 or more? I can try to implement a solution in next 24 hours.

Philip

Really loving lmcouncil.ai! Feature request: When I have a large number of council members, scrolling from left to right/right to left to review responses is quite cumbersome. Is it possible to add some scroll arrows (perhaps something like what you see in headline image from this blog: https://www.experienceux.co.uk/ux-blog/a-ux-perspective-on-horizontal-scrolling/ )

Bret Brizzee

Done! Just for you!

Philip

Thanks Joshua! And for being a Max sub...

Philip

Damn, not too much I can do about that!

Philip

In LM Council, please add an option to "delete all" past conversations. Thank you!

Riley Thomson

Great content Philiph!

Joshua Davis

The vocal fry in your voice is a bit disturbing.... πŸ˜…

Kishore Kumar

This makes me wonder if human introspection is as a result of the physical capabilities of the brain or if it comes about as a result of lived human experiences. If it is a result of the physical brain then it would suggest the possibility that an artificial intelligence could stumble across introspection.

Barnaby Golden

I'm wondering if this is not a case of just writing "bread" or "treasure" or "dust" with vector instead of with words and after that it being the same normal model inference as always. I'm hypothesizing that for example just because model was not trained on "Quick fox jumped over a lazy dog" with "bread" skewed activations, the context-filling circuits would kick in. All in all their job is to extract information that is not written directly. Combine it with suggestive question et voila. So to me the testing was clever and shows interesting way of inference, but was not rigorous enough to conclude what they tentatively did. Oh and congrats on lmcouncil expansion and development! Will most probably give it a try!

PaweΕ‚ Pieniacki

Again, very fascinating research by Anthropic. Although I am not sure what to make of this. When they inject activity patterns for specific concepts, we would expect (as seen by their prior work on monosemanticity) that the outputs are steered towards this concept. I feel like the combination of the given prompt, where the researchers reveal that the model is tested, and the actual activation injection, is something we would expect even if a model had no capability to introspect. It would be more impressive if the model detected these injections without getting any hints. Also, it's not clear how introspection could work in a feed forward network. One could imagine that parts of the state represented by earlier layers are analyzed by later layers, but this would be the barebone version of introspection. Anyways, very interesting research even though I don't find it super convincing for now.

Phillip Yao-Lakaschus

Thanks Phillip, I was hoping you’d cover the details of that Anthropic research after running out of time to read more than the summary myself! I hope you can keep doing these videos and don’t get too distracted by building - it’s very addictive!

Erik


More Creators