AIExplained

Can ChatGPT Do Task X? It’s Surprisingly Hard to Answer

Added 2024-07-04 18:17:18 +0000 UTC

‘Can any model do [insert task]?’ is a much harder question than it seems. I’m going to give you five vivid categories, with unambiguous examples, drawing on 6 new papers, of the kind of detail that is so often lost in 2024 debates on AI.

Link for Off-line Watching and Download: https://drive.google.com/file/d/1ep3Asw6_1LZRJoCKU1VGSaUcYMP6djBq/view?usp=sharing

Anthropic What LLMs Can’t Do: https://www.anthropic.com/news/a-new-initiative-for-developing-third-party-model-evaluations?s=09

Fool Your (Vision and) Language Model with Embarrassingly Simple Permutations https://arxiv.org/pdf/2310.01651

LLMs are not robust selectors: https://openreview.net/pdf?id=shr9PXz7T0

Eliminating Position Bias of Language Models: A Mechanistic Approach - https://arxiv.org/pdf/2407.01100

DROP Benchmark: https://arxiv.org/pdf/1903.00161v2

Investigating the Robustness of LLMs on Math Word Problems: https://arxiv.org/pdf/2406.15444

AI Explained Hat-tip: https://x.com/vipul_1011/status/1808243451852644794

Changing Answer Order Can Decrease MMLU Accuracy: https://arxiv.org/pdf/2406.19470

AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf

AI Agents That Matter: https://arxiv.org/pdf/2407.01502

Functional benchmarks: https://arxiv.org/pdf/2402.19450v1

LLMs Cannot Plan: https://arxiv.org/pdf/2402.01817

MysteryWorld: https://arxiv.org/pdf/2302.06706

Can ChatGPT Do Task X? It’s Surprisingly Hard to Answer

Comments

Wonder if models liking the first answer is because of triangluar masking, the vector of the first answer doesn't have information about the other answers. Would be interesting to see if a BERT style model fixes this.

Bryson Tang

2024-07-31 17:19:21 +0000 UTC

Much more to come, have been buried in creating a new benchmark for just this!

Philip

2024-07-15 15:37:29 +0000 UTC

I am very optimistic, just unclear when. In a sense, logic should be the easier problem than language.

Philip

2024-07-15 15:37:06 +0000 UTC

If you like, more to come, even a formal benchmark. Am working hard on it

Philip

2024-07-15 15:36:38 +0000 UTC

It could well be yes, seems really fundamental to the architecture

Philip

2024-07-15 15:36:18 +0000 UTC

Will try this prompt on a new benchmark I am creating, seems to work just occasionally. Fundamentally it often does not grok what is irrelevant.

Philip

2024-07-15 15:35:53 +0000 UTC

Thanks Carlos!

Philip

2024-07-15 15:34:37 +0000 UTC

Thank you Steven, will have to dwell on that one, hard to put a response in words!

Philip

2024-07-15 15:34:31 +0000 UTC

Yeah this video actually inspired me to finally, and formally, create a benchmark. This is why I have been quiet the last week or so, and lots more work to go. Really could reveal weaknesses/unlock gains.

Philip

2024-07-15 15:34:06 +0000 UTC

If models aren't only memorising patterns/answers but also thinking to some extent, what you've described here are analogous to human cognitive biases.

Machiel Reyneke

2024-07-07 05:16:12 +0000 UTC

Great video as always! It seems that reasoning failures are a challenge for LLMs on the scale of hallucination. How optimistic are you that these things can be solved without some sort of new breakthrough, on the level of transformers?

Mark Levine

2024-07-07 00:46:29 +0000 UTC

Sometimes I feel like this channel might lead to a skeptical revival (philosophical) - not cynical - where it's almost like a frontier AI position gets pummeled/stripped down by a battery of tests from logical skepticism and then limps back to a drawing board. I can see how textbook old schools of thought were founded; a societal/economic shift occurred and then some people crowded around an interrogative position. Asking what if / what then / why is it / enough times until it becomes degraded or untenable.

Oldsnakenewtrix

2024-07-06 02:47:11 +0000 UTC

This guy made a great job explaining the mechanistic interpretability work of Anthropic. It is in Spanish, but really really good https://youtu.be/VjZ51nIcy1I?si=dg8tNKmBHN55RNqU

Pablo Rodríguez

2024-07-05 19:04:38 +0000 UTC

Very interesting! Throughout all my learning experiences, I've always felt frustrated with tests that required memorization since I prefer to understand things. Human tests rely heavily on memorization. LLMs are basically trying to compress information, and an effective way to do this is to extract axioms and understanding from all the information we receive. I wonder if, with enough training and scale, these patterns will emerge. However, I also think that a change in architecture is needed. Transformers are too granular, and we humans have the ability to understand and change some of these hyper structures at will (just realizing you are wrong and changing your logic flow). I think the work around auto sparse encoders for mechanistic interpretability is a step in the right direction, and feeding this to the model so it can be aware of that meta-cognition layer is crucial.

Pablo Rodríguez

2024-07-05 16:23:24 +0000 UTC

Hello Philip, if you prime GPT-4o with a prompt like: "Please review each question closely and take special care to ignore any irrelevant information." GPT-4o tends to do better. Using your example with GPT-4o and this additional prompt produces the correct result every time (for me). ----GPT-4o response---- To solve the problem of determining the least possible number of marbles that must be removed to guarantee that at least 10 marbles of any one color have been removed, we need to consider the worst-case scenario. In the worst-case scenario, we want to maximize the number of marbles removed without reaching 10 of any single color until it becomes inevitable. The marble counts are: 8 green marbles 18 red marbles 17 blue marbles 21 purple marbles 5 silver marbles The worst-case scenario would involve removing as many marbles as possible from each color without removing 10 of any color. Here’s the calculation: Remove 8 green marbles (since there are only 8). Remove 9 red marbles (one short of 10). Remove 9 blue marbles (one short of 10). Remove 9 purple marbles (one short of 10). Remove 5 silver marbles (since there are only 5). The total number of marbles removed in this scenario is: 8 + 9 + 9 + 9 + 5 = 40 To guarantee that at least 10 marbles of any one color have been removed, one more marble needs to be removed, making it inevitable to reach 10 of a particular color. Therefore, the least possible number of marbles that must be removed to guarantee that at least 10 marbles of any one color have been removed is 40 + 1 = 41 So, the correct answer is: 41 ----End GPT-4o Response----- You can then ask the model for the irrelevant information: ----GPT-4o Response---- The irrelevant information in this problem is the weight of the marbles and the weight of the bag. These details do not impact the calculation of the number of marbles that need to be removed to guarantee that at least 10 marbles of any one color have been removed. Therefore, you can ignore the following information: "3g weight" of green marbles "2g weight" of red marbles "4g weight" of blue marbles "3g weight" of purple marbles "2.5g weight" of silver marbles "The bag has a weight of 100g when empty." The relevant information is the number of marbles of each color. ----End GPT-4o Response-----

Tony Coffman

2024-07-05 15:02:43 +0000 UTC

Great video! Reminds me of last week's Bill Gates comment: "The various actions to change the underlying reasoning algorithm from the trivial that we have today to more human-like meta-cognition; that's the big frontier." However, he also said, "I've seen that we will make progress on [meta-cognition] next year but we won't completely solve it for some time after that." This mirrors what Sam, Mira, Brad, John, and Kevin Scott have said about expecting better reasoning in the next generation. I tend to think they're not simply speculating but, instead, have seen notable reasoning progress behind closed doors. And since some of the flaws you point out intuitively feel like a little reasoning would go a long way (as opposed to, say, ARC), I tend to think the next generation will make significant progress on these -- and not at the expense of speed which users have come to expect and, more importantly, the voice-to-voice paradigm they're pushing relies on. Thoughts? It sounds like you think these are stubborn flaws that will materially hold back AI capabilities for several years.

Brian Crabtree

2024-07-05 08:31:47 +0000 UTC

Excellent video, as always!

Carlos Galarza

2024-07-04 20:57:54 +0000 UTC

Great video mate. I love the criticism of LLM’s. What I appreciate the most is the level headed position you often have. I did have some specific questions I would like to ask you directly. Do you respond to these comments for questions? Have any of these ai companies tried running them similar to an organic brain? Like having one ai talk to itself and forming “hemispheres” of thought? One for pure output, one for evaluating proper and quality outputs, one to evaluate the past (memory), in order to actually create a somewhat conscious machine? Also, the frequency of our brain is directly related to our thoughts and feelings, can we somehow get LLM’s to think on a certain frequency? Thanks!

Steven

2024-07-04 20:23:26 +0000 UTC

Philip: this is a really important video! I suppose the implication is that AI models would benefit from more randomness in their training. To help models truly “grok” the underlying abstract concepts and generalise better, it’s important to randomise various aspects of the training data. This could include shuffling the order of options in tests, varying the location and position of objects in images, and dressing up the concepts in different guises during learning. By “fuzzing up” these elements, models can learn to focus on the essence of the concepts rather than relying on superficial details. Incorporating such techniques should enable models to improve their ability to generalise to new contexts.

Jason Tangen

2024-07-04 19:36:37 +0000 UTC

Great coverage as always! It's a breath of fresh air to hear about the flaws of these models and how much more work needs to be done rather than having hype and fear of replacement being firehosed at you. I think that the AI hype bubble needs to pop, so that the labs that are doing the real work can take their time to develop it without the pressure from higher ups to release before it's due.

Jonathan Kirk

2024-07-04 19:03:24 +0000 UTC

More Creators

wParallax

gumroad

marcelinhofeet

patreon

tyjokr

patreon

jrenaegaming

patreon

boudoir_noir

patreon

Nemuinu

fanbox

木下心葉

fanbox

Milena Cipriano

patreon

Belmontart

fanbox

mockrock

patreon

Low End University

patreon

billionslove

fanbox

marmastry

gumroad

Turbo

patreon

Tosaka

gumroad

bintuc

patreon

Holycchi

patreon

Stiletto Bella

patreon

ホゾヒカル

fanbox

MysticLight

patreon

Oct.

fanbox

patreon

Daloknight

patreon

starblame

fanbox

JPC Comics

gumroad

Tarri

patreon

fanbox

ZackGrooves

patreon

glaazius

patreon

strauzek

patreon

KrUnCh

fanbox

GarryFlix

gumroad

pixeledasteroid

gumroad

山本精子漏

fanbox

KyokaSuigetsu

patreon

Younagi

patreon

青肌サキュバス

fanbox

Stormia

patreon

trinian

patreon

あやつり人形くるみ

fantia