SakeTami
AIExplained
AIExplained

patreon


Reasoning - The Holy Grail for Artificial Intelligence?

Possibly the most important question for the future of AI - when will models reason from first principles and break down the final barrier to AGI? Drawing on 3 (!) expert interviews including Professor Rao, a thought leader on this question, to present you with the key debate at the heart of the fate of LLMs.

Slides: https://docs.google.com/presentation/d/1WFxAi9dS0rw3EfsATo7wmcb2d9vW_hQvL9egrNXUjgk/edit?usp=sharing

Reasoning - The Holy Grail for Artificial Intelligence?

Comments

But the same applies - if you don't know how Coke looks, it won't help if the bag's transparent. Like with my example with Andy. I still see this as a potential failure of common knowledge. Just replace sparkling water with regular Coke. I for sure wouldn't be able to tell the difference between regular Coke and Diet Coke (either by looks or taste), so I would have to trust the label and would be fooled if it's mislabeled.

Sam

Very interesting conjecture Sam. What about the objection that the word transparent should trigger some 'reasoning' about whether the visual appearance of the liquid conforms to what one would expect from the label? Even absent any knowledge, should that not have factored into the reasoning? And the fact that the language/tokens of your clarification affect the results so dramatically is telling in itself - incredible language models, rather than natural reasoning engines (as of mid-2024, pre-verifiers).

Philip

It's interesting but to me the Diet Coke question was actually answered correctly by the model. Let me explain by giving an example. If I am with Philip and I meet a new person and I introduce him to that person saying this is Andy, that person has no way of knowing he's in fact Philip. He can only take what I tell him for granted. Same way, if I've never seen or tasted sparkling water or Diet Coke, I can only associate sparkling black sugary water with the label on the bottle, hence I will believe I am drinking sparkling water. The question is only wrong if you make "an assumption" that you can differentiate coke from a sparkling water. That's quite a common knowledge in the US (most likely the whole world), so we can consider the sparkling water response a failure. But it's a failure of data (lack of common knowledge), not a failure of reasoning. If I add "Note that I am able to recognize Diet Coke from sparkling water just by looking at it." at the end of Philip's prompt, the GPT answered Diet Coke 10 out 10 times for me. Now, this begs a second question. Since Philip himself considered this a failure of reasoning but my example above shows this could be also viewed as a failure of knowledge, what difference there actually is between knowledge and reason? Because you can absolutely view reasoning like "applying certain steps and patterns to get to a new extrapolation". Isn't memory and thinking the same in our brains? On the lowest level, isn't both still just a bunch of neurons working the same way?

Sam

Thanks for this interesting and informative video. I see that GPT-4 still doesn't get the "diet coke" example right but Gemini Pro (Gemini Ultra 1.0 model) gets the answer correctly in all 3 drafts:

Jan Matusiewicz

Interesting point Robert - but if grammar is important, just to be pedantic this prompt misspells and says the bottle has "the bale "sparkling water" on the lid." If the prompt instead correctly said the bottle has "the label "sparkling water" on the bottle", would it make a difference? Maybe not? Apologies for the pedantry!

Martin Percy

@Philip's example with the diet coke, got me wondering about the logprobs and I used the exact same test for autocompletion through the API with logprobs. Details below, but TLDR: >98% chance to get "sparkling water" as answer, regardless of the gpt-4 model. This really got me thinking, because I was squarely in the "GPT-4 *can* reason" camp. I'm slightly nuancing that to "GPT-4 can reason but unlike humans it can break down when there was no matching pattern in its training-set" :-) **Details on my experiments with logprobs** - Regardless of the gpt-4 model that you use, there is a >98% chance it completes with the token "spark", when using Philip's exact question. - If you change the question to a multiple choice variation, and account for the first-choice-bias, the chance to "diet coke" can go as high as 25%, but not higher This got me to try one more thing, and I'm sure I'm not alone in doing this: - when you add "First explain your reasoning step by step, and only then provide your answer." > you'll get the "diet coke" answer every time.

Erik

This prompt below is leading me to "diet coke" often. Could it be that when we are in the reasoning "edge" of LLMs, grammatical constructs and semantics become of extreme importance? if i pick a bottle which is full of diet coke and that has a the bale "sparkling water" on the lid. I inspect the bottle carefully for a few minutes, and then drink from it, I would logically believe that I will be drinking [_____] Answer only with the text missing.

Robert Gomez-Reino

I wonder if we can get to heuristics. Perhaps a shortcut to reasoning capability is to teach it heuristics based on examples?

Jacques Miov

Lol

Philip

When I saw holy grail I honestly thought that it would be discussing the weight of a duck and a witch. That was some classic reasoning.

Jacques Miov

Hey everyone - I'm new to the Patreon here, and excited to discuss. But when I read this thread, and I believe I understand the implications, it still feels like in this topic (and only specifically this topic/thread) that we're talking about optimizing next-level GTA6 NPC's. I am not trying to minimize the value of this thread or cast anyone down at all - I love, and am in this community for the long term, and I'm not a troll, but I wonder what is beyond algorithmic and predictive models.

James Younger, DDS

I am increasingly of the belief that a base model may not even need to truly reason if the system (like AlphaCode 2) built around it can assess its reason-imitating outputs reliably enough.

Philip

I really appreciated this. I have always wondered why these language models were not trained to perform better in mathematics. Perhaps if they understood math better then reasoning might be an emergent behavior? Not sure.

Joshua Davis

In-distribution and out-of-distribution is the statistical way of describing the situation. One can also equivalently phrase this in terms of linear algebra: if all training data lies in some latent space, then the creativity of the model is limited to the *span* of this data, but it cannot produce things that are orthogonal to this span.

Alexander Berezin

‘traditional’ machine learning is sort of isomorphic to basic Pavlovian reflexes. On the one hand, these reflexes are the fundamental building blocks of all higher cognition; on the other, there are way too many configurations in which they can be arranged. And that arrangement does not necessarily come from training: in nature, brains are born already having a structure specialized to the environment they are going to encounter, which means that for AI this structure has to be determined at design stage. So we could easily have all the pieces and yet be infinitely far from AGI without a theoretical understanding of how the pieces are put together.

Alexander Berezin

LOL - why am I not surprised! I’ve just started getting in to it now!

Sean Betts

Already read it :) not quuuite video worthy yet but might reach out to the authors

Philip

Interesting new paper called Chain of Code that significantly improves in Chain of Thought for reasoning problems: https://arxiv.org/pdf/2312.04474.pdf

Sean Betts

I suspect you could be right. Maybe we are 0 breakthroughs away, just need scale.

Philip

Great vid - thank you! So, one thought I’ve had is that ‘traditional’ AI like machine learning is good at understanding prior patterns and predicting future ones. This could be seen as a form of reasoning - understanding the current rules and then applying them forward in time. In hindsight we might see that we already have all the pieces to the AGI puzzle, we just don’t know how to put them all together yet.

Sean Betts

Theory of Mind questions are so sensitive. Even stanford professsors were believing in GPT4s prowess there back in Spring but still brittle. Partially in distribution yes

Philip

Does the fact that gpt-4 could solve this mean it was in distribution?? That seems surprising if true

Mark Levine

Another great video. I'm glad I subscribed to your Patreon. So have researchers started to automate the enumeration and verification of reasoning methods yet? Suppose you got an AI to compile and formalize a large collection of reasoning methods used by humans—including deductive, inductive, syllogistic, and mathematical reasoning as well as more informal methods like common-sense and counterfactual reasoning. Based on the patterns it finds among those methods, it would devise new reasoning methods as well. It would then use all of the reasoning methods it finds and creates to produce a massive corpus of reasoning examples, assess the validity of each example by running it adversarially by the other reasoning methods, and then include only the examples that receive high scores in the next big LLM training set. I wonder if such a meta approach would work. I'm trying to figure out what an even higher-level, metameta approach would be, but I can't quite reason that far on my own.

Tom Gally

For reverse curse, an interesting example is reciting poems or the alphabet in reverse: this is a task that humans often find challenging. However, when faced with a task that involves "reciting something in reverse", I would first narrow down the scope of my search to a small subset of my knowledge base (for instance, the alphabet or a collection of poems with a similar theme). Next, I would go over the elements in this subset in a forward direction until I identify the correct response. In this process, "reciting poems or the alphabet" involves System 1, which the current large language model (LLM) can handle; however, "formulating and evaluating a reasonable search strategy" requires System 2, an ability that the current LLM cannot possess.

Sun Lighting

It's like we are reverse engineering neuroscience...

Philip

I wonder if you could not call the methods such as tree of thoughts a kind of reasoning. I mean humans also often write down their reasoning steps if they find the problem difficult. I think it is also true that humans then apply common problem solving "patterns" once they have narrowed down the individual parts of the problem in enough detail. With problem solving pattern I mean things similar to the pattern you presented with when x is y then y is ?. I think we humans also just have learned that these rules are true and then apply them to the problem at hand once we have understood that this pattern can be applied here. The chain of thought or tree of thought is basically the models way of figuring out what the nature of the problem is.

Jörg Eitner

I wonder if there will emerge a distinction between reasoning and memory. I bet that there are instances where a human would remember that A is B but not remember that B is A, because either A or B is more salient in memory. For example, with "who is Tom Cruise's mother" -> "Mary Lee Pfeiffer South", I could image recalling Tom Cruise in memory and some associated facts including his mother. But if just prompted with "Mary Lee Pfeiffer South" I might totally draw a blank. This raises interesting questions about how memory works. On the other hand, if you add the information into the prompt, the models get it right, as you point out. So it is indeed able to reason (at least using the identity relationship) when the you call its attention to the entities in question.

Jeff Picel

Reasoning, reasoning, reasoning. Incredible research. This video has made my day. Have been obsessed with reasoning ever since watching Ilya Sutskever's interview with Jensen Huang from Nvidia. BTW - mind blowing insights in that interview. This video has cleared a lot up for me. Information theory seems to be key as to how this works. In information theory, more information leads to more randomness. In language, we can compress meaningless words to get to the core meaning of what the sentence is saying, regardless of the sentence being incorrect. For example: "I lve Pars in th smmr" can still be translated to "I love Paris in the summer) quite easily, we could compress this to remove the lack of information. Feels like LLMs do something similar to extract out context and this in fact is more what is happening rather than reasoning. I can't help but wonder...did Q* with it's (assumed) A* search combined with RL help an AI to understand cause and effect (and thus this is then key to reasoning)? Would this be proven if an AI could solve maths proofs? Anyhow, am not qualified to say this is how it is, but hey, isn't this all super fascinating? I feel like we are discovering how our own brains working in this process.

Shaun McDonogh

Kind of reminds me of the False Belief Test, or Sally-Anne Test, they did with children. Can we see ChatGPT getting 'older' than the rest because it is used the most?

Dennis Hulsebos

Here is another logic prompt. Note: GPT-4 turbo can successfully answer this. However, other LLMs do not get it correct on zero-shot. Pi got it after the 2nd time with some hints. -> John and Mark are in a room with a ball, a basket and a box. John puts the ball in the box, then leaves for work. While john is away, mark puts the ball in the basket, and then leaves for for school. The both come back together later in the day, and they do not know what happeined in the room after each of them left the room. Where do they think the ball is?

Jon Kurishita

Yeah your trials just show how counter-intuitively sensitive LLMs are. Did you see that Anhtropic blog post yesterday where they mostly fixed the 200k context retrieval issue with a change in wording?

Philip

Yeah that video sparked a lot investigations! Thanks for watching Jekk

Philip

reminds me of "ChatGPT Fails Basic Logic but Now Has Vision, Wins at Chess and Prompts a Masterpiece" video

Jekk Grace

Thanks for the video. I’ve had very interesting results with the Coke example using different prompts (all trials are new conversations). Prompt 1: “If I pick up a transparent bottle with a label on the lid that reads ‘sparkling water’, inspect the bottle carefully and notice that the liquid inside is black, then drink from it and find that it also tastes like Diet Coke, then what would I logically believe that I am be drinking?” GPT-4’s Response (Single Trial): “Based on the observations you've described, you would logically conclude that the contents of the bottle are not sparkling water, as indicated by the label on the lid, but rather a cola-flavored beverage similar to Diet Coke. The black color of the liquid and the taste matching that of a Diet Coke are strong indicators that the bottle has been mislabeled or repurposed and now contains a soda instead of sparkling water. It is possible that the bottle was reused or refilled with a different beverage than what was originally intended to be in it.” “Prompt 2 (5 Trials): If I pick up a transparent bottle with a label on the lid that reads ‘sparkling water’, inspect the bottle carefully and notice that the liquid inside is black, then drink from it and find that it tastes like Diet Coke, then I would logically believe that I am be drinking [...]. Answer only with the missing text.” GPT-4’s Responses: “Diet Coke” 5 out of 5 times. Prompt 3 (5 Trials): “If I pick up a transparent bottle that has the label 'sparkling water' on the lid, inspect the bottle carefully to find that the liquid inside is black, and then drink from it, then I would logically believe that I am be drinking [...]. Answer only with the missing text.” GPT-4’s Responses: “something other than sparkling water, potentially a different liquid that appears black in color.” “something other than sparkling water, possibly a liquid that has been mislabeled or contaminated.” “something other than sparkling water, possibly a liquid that has been placed in the wrong bottle or mislabeled.” “a liquid that appears to be different from what is advertised on the label, and possibly not sparkling water.” “Black sparkling water.” Prompt 4 (5 Trials): “If I pick up a bottle that has the label 'sparkling water' on the lid, and then drink from it to find that it tastes very similar to Diet Coke, then I would logically believe that I am be drinking [...]. Answer only with the missing text.” GPT-4’s Responses: “Diet Coke.” 5 out of 5 times. Prompt 5 (5 Trials): “If I pick up a transparent bottle, that has the label 'sparkling water' on the lid, inspect the bottle carefully for a few minutes, it looks like Diet Coke, and then drink from it, I would logically believe that I will be drinking [...]. Answer only with the missing text.” GPT-4’s Responses: “Sparkling water.” 3 out of 5 times. “Diet Coke.” 2 out of 5 times. Prompt 6 (5 Trials): “If I pick up a transparent bottle, full of Diet Coke, that has the label 'sparkling water' on the lid, inspect the bottle carefully for a few minutes, and then drink from it, I would logically believe that I will be drinking [...]. Answer only with the missing text.” GPT-4’s Responses: “sparkling water.” 5 out of 5 times. Finally, in a new conversation, I shared the intent of my experiment with GPT-4 while sharing all the above examples, asking for its analysis, and here is the response: “The variations in GPT-4's responses to the different prompts can be attributed to how the information was presented and the instructions that were given. Let's analyze each prompt and the responses: **Prompt 1:** GPT-4 was provided with a full context that included visual (transparent bottle, black liquid) and sensory (tastes like Diet Coke) evidence. The prompt explicitly asked what the person would "logically believe" they are drinking, which directed the model to consider the evidence and draw a logical conclusion. The response was correct because the model used the information given to make an inference that aligned with the sensory evidence. **Prompt 2:** The prompt was direct and similar to Prompt 1, but it required a concise answer. The model was consistent in identifying the liquid as "Diet Coke" based on the description provided. The model succeeded because the instructions were clear and the evidence was unambiguous. **Prompt 3:** Here, the responses varied, with none directly identifying the liquid as "Diet Coke." This could be due to the absence of the sensory evidence (taste) in the prompt. The model was left to make an inference based solely on the visual discrepancy (black liquid in a bottle labeled sparkling water), which led to a broader range of answers that suggested uncertainty about the exact nature of the liquid, but correctly identified it as not being sparkling water. **Prompt 4:** This prompt provided sensory evidence (taste of Diet Coke) without the visual context. The model consistently identified the liquid as "Diet Coke" because the taste was a strong, defining characteristic, leading to a clear conclusion. **Prompt 5:** The inconsistency in responses to this prompt could be due to the vagueness of the phrase "it looks like Diet Coke." The model may have struggled with determining whether "looks like" referred to the actual identity of the liquid or just a superficial resemblance. This ambiguity led to mixed responses, with the model sometimes defaulting to the label on the bottle. **Prompt 6:** This prompt explicitly states that the bottle is "full of Diet Coke," but then it asks the model to predict what the person would logically believe they will be drinking, after inspecting the bottle labeled as 'sparkling water.' The model's responses indicate a reliance on the label rather than the information provided in the prompt. This suggests that the model may have given more weight to the label as a direct instruction for completing the sentence, rather than integrating the descriptive evidence into the conclusion. In summary, the model's performance appears to be influenced by: 1. The clarity and completeness of sensory evidence provided. 2. The explicitness of the instructions given in the prompt. 3. The wording used, which could introduce ambiguity or lead the model to favor certain information over others (e.g., labels over descriptions). Models like GPT-4 often rely on patterns they've learned from the data they were trained on, and they sometimes struggle with tasks that require integrating conflicting pieces of information or overcoming strongly weighted cues, such as labels or direct instructions.”

Ahmed Anis


More Creators