SakeTami
AIExplained
AIExplained

patreon


Pod 7: The Story Behind SIMPLE Bench, More Results, and Next Steps - Let's Think Sip by Sip

Mistral Large flops hard, but what exactly is this benchmark, what are some more of its questions, why is it different, and what is next? All, or at least some, of these questions will be answered.

Comments

Around 30:50, Philip talks about how models could be able to predict human reactions to experimental and how impactful that would be. I came across a paper https://arxiv.org/abs/2305.09620 “AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction” that might be relevant for updating one’s evaluation of current models’ performance on this metric. (Disclaimer: I haven’t read the paper, only heard it mentioned on some YT video, which I cannot find again)

Pavol Vaskovic

This should do it Pablo! https://support.patreon.com/hc/en-us/articles/212052266-Getting-Discord-access

Philip

Can you share the discord link with me please?

Pablo Rodríguez

I think we are always coaching ourselves and other humans into reasoning. Prompt engineering or something very close *is* part of human rationality. Psych experiments from Kahneman et al. show that humans do a lot better when you prompt them correctly. But when we compare humans with LLMs, we never consider how they're being prompted or their direct experience with being tested and evaluated. Or the context awareness that e.g. they might be getting trick questions. So I have the opposite concern, that we are understating LLM capabilities by being unfair to them.

Vlad Gheorghe

haha maybe one day if I am tipsy

Philip

Tokenisation based weaknesses aren't super-interesting to me any more for the reason you gave!

Philip

I'm mysterious but not 3-phd mysterious.

Philip

I am optimistic these things will be solved but more thinking of a hybrid neuro-symbolic system that can call - among other things - a multi-faceted simulator. Like imagine IsaacGym on steroids.

Philip

Sub 10% so far, but aiming for completing the full self-consistency run by end of week.

Philip

It is a very important point, about how much prompting is 'coaching the model through reasoning'. So you are using your human capabilities to identify it as a question of time, and what the salient tasks are. You are also framing, in language, what kind of tokens to recall, or weights to activate, to answer the question. Part of what I am testing is whether the model can do these things. I am open-minded therefore, about a 'prompt-augmented' set of results but I feel they may somewhat overstate LLM capability. You are right though, that in the spatial reasoning questions they fail even when prompted in almost any way, so even prompt-augmented it would fall well below human performance.

Philip

Thank you for this. I disagree when you say that the model lacks relevant knowledge because basic things aren’t written down. Try the following prompts: 1. How long does it take on average to read a tweet? 2. Imagine there's a guy running a race and while they're running they turn to wave at a fan. How long would that take? 3. How long does it take to ride an elevator for 10 floors and then back? Imagine that a man gets some news. Rate the following news in terms of gravity. Give each a gravity score between 0 and 1,000 where 0 is trivial and 1,000 is maximum gravity. - his ex has been seeing other people after they broke up - global thermonuclear war has erupted - his kid has failed a grade - his uncle has been diagnosed with terminal illness Claude and ChatGPT give very reasonable answers. Clearly the model has the building blocks for solving the problem and it’s not about lack of knowledge*. I think it’s more like the system 1 trap of blurting out an answer too quickly. Or maybe the model *knows* they're giving out the wrong answer but they can't *help* it, can't resist the pull of the statistical pattern (LLMs are weird!). Therefore I think the models would do a lot better with some basic prompting. And I don't mean cheating by applying a tailored prompt to each question, but a general prompt that tells the model to think and model the situation. So my questions now are: - Have you tried any type of prompting when evaluating models? - What prompt did you give to humans when you gave them the test? You mention something like “think about this for 1-2 minutes before answering.” - Would you consider opening an ARC style competition where you release the training data and people can write a prompt and be evaluated on the hidden test data? (*Caveat: for spatial reasoning and physics intuition I am closer to your views because that is hard to model through language and requires a simulator.)

Vlad Gheorghe

Can we get some idea of how the new Gemini 1.5 Pro is scoring in the SIMPLE benchmark? at this point for me this might mean testing or not testing it tbh

Robert Gomez-Reino

Hey John, thanks for your answer, do you have a link?

Pablo Rodríguez

I suspect his discord channel is probably the best avenue if you're looking for networking and just more open discussion

John Dorian

I finally finished the podcast and have to say, great work, thanks for sharing. I've mentioned this before, but I’ll elaborate further. Throughout my life, I enjoyed understanding things deeply but didn’t focus on repetitive practice. This meant that during tests, I often faced problems for the first time, unlike my friends who practiced similar problems multiple times and were quicker at solving them. In college, my university's tests involved long, math-heavy problems where speed was rewarded. In contrast, the Polytechnic University of Madrid had a different approach with "problemas de idea feliz" or "happy idea problems." These were unique, original problems that required a smart idea to solve, often simpler and less mathematically intricate, yet typically considered harder. The questions you share remind me of these "happy idea problems" because they require creative thinking rather than rote knowledge. Sidenote: have you thought about opening a forum? I would love to network and talk with people in the space. Maybe you have it and I don’t know about it.

Pablo Rodríguez

You have done such a brilliant job of convincing me that real-world reasoning is hard to build that now I am skeptical that even some of the emerging architectures (mamba, Q* etc) can tackle it. But you seem more optimistic, correct?

Mark Levine

I dont care how you pronounce it but at 7:05 it does sound like "mistrial" a bit

SwayStar123

I’m pretty sure he was referring to “PhDs” as people with a PhD, and not PhDs that he has himself! Although, with mysterious Phillip, you never know!

Erik

Thanks for the podcast. I'm looking forward to your website with the SIMPLE ratings of the main LLMs. Here’s what I’ve been thinking about lately: As LLM performance gets better and better, and as people use them for more and more purposes, there need to be good ways of evaluating and comparing their performance on fuzzy tasks that are difficult to rate on a right-or-wrong basis, such as discussion, brainstorming, role-playing, consulting, etc. I, at least, don’t use LLMs much for reasoning problems with definite solutions. Rather, I use them interactively for open-ended tasks. While I might come to prefer a particular LLM for my limited purposes (currently, my preference is for Claude 3.5 Sonnet), it would be nice to know which LLM is strongest overall on a wide range of tasks. Can an automatable benchmark be developed for that purpose?

Tom Gally

Ah, yeah, that's true.

Adrian Schmidt

Did he just say " ... this is why i got multiple PhDs in different domains by the way ... " :P This triggers me as I am currently finalising to finish my single PHd. I would be curious what subjects you studied? btw: i am such a good prompt engineer now that i have really stopped writing. i am prompting the whole book i have to write. quite complex prompting. but for me it has changed into a new paradigm of academic work.

Christopher Pollin

I think in theory when we are using API we can opt out of that. We have to believe they are committing to their contract...

Robert Gomez-Reino

yes. but there you are using your knowledge of the task at hand to give specific instructions. notice my approach is generic and thefore scalable and can be automated. it just helps to solve many riddles

Robert Gomez-Reino

A interesting (not necessarily better) solution to this only with prompting is to ask for each word on a new numbered line. The right task is easy enough for the model.

Sim Davis

I would say that when the answer requires any maths, algebra etc. it is a very good idea to ask it to use code interpreter (or if via API provide your own tool to do so) but when the questions is relatively simple, the fact of asking to use code interpreter is helping sometimes just because the model is indirectly "forced" to decompose the task to crate the code, and during that decomposition it "reasons" to the right solution. Example of the famous counting of r in strawberry here, without using open interpreter: https://chatgpt.com/share/e/580994f1-a92f-4c6d-9b20-024b1bddb0d5

Robert Gomez-Reino

Unrelated... to previous, I have a strong feeling ARC AGI will be solved before the end of the month. I don't know to be honest if it will be considered brute force. I feel now solving ARC AGI doesn't bring us necessarily closer to AGI... perhaps only if solved in a specific way. I feel is what you expressed too.

Robert Gomez-Reino

Awesome. I listen half on the way in to see Deadpool with my wild girls, and half on the way back from the cinema. So many questions I cannot go to sleep with them :D First, I just want to share that I was screaming for a benchmark like this. TBH, the results are consistent with our experience. But we were before hesitating (could it be some not optimal prompting? some wrong prompt CoT flow etc.) Now we know there is a benchmark we believe very useful. 1) You say in the vide some AI vendors/labs might want to test against that benchmark. Sounds great. What about wrappers? That could be an enormous tool, I would love to be able to optimize first my COT approach for general reasoning, and then focus on the specific task. would it make sense in your opinion? 2) Boat one is magic. No way they can get it right. Its an awkward question though. I would wonder if asked myself, if I would consider that the person making the question made a mistake or so? or assume an infinite boat. But LLMs don't say they do assume that. 3) the race one though, I do manage to make LLMs consistently answering correctly (GPT4o and Sonnet). I think I didn't get correctly the answer? or you missed some details? 4) Breadcrumbs: I found many times that simply telling the model to be methodical and asking it to solve the task from principles instead of using just knowledge solves some question that otherwise they consistently failed. Worth to try benchmark with such simple and generic (so not related to the specific question) system prompt? I would be happy to discuss this one. Indeed, knowledge helped LLM to somehow store "concepts" but also I feel for a while they contaminate their reasoning. I guess LLMs would evolve into something that can decide when not to use that reasoning (as humans do) and go down a level of abstraction, all the way to base principles if needed. Amazing work! Looking forward to hearing more. And sharing the word in the wild, this benchmark is the right approach

Robert Gomez-Reino

You stay being the man. Thank you for your hard work. Also can you give indication on if you plan to release anything on AlphaProof or Nvidia's Gr00t? I'm going crazy constantly refreshing my feed lol

r

In the middle of the podcast. With ChatGpt in those situations I ask explicitly to use code interpreter and he gets it right most of time. Like for example counting word on a text and stuff like that. The typical thing that humans would have to sit and think about more carefully, on those I ask for code interpreter

Pablo Rodríguez

Is there some way you can protect against your input when benchmarking being logged and used to train future models? For locally run open weights models, I guess that's not a problem, but for ChatGPT, Claude, etc?

Adrian Schmidt

This is huge and so exciting!! 👏 Can't wait to see the website for SIMPLE. It's so important. Def run 5 samples for stability, and consider versioning it (SIMPLE-1) to hint at the future. Any benchmarks with high scores (say >80%) has no room for growth, so it's quite irrelevant or saturated. I love that SIMPLE starts with scores such as 8% because it leaves room for growth, leaves room for being excited by tech and advancements once again. Very well done, and looking forward.

Enrico Ros

I love this, can you make 8-hour podcasts about LLM benchmarks?

Alfredo Maria Fomitchenko

Thanks for releasing this early. I also wanted to note how impressive your speaking is. No script, just an exceptional verbal fluency.

Jonathan Kirk

Didn't want you guys to wait, so here it is early!

Philip


More Creators