Simple Bench Exclusive Tour: I couldn’t find a good reasoning benchmark, so I made one.
Added 2024-08-19 14:44:44 +0000 UTC
Full results from the first Simple Bench run (including latest model updates), the new website, more insight into the questions and what the gaping hole in basic reasoning means, plus my plans going forward.
Thanks for the video.
Really liking your benchmark. I think it completes ARC well.
Besides : joined Patreon with the price decrease. I'm glad you choose this, no doubt this will prove beneficial !
Alexandre Fruchaud
2024-09-22 19:53:44 +0000 UTC
It would be amazing if you could make a ComplexBench that relies on similar concepts but even smart humans find it difficult. It would be really interesting to see how language models fare against humans there
Gilad
2024-09-20 09:23:04 +0000 UTC
my actual preamble prompt is a bit longer but you get the idea. It is possible to get the correct answer if you tell the AI model that you are about to ask it tricky questions.
Hugh Jackman
2024-09-20 03:15:43 +0000 UTC
Philip, this comment is basically asking for implementing your idea about letting us people run the simple bench, I'd also love that :)
Sam
2024-09-19 14:09:25 +0000 UTC
Philip,
Have you tried preamble your simple bench questions with the below instructions?
When approaching problems that include real-world scenarios, especially those asking for the 'most realistic' answer: Carefully read the entire question, paying special attention to any details that describe the situation, environment, or conditions. Consider how these real-world conditions might affect the elements of the problem. Look for details that could override straightforward interpretations or calculations. Apply basic principles of logic, science, and common sense to understand how the described scenario would actually play out in reality. If the scenario involves changes over time or sequential events, track how each step affects the overall situation.
I don't have your dataset, but I believe that the preamble will help the LLM in reaching the correct answer for your simple-bench questions.
Hugh Jackman
2024-09-17 05:03:33 +0000 UTC
I checked now this first question with chatgpt o1-preview and it got it perfectly.
Will you be able to get the API access for the whole batch of testing of the o1?
Adin Softic
2024-09-12 21:39:22 +0000 UTC
Google does not sleep and releases another experimental version in August: gemini-1.5-pro-exp-0827. Seems like another argument for sticking to versioning the LLMs by strictly using the respective platforms keys, preferably the dated version.
Pavol Vaskovic
2024-09-06 14:45:43 +0000 UTC
this is wonderful job Philip, well done!! hope to see this advance further in the near future!!
Juanjo do Olmo
2024-09-01 22:55:10 +0000 UTC
I would like to see Pi tested with this benchmark. It doesn't seem to get much love, but it is my favorite LLM to chat with.
Jason Dowd
2024-08-30 05:42:47 +0000 UTC
Instant bookmark, this looks not only a more realistic comparison between human ability and LLMs but also a more accurate LLM leadership. Thanks, Philip.
Kol Tregaskes
2024-08-26 20:02:32 +0000 UTC
I think this would be really valuable for public understanding and confidence in the SIMPLE benchmark. Having a couple hundred public questions similar to the secret ones would allow us to experiment with prompts, understand differences between LLMs better, critique the questions and responses, and give better-informed feedback on the whole project.
Alexis Olson
2024-08-25 18:33:54 +0000 UTC
Thanks! Most of the questions were actually generated using RAG based on industry training materials. I then quality checked them to ensure they all made sense and had a high confidence in them. I need to do a new testing run at some point to update the leaderboard, you can see the last one in the repository's readme file.
Sean Betts
2024-08-25 17:07:33 +0000 UTC
I'd love to participate. I will ping you in Discord PM
Eugene Vyborov
2024-08-25 15:42:40 +0000 UTC
Happy for you to share it!
Philip
2024-08-25 14:06:09 +0000 UTC
Great shout!
Philip
2024-08-25 14:06:00 +0000 UTC
If you think of a way, let me know! Currently hand-crafting it, soon though likely alongside some top figures at the leading labs.
Philip
2024-08-25 14:05:51 +0000 UTC
Yes, came out just before this run! Hence the improved performance compared to my tweet!
Philip
2024-08-25 14:05:08 +0000 UTC
Hold the line Cyril! These fragilities, as you mention, are still critical, and somewhat resistant to the ubiquitous hype. We need more people like you!
Philip
2024-08-25 14:04:40 +0000 UTC
That's an amazing repository! Would love to know how you sourced them and what the current leaderboard is.
Philip
2024-08-25 14:03:45 +0000 UTC
This is a key point, and will be addressed in more detail in the near future. After SmartGPT, I am highly cognisant of latent performance increases!
Philip
2024-08-25 14:01:33 +0000 UTC
Coming at some point Steven!
Philip
2024-08-25 14:00:42 +0000 UTC
I agree Kavin, but I kept in some questions that work with CoT as it will be interesting to many what models can do by default. Remember though that many examples are impenetrable even to any amount of 'Cotting'
Philip
2024-08-25 14:00:21 +0000 UTC
It is a great point that one 'tier' of questions on SIMPLE can yield noticeably increased performance through CoT and other techniques. They were not discarded, however, because (as Robert mentions) the core, unaided 'intelligence' of models is still interesting. However, keep in mind that many tiers of questions on SIMPLE are not answered correctly with any CoT or self-consistency sampling of the dozens I have tried, including SmartGPT 2.0. Less 'tricks', more total blindspots.
Philip
2024-08-25 13:59:12 +0000 UTC
Absolutely, great points. While it is targeted at core 'human' intelligence, it is certain that sub-par models on SIMPLE, like GPT-mini, could still be incredibly cost-effective for certain use-cases.
Philip
2024-08-25 13:56:00 +0000 UTC
Great question. Truth is, I initially wanted 0.7 x 5 self-consistency samples, but the immense delays involved made this unrealistic for now, so went for the most straigtforward implementation. First obvious iteration would be to test (SmartGPT-style) prompting strategies on Simple.
Philip
2024-08-25 13:55:08 +0000 UTC
Great idea Andrew. I hadn't thought about that but it's a great point and yes, all data will be stored and eventually released.
Philip
2024-08-25 13:53:46 +0000 UTC
Thanks Enrico, I will always be transparent with your guys about what is going on, rest assured.
Philip
2024-08-25 13:53:10 +0000 UTC
One of many good options. Not as focused on citations at the moment, but looking good now to scale up to 1000+ with some help with some elite folks at the top labs.
Philip
2024-08-25 13:52:38 +0000 UTC
Haha, appreciate the rigor Pavol! I called the new Gemini 1.5 Pro as 08-24, as in Aug 2024, and therefore felt OCD about making the GPT-4 version used follow the same format, month then year. I also noticed that this contravenes how it is normally showcased - 0613, but then what would I call the new Gemini 1.5 Experimental?
Philip
2024-08-25 13:51:17 +0000 UTC
Happy for you to mention it Enrico. I should have clarified that the website itself is not private, but all the exclusive info and updates are just for you guys!
Philip
2024-08-25 13:49:53 +0000 UTC
Agree on all points Blixt. Needs more info on the website (working on it) and also a 'prompt-optimised leaderboard'. I still stick by my point though that in the extreme any prompt can solve any arbitrary benchmark (e.g. a permutated prompt that eventually became a set of answers to all questions)
Philip
2024-08-25 13:49:06 +0000 UTC
Yes, so many people have mentioned a separate tiering/leaderboard based on prompting, it is inevitable that this will be included in future.
Philip
2024-08-25 13:47:12 +0000 UTC
haha, glad to hear Dylan. Might one day usurp your channel's coverage of Dr. Thompson's AGI countdown, you never know!
Philip
2024-08-25 13:46:17 +0000 UTC
Thanks Lee, I think it was about Grok 2 having an API but Ideogram is epic, so any insider news you have there is great too
Philip
2024-08-25 13:45:41 +0000 UTC
Alas not, the privacy is part of the point! But I may invite selected individuals (inc. via an NDA) to sample and review.
Philip
2024-08-25 13:45:07 +0000 UTC
Philip, is there anywhere I can download the questions/answers for the benchmark to test it out on my agent?
Eugene Vyborov
2024-08-25 07:54:13 +0000 UTC
Yes, Phillip, the website leaderboard might benefit from more clarity on the precise model tested. Btw, with many people reporting Sonnet 3.5 getting "dumber", wouldn't it be great to know for real that this is happening? Would that show on SIMPLE-bench? Is it real? Or some user effect induced as we advanced to better models and keep trying to push the bar?
Robert Gomez-Reino
2024-08-23 07:42:05 +0000 UTC
- I agree with the required prompt/flow/concept engineering
- However, rather than "a weak reasoner with undeveloped concepts", I say it's capable of interpolating between patterns that its trained on/familiar with. And I would also say that many humans are similarly limited in their reasoning by interpolating between patterns that they know. It takes a certain level of intelligence (dare I say IQ) to reason beyond that.
I guess what I'm saying is that I absolutely agree with the limitations that we discuss here, but I also don't want to lose the sense of awe that I have of working with and on these really powerful AI solutions that simply weren't possible 5 years ago.
Erik
2024-08-23 06:07:42 +0000 UTC
Messaging here Philip as it'll get lost on your youtube channel and I'll pop a message in discord too, I was on the ideogram 2.0 announcement today and heard about them now having an API beta key. You mentioned you would be interested in them having an API.
Lee FRASER
2024-08-23 00:20:14 +0000 UTC
I have discovered a love to benchmarks because of your work!
Dylan Curious (YouTube)
2024-08-22 21:30:20 +0000 UTC
"Humans don't need any prompting", I think this is slightly wrong. The way that we think about a question is dependent on the context that we find it. If the question is part of your test set, we know that it is some kind of trick reasoning question. I think there is a level of context that we can provide the model that wouldn't be considered over-prompting. I.e. letting the model know that the question is part of an advanced reasoning benchmark test, perhaps providing another example question without the answer?
At least I'd be interested to know if this caused a statistically significant increase in model capabilities.
Ivo Denisovich
2024-08-22 13:12:45 +0000 UTC
I assumed it is officially released :D as is likely indexed and shared already by many. So, I already posted it a few days ago. I hope ok for Philip :D
Robert Gomez-Reino
2024-08-21 11:40:32 +0000 UTC
Great work, the curation of this set of questions is tremendously valuable!
I would love to see a full description of how tests are run on the About page, such as system prompt, user message formatting (if it's identical to what was shown, then stating that explicitly would help reduce doubt).
One more thing that is maybe a bit contentious, but I do believe it would be nice to keep the current results as "raw" and add "guided" results where a prompt is designed to encourage the LLM to delay its answer until after it has performed basic reasoning. As you said in your video, once LLMs have output an answer they're pretty much locked in, and this is fundamentally part of their token stream and will only change once LLMs are able to change their previously output tokens, or at least disregard them (negative attention?). Some LLMs have been fine-tuned to emulate this by way of CoT and similar, and some chat UIs have this prompting built in, and I think trying to level the playing field across all APIs with an explicit prompt would add some fairness to measuring the potential of each LLM.
(Edited to add: I completely agree that no specific prompts for subsets of the benchmark should be made, since that would be the writer doing the reasoning. Just arguing for trying to glimpse the optimal reasoning capability of any LLM that has not been fine-tuned to work around the "lock-in" behavior since they were most likely fine-tuned for non-benchmark related tasks where someone does have a prompt indicating its reasoning behavior.)
Let me know if you need any help on expanding the website or adding any functionality, I've got plenty of full-stack experience and happy to help.
Blixt
2024-08-21 09:52:36 +0000 UTC
When's the official release? I want to retweet it from my ai account.
Enrico Ros
2024-08-21 06:54:17 +0000 UTC
What’s Gemini 1.5 Pro (08-24)? Given today is 21st of August 2024, seems like a model from the future. But to ground my better guess in reality I will assume you are talking about gemini-1.5-pro-exp-0801 that’s available in Google’s AI Studio as experimental preview? Rather than riffing on OpenAI’s dubious model labeling convention, I suggest you use the specific key used to select the model in their API or use the exact names as specified by the authors. I find the renaming of models extremely confusing. (Is there really a “Turbo” version of Llama 3 405B?! AFAIK the revision is 3.1 and not just 3… Sorry to harp on this, but my OCD simply can’t tolerate such imprecisions.)
Pavol Vaskovic
2024-08-21 06:53:31 +0000 UTC
If you make more questions, release half of them as open source, you could potentially get tons of citations.
Michaeel Kazi
2024-08-21 06:26:36 +0000 UTC
This is Absolutely incredible information. SO precious and finally we have a benchmark where my subjective matches 1:1 the results.
4o-mini: what a shame: it's great to have a cheap model, it's criminal to make it overfit benchmarks (but not Simple Bench!)
Please keep Simple Bench secret, despite large companies will throw Very Large amounts of money you way for "partnerships" or "early access". You are great because you have your unique way.
Enrico Ros
2024-08-21 06:18:39 +0000 UTC
Philip, love that you've shipped this exciting new benchmark! Congrats!!!
You mentioned that you're going to try and keep Simple Bench updated as often as possible. Is your intention to keep all prior data publicly available so someone can potentially create like this benchmark battle animation over time? https://x.com/amebagpt/status/1825643117124743259
I think that would be a really nice way to see how various models ebb and flow over time when graded by Simple Bench.
Andrew Thompson
2024-08-20 22:00:34 +0000 UTC
Why the temperature: 0.2 choice? Isn't this worse for reasoning? more likely decoherence?
Robert Gomez-Reino
2024-08-20 18:42:03 +0000 UTC
But I think the whole point of this benchmark is to test the model as they are. No system prompt. Of course, there is for any task an EXACT prompt that will solve that task. And you can use generic ones that help in multiple types of tasks too. But the best way to test LLMs to know their current native power is "naked". Like this, I know that Sonnet has much better-grounded general ideas or concepts and is better at anchoring my request to those, so it will likely be MUCH better when I use some specific prompting approach. I don't want Philip & team to test models with what they might believe is the best prompting approach. Like this looks perfect for now to me :)
Robert Gomez-Reino
2024-08-20 18:37:04 +0000 UTC
I think both are compatible. They are indeed yet weak reasoners (much dumber than humans), and that's why for our real use case solutions we need quite some prompt/flow/concept engineering. We are basically grounding and developing otherwise undeveloped concepts in the model with the prompting indeed. just the way i see it (and I believe it will be proven soon).
Robert Gomez-Reino
2024-08-20 18:27:42 +0000 UTC
One more comment from my side, this benchmark, like any other, should be used in context. What is the best model for "my own" use case? For sure I am inclined for the most reasoning performance one. Our specific tasks a) relate to cyberphysical control systems (so good reasoning on real world scenarios is very much needed) and b) training material for models on OT was (and is) likely scarce. However, we still need to take into account:
- adherence to structure output?
- function calling? is this GPT4-preview supporting that?
- context size? same, is this 32k only?
- speed
- cost... well, for our use case wont be the top criteria (what is the cost of doing human equivalent work for some million tokens...) but still to consider
So still, some work to do. But SIMPLE Bench surely is a big help.
Robert Gomez-Reino
2024-08-20 14:19:17 +0000 UTC
Philipp, GPT4 0623 I assume is a typo: 0613 should be the right one?
Robert Gomez-Reino
2024-08-20 13:48:10 +0000 UTC
Wow... Turbo Preview shows up in the playground but not in the pricing page of OpenAI? I dont find it either in Azure OpenAI models.
At first I thought it was gpt-4-turbo-2024-04-09 which is in the pricing page: https://openai.com/api/pricing/ btw, 4 times the price of the latest omni.
But then, if you go to the platform playground, you see you can choose gpt-4-turbo-2024-04-09 and also gpt-4-turbo-preview...
Maybe someone from OpenAI here can clarify?
Also maybe someone from Azure OpenAI can reach to clarify if this model is available at all or not in Azure?
Robert Gomez-Reino
2024-08-20 13:32:20 +0000 UTC
Interesting work Philip, but as others said, prompting is a major issue. It shouldn't be an "afterthought" in assessing the full reasoning capabilities of the models. It's common for models to solve "viral puzzles that show how dumb models are" with a bit of prompting, even just plain chain of thought.
In your "bathtub man" problem, you can see that GPT-4o, GPT-4T and Gemini 1.5 Pro pick an option right away and then stick with it. And Gemini gets the essence of the matter afterwards, during explaining, but the bias to stick with the initial mistake is too heavy.
On the other hand, Claude Sonnet 3.5 first does some rudimentary CoT before responding, even with the same prompt. Its success could be due to fine-tuning making it think before respond (it also has an invisible thinking function now). You can see that when you add CoT to GPT-4o it also gets it https://chatgpt.com/share/aae86a25-e4a3-47bb-9583-3e5133c7fc09.
And there could still be some subtle but important issues with the system prompt in certain questions. When a human takes the test, they know to expect trick questions. But when the model takes it, its default is to try and be a helpful assistant in everyday situations, which biases it to try to reinterpret the question in a more likely manner. In a previous video you presented an example where the model was "fooled" in a medical diagnosis despite being told about a gunshot wound. Simple CoT showed GPT-4o detecting the trick detail but discarding it as implausible https://chatgpt.com/share/9c434151-5354-4aea-a943-6f2788f4f0fa. This needs a careful system prompt to overcome.
Andronikos Koutroumpelis
2024-08-20 09:40:59 +0000 UTC
I think most of the times the LLMs fails at answering correctly is because they don't even understand the question correctly, rather than failing at reasoning correctly. Also these LLMs answer better when prompted to "Think step by step". I believe every system prompt must default tell the LLMs to think step by step because after all LLMs nature is to predict the next immediate token which corresponds to step by step thinking. I am sure the LLMs would have scored better with step by step thinking prompt. The only LLM i believe that does step by step thinking by default is Claude 3.5 sonnect (may be due to it using its default internal tags) & may be that's why it gives better answers than the rest. What do you think Philip ?
For now, one thing i keep in mind before prompting an LLM is to think of it as a drugged genius who is always in a half awakened state. I also think of them as lazy machines. They won't do something unless you explicitly ask them to even if they know how to. After all, sometimes humans behave the same way too !
kavin vikram
2024-08-20 08:02:07 +0000 UTC
Thanks for the new site! It sounds like you already have a bunch of new features planned (so this might already be on your list), but it would be great to have plots showing time series data on how performance has changed over time.
Steven Fazzio
2024-08-20 04:34:16 +0000 UTC
Great work Phillip! One thought: Specifically tweaked prompts that increase performance on that specific task may not lead us to AGI level models, but they DO have a lot of business value.
Some people will see your videos and say “you see, these models are dumber than humans and can’t solve my business challenges”.
That’s far from the truth though: I make my living by doing just that: solving business challenges with AI.
Erik
2024-08-19 23:49:31 +0000 UTC
Here is an ice cube in a pan "recipe" for perfect reheating of leftover slices of deep pan pizza. All needs love 🍕 right! So to bring that dry cold crust, and that congealed cheese back to life, follow these simple steps for crispy and delicious melty next day scran. Put a non stick frying pan on, and warm to a medium hear. Add a little oil, 1/2 a teaspoon drizzle and add the cold pizza, cook on medium heat for 2 minutes -then with the oil absorbed into the pizza base to crisp it-add a few ice cubes to the pan and immediately cover to create steam so that the cheese cooks and melts.
The ice worked better than water at this point in testing and the crust didn't get soggy! After a minute, remove the lid, slide your slice onto a plate. And hey presto, perfect reheated pizza.
Lee FRASER
2024-08-19 22:34:26 +0000 UTC
Congrats on Simple Bench - it’s a great project and I’m really looking forward to seeing if progress is made against it with the next generation of models.
I have my own marketing benchmarks (2,800+ questions) that I’m keeping private for now for similar reasons. If you ever want to swap notes on testing scripts, scoring etc please let me know!
Sean Betts
2024-08-19 20:22:09 +0000 UTC
Thank you Philip for all these insider videos!
In the company that I work for, there are currently abnormal and unrealistic expectations of what GenAI can do - including thoughts like letting go of hundreds of software engineers.
Your rigorous reasoning and proofs help me to at least somewhat stand ground and hold a defensive position. Although I am afraid when all is said and done and the dust settles, leadership will come to senses and realize what mistakes over unrealistic expectations were done.
It's like 2022 again but in reverse => Instead of expecting huge growth and hiring unreasonable amounts of people, huge companies now want that sweet piece of magical cake of infinite productivity without actual people doing the work.
Even still, this technology is a miracle. However, leadership often doesn't realize, that to reap benefits from GenAI, real hard engineering work has to be invested during months and years.
Everyone is looking for quick wins. But I think, that those who are stable and understand strengths and weaknesses of GenAI will prevail and become the winners of the Gartner Hype Cycle.
That's why I see so much value in your work - stable and hard work.
Cyril Sadovsky
2024-08-19 20:16:48 +0000 UTC
LLM content moderation is a big focus for me, and I feel like your comparison of "interpolation" vs "simulation" unlocked something in my head, and I'm particularly intrigued by how to apply this nuance to Conversational AI. The back-and-forth of a conversation with an AI feels like a kind of language "interpolation", one that cane be easily guided to undesired content. So, what if there was an added layer of "simulation", and could it enhance the AI's responses? I'm playing with the idea of creating a system that would assess the dynamics between the user and the AI, almost like it's forming an opinion about the user's intentions. The results could be used as a framing device, potentially improving the AI's ability to generate appropriate responses. It's just a concept that came to mind and that I kind of want to explore. I'd love to hear what others think about this approach and learn if there are similar ideas in the field.
Blake Chambers
2024-08-19 18:36:57 +0000 UTC
Is GPT4o here the 2024-08-08 version?
Alexis Olson
2024-08-19 17:16:11 +0000 UTC
Is there were a way to crowd-source a benchmark like this without massive contamination risk? It would be cool to have a public or semi-private version that help up as long as it wasn't specifically Goodharted.
Alexis Olson
2024-08-19 17:15:22 +0000 UTC
I love it. I wonder if instead of giving a score you could categorize the score across different fields, assumptions or something like that (radar chart). I do have some connections to VCs in USA, in case that is handy I'm happy to help
Pablo Rodríguez
2024-08-19 17:02:00 +0000 UTC
This is beautiful! When do you plan to share the website link publicly? I really want to ask the Humane Intelligence Discord what they think about the SIMPLE benchmark.
Blake Chambers
2024-08-19 16:47:52 +0000 UTC
Yes like with AlphaProof. Prompting the system millions of times over to produce reality-grounded hypotheses then running an evolutionary algorithm over it seems to be producing incredibly promising results. Perhaps our hyper-fixation on one-shot benchmarks is leading us astray from the ultimate goal of producing a better reasoning machine.
r
2024-08-19 16:46:52 +0000 UTC
100%
Robert Gomez-Reino
2024-08-19 15:52:21 +0000 UTC
let's goo!!! yeah!!
Robert Gomez-Reino
2024-08-19 15:51:34 +0000 UTC
Okay so what if you give it several passes to reason? Great questions BTW and solid reasoning (on your part) about the nature of the questions.
David Shapiro
2024-08-19 15:14:53 +0000 UTC
"I'm addicted to papers please help" - nope this is too entertaining for the rest of us!