AIExplained - Simple Bench Exclusive Tour: I couldn’t find a good reasoning benchmark, so I made one.

Thanks for the video. Really liking your benchmark. I think it completes ARC well. Besides : joined Patreon with the price decrease. I'm glad you choose this, no doubt this will prove beneficial !

Alexandre Fruchaud

2024-09-22 19:53:44 +0000 UTC

It would be amazing if you could make a ComplexBench that relies on similar concepts but even smart humans find it difficult. It would be really interesting to see how language models fare against humans there

Gilad

2024-09-20 09:23:04 +0000 UTC

my actual preamble prompt is a bit longer but you get the idea. It is possible to get the correct answer if you tell the AI model that you are about to ask it tricky questions.

Hugh Jackman

2024-09-20 03:15:43 +0000 UTC

Philip, this comment is basically asking for implementing your idea about letting us people run the simple bench, I'd also love that :)

Sam

2024-09-19 14:09:25 +0000 UTC

Philip, Have you tried preamble your simple bench questions with the below instructions? When approaching problems that include real-world scenarios, especially those asking for the 'most realistic' answer: Carefully read the entire question, paying special attention to any details that describe the situation, environment, or conditions. Consider how these real-world conditions might affect the elements of the problem. Look for details that could override straightforward interpretations or calculations. Apply basic principles of logic, science, and common sense to understand how the described scenario would actually play out in reality. If the scenario involves changes over time or sequential events, track how each step affects the overall situation. I don't have your dataset, but I believe that the preamble will help the LLM in reaching the correct answer for your simple-bench questions.

Hugh Jackman

2024-09-17 05:03:33 +0000 UTC

I checked now this first question with chatgpt o1-preview and it got it perfectly. Will you be able to get the API access for the whole batch of testing of the o1?

Adin Softic

2024-09-12 21:39:22 +0000 UTC

Google does not sleep and releases another experimental version in August: gemini-1.5-pro-exp-0827. Seems like another argument for sticking to versioning the LLMs by strictly using the respective platforms keys, preferably the dated version.

Pavol Vaskovic

2024-09-06 14:45:43 +0000 UTC

this is wonderful job Philip, well done!! hope to see this advance further in the near future!!

Juanjo do Olmo

2024-09-01 22:55:10 +0000 UTC

I would like to see Pi tested with this benchmark. It doesn't seem to get much love, but it is my favorite LLM to chat with.

Jason Dowd

2024-08-30 05:42:47 +0000 UTC

Instant bookmark, this looks not only a more realistic comparison between human ability and LLMs but also a more accurate LLM leadership. Thanks, Philip.

Kol Tregaskes

2024-08-26 20:02:32 +0000 UTC

I think this would be really valuable for public understanding and confidence in the SIMPLE benchmark. Having a couple hundred public questions similar to the secret ones would allow us to experiment with prompts, understand differences between LLMs better, critique the questions and responses, and give better-informed feedback on the whole project.

Alexis Olson

2024-08-25 18:33:54 +0000 UTC

Thanks! Most of the questions were actually generated using RAG based on industry training materials. I then quality checked them to ensure they all made sense and had a high confidence in them. I need to do a new testing run at some point to update the leaderboard, you can see the last one in the repository's readme file.

Sean Betts

2024-08-25 17:07:33 +0000 UTC

I'd love to participate. I will ping you in Discord PM

Eugene Vyborov

2024-08-25 15:42:40 +0000 UTC

Happy for you to share it!

Philip

2024-08-25 14:06:09 +0000 UTC

Great shout!

Philip

2024-08-25 14:06:00 +0000 UTC

If you think of a way, let me know! Currently hand-crafting it, soon though likely alongside some top figures at the leading labs.

Philip

2024-08-25 14:05:51 +0000 UTC

Yes, came out just before this run! Hence the improved performance compared to my tweet!

Philip

2024-08-25 14:05:08 +0000 UTC

Hold the line Cyril! These fragilities, as you mention, are still critical, and somewhat resistant to the ubiquitous hype. We need more people like you!

Philip

2024-08-25 14:04:40 +0000 UTC

That's an amazing repository! Would love to know how you sourced them and what the current leaderboard is.

Philip

2024-08-25 14:03:45 +0000 UTC

This is a key point, and will be addressed in more detail in the near future. After SmartGPT, I am highly cognisant of latent performance increases!

Philip

2024-08-25 14:01:33 +0000 UTC

Coming at some point Steven!

Philip

2024-08-25 14:00:42 +0000 UTC

I agree Kavin, but I kept in some questions that work with CoT as it will be interesting to many what models can do by default. Remember though that many examples are impenetrable even to any amount of 'Cotting'

Philip

2024-08-25 14:00:21 +0000 UTC

It is a great point that one 'tier' of questions on SIMPLE can yield noticeably increased performance through CoT and other techniques. They were not discarded, however, because (as Robert mentions) the core, unaided 'intelligence' of models is still interesting. However, keep in mind that many tiers of questions on SIMPLE are not answered correctly with any CoT or self-consistency sampling of the dozens I have tried, including SmartGPT 2.0. Less 'tricks', more total blindspots.

Philip

2024-08-25 13:59:12 +0000 UTC

Absolutely, great points. While it is targeted at core 'human' intelligence, it is certain that sub-par models on SIMPLE, like GPT-mini, could still be incredibly cost-effective for certain use-cases.

Philip

2024-08-25 13:56:00 +0000 UTC

Great question. Truth is, I initially wanted 0.7 x 5 self-consistency samples, but the immense delays involved made this unrealistic for now, so went for the most straigtforward implementation. First obvious iteration would be to test (SmartGPT-style) prompting strategies on Simple.

Philip

2024-08-25 13:55:08 +0000 UTC

Great idea Andrew. I hadn't thought about that but it's a great point and yes, all data will be stored and eventually released.

Philip

2024-08-25 13:53:46 +0000 UTC

Thanks Enrico, I will always be transparent with your guys about what is going on, rest assured.

Philip

2024-08-25 13:53:10 +0000 UTC

One of many good options. Not as focused on citations at the moment, but looking good now to scale up to 1000+ with some help with some elite folks at the top labs.

Philip

2024-08-25 13:52:38 +0000 UTC

Haha, appreciate the rigor Pavol! I called the new Gemini 1.5 Pro as 08-24, as in Aug 2024, and therefore felt OCD about making the GPT-4 version used follow the same format, month then year. I also noticed that this contravenes how it is normally showcased - 0613, but then what would I call the new Gemini 1.5 Experimental?

Philip

2024-08-25 13:51:17 +0000 UTC

Happy for you to mention it Enrico. I should have clarified that the website itself is not private, but all the exclusive info and updates are just for you guys!

Philip

2024-08-25 13:49:53 +0000 UTC

Agree on all points Blixt. Needs more info on the website (working on it) and also a 'prompt-optimised leaderboard'. I still stick by my point though that in the extreme any prompt can solve any arbitrary benchmark (e.g. a permutated prompt that eventually became a set of answers to all questions)

Philip

2024-08-25 13:49:06 +0000 UTC

Yes, so many people have mentioned a separate tiering/leaderboard based on prompting, it is inevitable that this will be included in future.

Philip

2024-08-25 13:47:12 +0000 UTC

haha, glad to hear Dylan. Might one day usurp your channel's coverage of Dr. Thompson's AGI countdown, you never know!

Philip

2024-08-25 13:46:17 +0000 UTC

Thanks Lee, I think it was about Grok 2 having an API but Ideogram is epic, so any insider news you have there is great too

Philip

2024-08-25 13:45:41 +0000 UTC

Alas not, the privacy is part of the point! But I may invite selected individuals (inc. via an NDA) to sample and review.

Philip

2024-08-25 13:45:07 +0000 UTC

Philip, is there anywhere I can download the questions/answers for the benchmark to test it out on my agent?

Eugene Vyborov

2024-08-25 07:54:13 +0000 UTC

Yes, Phillip, the website leaderboard might benefit from more clarity on the precise model tested. Btw, with many people reporting Sonnet 3.5 getting "dumber", wouldn't it be great to know for real that this is happening? Would that show on SIMPLE-bench? Is it real? Or some user effect induced as we advanced to better models and keep trying to push the bar?

Robert Gomez-Reino

2024-08-23 07:42:05 +0000 UTC

- I agree with the required prompt/flow/concept engineering - However, rather than "a weak reasoner with undeveloped concepts", I say it's capable of interpolating between patterns that its trained on/familiar with. And I would also say that many humans are similarly limited in their reasoning by interpolating between patterns that they know. It takes a certain level of intelligence (dare I say IQ) to reason beyond that. I guess what I'm saying is that I absolutely agree with the limitations that we discuss here, but I also don't want to lose the sense of awe that I have of working with and on these really powerful AI solutions that simply weren't possible 5 years ago.

Erik

2024-08-23 06:07:42 +0000 UTC

Messaging here Philip as it'll get lost on your youtube channel and I'll pop a message in discord too, I was on the ideogram 2.0 announcement today and heard about them now having an API beta key. You mentioned you would be interested in them having an API.

Lee FRASER

2024-08-23 00:20:14 +0000 UTC

I have discovered a love to benchmarks because of your work!

Dylan Curious (YouTube)

2024-08-22 21:30:20 +0000 UTC

"Humans don't need any prompting", I think this is slightly wrong. The way that we think about a question is dependent on the context that we find it. If the question is part of your test set, we know that it is some kind of trick reasoning question. I think there is a level of context that we can provide the model that wouldn't be considered over-prompting. I.e. letting the model know that the question is part of an advanced reasoning benchmark test, perhaps providing another example question without the answer? At least I'd be interested to know if this caused a statistically significant increase in model capabilities.

Ivo Denisovich

2024-08-22 13:12:45 +0000 UTC

I assumed it is officially released :D as is likely indexed and shared already by many. So, I already posted it a few days ago. I hope ok for Philip :D

Robert Gomez-Reino

2024-08-21 11:40:32 +0000 UTC

Great work, the curation of this set of questions is tremendously valuable! I would love to see a full description of how tests are run on the About page, such as system prompt, user message formatting (if it's identical to what was shown, then stating that explicitly would help reduce doubt). One more thing that is maybe a bit contentious, but I do believe it would be nice to keep the current results as "raw" and add "guided" results where a prompt is designed to encourage the LLM to delay its answer until after it has performed basic reasoning. As you said in your video, once LLMs have output an answer they're pretty much locked in, and this is fundamentally part of their token stream and will only change once LLMs are able to change their previously output tokens, or at least disregard them (negative attention?). Some LLMs have been fine-tuned to emulate this by way of CoT and similar, and some chat UIs have this prompting built in, and I think trying to level the playing field across all APIs with an explicit prompt would add some fairness to measuring the potential of each LLM. (Edited to add: I completely agree that no specific prompts for subsets of the benchmark should be made, since that would be the writer doing the reasoning. Just arguing for trying to glimpse the optimal reasoning capability of any LLM that has not been fine-tuned to work around the "lock-in" behavior since they were most likely fine-tuned for non-benchmark related tasks where someone does have a prompt indicating its reasoning behavior.) Let me know if you need any help on expanding the website or adding any functionality, I've got plenty of full-stack experience and happy to help.

Blixt

2024-08-21 09:52:36 +0000 UTC

When's the official release? I want to retweet it from my ai account.

Enrico Ros

2024-08-21 06:54:17 +0000 UTC

What’s Gemini 1.5 Pro (08-24)? Given today is 21st of August 2024, seems like a model from the future. But to ground my better guess in reality I will assume you are talking about gemini-1.5-pro-exp-0801 that’s available in Google’s AI Studio as experimental preview? Rather than riffing on OpenAI’s dubious model labeling convention, I suggest you use the specific key used to select the model in their API or use the exact names as specified by the authors. I find the renaming of models extremely confusing. (Is there really a “Turbo” version of Llama 3 405B?! AFAIK the revision is 3.1 and not just 3… Sorry to harp on this, but my OCD simply can’t tolerate such imprecisions.)

Pavol Vaskovic

2024-08-21 06:53:31 +0000 UTC

If you make more questions, release half of them as open source, you could potentially get tons of citations.

Michaeel Kazi

2024-08-21 06:26:36 +0000 UTC

This is Absolutely incredible information. SO precious and finally we have a benchmark where my subjective matches 1:1 the results. 4o-mini: what a shame: it's great to have a cheap model, it's criminal to make it overfit benchmarks (but not Simple Bench!) Please keep Simple Bench secret, despite large companies will throw Very Large amounts of money you way for "partnerships" or "early access". You are great because you have your unique way.

Enrico Ros

2024-08-21 06:18:39 +0000 UTC

Philip, love that you've shipped this exciting new benchmark! Congrats!!! You mentioned that you're going to try and keep Simple Bench updated as often as possible. Is your intention to keep all prior data publicly available so someone can potentially create like this benchmark battle animation over time? https://x.com/amebagpt/status/1825643117124743259 I think that would be a really nice way to see how various models ebb and flow over time when graded by Simple Bench.

Andrew Thompson

2024-08-20 22:00:34 +0000 UTC

Why the temperature: 0.2 choice? Isn't this worse for reasoning? more likely decoherence?

Robert Gomez-Reino

2024-08-20 18:42:03 +0000 UTC

But I think the whole point of this benchmark is to test the model as they are. No system prompt. Of course, there is for any task an EXACT prompt that will solve that task. And you can use generic ones that help in multiple types of tasks too. But the best way to test LLMs to know their current native power is "naked". Like this, I know that Sonnet has much better-grounded general ideas or concepts and is better at anchoring my request to those, so it will likely be MUCH better when I use some specific prompting approach. I don't want Philip & team to test models with what they might believe is the best prompting approach. Like this looks perfect for now to me :)

Robert Gomez-Reino

2024-08-20 18:37:04 +0000 UTC

I think both are compatible. They are indeed yet weak reasoners (much dumber than humans), and that's why for our real use case solutions we need quite some prompt/flow/concept engineering. We are basically grounding and developing otherwise undeveloped concepts in the model with the prompting indeed. just the way i see it (and I believe it will be proven soon).

Robert Gomez-Reino

2024-08-20 18:27:42 +0000 UTC

One more comment from my side, this benchmark, like any other, should be used in context. What is the best model for "my own" use case? For sure I am inclined for the most reasoning performance one. Our specific tasks a) relate to cyberphysical control systems (so good reasoning on real world scenarios is very much needed) and b) training material for models on OT was (and is) likely scarce. However, we still need to take into account: - adherence to structure output? - function calling? is this GPT4-preview supporting that? - context size? same, is this 32k only? - speed - cost... well, for our use case wont be the top criteria (what is the cost of doing human equivalent work for some million tokens...) but still to consider So still, some work to do. But SIMPLE Bench surely is a big help.

Robert Gomez-Reino

2024-08-20 14:19:17 +0000 UTC

Philipp, GPT4 0623 I assume is a typo: 0613 should be the right one?

Robert Gomez-Reino

2024-08-20 13:48:10 +0000 UTC

Wow... Turbo Preview shows up in the playground but not in the pricing page of OpenAI? I dont find it either in Azure OpenAI models. At first I thought it was gpt-4-turbo-2024-04-09 which is in the pricing page: https://openai.com/api/pricing/ btw, 4 times the price of the latest omni. But then, if you go to the platform playground, you see you can choose gpt-4-turbo-2024-04-09 and also gpt-4-turbo-preview... Maybe someone from OpenAI here can clarify? Also maybe someone from Azure OpenAI can reach to clarify if this model is available at all or not in Azure?

Robert Gomez-Reino

2024-08-20 13:32:20 +0000 UTC

Interesting work Philip, but as others said, prompting is a major issue. It shouldn't be an "afterthought" in assessing the full reasoning capabilities of the models. It's common for models to solve "viral puzzles that show how dumb models are" with a bit of prompting, even just plain chain of thought. In your "bathtub man" problem, you can see that GPT-4o, GPT-4T and Gemini 1.5 Pro pick an option right away and then stick with it. And Gemini gets the essence of the matter afterwards, during explaining, but the bias to stick with the initial mistake is too heavy. On the other hand, Claude Sonnet 3.5 first does some rudimentary CoT before responding, even with the same prompt. Its success could be due to fine-tuning making it think before respond (it also has an invisible thinking function now). You can see that when you add CoT to GPT-4o it also gets it https://chatgpt.com/share/aae86a25-e4a3-47bb-9583-3e5133c7fc09. And there could still be some subtle but important issues with the system prompt in certain questions. When a human takes the test, they know to expect trick questions. But when the model takes it, its default is to try and be a helpful assistant in everyday situations, which biases it to try to reinterpret the question in a more likely manner. In a previous video you presented an example where the model was "fooled" in a medical diagnosis despite being told about a gunshot wound. Simple CoT showed GPT-4o detecting the trick detail but discarding it as implausible https://chatgpt.com/share/9c434151-5354-4aea-a943-6f2788f4f0fa. This needs a careful system prompt to overcome.

Andronikos Koutroumpelis

2024-08-20 09:40:59 +0000 UTC

I think most of the times the LLMs fails at answering correctly is because they don't even understand the question correctly, rather than failing at reasoning correctly. Also these LLMs answer better when prompted to "Think step by step". I believe every system prompt must default tell the LLMs to think step by step because after all LLMs nature is to predict the next immediate token which corresponds to step by step thinking. I am sure the LLMs would have scored better with step by step thinking prompt. The only LLM i believe that does step by step thinking by default is Claude 3.5 sonnect (may be due to it using its default internal tags) & may be that's why it gives better answers than the rest. What do you think Philip ? For now, one thing i keep in mind before prompting an LLM is to think of it as a drugged genius who is always in a half awakened state. I also think of them as lazy machines. They won't do something unless you explicitly ask them to even if they know how to. After all, sometimes humans behave the same way too !

kavin vikram

2024-08-20 08:02:07 +0000 UTC

Thanks for the new site! It sounds like you already have a bunch of new features planned (so this might already be on your list), but it would be great to have plots showing time series data on how performance has changed over time.

Steven Fazzio

2024-08-20 04:34:16 +0000 UTC

Great work Phillip! One thought: Specifically tweaked prompts that increase performance on that specific task may not lead us to AGI level models, but they DO have a lot of business value. Some people will see your videos and say “you see, these models are dumber than humans and can’t solve my business challenges”. That’s far from the truth though: I make my living by doing just that: solving business challenges with AI.

Erik

2024-08-19 23:49:31 +0000 UTC

Here is an ice cube in a pan "recipe" for perfect reheating of leftover slices of deep pan pizza. All needs love 🍕 right! So to bring that dry cold crust, and that congealed cheese back to life, follow these simple steps for crispy and delicious melty next day scran. Put a non stick frying pan on, and warm to a medium hear. Add a little oil, 1/2 a teaspoon drizzle and add the cold pizza, cook on medium heat for 2 minutes -then with the oil absorbed into the pizza base to crisp it-add a few ice cubes to the pan and immediately cover to create steam so that the cheese cooks and melts. The ice worked better than water at this point in testing and the crust didn't get soggy! After a minute, remove the lid, slide your slice onto a plate. And hey presto, perfect reheated pizza.

Lee FRASER

2024-08-19 22:34:26 +0000 UTC

Congrats on Simple Bench - it’s a great project and I’m really looking forward to seeing if progress is made against it with the next generation of models. I have my own marketing benchmarks (2,800+ questions) that I’m keeping private for now for similar reasons. If you ever want to swap notes on testing scripts, scoring etc please let me know!

Sean Betts

2024-08-19 20:22:09 +0000 UTC

Thank you Philip for all these insider videos! In the company that I work for, there are currently abnormal and unrealistic expectations of what GenAI can do - including thoughts like letting go of hundreds of software engineers. Your rigorous reasoning and proofs help me to at least somewhat stand ground and hold a defensive position. Although I am afraid when all is said and done and the dust settles, leadership will come to senses and realize what mistakes over unrealistic expectations were done. It's like 2022 again but in reverse => Instead of expecting huge growth and hiring unreasonable amounts of people, huge companies now want that sweet piece of magical cake of infinite productivity without actual people doing the work. Even still, this technology is a miracle. However, leadership often doesn't realize, that to reap benefits from GenAI, real hard engineering work has to be invested during months and years. Everyone is looking for quick wins. But I think, that those who are stable and understand strengths and weaknesses of GenAI will prevail and become the winners of the Gartner Hype Cycle. That's why I see so much value in your work - stable and hard work.

Cyril Sadovsky

2024-08-19 20:16:48 +0000 UTC

LLM content moderation is a big focus for me, and I feel like your comparison of "interpolation" vs "simulation" unlocked something in my head, and I'm particularly intrigued by how to apply this nuance to Conversational AI. The back-and-forth of a conversation with an AI feels like a kind of language "interpolation", one that cane be easily guided to undesired content. So, what if there was an added layer of "simulation", and could it enhance the AI's responses? I'm playing with the idea of creating a system that would assess the dynamics between the user and the AI, almost like it's forming an opinion about the user's intentions. The results could be used as a framing device, potentially improving the AI's ability to generate appropriate responses. It's just a concept that came to mind and that I kind of want to explore. I'd love to hear what others think about this approach and learn if there are similar ideas in the field.

Blake Chambers

2024-08-19 18:36:57 +0000 UTC

Is GPT4o here the 2024-08-08 version?

Alexis Olson

2024-08-19 17:16:11 +0000 UTC

Is there were a way to crowd-source a benchmark like this without massive contamination risk? It would be cool to have a public or semi-private version that help up as long as it wasn't specifically Goodharted.

Alexis Olson

2024-08-19 17:15:22 +0000 UTC

I love it. I wonder if instead of giving a score you could categorize the score across different fields, assumptions or something like that (radar chart). I do have some connections to VCs in USA, in case that is handy I'm happy to help

Pablo Rodríguez

2024-08-19 17:02:00 +0000 UTC

This is beautiful! When do you plan to share the website link publicly? I really want to ask the Humane Intelligence Discord what they think about the SIMPLE benchmark.

Blake Chambers

2024-08-19 16:47:52 +0000 UTC

Yes like with AlphaProof. Prompting the system millions of times over to produce reality-grounded hypotheses then running an evolutionary algorithm over it seems to be producing incredibly promising results. Perhaps our hyper-fixation on one-shot benchmarks is leading us astray from the ultimate goal of producing a better reasoning machine.

r

2024-08-19 16:46:52 +0000 UTC

100%

Robert Gomez-Reino

2024-08-19 15:52:21 +0000 UTC

let's goo!!! yeah!!

Robert Gomez-Reino

2024-08-19 15:51:34 +0000 UTC

Okay so what if you give it several passes to reason? Great questions BTW and solid reasoning (on your part) about the nature of the questions.

David Shapiro

2024-08-19 15:14:53 +0000 UTC

"I'm addicted to papers please help" - nope this is too entertaining for the rest of us!

David Shapiro

2024-08-19 14:52:55 +0000 UTC

Looking forward to this one, and it is so needed.

Daniel Henderson

2024-08-19 14:47:02 +0000 UTC

Simple Bench Exclusive Tour: I couldn’t find a good reasoning benchmark, so I made one.

Comments

Thanks for the video. Really liking your benchmark. I think it completes ARC well. Besides : joined Patreon with the price decrease. I'm glad you choose this, no doubt this will prove beneficial !

It would be amazing if you could make a ComplexBench that relies on similar concepts but even smart humans find it difficult. It would be really interesting to see how language models fare against humans there

my actual preamble prompt is a bit longer but you get the idea. It is possible to get the correct answer if you tell the AI model that you are about to ask it tricky questions.

Philip, this comment is basically asking for implementing your idea about letting us people run the simple bench, I'd also love that :)

I checked now this first question with chatgpt o1-preview and it got it perfectly. Will you be able to get the API access for the whole batch of testing of the o1?

Google does not sleep and releases another experimental version in August: gemini-1.5-pro-exp-0827. Seems like another argument for sticking to versioning the LLMs by strictly using the respective platforms keys, preferably the dated version.

this is wonderful job Philip, well done!! hope to see this advance further in the near future!!

I would like to see Pi tested with this benchmark. It doesn't seem to get much love, but it is my favorite LLM to chat with.

Instant bookmark, this looks not only a more realistic comparison between human ability and LLMs but also a more accurate LLM leadership. Thanks, Philip.

I'd love to participate. I will ping you in Discord PM

Happy for you to share it!

Great shout!

If you think of a way, let me know! Currently hand-crafting it, soon though likely alongside some top figures at the leading labs.

Yes, came out just before this run! Hence the improved performance compared to my tweet!

Hold the line Cyril! These fragilities, as you mention, are still critical, and somewhat resistant to the ubiquitous hype. We need more people like you!

That's an amazing repository! Would love to know how you sourced them and what the current leaderboard is.

This is a key point, and will be addressed in more detail in the near future. After SmartGPT, I am highly cognisant of latent performance increases!

Coming at some point Steven!

I agree Kavin, but I kept in some questions that work with CoT as it will be interesting to many what models can do by default. Remember though that many examples are impenetrable even to any amount of 'Cotting'

Absolutely, great points. While it is targeted at core 'human' intelligence, it is certain that sub-par models on SIMPLE, like GPT-mini, could still be incredibly cost-effective for certain use-cases.

Great question. Truth is, I initially wanted 0.7 x 5 self-consistency samples, but the immense delays involved made this unrealistic for now, so went for the most straigtforward implementation. First obvious iteration would be to test (SmartGPT-style) prompting strategies on Simple.

Great idea Andrew. I hadn't thought about that but it's a great point and yes, all data will be stored and eventually released.

Thanks Enrico, I will always be transparent with your guys about what is going on, rest assured.

One of many good options. Not as focused on citations at the moment, but looking good now to scale up to 1000+ with some help with some elite folks at the top labs.

Happy for you to mention it Enrico. I should have clarified that the website itself is not private, but all the exclusive info and updates are just for you guys!

Agree on all points Blixt. Needs more info on the website (working on it) and also a 'prompt-optimised leaderboard'. I still stick by my point though that in the extreme any prompt can solve any arbitrary benchmark (e.g. a permutated prompt that eventually became a set of answers to all questions)

Yes, so many people have mentioned a separate tiering/leaderboard based on prompting, it is inevitable that this will be included in future.

haha, glad to hear Dylan. Might one day usurp your channel's coverage of Dr. Thompson's AGI countdown, you never know!

Thanks Lee, I think it was about Grok 2 having an API but Ideogram is epic, so any insider news you have there is great too

Alas not, the privacy is part of the point! But I may invite selected individuals (inc. via an NDA) to sample and review.

Philip, is there anywhere I can download the questions/answers for the benchmark to test it out on my agent?

Messaging here Philip as it'll get lost on your youtube channel and I'll pop a message in discord too, I was on the ideogram 2.0 announcement today and heard about them now having an API beta key. You mentioned you would be interested in them having an API.

I have discovered a love to benchmarks because of your work!

I assumed it is officially released :D as is likely indexed and shared already by many. So, I already posted it a few days ago. I hope ok for Philip :D

When's the official release? I want to retweet it from my ai account.

If you make more questions, release half of them as open source, you could potentially get tons of citations.

Why the temperature: 0.2 choice? Isn't this worse for reasoning? more likely decoherence?

Philipp, GPT4 0623 I assume is a typo: 0613 should be the right one?

Thanks for the new site! It sounds like you already have a bunch of new features planned (so this might already be on your list), but it would be great to have plots showing time series data on how performance has changed over time.

Is GPT4o here the 2024-08-08 version?

Is there were a way to crowd-source a benchmark like this without massive contamination risk? It would be cool to have a public or semi-private version that help up as long as it wasn't specifically Goodharted.

I love it. I wonder if instead of giving a score you could categorize the score across different fields, assumptions or something like that (radar chart). I do have some connections to VCs in USA, in case that is handy I'm happy to help

This is beautiful! When do you plan to share the website link publicly? I really want to ask the Humane Intelligence Discord what they think about the SIMPLE benchmark.

100%

let's goo!!! yeah!!

Okay so what if you give it several passes to reason? Great questions BTW and solid reasoning (on your part) about the nature of the questions.

"I'm addicted to papers please help" - nope this is too entertaining for the rest of us!

Looking forward to this one, and it is so needed.

More Creators