SakeTami
AIExplained
AIExplained

patreon


New Benchmark Madness, But Hope on the Horizon

This video won’t just show you the problem with a range of the most popular benchmarks (though it will do that, from MMLU-Pro to GPQA, GSM8K, LMSYS and more). It will show you a useable path forward, so that we might finally get benchmarks we can trust, that really get to the underlying capacities of our models. Guest-starring an Andrej Karpathy comment!


Karpathy Comment: https://www.youtube.com/watch?v=hVade_8H8mE

Functional Benchmarks – the best solution? https://arxiv.org/pdf/2402.19450

https://twitter.com/_saurabh/status/1769852207925805080

LMSys Leaderboard Updated: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Original Claims: https://twitter.com/LiamFedus/status/1790064963966370209

GSM8K Problems: https://arxiv.org/pdf/2405.00332

MMLU-Pro Problems: https://www.reddit.com/r/OpenAI/comments/1ctm8sw/i_ran_the_mmlupro_benchmark_on_gpt4o_some_notes/

And Conversation with Author: https://twitter.com/AIExplainedYT/status/1790711185404199142

MMMU: https://mmmu-benchmark.github.io/

https://mmmu-benchmark.github.io/#leaderboard

https://arxiv.org/pdf/2311.16502

New Gemini Paper: https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf

GPT-4o Benchmarks: https://openai.com/index/hello-gpt-4o/

GPQA Refutation?: https://wp.nyu.edu/arg/can-good-benchmarks-contain-mistakes/

GPQA: https://arxiv.org/pdf/2311.12022

New Benchmark Madness, But Hope on the Horizon

Comments

Philip, what are your thoughts on LiveBench.ai?

Kol Tregaskes

Thank you so much! Much much appreciated.

Armin

https://www.anthropic.com/news/evaluating-ai-systems , pretty sure!

Philip

I remember Phil in one of his Youtube videos said Claude in a blogpost referred to errors in the benchmark choices e.g. wrong answers. Does anyone have link to that? I spent 2 hours can't find it :|

Armin

Phil: excellent, incisive coverage of this vital area. Considering what's at stake, reliable, meaningful benchmarks are crucial. The problems you've found may explain why I constantly see the frontier LLMs struggling on basic real-world tasks. And I mean basic -- I'm not asking them to solve the Riemann Hypothesis. E.g, It often takes ChatGPT-4o fifteen attempts to write a working ten-line Python script. And that is with me helping it. They often cannot answer a basic question about what menu to use on DaVinci Resolve, or Premiere Pro. They often cannot answer questions that require knowledge from academic journals. This may be related to the Open Access issue. They often cannot answer questions that require knowledge about patents. There are also significant knowledge domains which a human expert knows but is not well captured by LLM training data. E.g, if you ask the LLMs "Does a RISC CPU use a hard-wired instruction set or a microcoded instruction set?", all of them emphatically say "hard wired." That is incorrect -- almost all modern RISC CPUs use microcoded instructions to some degree. Why are the LLMs wrong? Because decades ago during RISC ascendency, there was lots of trade press about hard-wired instruction sets. Gradually over the decades, RISC microarchitectural approaches evolved, but it was not well documented. Today if you ask a human expert like Jim Keller, he'd accurately explain RISC CPUs use microcode. But our most advanced frontier LLMs don't know that -- their knowledge is biased by what is captured in easy-to-acquire data. This is not a single exotic case. Multiply this by many similar cases where the current data access, data prep, ingest and training system fails to capture this type of knowledge. If these cases are not resolved, it will limit the potential benefit of LLMs to all humankind. If the Open Access issue is not fully resolved, you could have a future AGI that cannot answer a question until his human or robot assistant physically runs down to the university library to retrieve a paper journal. Other LLM struggles may involve lack of metadata or labels. E.g, there are many versions of a given software, each version having different menus, parameters, etc. In a recent interview, Dwarkesh Patel asked OpenAI cofounder John Schulman a question which I may have first articulated. It was how can the unsupervised, unlabeled learning process enable LLMs to answer version-specific questions about utilities like FFmpeg, which have many versions with differing syntax. Schulman sort of blew it off saying they would improve, they know the "man" pages, etc. Except they have not so far. I posted that dilemma in several places because I constantly see problems with LLMs' ability to answer basic things requiring version-specific metadata. It doesn't matter if they get 100% on MMLU, etc. If they can't answer practical questions, their potential benefit will be constrained. Worse, (as you implied) you have decision-makers and potentially regulators trying to make decisions based on benchmarks that are essentially presenting "The Emperor's New Clothes." Lots of people have apparently talked themselves into believing the popular benchmarks are representative, but they are not. If the LLMs aren't being tested in a way to measure real-world performance, you have a trillion $ development program just wondering in dark. Maybe they need a Super Benchmarking team more than a Super Alignment team.

Joe Marler

Sam> I'd be happy to talk. I'm afraid I don't use Discord much, though. If you don't mind, could you e-mail me at tomgally@gmail.com? We can discuss ideas by e-mail or set up a videoconference to chat.

Tom Gally

This is what I've been thinking about as well. I'm a sw engineer and my first AI pet project was book / article summarization. I don't have answers but if you want, I'd like to discuss this with you anyway, I have some ideas I'd like to run by someone. Also would love to hear your thoughts. Reach out to me on Discord (saamcek) if you're interested.

Sam

We can have different working theories of what is going on (Or at least that is what I can do). One ist that we've seen the shape of LLMs with GPT4 and from now on, all we get is incremental progress, not a breakthrough into capabilities like real reasoning. I don't think that is true, but I'd definitely welcome new benchmarks that would allow us to measure what is actually happening. I'd also REALLY like capability tests done by people who do not make fortunes and careers based on their success. Same goes for safety testing.

Jörg Weiß

First - I am sorry for the typo in the first sentence of my original comment :-) English is not my first language and I don't use any LLM to help me - yet. I am absolutely very close to the "free-will" discussion. I have chosen (I think) to belive I have free will. At the same time - if we are biological machines our idea of self must be "metaphysical". As a consequence LLMs could develop a self as well. Final reflection. When LLMs are trained they are, up until now at least, trained on human intellectual production. I find it fascinating that the LLMs can find patterns in that training material we, as humans, are not aware of. But we must produce the patterns or else they would not be there. We are just not aware. I will stop there!

Kai Lohre

That's a very philosophical question. If you follow that line of thought far enough you can even bring into question the very concept of free will if you generalise humans to being mere pattern recognisers that generate some output based on the context window of their entire life. But setting the philosophical debate aside, my feeling on this topic is that because the way humans think and the way that LLMs "think" is so different, I am doubtful that the LLM approach can ever truly accomplish something like what we call reasoning. They may get better at imitating it and making us think they can, especially with multimodality, but underneath it all they are doing something that is fundamentally very different. That's not to say that LLMs don't do something useful, because it's clearly been demonstrated that LLMs can be very helpful tools. For example there was this video looking into how LLMs can boost productivity: https://www.patreon.com/posts/do-llms-boost-5-96740243 And in general they're great at quickly pulling out key facts from large amounts of data, much faster than a human could ever hope to do so. They can also generate new ideas incredibly quickly, but lack that basic human reasoning to recognise something that to us is obviously nonsense. This all makes me wonder if we're asking the wrong questions when we try to test LLM's abilities to reason like humans do. They're designed so differently from our brains that isn't the answer just obviously no? LLMs get things wrong which we as humans consider to be trivial, like basic maths problems, but they can also achieve things which we as humans can't, due to the limited capacity and speed of our own brains. Bringing this back therefore to the topic of benchmarking LLMs, I'm starting to become of the opinion that we may be testing for the wrong things by trying to measure their outputs against human strengths. Have you ever come across Goodhart's Law? (I forgot the name but ChatGPT reminded me. Fact retrieval, that's something LLMs are great at.) It's often stated as "when a measure becomes a target, it ceases to be a good measure." I feel like this is where we're heading with current benchmarks. LLMs are getting better and better at giving the appearance of human reasoning, while in fact doing something totally different. Instead I feel that we should be more focused on assessing LLMs as LLMs rather than as replacements for humans. So for example, measuring retrieval speed and accuracy over various size and types of context window. I know it's been done, but that's the kind of test I think deserves more focus. LLMs should also be great at translation tasks given that is all about language, though coming up with an objective measure of that may be tricky given there can be multiple valid interpretations. Also some languages, Japanese springs to mind, rely a lot on context and stating things implicitly, which can make translation to/from English difficult due to yet more scope for multiple interpretations. Okay, maybe translation isn't so easy after all. Anyway that's as far as I've gotten so far with that train of thought so I'll stop there. 😅

Lemon

Also sceptical, but do humans do anything different? How often do you know how a sentence you start will end when you start formulating it? Do you have a "hidden LLM" in the background producing the answer? Based on the prompt - a question from another person. Are we more than pattern generators and repeaters? At least we hallucinate a lot - se patterns where patterns do not exist :-) We are of course more sophisticated - been trained on "multimodal" material since birth. So how do we get progress and new insights? Two important ingredients: different perspectives - ie. persons with different training material, and a loss/utility function. Progress (new insights) is then created in the interaction between the actors and it is measured by how many of the actors that accept the suggestions (new insights) as relevant. In theory this way one could create values, culture, scientific paradigms etc. And still have "random" suggestions now and then - that could be hallucinations or accepted by "all". In principle one could simulate evolution this way... just thinking aloud.

Kai Lohre

Covered that paper extensively on my Holy Grail video on Insiders, us my Chatgpt fails basic logic video on Explained. It is a brilliant paper!

Philip

This article looks at performance across a bunch of domains in counterfactuals. Similar to the overperformance in math when changing to functional forms, they made similar changes in other domains and found a similar drop in LLM performance (but still above random performance, of course). For example, in chess, if you swap the bishops and the knights, what is a legal move now? https://arxiv.org/pdf/2307.02477

Joel

Absolutely

Philip

The labs should be incentivised over the medium-long term to have more capable models, not obfuscating the reasoning gaps by using shoddy benchmarks. If they are overemphasizing the benchmarks, that points to short-termism - i.e. poor leadership or inadequate product thinking.

Machiel Reyneke

It does indeed go against the narrative. Could be a big revelation across the board if validated across benchmarks - newsworthy even.

Philip

Agreed Steve. Once a model 'knows' all of history, it just knows. Although, you could randomise the phrasing of the same tests of historical knowledge to get some of the same effect. Leverage LLMs to rephrase 'which ruler led the army of Z in conquering the land of X IN year Y' to be 'name the ruler who in year Y seized the territory of X, leading the forces of Z'

Philip

Me too Trenton! If we can shape, in a tiny way, the 'final' AGI benchmark, it would be history in the making

Philip

I feel there starts to be a real need for independent benchmarking. To me, the elephant in the room is how benchmarking is mixed up with marketing. Considering the insane amount of money pouring into the field, companies have a strong incentive to show a number going up. I wouldn't be surprised to learn that benchmark data is fed to models on purpose. Just think car manufacturers including cheat modes where engines detect being tested and behaving differently than under real conditions (https://en.wikipedia.org/wiki/Diesel_emissions_scandal). Also: The gap you are describing points at models being dramatically less capable at reasoning than suggested. That goes right against the hype narrative. Most leading labs have a strong incentive to push that discussion further down the road. (I'd love to hear more about how to understand this.)

Jörg Weiß

Another thought-provoking video. I am sure that benchmarks for reasoning ability and mathematics are important, but there are other uses of LLMs that may be even more difficult to assess. One thing I’ve been struggling with lately is how to evaluate the summarization performance of different models. For example, if I throw a 50-page text at two models and ask each to summarize it in five paragraphs, how can I judge which summary is better? There are no correct or incorrect summaries, just better or worse ones, and the only way I can think of to evaluate summaries systematically is to have multiple people read the entire text and write their own summaries, and then compare the LLM summaries with those human-written summaries. This has become an important issue for me because lately I have been using LLMs more and more to summarize various types of documents. And the reason I do so is that I don’t want to read the entire documents myself. How can I know to what extent I can trust the summaries LLMs give me?

Tom Gally

It's so crazy that it wasn't triee before, almost annoyed I didn't think of it. Definitely right to be skeptical.

Philip

Thanks so much Lee

Philip

What an amazing comment thank you so much and agreed.

Philip

Very interesting video Philip! Honestly, it felt quite eye-opening that the current AIs are still so very limited. So many people discuss AGI these days that it somehow shocked me we still have a long way to go. I took some time to contemplate current MMMU leaderboard. Notice how the best AIs (average of the best 3 AI models) score against the Human Expert - Worst in various fields: Art & Design: 80.8 (Human Expert - Worst) vs 72.8 (average of best 3: 75.9 | 72.5 | 70.0) Business: 78.0 vs 61.6 (67.2 | 59.3 | 58.2) Science: 78.0 vs 51.0 (54.7 | 49.3 | 48.9) Health & Medicine: 73.3 vs 64.4 (67.3 | 64.7 | 61.1) Humanities & Social Science: 74.2 vs 76.1 (78.3 | 75.0 | 75.0) Tech & Engineering: 74.3 vs 48.6 (50.6 | 48.1 | 47.1) In Humanities, AIs are already beating Human experts. Health & Medicine are next with AIs at 88% of worst human's performance Business 79% Science 65% And Tech & Engineering 65% It's the hard sciences that are the most behind and it seems it's also the hard sciences where this new approach with functional benchmarks can help the most. I can easily think of turning math / chemistry problems into code but not sure how I would do the same for history or literature. One more thought - these functional benchmarks seem to me very related (in idea) to training on synthetic datasets.

Sam

This has exposed how relatively dumb today's models are despite having memorised all of human knowledge. I look forward to new architectures or primitives that are more sophisticated and won't need the extreme volumes of training data and can reason its way to more correct answers (rather than remembering). Hopefully smaller models too.

Machiel Reyneke

I was wondering not too long ago what would happen if someone took the numbers in existing benchmarks and just tweaked them to mitigate against the benchmark questions ending up in the training data. Very interesting to see how performance falls off on these new functional benchmarks. I still have this one fundamental doubt about LLMs ever developing reasoning. Perhaps I just don't understand enough about how LLMs work, but my layman understanding is that they are essentially just next token predictors. Sure, the output can be improved with multi-shot, chain of thought, backtracking and various other techniques, but at the end of the day these models are just predicting the next word based on what they've seen before. I still struggle to conceptualise how this type of architecture could ever produce real reasoning. I feel like we've all been wowed by an illusion due to the human-like appearance of the output. I forget which video it was, but I seem to recall a video on your channel that showed at one point showed how an LLM had "reasoned" that 1+1 = 2, and the result was some incredibly complicated function full of sines and cosines that ultimately resulted in 1+1=2 in a very roundabout way. I'm still sceptical when it comes to LLMs achieving reasoning and being able to provide new insights in maths and science, but I do look forward to being proven wrong if that does happen someday.

Lemon

Another beautiful school day. Work still to do then, so we don't get HAL 9000 reasoning and hallucinations when implementation of "AGI" that seemed so close, but in truth now seems some distance away, from general implementation. If the reported swing in reaching AGI guesses is 2025 -2029, for AGI level to be achieved then there is a lot to solve still for those major labs. The note on Mistrals offerings highlighted in the text on the video was enlightening, very low true scores for a highly rated independent lab model. I really enjoyed today's Patreon video. Thank you for your continued dedication to educating your patreons, and have a wonderful day ;) 😉

Lee FRASER

I do really like this idea of functional benchmarks. Seems like it is useful not just for a more realistic benchmark, but for training purposes as well. If the functional aspect is sophisticated enough (not just swapping some numbers, but names, sentence order, various interchangeable nouns), instead of worrying about accidentally training on the benchmark, maybe you actually WANT to train on it. It'd encourage generalizing in the model, especially if the function scenarios are broad enough. A lot of "synthetic data" seems to be based around generation by other LLMs (or even the same LLM w/ iteration), but this kind of middle-ground between that and human-created examples seems quite useful as well.

Shawn Fumo

I worry that even aside from straight-up errors, ambiguity in the wording of questions is an even larger problem (i.e., not logically clear what the answer is). Take your own hand-crafted combinatorics question that you screenshotted for us: "Matt has six different playing cards. How many ways are there for Matt to sort the cards into three piles of two cards each?" Well, that question is a bit ambiguous. What are you counting? The different 'ways' to sort the cards (order matters, which card goes into which pile first matters)? That'd be 720. Does the order of the piles and order of the cards within each card not matter at all? That'd be 15. Does the order of the piles matter, but not the order of the cards within each pile? That'd be 90 (not an option). Order of the cards, but not the piles matter? 120, I think? I think many people who know some combinatorics (as well as many models) will just *assume* the questions means 'Don't count the number of ways Matt could sort the cards, count the number of unique combinations of 3 piles of 2 cards, where order of the piles and order of the cards doesn't matter', because they've seen similar questions like this before. But that's an assumption - the question really doesn't make it clear. In fact, the language itself 'number of *ways* to sort' kind of suggests the 720 answer (or 1, if you take 'sort' literally, heh), even though presumably the intended correct answer is 15.

Hillary Sanders

Interesting about the GPQA. If there were only two expert validators and they got the 'answer' (defined by the Q author) wrong 25% of the time, then we really shouldn't expect the 'diamond' set to have 100% accuracy. Like, if correct/incorrect answers were iid (so false), that circumstance would be really rare. i.e. if experts tended to answer differently 25% of the time, I imagine the question authors were wrong (either in their answer or slightly incorrect in their problem setup / ambiguity) a similar amount of the time. 25%*(25%/3)*(25%/3) (all were wrong in the same way) is really small. (like, two in a thousand.) But in reality like I think the video mentioned, there are common misunderstandings / misconceptions in these fields, and I'd imagine these are pretty plentiful in areas where experts disagree on answers so often. And in those circumstances, if one person misunderstands some concept and writes or answers B instead of A, the next expert is probably far, far more likely to carry the same or similar misunderstanding. So honestly 75-100% doesn't seem like a crazy range, it seems realistic (maybe except for the 100% part, ha!)

Hillary Sanders

Will updating benchmarks be a new way how science progresses? Not fully serious here, but I can imagine a cycle of "people set up a benchmarks" -> "model answers some questions differently than benchmark" -> "people evaluate disagreement and find out model was correct". Increasingly you could see more expert and niche knowledge needed to set up benchmarks. And then treating model's lower scores as maybe having some grain of knowledge behind them, alternative hypotheses to be validated. Analogous to Move 37 by AlphaGo seeming as a mistake just after it's played but then revealing it was a smart move later.

Tomas Dulka

Philip, thank you for another excellent episode, this time shining a light on the critical issue of benchmarking in AI. If I recall correctly, you were one of the first people to raise this issue and emphasize its importance, so it's wonderful to see it now getting more widespread attention - and from luminaries like Andrei Karpathy no less! You make a brilliant point about the stark disparity between the billions of dollars poured into training ever-larger models and the relative pittance devoted to developing meaningful benchmarks to assess their true capabilities and limitations. Without robust benchmarks, it's hard to cut through the hype and determine what these systems can actually do, and what vulnerabilities or failure modes we need to watch out for. It's encouraging to learn about some of the new benchmarking initiatives and approaches that are now emerging to try to address this challenge. While there's clearly still a long way to go, these efforts give me hope that we might gradually develop more informative and reliable ways to evaluate AI systems. We desperately need an "honest broker" in this space. As always, I'm grateful to you for shedding light on what I consider to be some of the most meaningful issues and questions surrounding AI development – issues that too often get lost amidst the ceaseless hype cycle. Please keep up the insightful commentary and analysis! I am certain I speak for other members on this platform when I express heartfelt gratitude for your tremendous efforts. Thank you.

Arvind Mani

I stopped using GPT4o, and the issues can't be recognized on LMSys arena. It has a lot of issues with long conversations, don't answer questions but summarizes those; also it get stuck on the previous context and replies it gave and keeps repeating those, even when said not to.

Adin Softic

I love the idea of functional variants such that you can generate a unique benchmark every time you want by pressing a button. I am not sure if it is possible for all fields besides math, physics, coding...

SteveHaupt

Benchmarks have always been this fickle thing, when I visited OpenAI one of the staff members said that usually what happens is someone makes a small benchmark for themselves, and then people just end up adopting it and treating it as the standard. This gives me great hope though that there’s still significant room for improvement (aka gpt-4 level isn’t the upper limit of model intelligence!). I share your excitement around the MMMU, Vision and reasoning are the two things I care about more than anything else. Perhaps we shall see a new version of SmartGPT topping the charts…

Trenton Dambrowitz


More Creators