SakeTami
AIExplained
AIExplained

patreon


10,000x Scaling Deep Dive, and a 5-year LLM Roadmap

A 20,000-word new report on AI scaling, and yes, I read it all to bring you the highlights. What are the biggest unanswered questions for whether we will scale models 10,000x and is there a deeper question that underlies them all? Plus new clips from Anthropic CEO, Simple update, Eric Schmidt and more…

Link for Download and Off-line Watching: https://drive.google.com/file/d/1EObJY3kVgXRxH4N0mWxd3fqYWOEPNckW/view?usp=sharing

Epoch Report: https://epochai.org/blog/can-ai-scaling-continue-through-2030

Stargate Report: https://www.theinformation.com/articles/microsoft-and-openai-plot-100-billion-stargate-ai-supercomputer?rc=sy0ihq

My StarGate Video: https://www.youtube.com/watch?v=KXG2f-So9oo

Eric Schmidt Leak: https://www.cnbc.com/2024/08/15/eric-schmidt-on-nvidia-you-know-what-to-do-in-the-stock-market.html

Amodei Interview: https://www.youtube.com/watch?v=7xij6SoCClI

Market Caps: https://companiesmarketcap.com/
Cerebras: https://inference.cerebras.ai/

Noam Brown Tweet: https://x.com/polynoamial/status/1803844406480638284

RT-2-X: https://deepmind.google/discover/blog/scaling-up-learning-across-many-different-robot-types/

Exponential Data Might Be Needed: https://arxiv.org/pdf/2404.04125

10,000x Scaling Deep Dive, and a 5-year LLM Roadmap

Comments

@Barnaby it only feels like that because you're living in linear time. On the exponential scale, they're a straight line.

Gregory Klopper

It feels like the time-scales are getting shorter.

Barnaby Golden

Yeah what happens if scaling does continue for 10,000x AND it continues with o1 -> o3 (another 10,000x). ASI?

Dane Holmberg

Has the release of o1-preview changed your conclusions from this video at all?

Barnaby Golden

No problem, and yeah “Beff Jezos” is interesting to say the least. Here is a video Extropic released that does go into a bit of detail (and didn’t get many views): https://youtu.be/mxLuoifgJdU?si=2rP-A-XFmwKgzMe7

James Patton

Thanks so much Juanjo! I see you retweeting me often, am grateful for your support on multiple platforms!

Philip

So kind James, I will be honest in saying I don't know much about Extropic, other than the controversies around its leadership, so will watch them more carefully after what you said.

Philip

I once did a video on here vaguely related to your Stack Overflow comments, it was about a patent OpenAI did. But you raise so many interesting points, makes me wonder what you do for work Blake!

Philip

Me too Vlad, me too

Philip

Thanks Mark, and yes, that term has become so confusing, and polarising, that I steer clear of it more than I used to. Even the demo-AGI term I tried to popularize does not fit the bill, so Andrew is right in that sense that it very much depends on our definitions.

Philip

Yeah I toyed with including it, and DiLoCo, but had to weigh up the heaviness of the video for a general audience!

Philip

That is so kind Joe, thank you

Philip

Philip: you have a truly penetrating insight. Amidst all the distracting noise in this area, your incisiveness cuts straight to the core issues. Your commentary and analysis are a breath of fresh air.

Joe Marler

Great video as always! Btw on the topic of power and distributed training, Nous Research lately talked about DisTrO to make this easier: https://github.com/NousResearch/DisTrO Right now just preliminary info, but a full paper is coming if their readme is to be believed.

Shawn Fumo

Yeah, it feels like the next 6 months or so may tell us a lot. And whether OpenAI themselves wows us, between Anthropic, DeepMind, xAI, Meta, now Magic coming out of stealth, etc., it feels like someone is going to have something impressive. If none of those manage anything other than incremental improvements, the “AI has slowed” people may have a point. Then again, all it takes is one breakthrough to possibly turn that all on its head. I keep coming back in my mind to that recent paper where researchers trained a good quality image generator from scratch for less than $2000 and 37m images. Vs SD 1.5 costing $300k and using almost 4b images. We may not see something quite that dramatic on the LLM side but then again who knows. It still feels so early in all of this.

Shawn Fumo

I’m not sure on how much time awareness something like Gemini Pro has, but I know Covariant has stuff baked into its robot arm AI (mostly for picking operations) to do things like visualize what will happen given various physical actions. Interesting their co-founders were just hired by Amazon to incorporate their tech into Amazon’s warehouses, etc.

Shawn Fumo

Truly one of your most insightful and thought-provoking videos ever Phillip. Unless I missed it I think you barely used the term AGI. Perhaps because it’s such a poorly defined concept that it’s kind of become a distraction. But I wonder If you heard Andrew Ng recently say that he believes AGI is at least decades away and how you would square that with what seem to be more ambitious projections from the Epoch paper.

Mark Levine

Sometimes I myself feel like "a blind reinforcement learning model, just fumbling amid the infinities of the universe". Thank you!

Vlad Gheorghe

I agree strongly with the notion of a data labeling/synthetic data revolution. I think that it extends past just labeling because I think there are low hanging fruit that hybrid-synthetic data generation would manifest exponentially faster—I'm making the assumption that any leaps in purely synthetic data generation for informing world modeling will only be possible after innovations are conceptually demonstrated in a supervised process. That cottage industry you mention in your video is real, and kind of scary. Although, please know that I'm still very new to the field, and most of my knowledge is informed by an informal education and practical experience; so please correct me on any points. Right now, there is an upswell of data annotating of large language models that goes beyond A/B testing. I'd argue that there is some A/B testing going on that is a facade for aggregating domain expert reasoning. And because that sentence sounds almost conspiratorial, I'd appreciate any correction in my interpretation of the processes I've observed. If It's reasonable to use coding tasks as a microcosm of the data that could be scaled synthetically with a direct impact on performance (memorization or otherwise), then this process is kind of freaky: - First, an A/B testing environment is set up for evaluating comparative performance on coding tasks. - Second, human annotators are given a rating criteria with three essential parts: 1) Rating scales for qualitative aspects of the response, 2) Section for reviewers to explain their reasoning for their rating with specifications for how the reviewer should write and the details to include, 3) Proof of code execution. - Finally, the human annotators are given the option to edit a model response such that the solution can be executed. The process is interesting because of the requirement to explain one's reasoning and provide proof of code execution, but it's freaky because of the detailed instructions and guidance on how an annotator must explain why the provided solution to the coding task is wrong and how it is corrected. From my naive understanding, this is more than high-quality, expert-informed data on a model's performance on coding tasks. If I think about Stack Overflow as a baseline, this process seems like a very simple way to take a step past A/B testing and systematically produce proper language modeling of the conditions for a broad range of coding tasks and their correct completion, much like Stack Overflow, but without tangential comments and competing explanations. I don't think this process alone will scale, but if LLMs can be trained to perform the same process in tandem, maybe synthetic data that mimics the Stack Overflow style post would be something LLMs could extrapolate into its most generalized form. I can't tell if this is the kind of this hybrid-synthetic data would contain the patterns and artifacts needed to model any of the real-world systems like synchronicity of process execution or the memory garbage collection of virtual runtime environments, small world-models I would argue are necessary for reasoning through coding tasks. However, I can see how this kind of approach could be used in other problem spaces to create large datasets of tasks and written down reasoning of how they are performed correctly.

Blake Chambers

Current multimodal models are trained on relatively low resolution image patches that have been tokenized, and I think this hinders an LLM from seeing the images as clearly as they would need to in order to reason like they do with text. Additionally, as far as I know, no multi-modal LLM yet truly has time awareness and so even if they train on frames of a video I don't think we yet have visual cause and effect as a concept in LLMs (but possibly in diffusion models, especially if you look at the recent DOOM engine which could repurpose a Stable Diffusion model to simulate the game). We will probably see a lot more development in multimodal networks still, because it seems like the current models are just barely scraping by. Heck, even the way tokens work is too low resolution in some regards (hence the "how many 'r's in 'strawberry'" meme). And I think this could dramatically change the amount of intelligence you can squeeze out of even the current amount of compute. And it can be shown that architectural improvements does have this effect, e.g. compare GPT-3 with Llama 3.1 8B. Hopefully architectures like LeCun's JEPA which can reason more in latent space also with video will eventually unlock a lot more trainability on multimodal data. If it works, I don't think data labeling will be too important (just as it isn't for humans), though maybe data consistency will be (so for the example of watching cartoons, maybe they should run through a 3D renderer pass that shows the cartoon on a screen "in the real world" so that all data the LLM receives is based in reality, just like it is for humans).

Blixt

One of your best videos. So many insights from it

Leo G

Amazing video as always Philip! Regarding the power constraints - I'd be curious to hear your views on "thermodynamic" computing (Extropic) and "liquid" neural networks (Liquid.ai) as a part of the solution - as they claim 100-1000x efficiency gains. However, there may simply not be enough information released from these guys to put together a video/analysis on it.

James Patton

Very interesting stuff. At the very least the next generation of LLM's will help researcher digest the tons of papers coming out each year even if they only end up being great at regurgitation.

Mike D

This is what I think every I hear "diminishing returns" or that OpenAI is losing it's lead. I don't think anyone can answer that until we've seen GPT-5 in action.

Mike D

As always, awesome video Philip!

Juanjo do Olmo

It seems like so many decisions hinge upon how much GPT-5 wows people. This is a "make or break" moment for the trajectory of AI.

David Shapiro

It's always an exciting email to get, when one of these videos is ready!

Simon Sturmer


More Creators