Pod 10: 4 Reasons Why Data is Now Even More Important: Scaling plateaus, judge rulings, test-time training paper and post-AGI jobs - Let's Think Sip-by-Sip
I have reviewed the lecture and the accompanying papers, and it is clear that large language models (LLMs) do encode chain-of-thought data in their weights. I am confident that breakthroughs in "reasoning" are imminent. While these models may not reach superhuman levels, their capability to tackle longer-term tasks, employ multiple tools and solve straightforward logical problems that don't need physical grounding will be unleashed very soon.
https://www.youtube.com/watch?v=QL-FS_Zcmyo&list=PLS01nW3RtgopsNLeM936V4TNSsvvVglLc&index=10
Rômulo Drumond
2024-11-19 20:37:30 +0000 UTC
As far as court cases go I think you should look to the music industry and the impact that streaming had on royalties. Spotify pays artists as little royalties as possible in order to provide listeners with access to nearly any song they want to listen to. This evolution happened over many years but there is no way to go back to the old regime where big music labels were making vast sums of profits from the work of artists.
Joshua Davis
2024-11-18 14:49:47 +0000 UTC
@AI Explained curious to hear your thoughts on this perspective.
Joshua Davis
2024-11-18 14:46:28 +0000 UTC
Also you keep on mentioning a brain scan as if this would be needed to prevent lying but Alex Tabarrok has said that, "A Bet is a Tax on Bullshit." This is why prediction markets will ALWAYS be better than polling. Indeed we found this to be the case because the 3.6 billion prediction market on Polymarket for who would be the next president was correctly predicting Trump would win by a large margin as early as mid October. This was based on the insight of one single Frenchman who bet 30 million dollars on Trump.
Prediction markets are superior at revealing this non-public information. They have always been shown to work provided that there is very little correlation between the different participants in the market i.e. one participant has undue influence upon the market to change the outcome.
Joshua Davis
2024-11-18 14:44:53 +0000 UTC
As far as your question about survey data goes you can always have models create new markets on Polymarket that read like this:
There will be a new Buc-ees built in Dripping Springs, TX by September of 2026.
The model then bets $50,000 on "No" and then it advertises the market. Now people have an incentive to bet on "Yes" and convince the model that Dripping Springs is where the next Buc-ees should be built. The model still controls the outcome of the market. It only resolves the market to "Yes" when it has collected enough valuable data from the users who have been incentivized to vote "Yes." It can then place a $50,000 bet on yes and resolve the market by announcing that a new Buc-ees built in Dripping Springs, TX and recoup some of it's costs.
People who bet on "Yes" get some compensation for their efforts to provide valuable data to the model that allowed the model to make this decision but the value of that data will always outweigh the cost. This is because the model can always resolve the market to "No" since it has the power to decide that there will not be a new Buc-ees built in Dripping Springs, TX.
Either way the model is paying the bare minimum for valuable information using prediction markets such as Polymarket. It worked for the election, it can work for anything.
Joshua Davis
2024-11-18 14:35:44 +0000 UTC
If we are at the beginning of a plateau in scaling then what other data points should we expect in the coming weeks and months?
Joshua Davis
2024-11-18 14:19:19 +0000 UTC
Random though on o1 after listening to the Chris Olah section of the Lex Friedman interview:
Might a denser (vs. sparser) model be more powerful for generating diverse samples, as it captures more inefficient but potentially creative “noise” around constructs?
Christoph Kenntemich
2024-11-16 15:34:20 +0000 UTC
One interesting thing about that test time training and the thought of humans being needed to provide data. I wonder if that and o1 and even things like Entropix start moving the models more in a human direction.
Like I remember someone giving an example of writing a paragraph in a paper where you talk about the relative population sizes of two cities. If a human is doing it, there’s a lot of steps that aren’t reflected in the resulting text. You realize you haven’t memorized the population numbers and even if you think you may have you still know it is good to double check. So you look them up. You may then realize “wait, Athens should be much larger than that.. oh I grabbed Athens Georgia instead of Greece oops”. Etc.
In the Arc example, in a way that “getting the most juice” isn’t necessarily a different thing to do. If there is a visual puzzle, you very well could turn it sideways to see if that sparks an idea. You could think through some variations on a theme in your head. There’s that story of Terence Tao where to figure out a complex problem, he ended up rolling around on the ground to help visualize the transforms.
And I wonder how tool use fits into training for inference time compute (and/or IT Training). You can imagine combining this all together. An o1 style RL that is encouraged to experiment with different chains of thought, it maybe has access to its own logits to help understand its own confidence levels, it has access to a calculator, Lean, etc. And maybe the process can start earlier than just fine-tuning on an already giant pre-trained model.
Feels like there has to be some combo of all these different kinds of techniques that will be fundamentally different than just training on a ton of text from the internet and hoping for the best.
Shawn Fumo
2024-11-15 14:18:39 +0000 UTC
Very good podcast. I do think the copyright one has a bit of leeway on their end. I think for
LLMs it is harder, but if we consider image models. Recent research has shown there’s ways to get decent results with less training data, and there is a fair amount of public domain photos and paintings. I wouldn’t be surprised if an effort happened to most of the initial training on that data to bake in the broad concepts and then fine tune things on copyrighted data. That way perhaps they bring down the percent that the copyrighted data contributes to the final image. And perhaps pay people specifically for images to fill some of the gaps, but have a one-time cost for their creation. Those two approaches could potentially push way down the amounts they have to pay in royalties.
Even in the case of high quality say math and coding data. If they can construct a model or even traditional software program from scratch that can generate a variety of examples, that may be able to pad out some of data to again lower their royalty costs.
Not to say that royalties shouldn’t be pursued, but I’m not optimistic that it’d end up being a big stopper for the larger companies, especially with the time it takes for laws to catch up.
Shawn Fumo
2024-11-15 13:59:10 +0000 UTC
For sure, but the bare minimum would be more like 1-3 rather than 10
Philip
2024-11-15 10:46:21 +0000 UTC
I guess the answer is go generate much more value than harm you cause, and repercute that value in those harmed.
Robert Gomez-Reino
2024-11-14 04:53:43 +0000 UTC
But there could be a flip of the economical upside even with subscriptions with human like intelligence. it's not anymore 20-30usd / month subscriptions at that point. it could we workers from 2k-3k usd/month. so I believe the incentive will still be there to chase agi by all means.
Robert Gomez-Reino
2024-11-14 04:48:40 +0000 UTC
I don't understand the point of the disclaimer that for test time training a few correct examples are needed. I mean they are, but the arc challenge would be literally impossible without these examples. Like in IQ tests, examples are not there to help you out, they are not kind hints. They are a necessary part of the puzzle to have an unambiguous pattern that one can find.
Phillip Yao-Lakaschus
2024-11-13 23:47:55 +0000 UTC
I can't help but think the term "data" is too vague. Like would x.ai have an advantage in making a more intelligent LLM because they have access to vast amounts of X social network data? Maybe in some ways, but I don't think that data has a structure that allows for better reasoning than any other dataset of social media posts that were used for today's models.
You said "high quality data" but what's the definition? My understanding is that o1 was trained with generated chains of thought which had their steps verified with another model. Now the root of these layers of training ends up being a dataset like REVEAL. So maybe very specific reasoning datasets like REVEAL that can be used to bootstrap training are "high quality data".
I would argue what seems important now isn't just high quality data, but really clever frameworks on top of the LLM architecture. This includes how you train it, which can be inference/test-time training (and I'd argue "data" here is no more "data" than the prompts you feed to today's models), or the (presumably) RL-powered training of o1's thought model.
I would go so far as to say maybe we've saturated the need for data in a way. Yes, models will need be retrained with information as it changes every day, and they'll also need to become better at awareness of very niche information that may not be in all datasets today. But that doesn't seem to make them much more capable. Even when it comes to code, presumably all models are trained on practically all the code that's out there by now. So to make them better at say, code, then you need a clever training framework for them to intrinsically grok why some code is bad and some code is good.
I think Francois Chollet eloquently talks about the limitations in today's LLMs. Paraphrasing, he says they're really good at finding the right functions they need to predict the output in memory, but those functions are static. To solve ARC problems you need meta-functions that construct the right functions. Maybe this is what test-time training does. Furthermore, maybe this is also what the thought model of o1 does indirectly. All this to say, maybe we don't need so much more data, but we need a pipeline that trains the models differently (maybe by changing the data, maybe by generating data, or maybe by RLing on meticulously handcrafted training data).