Let's kickstart 2025 by comparing Veo 2 to Sora, then asking a bigger question. Can the o3 method be repurposed for different modalities? I piece together the evidence, and argue that we could see step-changes in more than just text-based reasoning.
I might have missed something, but in the o3 method talk they stated that video verification is about as hard as video generation. Will this make it a bottle neck for using the techniques for video generation. Assuming that the assumption of reasoning verification is much easier than reasoning generation done by o3 holds.
Sion Boguszewicz
2025-01-10 01:34:47 +0000 UTC
Agreed. My point is this is a poor test or benchmark. The insight with o3 is that things with a deterministic and easy to measure benchmark will be easy to saturate.
Austin King
2025-01-06 17:57:23 +0000 UTC
I would have assumed that "how close are the predicted frames to the next N frames of the reference video" is how they are already training video models? Can't see how else you'd do it.
Simon Booth
2025-01-06 16:59:17 +0000 UTC
Thanks, Philip! Very interesting. It doesn’t show the guy getting out of the carriage, but still – it does a good job of creating a Victorian street and a horse-drawn carriage.
By contrast, here’s the opposition trying the same job: https://youtu.be/vHoRDzKabMQ
Videos from Runway Gen-3 and Hailuo Minimax; plus still image from MidJourney,
Martin Percy
2025-01-05 11:12:51 +0000 UTC
Thank you Martin! Tried these but it loses coherence very quickly. Example, the man never gets out of the carriage, in any of the four videos: https://drive.google.com/file/d/1HLmjrAPyxCXoodwYZWKBurj8xUjg8vIO/view?usp=sharing
Philip
2025-01-05 10:31:03 +0000 UTC
Yes, in a way this is the ultimate question. Or, reframed, are there any circumstances where we don't have that signal, given enough work?
Philip
2025-01-05 10:30:11 +0000 UTC
Reiterating what you've already said, the path towards 'solving' generative AI for tasks with clear evaluation criteria seems very clear. Obviously there are plenty of engineering challenges, but we had AI working out RTS games all the way back in 2018 - so hardly seems insurmountable.
Now my 2 cents, (and I've said this before, but always bears repeating) - I think the big open question is what do we do where we don't have a readily available evaluation signal? We do have technical solutions here, and I am pretty confident we can bootstrap off of various things to get pretty good criteria for most stuff, but I think this is pretty far from being solved, or even starting to be taken seriously as an issue. We've seen from LLMsys how our intuitive sense can sometimes be (apparently) misaligned with our stated preferences, and I'm sure we've all had frustrating moments where we wanted something from our models but didn't know how to convey it exactly - this has big implications for relying on RLHF.
This may seem fairly trivial in the grand scheme of things, and maybe crude objectives is all that's needed given enough compute/data, but eventually I think this could be the differentiator between sub- and super- human capabilities, at least in terms of utility to the end user. It's a bit like an employee who isn't really able to work independently but can complete grunt work as instructed, versus someone who is able to work autonomously without oversight because they understand the brief. Intelligence is surely a big part of that, but so is being able to understand the objective. And you can't train something to be good at something if you don't have those in the first place.
William Woof
2025-01-04 21:12:13 +0000 UTC
Hi Philip,
Great video as always. Below is my wish list for a few scenes to generate in VO. As you see, I’m interested in seeing if it can do fairly straightforward scenes – the kind of thing which occurs many times in regular film and TV.
But you're a busy guy - no problem of course if you don’t have time to do any of these. Either way, keep up the great work.
Thanks!
Martin :)
Shot Prompts for Video Generation:
Victorian London Street Scene:
– Scene from a feature film showing a city street in Victorian London at night. A horse-drawn carriage travels along the street. The carriage stops, and a man in period-appropriate attire steps out.
Sporting Event Broadcast:
– Scene from a television broadcast of a gymnastics competition. A female gymnast performs a brilliant, gold medal-winning routine, showcasing grace and technical skill. The audience cheers as she lands her final move.
Feature Film Argument Scene:
– A tense emotional scene from a feature film. A man and a woman are arguing in a living room. The man is pleading for forgiveness, gesturing emotionally. The woman shouts back in anger, then turns away and leaves the room, visibly upset.
War scene:
Scene from a feature film showing a battlefield during World War I. British soldiers in muddy trenches, explosions in the background. A commanding officer gives orders under fire.
Romantic Comedy Meet-Cute:
Scene from a feature film showing a coffee shop in modern-day New York. A man spills his drink, bumping into a woman. They exchange awkward smiles and laughter as they clean up the mess.
Martin Percy
2025-01-04 17:31:21 +0000 UTC
Tangential but I wish more effort was put into models that scored high on agentic behaviors. I wish Claude computer use would get an upgrade. I even wonder if reasoning models would be better at computer use since what button to click on has an objectively correct answer.
Grant Singleton
2025-01-04 15:56:25 +0000 UTC
Ah. Of course. My mistake
Matthias Blank
2025-01-04 09:12:22 +0000 UTC
Thanks. Looks a bit better than early Star Trek. Haha. 1 year and it will be great
Matthias Blank
2025-01-04 09:11:39 +0000 UTC
Can't specifically do SpaceX rocket due to copyright it seems
I don't think it's coming this year. The compute costs are multiplicative, and I'd guess that stacking them would make it far too expensive to train this year. This year, I'm expecting algorithmic improvements to make them both more efficient, and maybe next year we'll see some combination of the two. I do expect reasoning + image generation soon though.
Chad Wick
2025-01-04 05:01:27 +0000 UTC
Looking forward to this documentary!
Anouar Mansour
2025-01-04 04:16:57 +0000 UTC
Sora looks like a dream/nightmare, veo2 looks more stable.
Very interesting about inference time verifier, would love to know what is really happening in the background.
Looking forward to your documentary. Happy New Year.
Daniel Henderson
2025-01-03 23:53:21 +0000 UTC
I also don't fully see the analogy. What you are describing sounds very much like the next token prediction from the pretraining paradigm. Though I could imagine a class of "visual puzzles" (e.g., in the style of ARC), where you can clearly define right and wrong answers and where reasoning with pictures instead of words could make sense.
Lukas Bentkamp
2025-01-03 22:49:29 +0000 UTC
Seems they are, but before that the RL made the model more likely to produce logically coherent answers. Otherwise it'd just be Cost.
Anders
2025-01-03 22:44:37 +0000 UTC
Are you suggesting that the verifier model will include a physics model? Or that the verifier model will learn a physics model from video training?
Barnaby Golden
2025-01-03 22:43:17 +0000 UTC
Thank you again for your timely explanations that are continuing to keep me engaged, although I'm not at the technical level needed to fully appreciate the complexities. It's all rather fantastic, and on that note have a fantastic 2025. I'm on the edge of my seat with anticipation of what this year will bring!
Judy Hitchcock
2025-01-03 21:43:19 +0000 UTC
thank you for trying. had seen some similar results. people go flying.. had one where the table actually exploded. you notice I had added one point that the couple remains seated calmly. That was done with Veo correct? I had tried with Sora until I ran out of credits ... was trying the storyboard option thinking that might appropriately place the various actions .. but it actually was giving me worse results than the other models I had tried.
Daniel A Barbatti
2025-01-03 20:51:02 +0000 UTC
Possibly, though if the data is clean enough, even a simple signal like 'did it replicate' could be enough to brute force tremendous progress. Even if many of the errors are realistic alternatives. Am sure much of the failed chains of thought for o3 contain brilliances.
Philip
2025-01-03 20:48:52 +0000 UTC
this was the closest one: https://drive.google.com/file/d/1ZbygOOvaJr4ODQCGfqA_f0uxP0IkP2GL/view?usp=sharing
Philip
2025-01-03 20:46:09 +0000 UTC
A possible Prompt:
----
Create a visually realistic 10-15 second simulation of the SpaceX Starship approaching the Moon for a lunar orbit insertion. Begin with a shot of the spacecraft gliding silently through the dark void of space, with the Moon looming large in the frame. The surface of the Moon should be highly detailed, showcasing craters, ridges, and shadows.
As the Starship gets closer, show its engines briefly firing for an orbital adjustment, with blue exhaust plumes dissipating into the vacuum. Add subtle details, such as navigation lights blinking on the spacecraft, and faint reflections of the Moon’s surface on the metallic hull of the Starship.
The background should include the blackness of space dotted with distant stars, with Earth appearing as a small, glowing blue sphere in the distance. Use cinematic camera angles, starting with a wide view of the Starship and Moon, then transitioning to a closer view of the spacecraft performing its final approach.
Incorporate a soft hum or low-pitched engine sound, alongside subtle mission control communication audio in the background. The sequence should feel awe-inspiring, precise, and futuristic, capturing the silent grandeur of the lunar approach.
Matthias Blank
2025-01-03 20:42:56 +0000 UTC
Yes to the first two, using high-quality data. And maybe insights from eclectic domains of data
Philip
2025-01-03 20:36:03 +0000 UTC
Doesn't the observation in the second part of the video - that scoring a solution is hard for image generation negate the idea that it will be easy to score video generation against the real world footage? If the AI generates the car crash in a slightly different configuration.... how much do you penalize that video? "Are the physics correct" is hard without the AI being able to spin up a simulation and then digitize the video elements into 3D and then measure differences. Otherwise it is subjective judgement if there should be an explosion and if so, how big of an explosion, etc.
Austin King
2025-01-03 19:38:33 +0000 UTC
In your opinion, is the best way to train a model to better answer open ended, non-verifiable questions like philosophical ones going to be scaling up parameter counts and pretraining data, vs this new paradigm of TTC? Or do you see a third way?
John Merkowsky
2025-01-03 19:02:45 +0000 UTC
That first video looks very CGI to me. There is a disconnect for lighting and art style between the hippo and the renaissance background.
Austin King
2025-01-03 18:55:02 +0000 UTC
I believe you asked for prompts to try. I have tried some version of this prompt on most of the video models out there. About the only one that produced anything that wasn't total garbage was Pika (and it was still not "there"). here is the prompt: "Create a simple animated video where a couple is seated at a single table in a restaurant. The couple remains seated calmly. A single waiter, distinct from the seated man, walks up to the couple’s table carrying one plate of food. The waiter gently places the plate on the table. Immediately after the plate touches the table, the entire table (including the tabletop and base) tips over and falls to the floor "
Daniel A Barbatti
2025-01-03 18:52:30 +0000 UTC
So the O series of models are bruteforcing solutions. Still impressive results though.