I want to see something where we have a big piece of code, and a big standards document it purports to implement, and the system can answer questions like "is this part of the spec implemented? Where it is implemented? What does this piece of code mean (w.r.t. the spec)? If I implemented this part of the spec, where would the changes go?"
I'm pretty excited about the increased context length (e.g. in my other comment here[0]), but I'm kind of disappointed by the examples here.
The codebase is 100k lines, but the tasks it gave it seemed to be focused on just hundreds of examples. The examples are probably largely independent, so it doesn't seem like this is really flexing anything a relatively simple RAG approach with a much smaller context window couldn't handle. The prompts said "the demo that ...", so it's a matter of identifying the demo in question and looking at just that code, which is a much smaller necessary context. There was the "use the GUI approach from other examples" task, which kind of gets there, but that's kind of another distinct little bit of code.
In other words, while the codebase has lots of lines, the actual inference across them seemed to use relatively few of them, and identifying the relevant lines didn't seem that hard to me based on the tasks given. That means it could be done with some retrieval and a much smaller context window.
From the title, I thought it would be loading 100k into the context and then asking some deeper questions like "find the bug" that spans several function calls or something like that. Something that wouldn't be trivial to accomplish with current techniques.
Exactly. Give us access the model and let independant researchers test it. OpenAI did this with GPT4, opening access publicly and deeper access to researchers within and outside of Microsoft.
I simply don't believe the model is that good. Otherwise, maybe try to compete with OpenAI directley?
Wonder why they're not just giving us access, if it's indeed so good? Seems it's just to generate some noise and hype around Gemini. Hardly believable after the previous faked demo, as someone already said.
Google faces a different calculus than Microsoft/OpenAI when throwing these things out. It's just like Google Cloud. They have huge, valuable first-party workloads that compete for the hardware resources that would be used by generally-available free AI toys.
For Microsoft it doesn't make a difference. They are taking their own cash, investing it in OpenAI, and then turning right around and booking it as revenue. As a bonus it makes Google look wrong-footed. But fundamentally Microsoft doesn't care how much money they torch doing this.
Even the demo now is careful to show curated but possible things now. they learned their lesson.
The code changes are the most common tutorials you can find on the web. Adding a speed slider, the terrain tutorials are literary called "height maps" and focus on making it taller or flatter.
To be fair, they mostly faked the near instantaneous, real-time flow of the conversations. The answers were, as far as I know, legit. But I still agree that we should be skeptical.
The prompts they used were also different than the ones given like “is this the right order” was “is this the right order, consider the distance from the sun” they put this in their post on Google dev blog.
This one seems to be super straightforward about timeliness and capabilities, but the examples might be a bit simpler than people think. This is pretty amazing but like someone else said you could achieve similar results from rag due to the lack of novelty in these questions and the fact that each dealt with pretty independent examples as opposed to using custom code developed elsewhere in the codebase.
It's really interesting to see where all of this is going. I guess a large part of the best practices behind clearly naming things for human interpretation has allowed for training and evaluation of things by ML models too.
Meta, but the speaker sounds eerily close to Mark Zuckerberg.
Na we’ll all be moved to perpetual on-call, every day an endless fire drill as hundreds of services are launched on top of the crumbling crash looping burning landscape of millions of services launched last quarter, a Mad Max world of endless adrenaline and New Relic AI-enhanced alerts.
Is this where unit tests will be very useful, where you ask it to fix all bugs found, and make sure it passes all unit tests. This is where all the github's public repos will get really interesting forks.
What they did in this demo is collect a bunch of small demos, small enough that earlier models could have answered questions about them, or dinked them, individually, and mostly demonstrated that the model could figure out which demo was pertinent to the question that they were asking, and focus only on that.
But the input was still divisible into self-contained little bits -- so this is still somewhat different from dumping the full source code for a database engine into it, and having it answer questions about, say, where foreign key constraints are implemented -- or, more dramatically, how several different parts of the codebase work together to implement, say, transaction isolation levels.
I like the fact this is a Three.js example because it implies some concept of 3D understanding (not exactly in the examples given) of worldspace as x-y-z and I have had hard times just getting xy plots right with GPT4. This would be a nice bonus in addition to the context increase.