Hacker News new | past | comments | ask | show | jobs | submit login

The capabilities here look shocking advanced yesterday.



A lot of the demo is very impressive, but some of it is just stuff that already exists but this is slightly more polished. Not really a huge leap for at least 60% of the demos.


The super impressive stuff is more subtle:

1. Price is half for a better performing model. A 1000x1000 image costs $0.003.

2. Cognitive ability on visuals went up sharply. https://github.com/kagisearch/llm-chess-puzzles

It solves twice as much despite a minor update. It could just be better trained on chess though, but this would be amazing if it could be applied to the medical field as well. I might use it as a budget art director too - it's more capable of knowing the difference in subtle changes in color and dealing with highlights.


I'm not sure for text it's a better performing model. I was just testing GPT-4o on a use case (generating AP MCQ questions) and -4o is repeatedly generating questions with multiple correct answers and will not fix it when prompted.

(Providing the history to GPT-4Turbo results in it fixing the MCQ just fine).


After some testing, I find it's not as good as code too. It is better at some things, but benchmarks don't tell the whole story apparently.


Yes, I've been using phind extensively as a google replacement and after moving to gpt4o the responses have gotten so much dumber.

I guess time to build something that lets you select which model to use after a google search.


The benchmark you're linking in 2 is genuinely meaningless due to it being 1 specific task. I can easily make a benchmark for another task (that I'm personally working on) where e.g. Gemini is much better than GPT4-Vision and any Claude model (not sure about GPT-4o yet) and then post that as a benchmark. Does that mean Gemini is better at image reasoning? No.

These benchmarks are really missing the mark and I hope people here are smart enough to do their own testing or rely on tests with a much bigger variety of tasks if they want to measure overall performance. Because currently we're at a point where the big 3 (GPT, Claude, Gemini) each have tasks that they beat the other two at.


It's a test used for humans. I personally am not a big fan of the popular benchmarks because they are, ironically, the narrow tasks that these models are trained on. In fact, GPT-4o performance on key benchmarks has been higher, but on real world tasks, it has flopped on everything we used other models on.

They're best tested on the kinds of tasks you would give humans . GPT-4 is still the best contender on AP Biology, which is a legitimately difficult benchmark.

GPT tends to work with whatever you throw at it while Gemini just hides behind arbitrary benchmarks. If there are tasks that some models are better than others at, than by all means let's highlight them, rather than acting defensive when another model does much better at a certain task.


I'm reminded of people talking about the original iPhone demo and saying 'yeah, but this is all done before ...'. Sure, but this is the first time it's in a package that's convenient.


How so? It's obvious convenient to for it to all be there on ChatGPT, but I'm more commenting on the "this is so Earth shattering" comments that are prevalent on platforms like Twitter (usually grifters,) when in reality while it will change the world, much of these tools sets already existed. So, the effect won't be as dramatic. OpenAI has already seen user numbers slip, I think them making this free is essentially an admission of that. In terms of the industry, it would be far more "Earth shattering" if OpenAI became the defacto assistant on iOS, which looks increasingly likely.


This is earth shattering because _it's all in the same place_. You don't need to fuck around with four different models to get it working for 15 minutes once on a Saturday at 3am.

It just works.

Just how like the iPhone had nothing new in it, all the tech had been demoed years ago.


Not just all in the same place, but it can cross reference multiple modalities simultaneously and in real-time. How are people not blown away by this?

Watching the demos there were quite a few times where I thought “no way, that’s incredible.”


Yes, it is very cool, but I think you're missing the point that many of these features, because they were already available, aren't world changing. They've been in the ether for a while. Will they make things more convenient? Yes. Is it fundamentally going to change how we work/live? At the moment, probably not.


First of all, no these features weren’t available. We have not seen a model that can seamlessly blend multimodal on the fly, in real time. We also haven’t seen a model that can interpret and respond with proper inflection and tone. We’ve been able to mimic voices but not like this. And certainly not singing.

Secondly, it must be sad living with such a lack of wonder. Is that how you judge everything?

We discovered the higgs boson. Eh, it won’t change how we live.

We just launched a new rocket. Eh, it won’t change how we live.


Many of these products and features were quite literally available. Proper voice inflection isn't that new either, it's cool, but not really going to/hasn't changed my life.

Lack of wonder? No, I think it's very cool. But you have to differentiate what is going to fundamentally change our lives and the world and something that isn't going to. GPT/LLM/AI will fundamentally change my life over time, the features shown today, 70% of them won't. They will replace existing products and make things more streamlined, but not going to really going to shift the world.


Not this level of quality in terms of voice inflection, and the other features. And the integration between them too. This is beyond anything I've seen.

It seems like you're overgeneralizing to the point of missing what is innovative here. And I do think making AI realtime and work well at it, is innovative and will change our lives.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: