Hacker News new | past | comments | ask | show | jobs | submit | Last5Digits's comments login

Are you going to spam this same link in every single thread about LLMs on HN? People have provided good arguments refuting whatever you're trying to say here, but you just keep posting the same thing while not engaging with anyone.


[flagged]


Yes but please try to avoid repetition on HN. The GP's response was rude and broke the site guidelines but https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... does look excessive to me.


No, the answers aren't just "plausible", they are correct the vast majority of the time. You can try this for yourself or look at any benchmark, leaderboard or even just listen to the millions of people using them every day. I fact check constantly when I use any LLM, and I can attest to you that I don't just believe that the answers I'm getting are correct, but that they actually are just that.

But they apparently actually don't get better even though every metric tells us they do, because they can't? How about making an actual argument? Why is correctness "not a property of LLMs"? Do you have a point here that I'm missing? Whether or not Kahneman thinks that there are two different systems of thinking in the human mind has absolutely no relevance here. Factualness isn't some magical circuit in the brain.

> No such thing can exist.

In the same way there can exist no piece of clothing, piece of tech, piece of furniture, book, toothpick or paperclip that is environmentally friendly; yes. In any common usage, "environmentally friendly" simply means reduced impact, which is absolutely possible with LLMs, as is demonstrated by bigger models being distilled into smaller more efficient ones.

Discussing the environmental impact of LLMs has always been silly, given that we regularly blow more CO2 into the atmosphere to produce and render the newest Avengers movie or to spend one week in some marginally more comfortable climate.


No, they are not correct -- the answer it gives might accidentally be correct but it can not be trusted, you still need to do research to verify everything it says and so the only usable standpoint is to use it as a bullshit generator which it is very good at.


What's your definition of "correct" then? If a system is "accidentally correct" the majority of the time, when does it stop becoming "accidental"? You cannot trust any system in the way you want to define trust. No human, no computer, no thing in the universe is always correct. There is always a threshold.

I do research with LLMs all the time and I trust them, to a degree. Just like I trust any source and any human, to a degree. Just like I trust the output of any computer, to a degree. I don't need to verify everything they say, at all, in any way.

Genuine question, how do you think an LLM can generate "bullshit", exactly? How can it be that the system, when it doesn't know something, can output something that seems plausible? Can you explain to me how any system could do such a thing without a conception of reality and truth? Why wouldn't it just make something up that's completely removed from reality, and very obviously so, if it didn't have that?


Never. As long as it is a probabilistic token generator, it can not be correct, it's that simple.

And it creates plausible text because it is trained on what humans have produced so it looks plausible. As someone put it, they found a zero day in the OS of the human brain.

https://www.theguardian.com/commentisfree/2023/may/08/ai-mac...

https://undark.org/2023/04/06/chatgpt-isnt-hallucinating-its...


At this point, I strongly urge you to think about what could possibly change your mind. Because if you can't think of anything, then that means that this opinion is not founded on reasoning.

The text LLMs produce is not just plausible in a "looks like human text" sense, as you'd very well know if you actually thought about it. When ChatGPT generates a fake library that looks correct, then the library must seem sensible to fool people. This can't be just a language trick anymore, it must have a similarity to the underlying structure of the problem space to look reasonable.


It indeed lies on very solid reasoning: a probabilistic predictor doesn't deal in facts. You'd need CyC for that.


The fact that you refuse to engage with my points tells me otherwise.

You're drawing meaningless distinctions, anyone who has ever used Cyc will tell you that it makes massive mistakes and spits out incorrect information all the time.

But that is even true of humans, and every other system you can imagine. Facts aren't these magical things living in your brain, they're information with a high probability of accurately modeling reality.

When someone tells you x happened in y at time z. Then that only becomes a fact if the probability of the source being correct is high enough, that's it. 99% of all of your knowledge is only a fact to you because you extracted it from a source that your heuristics told you is trustworthy enough. There is never absolute certainty, it's all just probability.


> Facts aren't these magical things living in your brain, they're information with a high probability of accurately modeling reality.

Truly people have completely lost it because of the AI hype.

There are facts. They are not probabilistic, they are just that: facts. Despite Mencken's 1917 long essay "A Neglected Anniversary" which became really popular, the bathtub didn't arrive to the United States in 1842 and it didn't became popular because President Fillmore installed one. A Kia ad in 2008 still referred to this without realizing it's a made up story to distract from World War I. https://chatgpt.com/c/6b1869a7-c0d7-46e9-bcb5-7a7c78dc3d53 https://sniggle.net/bathtub.php

Notably in 1829 the Tremont Hotel in Boston had indoor plumbing and baths (copper and tin bathtubs) and in 1833 President Andrew Jackson had installed iron pipes in the Ground Floor Corridor and a bathing room in the East Wing. Well before 1842.

There's nothing probabilistic about this.


A system of checks and balances also costs orders of magnitude more money.


this is my fear regarding AI - it doesn't have to be as good as humans, it just has to be cheaper and it will get implemented in business processes. overall quality of service will degrade while profit margins increase.


You probably also need that for the AI as well though


The point was that for many tasks, AI has similar failure rates compared to humans while being significantly cheaper. The ability for human error rates to be reduced by spending even more money just isn't all that relevant.

Even if you had to implement checks and balances for AI systems, you'd still come away having spent way less money.


Exactly, we need a much more granular approach to evaluating intelligence and generality. Our current conception of intelligence largely works because humans share evolutionary history and partake in the same 10+ years of standardized training. As such, many dimensions of our intelligence correlate quite a bit, and you can likely infer a person's "general" proficiency or education by checking only a subset of those dimensions. If someone can't do arithmetic then it's very unlikely that they'll be able to compute integrals.

LLMs don't share that property, though. Their distribution of proficiency over various dimensions and subfields is highly variable and only slightly correlated. Therefore, it makes no sense to infer the ability or inability to perform some magically global type of reasoning or generalization from just a subset of tasks, the way we do for humans.


Agreed on the first part, but for LLMs not having correlated capabilities, I think we've seen they do. As the GPTs progress, mainly by model size, their scores across a battery of tests goes up, eg OpenAI's paper for ChatGPT 4, showing a leap in performance across a couple dozen tests.

Also found this, a Mensa test for across the top dozen frontier models.

https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-10...

That does seem to me to be demonstrating a global type of reasoning or generalization.

Also see the author's note that at least with Claude, they seem to be releasing about every 20 IQ points.


There is a difference between poor reasoning and no reasoning. SOTA LLMs correctly answer a significant number of these questions correctly. The likelihood of doing so without reasoning is astronomically small.

Reasoning in general is not a binary or global property. You aren't surprised when high-schoolers don't, after having learned how to draw 2D shapes, immediately go on to draw 200D hypercubes.


Granting that, the original point was that they're not excited about this particular paper unless (for example) it improves the networks' general reasoning abilities.

The problem was never "my llm can't do addition" - it can write python code!

The problem is "my llm can't solve hard problems that require reasoning"


They don't. Which you can easily check with any of the dozen web apps currently implementing the GPT-4o tokenizer.


Here's hoping that the average HN commenter will actually read the paper and realize that the study was performed using GPT-3.5.


That's not the purpose though, clearly. If anything, you could make the argument that they're trading in on the association to the movie "Her", that's it. Neither Sky nor the new voice model sound particularly like ScarJo, unless you want to imply that her identity rights extend over 40% of all female voice types. People made the association because her voice was used in a movie that features a highly emotive voice assistant reminiscent of GPT-4o, which sama and others joked about.

I mean, why not actually compare the voices before forming an opinion?

https://www.youtube.com/watch?v=SamGnUqaOfU

https://www.youtube.com/watch?v=vgYi3Wr7v_g

-----

https://www.youtube.com/watch?v=iF9mrI9yoBU

https://www.youtube.com/watch?v=GV01B5kVsC0


> People made the association because her voice was used in a movie that features a highly emotive voice assistant reminiscent of GPT-4o, which sama and others joked about.

Whether you think it sounds like her or not is a matter of opinion, I guess. I can see the resemblance, and I can also see the resemblance to Jennifer Lawrence and others.

What Johannson is alleging goes beyond this, though. She is alleging that Altman (or his team) reached out to her (or her team) to lend her voice, she was not interested, and then she was asked again just two days before GPT-4o's announcement, and she rejected again. Now there's a voice that, in her opinion, sounds a lot like her.

Luckily, the legal system is far more nuanced than just listening to a few voices and comparing it mentally to other voices individuals have heard over the years. They'll be able to figure out, as part of discovery, what lead to the Sky voice sounding the way it does (intentionally using Johannson's likeness? coincidence? directly trained off her interviews/movies?), whether OpenAI were willing to slap Johannson's name onto the existing Sky during the presentation, whether the "her" tweet and the combination of the Sky voice was supposed to draw the subtle connection... This allegation is just the beginning.


I honestly don't think it is a matter of opinion, though. Her voice has a few very distinct characteristics, the most significant of which being the vocal fry / huskiness, that aren't present at all in either of the Sky models.

Asking for her vocal likeness is completely in line with just wanting the association with "Her" and the big PR hit that would come along with that. They developed voice models on two different occasions and hoped twice that Johannson would allow them to make that connection. Neither time did she accept, and neither time did they release a model that sounded like her. The two day run-up isn't suspicious either, because we're talking about a general audio2audio transformer here. They could likely fine-tune it (if even that is necessary) on her voice in hours.

I don't think we're going to see this going to court. OpenAI simply has nothing to gain by fighting it. It would likely sour their relation to a bunch of media big-wigs and cause them bad press for years to come. Why bother when they can simply disable Sky until the new voice mode releases, allowing them to generate a million variations of highly-expressive female voices?


I haven’t hear the GPT-4o voice before. Comparing the video to the video of Johansson’s voice in “her”, it sounds pretty similar. Johansson’s performance there sounds pretty different from her normal speaking voice in the interview - more intentional emotional inflection, bubbliness, generally higher pitch. The GPT-4o voice sounds a lot like it.

From elsewhere in the thread, likeness rights apparently do extend to intentionally using lookalikes / soundalikes to create the appearance of endorsement or association.


Please try to actually understand what og_kalu is saying instead of being obtuse about something any grade-schooler intuitively grasps.

Imagine a legally blind person, they can barely see anything; just general shapes flowing into one another. In front of them is a table onto which you place a number of objects. The objects are close together and small enough such that they merge into one blurred shape for our test person.

Now when you ask the person how many objects are on the table, they won't be able to tell you! But why would that be? After all, all the information is available to them! The photons emitted from the objects hit the retina of the person, the person has a visual interface and they were given all the visual information they need!

Information lies within differentiation, and if the granularity you require is higher than the granularity of your interface, then it won't matter whether or not the information is technically present; you won't be able to access it.


I think we agree. ChatGPT can't count, as the granularity that requires is higher than the granularity ChatGPT provides.

Also the blind person wouldn't confidently answer. A simple "the objects blur together" would be a good answer. I had ChatGPT telling me 5 different answers back to back above.


No, think about it. The granularity of the interface (the tokenizer) is the problem, the actual model could count just fine.

If the legally blind person never had had good vision or corrective instruments, had never been told that their vision is compromised and had no other avenue (like touch) to disambiguate and learn, then they would tell you the same thing ChatGPT told you. "The objects blur together" implies that there is already an understanding of the objects being separate present.

You can even see this in yourself. If you did not get an education in physics and were asked to describe of how many things a steel cube is made up, you wouldn't answer that you can't tell. You would just say one, because you don't even know that atoms are a thing.


I agree, but I don't think that changes anything, right?

ChatGPT can't count, the problem is the tokenizer.

I do find it funny we're trying to chat with an AI that is "equivalent to a legally blind person with no correction"

> You would just say one, because you don't even know that atoms are a thing.

My point also. I wouldnt start guessing "10" and then "11" and then "12" when asked to double check only to capitulate when told the correct answer.


You consistently refuse to take the necessary reasoning steps yourself. If your next reply also requires me to lead you every single millimeter to the conclusion you should have reached on your own, then I won't reply again.

First of all, it obviously changes everything. A shortsighted person requires prescription glasses, someone that is fundamentally unable to count is incurable from our perspective. LLMs could do all of these things if we either solve tokenization or simply adapt the tokenizer to relevant tasks. This is already being done for program code, it's just that aside from gotcha arguments, nobody really cares about letter counting that much.

Secondly, the analogy was meant to convey that the intelligence of a system is not at all related to the problems at its interface. No one would say that legally blind people are less insightful or intelligent, they just require you to transform input into representations accounting for their interface problems.

Thirdly, as I thought was obvious, the tokenizer is not a uniform blur. For example, a word like "count" could be tokenized as "c|ount" or " coun|t" (note the space) or ". count" depending on the surrounding context. Each of these versions will have tokens of different lengths, and associated different letter counts. If you've been told that the cube had 10, 11 or 12 trillion constituent parts by various people depending on the random circumstances you've talked to them in, then you would absolutely start guessing through the common answers you've been given.


I do agree I've been obtuse, apologies. I think I was just being too literal or something, as I do agree with you.


Apologies from me as well. I've been unnecessarily aggressive in my comments. Seeing very uninformed but smug takes on AI here over the last year has made me very wary of interactions like this, but you've been very calm in your replies and I should have been so as well.


Your English is absolutely fine and your answers in this thread clearly addressed the points brought up by other commenters. I have no idea what that guy is on about.


I mean sure, I could solve the issue with a better strategy, but what bugs me is that this is a problem in the first place. Should accurate time really still be an issue in 2024?


In theory it shouldn't be. In practice, for the vast majority of people, it isn't. If one needs precise time, there are ways to do it. Most things don't need it that precisely. If the sync function requires precise time, then it should schedule syncs, or even use network time rather than system time (it needs a network to sync anyways).

Basically, if it's not an issue people have no incentive to fix it. For the vast majority of people, this is not an issue.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: