Why wordfreq will not be updated

voytec · 2024-09-18T12:33:12 1726662792

I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

doe_eyes · 2024-09-18T14:35:12 1726670112

> I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.

In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.

weinzierl · 2024-09-18T18:21:51 1726683711

Isn't it the other way around?

SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.

LLM content should just enhance and cement the status quo word frequencies.

Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.

mlsu · 2024-09-18T22:32:54 1726698774

But you can already see it with Delve. Mistral uses "delve" more than baseline, because it was trained on GPT.

So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...

Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).

Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.

taneq · 2024-09-18T22:37:28 1726699048

> LLM uses delve more, delve appears in training data more, LLM uses delve more...

Some day we may view this as the beginnings of machine culture.

mlsu · 2024-09-18T22:40:16 1726699216

Oh no, it's been here for quite a while. Our culture is already heavily glued to the machine. The way we express ourselves, the language we use, even our very self-conception originates increasingly in online spaces.

Have you ever seen someone use their smartphone? They're not "here," they are "there." Forming themselves in cyberspace -- or being formed, by the machine.

Matumio · 2024-09-23T14:59:50 1727103590

I think they meant culture in the sense of knowledge that gets passed down from one generation to the next. Not a human culture of using machines, but a machine culture of using human languages.

mlsu · 2024-09-23T16:52:52 1727110372

Consider that the algorithm cannot evolve without human interaction. That's what I'm saying, it's a symbiote to us. If you consider "weights in the Instagram recommendation algorithm" to be "the machine", what we are talking about here has been happening for a long time now and has seen many generations, with each entity influencing the other.

I don't think we'll have true machine culture until we have fully autonomous agents in the wild that are interacting with the world independently on its own terms. Right now the substrate is text which comes from a human mind -- it does not arise naturally from nothing. So the machine is a symbiote for now until we solve some difficult robotics problems.

Matumio · 2024-09-27T07:25:48 1727421948

Hm, it's probably true that recommendation algorithms do something similar already, training on "human likes" that were influenced by the previous generation. But "human language" is a richer medium to carry information.

I don't think you need to be independent or autonomous to develop a culture. And a lot of human culture was passed down over generations without understanding why it worked. We just imitate the behaviour and rituals from our most successful ancestors or role models.

If new LLMs can access the past generation's knowledge of how to please human evaluators, they will use it. It's not a deliberate decision by an "agent", it's just the best text source to copy from. This is a new feedback loop between generations of assistants, and it bypasses whatever the human designer had in mind. Phrases like "it is always best to ask an expert" will pop up just because you tuned the LLM to sound like a helpful assistant, and that's what helpful assistants sound like in the training data. You'd have to actively steer the new generation away from using their ancestral knowledge.

I guess it comes down to what your definition of "culture" is. There is no targeted teaching of the next generation, for example - but is this a requirement? I agree that talking about "machine culture" right now sounds like a stretch, but now I wonder what pieces are actually missing.

taneq · 2024-09-24T04:12:20 1727151140

Yep I was going for more "the machines have their own culture increasingly independent from ours."

taneq · 2024-09-19T12:24:59 1726748699

chat is this real?

bryanrasmussen · 2024-09-19T05:12:19 1726722739

is the use of miscible here a clue? Or just some workplace vocabulary you've adapted analogically?

mlsu · 2024-09-19T06:18:39 1726726719

Human me just thought it was a good word for this. It implies some irreversible process of mixing, I think that characterizes this process really well.

noduerme · 2024-09-19T08:06:42 1726733202

There were dozens of 20th Century ideological movements which developed their own forms of "Newspeak" in their own native languages. Largely, natural human dialog between native speakers and between those opposed to the prevailing regime recoils violently at stilted, official, or just "uncool" usages in daily vernacular. So I wouldn't be too surprised to see a sharp downtick in the popular use of any word that becomes subject to an LLM's positive-feedback loop.

Far from saying the pool of language is now polluted, I think we now have a great data set to begin to discern authentic from inauthentic human language. Although sure, people on the fringes could get caught in a false positive for being bots, like you or I.

The biggest LLM of them all is the daily driver of all new linguistic innovation: Human society, in all its daily interactions. The quintillions of daily phrases exchanged and forever mutating around the globe - each mutation of phrase interacting with its interlocutor, and each drawing from not the last 500,000 tokens but the entire multi-modal, if you will, experience of each human to date in their entire lives - vastly eclipses anything any hardware could ever emulate given the current energy constraints. Software LLMs are just a state machine stuck in a moment in time. At best they will always lag, the way Stalinist language lagged years behind the patois of average Russians, who invented daily linguistic dodges to subvert and mock the regime. The same process takes place anywhere there is a dominant official or uncool accent or phrasing. The ghetto invents new words, new rhythm, and then it becomes cool in the middle class. The authorities never catch up, precisely because the use of subversive language is humanity's immune system against authority.

If there is one distinctly human trait, it's sniffing out anyone who sounds suspiciously inauthentic. (Sadly, it's also the trait that leads to every kind of conspiracy theorizing imaginable; but this too probably confers in some cases an evolutionary advantage). Sniffing out the sound of a few LLMs is already happening, and will accelerate geometrically, much faster than new models can be trained.

mlsu · 2024-09-23T18:56:22 1727117782

Really insightful.

I'm a little more cautious though. I think GPT will be way more integrated, simply because it's useful. Stalinist language was artificial, in the sense that it was basically imposed on you from outside for no good reason. When you wanted to get real stuff done (either talking to close friends, being productive with colleagues, etc) you wouldn't use socialist newspeak because it got in the way. GPT will be imposed by the outside world, but it's actually a useful thing to be able to converse with a language model; you'll do it every day at work, when buying things, when using your phone/PC.

And also, unlike in USSR times, so much of our communication is online and visible. It would not surprise me if we develop a model that can train continuously on the firehose. Text is small. Data rate of every person on earth speaking simultaneously:

- 150 words per minute spoken

- 150 words × (5 characters/word + 1 space) = 150 × 6 = 900 characters per minute

- 1 byte per char = 900 bytes/min = 15 bytes/sec

- 15 bytes / sec * 8,000,000,000 people speaking continuously = 120 gigabytes/second

That's a lot but it's not even the bandwidth of a single consumer GPU.

bryanrasmussen · 2024-09-19T08:17:09 1726733829

humans also lag humans, the future may already be spoken, but the slang is not evenly memed out yet.

jazzyjackson · 2024-09-19T07:50:53 1726732253

If you think that's niche wait til you hear about man-machine miscegenation

derefr · 2024-09-18T18:50:13 1726685413

1. People don't generally use the (big, whole-web-corpus-trained) general-purpose LLM base-models to generate bot slop for the web. Paying per API call to generate that kind of stuff would be far too expensive; it'd be like paying for eStamps to send spam email. Spambot developers use smaller open-source models, trained on much smaller corpuses, sized and quantized to generate text that's "just good enough" to pass muster. This creates a sampling bias in the word-associational "knowledge" the model is working from when generating.

2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.

3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)

travisjungroth · 2024-09-19T04:58:19 1726721899

On point 1, that’s surprising to me. A 2,000 word blog post would be 10 cents with GPT-4o. So you put out 1,000 of them, which is a lot, for $100.

derefr · 2024-09-19T16:31:35 1726763495

There are two costs associated with using a hosted inference platform: the OpEx of API calls, and the CapEx of setting up an account in the first place. This second cost is usually trivial, as it just requires things any regular person already has: an SSO account, a phone number for KYC, etc.

But, insofar as your use-case is against the TOUs of the big proprietary inference platforms, this second cost quickly swamps the first cost. They keep banning you, and you keep having to buy new dark-web credentials to come back.

Given this, it’s a lot cheaper and more reliable — you might summarize these as “more predictable costs” — to design a system around a substrate whose “immune system” won’t constantly be trying to kill the system. Which means either your own hardware, or a “being your own model” inference platform like RunPod/Vast/etc.

(Now consider that there are a bunch of fly-by-night BYO-model hosted inference platforms, that are charging unsustainable flat-rate subscription prices for use of their hardware. Why do these exist? Should be obvious now, given the facts already laid out: these are people doing TOU-violating things who decided to build their own cluster for doing them… and then realized that they had spare capacity on that cluster that they could sell.)

travisjungroth · 2024-09-19T19:07:32 1726772852

This makes sense. But now I’m wondering if people here are speaking from experience or reasoning their way into it. Like are there direct reports of which models people are using for blogspam, or is it just what seems rational?

brazzy · 2024-09-19T05:58:19 1726725499

But then you'll be competing for clicks with others who put out 1,000,000 posts for less costs because they used a small, self hosted model.

baq · 2024-09-19T06:59:07 1726729147

if you are a sales & marketing intern, have a potato laptop and $100 budget to spend on seo, you aren't going to be self hosting anything even if you know what that means.

nerdponx · 2024-09-19T08:44:59 1726735499

This is about high-volume blog/news-spam created specifically to serve ads and affiliate links, not about occasional content marketing for legitimate companies.

lbhdc · 2024-09-18T18:28:03 1726684083

> LLM content should just enhance and cement the status quo word frequencies.

TFA mentions this hasn't been the case.

flakiness · 2024-09-18T22:49:35 1726699775

Would you mind dropping the link talking about this point? (context: I'm a total outsider and have no idea what TFA is.)

girvo · 2024-09-18T22:58:49 1726700329

TFA means "the featured article", so in this case the "Why wordfreq will not be updated" link we're talking about.

adastra22 · 2024-09-18T23:15:31 1726701331

To be pedantic, the F in TFA has the same meaning as the F in RTFM.

It’s the same origin. On Slashdot (the HN of the early 00’s) people would admonish others to RTFA. Then they started using it as a referent: TFA was the thing you were supposed to have read.

girvo · 2024-09-19T02:59:18 1726714758

Oh that I'm aware of, but it's softened over time too haha

I miss the old Atomic MPC forums in the ~00s.

jnordwick · 2024-09-19T12:52:10 1726750330

The Fucking Article, from RTFA - Read the Fucking Article - and RTFM - Read the Fucking Manual/Manpage

tigerlily · 2024-09-19T09:27:15 1726738035

  Too deep we delved, and awoke the ancient delves.

brudgers · 2024-09-19T04:42:41 1726720961

serpent eating its own tail

GOGI.

romwell · 2024-09-19T05:06:02 1726722362

The Inhuman Centipede

bondarchuk · 2024-09-18T13:26:48 1726666008

At some point though you have to acknowledge that a specific use of language belongs to the medium through which you're counting word frequencies. There are also specific writing styles (including sentence/paragraph sizes, unnecessary repetitions, focusing on other metrics than readability) associated with newspapers, novels, e-mails to your boss, anything really. As long as text was written by a human who was counting on at least some remote possibility that another human might read it, this is way more legitimate use of language than just generating it with a machine.

ToucanLoucan · 2024-09-18T13:24:44 1726665884

This feels like a second, magnitudes larger Eternal September. I wonder how much more of this the Internet can take before everyone just abandons it entirely. My usage is notably lower than it was in even 2018, it's so goddamn hard to find anything worth reading anymore (which is why I spend so much damn time here, tbh).

wpietri · 2024-09-18T13:31:04 1726666264

I think it's an arms race, but it's an open question who wins.

For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).

jsheard · 2024-09-18T15:04:12 1726671852

The fight against spam email also led to mass consolidation of what was supposed to be a decentralised system though. Monoliths like Google and Microsoft now act as de-facto gatekeepers who decide whether or not you're allowed to send emails, and there's little to no transparency or recourse to their decisions.

There's probably an analogy to be made about the open decentralised internet in the age of AI here, if it gets to the point that search engines have to assume all sites are spam by default until proven otherwise, much like how an email server is assumed guilty until proven innocent.

jerf · 2024-09-18T14:23:17 1726669397

Another problem with this arms race is that spam emails actually are largely separable from ham emails for most people... or at least they were, for most of their run. The thousandth email that claims the UN has set aside money for me due to my non-existent African noble ancestry that they can't find anyone to give it to and I just need to send the Thailand embassy some money to start processing my multi-million yuan payout and send it to my choice of proxy in Colombia to pick it up is quite different from technical conversation about some GitHub issue I'm subscribed to, on all sorts of metrics.

However, the frontline of the email war has shifted lately. Now the most important part of the war is being fought over emails that look just like ham, but aren't. Business frauds where someone convinces you that they are the CEO or CFO or some VP and they need you to urgently buy this or that for them right now no time to talk is big business right now, and before you get too high-and-mighty about how immune you are to that, they are now extremely good at looking official. This war has not been won yet, and to a large degree, isn't something you necessarily win by AI either.

I think there's an analogy here to the war on content slop. Since what the content slop wants is just for you to see it so they can serve you ads, it doesn't need anything else that our algorithms could trip on, like links to malware or calls to action to be defrauded, or anything else. It looks just like the real stuff, and telling that it isn't could require a human rather vast amounts of input just to be mostly sure. Except we don't have the ability to authenticate where it came from. (There is no content authentication solution that will work at scale. No matter how you try to get humans to "sign their work" people will always work out how to automate it and then it's done.) So the one good and solid signal that helps in email is gone for general web content.

I don't judge this as a winning scenario for the defenders here. It's not a total victory for the attackers either, but I'd hesitate to even call an advantage for one side or the other. Fighting AI slop is not going to be easy.

ToucanLoucan · 2024-09-18T13:52:20 1726667540

> but spammers mostly lost that arms race

I'm not saying this is impossible but that's going to be an uphill sell for me as a concept. According to some quick stats I checked I'm getting roughly 600 emails per day, about 550 of which go directly to spam filtering, and of the remaining 50, I'd say about 6 are actually emails I want to be receiving. That's an impressive amount overall for whoever built this particular filter, but it's also still a ton of chaff to sort wheat from and as a result I don't use email much for anything apart from when I have to.

Like, I guess that's technically usable, I'm much happier filtering 44 emails than 594 emails? But that's like saying I solved the problem of a flat tire by installing a wooden cart wheel.

It's also worth noting there that if I do have an email thats flagged as spam that shouldn't be, I then have to wade through a much deeper pond of shit to go find it as well. So again, better, but IMO not even remotely solved.

dhosek · 2024-09-18T14:29:51 1726669791

I’m not sure what you’ve done to get that level of spam, but I get about 10 spam emails a day at most and that’s across multiple accounts including one that I’ve used for almost 30 years and had used on Usenet which was the uber-spam magnet. A couple newer (10–15 year old) addresses which I’ve published on webpages with mailto links attract maybe one message a week and one that I keep for a specialized purpose (fiction and poetry submissions) gets maybe one to two messages per year, mostly because it’s of the form example@example.com so easily guessed by enterprising spammers.

Looking at the last days’ spam¹ I have three 419-style scams (widows wanting to give away their dead husbands’ grand piano or multi-million euro estate) and three phishing attempts. There are duplicate messages in each category.

About fifteen years ago, I did a purge of mailing list subscriptions and there’s very little that comes in that I don’t want, most notably a writer who’s a nice guy, but who interpreted my question about a comment he made on a podcast as an invitation to be added to his manually managed email list and given that it’s only four or five messages a year, I guess I can live with that.

⸻

1. I cleaned out spam yesterday while checking for a confirmation message from a purchase.

wpietri · 2024-09-18T14:38:20 1726670300

I'm having a hard time finding reliably sourced statistics here, but I suspect you're an outlier. My personal numbers are way better, both on Gmail and Fastmail, despite using the same email addresses for decades.

pyrale · 2024-09-18T14:30:49 1726669849

> but spammers mostly lost that arms race.

Advertising in your mails isn't Google's.

BeFlatXIII · 2024-09-18T16:56:36 1726678596

I hope this trend accelerates to force us all into grass-touching and book-reading. The sooner, the better.

MrLeap · 2024-09-18T22:43:59 1726699439

Books printed before 2018, right?

I already find myself mentally filtering out audible releases after a certain date unless they're from an author I recognize.

kevindamm · 2024-09-18T12:48:23 1726663703

Yes but not quite as far as you imply. The training data is weighted by a quality metric, articles written by journalists and wikipedia contributors are given more weight than Aunt May's brownie recipe and corpoblogspam.

jsheard · 2024-09-18T13:41:10 1726666870

> The training data is weighted by a quality metric

At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.

Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

acdha · 2024-09-18T15:10:22 1726672222

> Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

Google has been _monetizing_ the SEO game forever. They chose not to act against many notorious actors because the metric they optimize for is ad revenue and and those sites were loaded with ads. As long as advertisers didn’t stop buying, they didn’t feel much pressure to make big changes.

A smaller company without that inherent conflict of interest in its business model can do better because they work on a fundamentally different problem.

derefr · 2024-09-18T19:00:20 1726686020

> those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet

The problem is that, of the signals you mention,

• the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")

• and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.

Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.

And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.

Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)

nneonneo · 2024-09-19T05:03:39 1726722219

Reminds me of a Google search I did yesterday: “Hezbollah” yields a little info box with headings “Overview”, “History”, “Apps” and “Return policy”.

I’m guessing that the association between “pagers” and “Hezbollah” ended up creating the latter two tabs, but who knows. Maybe some AI video out there did a product review of Hezbollah.

selestify · 2024-09-21T05:57:05 1726898225

Wow, you’re not kidding. The “return policy” info box officially links to https://www.reuters.com/world/middle-east/dozens-hezbollah-m...

Suppafly · 2024-09-18T16:39:01 1726677541

>At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight.

I've noticed that lately. It used to be the top google result was almost always what you needed. Now at the top is an AI summary that is pretty consistently wrong, often in ways that aren't immediately obvious if you aren't familiar with the topic.

epgui · 2024-09-18T14:47:21 1726670841

I don’t think they were talking about the quality of Google search results. I believe they were talking about how the data was processed by the wordfreq project.

kevindamm · 2024-09-18T15:01:12 1726671672

I was actually referring to the data ingestion for training LLMs, I don't know what filtering or weighting might be done with wordfreq.

noirscape · 2024-09-18T15:27:10 1726673230

Google has those problems because the company's revenue source (Ads) and the thing that puts it on the map (Search) are fundamentally at odds with one another.

A useful Search would ideally send a user to the site with the most signal and the fewest noise. Meanwhile, ads are inherently noise; they're extra pieces of information inserted into a webpage that at best tangentially correlate to the subject of a page.

Up until ~5 years ago, Google was able to strike a balance on keeping these two stable; you'd get results with some Ads but the signal generally outweighed the noise. Unfortunately from what I can tell from anecdotes and courtroom documents, the Ad team at Google has essentially hijacked every other aspect of the company by threatening that yearly bonuses won't be given out if they don't kowtow to the Ad teams wishes to optimize ad revenue somewhere in 2018-2019 and has no sign of stopping since there's no effective competition to Google. (There's like, Bing and Kagi? Nobody uses Bing though and Kagi is only used by tech enthusiasts. The problem with Google is that to copy it, you need a ton of computing resources upfront and are going up against a company with infinitely more money and ability to ensure users don't leave their ecosystem; go ahead and abandon Search, but good luck convincing others to give up say, their Gmail account, which keeps them locked to Google and Search will be there, enticing the average user.)

Google has absolutely zero incentive to filter out generative AI junk from their search results outside the amount of it that's damaging their PR since most of the SEO spam is also running Google Ads (since unless you're hosting adult content, Google's ad network is practically the only option). Their solution therefore isn't to remove the AI junk, but to instead reduce it enough to the degree where a user will not get the same type of AI junk twice.

PaulHoule · 2024-09-18T19:09:40 1726686580

My understanding is that Google Ads are what makes Google Search unassailable.

A search engine isn't a two-sided market in itself but the ad network that supports it is. A better search engine is a technological problem, but a decently paying ad network is a technological problem and a hard marketing problem.

Freak_NL · 2024-09-18T13:02:13 1726664533

It certainly feels like the amount of regurgitated, nonsensical, generated content (nontent?) has risen spectacularly specifically in the past few years. 2021 sounds about right based on just my own experience, even though I can't point to any objective source backing that up.

eszed · 2024-09-18T15:18:31 1726672711

Upvoted for "nontent" alone: it'll be my go-to term from now on, and I hope it catches on.

Is it of your own coinage? When the AI sifts through the digital wreckage of the brief human empire, they may give you the credit.

Freak_NL · 2024-09-18T15:41:09 1726674069

I do hope it catches on! I did come up with this myself, but I really doubt I'm the only one — and indeed: Wiktionary lists it already with a 2023 vintage:

https://en.wiktionary.org/wiki/nontent

zharknado · 2024-09-18T14:08:22 1726668502

Ooh I like “nontent.” Nothing like a spicy portmanteau!

eptcyka · 2024-09-18T14:40:39 1726670439

I personally am yet to see this beyond some slop on youtube. And I am here for the AI meme videos. I recognize the dangers of this, all I am saying is that I don't feel the effect, yet.

Freak_NL · 2024-09-18T15:49:52 1726674592

I'm seeing it a lot when searching for some advice in a well-defined subject, like, say, leatherworking or sewing (or recipes, obviously). Instead of finding forums with hobbyists, in-depth blog posts, or manufacturers advice pages, increasingly I find articles which seem like natural language at first, but are composed of paragraphs and headers repeating platitudes and basic tips. It takes a few seconds to realize the site is just pushing generated articles.

Increasingly I find that for in-depth explanations or tutorials Youtube is the only place to go, but even there the search results can lead to loads of videos which just seem… off. But at least those are still made by humans.

ghaff · 2024-09-18T15:33:45 1726673625

There's been a ton of low-rent listicle writing out there for ages. Certainly not new in the past few years. I admit I don't go on YouTube much and don't even have a tiktok account so it's possible there's a lot of newer lousy content I'm not really exposed to.

It seems to me that the fact it's so cheap and relatively easy for people with dreams of becoming wealthy influencers to put stuff out there has more to do with the flood of often mediocre content than AI does.

Of course the vast majority don't have much real success and get on with life and the crank turns and a new generation perpetuates the cycle.

LLMs etc. may make things marginally easier but there's no shortage of twenty somethings with lots of time imagining riches while making pennies.

sharpshadow · 2024-09-19T08:24:40 1726734280

Looking forward to watch perfect generated videos. We need so much more power and chips but it’s completely worth it. After that? Maybe generated videogames. But the video stuff will be awesome and changing the video dominated social media content for ever. Virtual headsets will become useful finally generating anything you want to see and jump tru space and time.

jsheard · 2024-09-18T13:18:06 1726665486

SEO grifters have fully integrated AI at this point, there are dozens of turn-key "solutions" for mass-producing "content" with the absolute minimum effort possible. It's been refined to the point that scraping material from other sites, running it through the LLM blender to make it look original, and publishing it on a platform like Wordpress is fully automated end-to-end.

sahmeepee · 2024-09-18T20:12:38 1726690358

Or check out "money printer" on github: a tongue in cheek mashup of various tools to take a keyword as input and produce a youtube video with subtitles and narration as output.

darby_nine · 2024-09-18T13:07:36 1726664856

Aunt may's brownie recipe (or at least her thoughts on it) are likely something you'd want if you want to reflect how humans use language. Both news-style and encyclopedia-style writing represent a pretty narrow slice.

creshal · 2024-09-18T13:15:17 1726665317

That's why search engines rated them highly, and why a million spam sites cropped up that paid writers $1/essay to pretend to be Aunt May, and why today every recipe website has a gigantic useless fake essay in front of their copypasted made up recipes.

Freak_NL · 2024-09-18T13:22:33 1726665753

I hate how looking for recipes has become so… disheartening. Online recipes are fine for reputable sources like newspapers where professional recipe writers are paid for their contributions, but searching for some Aunt May's recipe for 'X' in the big ocean of the internet is pointless — too much raw sewage dumped in.

It sucks, because sharing recipes seemed like one of those things the internet could be really good at.

smallerfish · 2024-09-18T13:30:51 1726666251

There seem to be quite a few recipe sharing sites around - e.g. allrecipes.com.

creshal · 2024-09-18T13:33:50 1726666430

And they're all flooded with low effort trash and useless.

The only remaining reliable source - now that many newspapers are axing the remaining staff in favour of LLMs - is pre-2020 print cookbooks. Anything online or printed later must be assumed to be tainted, full of untested sewage and potentially dangerous suggestions.

jerf · 2024-09-18T14:08:38 1726668518

The wife and I use the internet for recipe ideas... but we hardly ever follow them directly anymore. We're no formally-trained chefs but we've been home cooks for over 20 years now, and so many of them are self-evidently bad, or distinctly suboptimal. The internet chef's aversion to flavor is a meme with us now; "add one-sixty-fourth of a teaspoon of garlic powder to your gallon of soup, and mix in two crystals of table salt". Either that or they're all getting some seriously potent spices all the time and I'd like to know where they shop because my spices are nowhere near as powerful as theirs.

halostatue · 2024-09-18T16:28:21 1726676901

It's not just online recipes, but cookbooks written for the Better Home & Gardens crowd. The ones who write "curry powder" (and mean the yellow McCormick stuff which is so bland as to have almost no flavour) or call for one clove of garlic in their recipe.

I joke with folks that my assumption with "one clove of garlic" is that they really mean "one head of garlic" if you want any flavour. (And if the recipe title has "garlic" in it and you are using one clove, you’re lying.)

nick3443 · 2024-09-18T17:07:51 1726679271

If the recipe has "garlic" in the title, I'm budgeting 1/2 head per serving.

formerly_proven · 2024-09-18T13:41:19 1726666879

Well there's https://www.allrecipes.com/author/chef-john/ on that particular site.

davejohnclark · 2024-09-18T15:19:45 1726672785

I absolutely love Chef John. Great recipes and the cadence of his speech on YouTube (foodwishes) is very soothing, while he cooks up something amazing. If you're a home cook I highly recommend his recipes and his channel.

JohnFen · 2024-09-18T15:04:10 1726671850

Chef John is the best.

c6400sc · 2024-09-19T03:22:00 1726716120

It's interesting to search for recipes in other languages and not find junk as we do in English.

I read Spanish and Italian fluently and stumble my way through Japanese (with translation). It's easier to find a good recipe in these languages, provided you can find the ingredients or substitutes.

shagie · 2024-09-18T13:42:47 1726666967

I wish more people presented recipes like cooking for engineers. For example - Meat Lasagna https://www.cookingforengineers.com/recipe/36/Meat-Lasagna

bhasi · 2024-09-18T18:56:01 1726685761

I love the table-diagrams at the end. I've never seen anything like that until now and it really seems useful for visualization of the recipe and the sequence of steps.

tirant · 2024-09-19T07:37:54 1726731474

Interestingly my wife has been writing recipes on post-it notes for years in that same style, with arrows instead of tables. And she's the opposite to an Engineer, a psychologist (interest in people vs objects).

When I saw them, they blew my mind. Short to store and easy to understand.

shagie · 2024-09-18T19:04:11 1726686251

Combined with pictures for what each step should look like. I had a few of these pages printed out back in the '00s for some recipes that I did.

grues-dinner · 2024-09-18T14:39:05 1726670345

And here I thought my defacement of printed recipes by bracketing everything that goes together at each stage was just me. There are, well, maybe not dozens but at least two of us! Saves a lot of bowls when you know without further checking that you can, say, just dump the flour and sugar, butter and eggs into the big bowl without having to prepare separately because they're in the "1: big bowl" bracket.

halostatue · 2024-09-18T16:32:17 1726677137

Depends on what you’re doing. For best cookies, you want to cream the butter with the sugar, then add the eggs, and finally add the flour. If you’re interested and can find one, it’s worth taking a vegan baking class. You learn a lot about ingredient substitutions for baking, about what the different non-vegan ingredients are doing that you have to compensate for…and it does something that I’ve only recently started seeing happen in non-vegan baking recipes: it separates the wet ingredients from the dry ingredients.

That is, when baking, you can usually (again, exceptions for creaming the sugar in butter, etc.) take all of your dry ingredients and mix/sift them together, and then you pour your wet ingredients in a well you’ve made in the dry ingredients (these can also usually be mixed together).

grues-dinner · 2024-09-18T16:44:39 1726677879

No need to cakesplain, that was an example with three ingredients of the top of my head, very, very obviously the exact ingredients and bracket assignments vary depending on what you are making.

But for shortbread or fork biscuits those three could indeed all go in the bowl in one go (but that one admittedly doesn't really need a bracket because the recipe is "put in bowl, mix with hands, bake").

darby_nine · 2024-09-18T13:19:41 1726665581

Ok, but what i said is true regardless of SEO, and that SEO has also fed back into english before LLMs were a thing. If you only train on those subsets you'll also end up with a chatbot that doesn't speak in a way we'll identify as natural english.

actionfromafar · 2024-09-18T14:37:52 1726670272

Yet. Give it time. The LLMs will train our future children.

darby_nine · 2024-09-18T22:30:22 1726698622

I'm sure they already are.

Lalabadie · 2024-09-18T14:38:23 1726670303

The current state of things leads me to believe that Google's current ranking system has been somehow too transparent for the last 2-3 years.

The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.

sahmeepee · 2024-09-18T20:07:20 1726690040

Prior to Google we had Altavista and in those days it was incredibly common to find keywords spammed hundreds of times in white text on a white background in the footer of a page. SEO spam is not new, it's just different.

rockskon · 2024-09-19T03:31:17 1726716677

Don't forget Google's adsense rules which penalized useful straightforward websites and mandated websites be full of "content". Doesn't matter if the "content" is garbage nonsense rambling and excessive word use - it's content and much more likely to be okayed by adsense!

pphysch · 2024-09-18T18:20:43 1726683643

It's crazy to attribute the downfall of the web/search to Google. What does Google have to do with all the genuine open web content, Google's source of wealth, getting starved by (increasingly) walled gardens like Facebook, Reddit, Discord?

I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.

redbell · 2024-09-18T21:25:48 1726694748

> ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.

rgrieselhuber · 2024-09-18T12:48:37 1726663717

Indexability is orthogonal to readability.

hk__2 · 2024-09-18T14:24:26 1726669466

It should be, but sadly it’s not.

krelian · 2024-09-18T13:21:33 1726665693

>And yet LLMs were still fed articles written for Googlebot, not humans.

How do we know what content LLMs were fed? Isn't that a highly guarded secret?

Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?

GTP · 2024-09-18T13:29:07 1726666147

We do know that the open web consitutes the bulk of the trainig data, although we don't get to know the specific webpages that got used. Plus some more selected sources, like books, of which again we only know that those are books but not which books were used. So it's just a matter of probability that there was a good amount of SEO spam as well.

jgrahamc · 2024-09-18T12:23:34 1726662214

I created https://lowbackgroundsteel.ai/ in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.

LeoPanthera · 2024-09-18T18:13:46 1726683226

Congratulations on "shipping", I've had a background task to create pretty much exactly this site for a while. What is your cutoff date? I made this handy list, in research for mine:

  2017: Invention of transformer architecture
  June 2018: GPT-1
  February 2019: GPT-2
  June 2020: GPT-3
  March 2022: GPT-3.5
  November 2022: ChatGPT

You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.

jgrahamc · 2024-09-19T08:28:16 1726734496

I was taking "Release of ChatGPT" as the Trinity date.

VyseofArcadia · 2024-09-18T12:48:46 1726663726

Clever name. I like the analogy.

freilanzer · 2024-09-18T14:57:38 1726671458

I don't seem to get it.

ziddoap · 2024-09-18T15:01:42 1726671702

Steel without nuclear contamination is sought after, and only available from pre-war / pre-atomic sources.

The analogy is that data is now contaminated with AI like steel is now contaminated with nuclear fallout.

https://en.wikipedia.org/wiki/Low-background_steel

>Low-background steel, also known as pre-war steel[1] and pre-atomic steel,[2] is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from ships (either as part of regular scrapping or shipwrecks) and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.[3][4]

umvi · 2024-09-18T16:04:15 1726675455

> and only available from pre-war / pre-atomic sources.

From the same wiki you linked:

"Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature"

and

"For the most demanding items even low-background steel can be too radioactive and other materials like high-purity copper may be used"

sergiotapia · 2024-09-18T16:50:14 1726678214

reading stuff like this makes me so happy. no matter how fucked up something may be there is always a way to clean right up.

shreddit · 2024-09-19T08:35:20 1726734920

I wouldn't be so optimistic. The thing you call "way" is actually just time. Yes, anything humanity does (good or bad) will fade with time. But do we have the amount of time to clean up X (and i don't refer to X as in "formally twitter")?

shmageggy · 2024-09-19T17:52:29 1726768349

This is (one of the many) reasons why I care primarily about biodiversity and preventing as many human-caused extinctions as we can. Those are a permanent loss to the beauty and complexity of the universe built up over millions of years, and they are permanent and irreversible.

valval · 2024-09-22T14:11:36 1727014296

Not everything that’s been permanently lost is bad, that’s just the nature of our reality. This too shall pass.

New things arise from the ashes.

felbane · 2024-09-18T16:58:01 1726678681

glances nervously at atmospheric CO2

genewitch · 2024-09-18T23:34:07 1726702447

the easiest solution is growing dense vegetation including trees, then using that for things[0] or burying it until we have a better mitigation strategy for atmospheric carbon.

Another solution, and one that, if i weren't such a lazy, is ocean based carbon binding. You can run electricity directly through ocean water and precipitate the carbon out as calcium carbonate, which is both: useful to humans as is and after processing; and useful to the coral reefs and crustaceans/mollusks or whatever in the oceans.

If anyone wants to kick me about a million US dollars, i can make a POC on a used barge with solar panels and as much recycled material as possible, and have that just run off the coast of florida or something. I figure the total cost to get a barge is around a quarter million, all-in[1], the electronics and seawater stuff is about another $150-200 thousand, and the rest is mine for the idea and the lawyers' to get this approved and left alone to do the research.

[0] burning it for heat is fine, as the net CO2 levels will remain constant, but i mean things like houses and boardwalks and boats, furniture, and so on.

[1] could be more, now, the last time i was researching seaworthy barge costs it was between $100,000 and $200,000. I'm hoping there's someone that can donate the barge so i can make the rest more fit for purpose - redundancy, better solar, better mppt, better batteries, better materials for the electrodes (it takes platinum and titanium iirc, i haven't looked at my documents for a long while.)

heavensteeth · 2024-09-19T06:05:21 1726725921

The earth will recover. We may not, but earth will.

more-coffee · 2024-09-19T08:16:37 1726733797

And in a few million years, the next intelligent life form will examine remains of human texts, and wonder: with all the tools and knowledge they possessed, how could they not have prevented their demise?

(Sorry for pessimism and offtopicism)

CAP_NET_ADMIN · 2024-09-19T12:43:04 1726749784

We are but puny agents of entropy.

swyx · 2024-09-18T17:16:10 1726679770

and I applied to LLMs here: https://www.latent.space/p/nov-2023

AlphaAndOmega0 · 2024-09-18T15:02:26 1726671746

It's a reference to the practise of scavenging steel from sources that were produced before nuclear testing began, as any steel produced afterwards is contaminated with nuclear isotopes from the fallout. Mostly ship wrecks, and WW2 means there are plenty of those. The pun in question is that his project tries to source text that hasn't been contaminated with AI generated material.

https://en.m.wikipedia.org/wiki/Low-background_steel

ms512 · 2024-09-18T15:03:13 1726671793

After the detonation of the first nuclear weapons, any newly produced steel has a low dose of nuclear fallout.

For applications that need to avoid the background radiation (like physics research), pre atomic age steel is extracted, like from old shipwrecks.

https://en.m.wikipedia.org/wiki/Low-background_steel

GreenWatermelon · 2024-09-18T15:05:07 1726671907

From the blog

> Low Background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945.

voytec · 2024-09-18T15:36:40 1726673800

To whomever downvoted parent: please don't act against people brave enough to state that they don't know something.

This is a desired quality, increasingly less present in IT work environments. People afraid of being shamed for stating knowledge gaps are not the folks you want to work with.

umvi · 2024-09-18T16:02:25 1726675345

I feel like there's a minimum "due diligence" bar to meet though before asking, otherwise it comes across as "I'm too lazy to google the reference and connect the dots myself, but can someone just go ahead and distill a nice summary for me"

voytec · 2024-09-18T16:20:38 1726676438

In this particular case, I was out of the loop regarding the clever analogy myself. I'm now a tad smarter because someone else expressed lack of understanding, and I learned from responses to this (grayed due to downvotes) comment.

PhunkyPhil · 2024-09-18T17:34:06 1726680846

The problem is that the answer was a really easy google. I didn't know what low background steel was and I just googled it.

cwillu · 2024-09-18T22:44:50 1726699490

A person asking the question here means there are now several good succinct explanations of it here.

input_sh · 2024-09-18T22:47:08 1726699628

But it's right there in the header, you could just click the link and find out on the top of the webpage.

waveBidder · 2024-09-19T05:15:49 1726722949

modern polite way of saying rtfm

KeplerBoy · 2024-09-18T15:00:10 1726671610

Steel made before atmospheric tests of nuclear bombs were a thing is referred to as low background steel and invaluable for some applications.

LLMs pollute the internet like atomic bombs polluted the environment.

cdman · 2024-09-18T15:00:21 1726671621

https://en.wikipedia.org/wiki/Low-background_steel

astennumero · 2024-09-18T15:56:10 1726674970

That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.

rovr138 · 2024-09-18T16:39:13 1726677553

The sources are just aggregated. The source doesn't change.

The new stuff generated does (and this is honestly already captured).

This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.

Their research and projects are great.

iak8god · 2024-09-18T20:29:43 1726691383

The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.

Making resources like wordfreq more visible won't exacerbate any of these concerns.

Der_Einzige · 2024-09-19T05:26:38 1726723598

FYI: My two datasets, DebateSum and OpenDebateEvidence/OpenCaseList in their current forms qualify for this, as they end at latest in 2022.

jgrahamc · 2024-09-19T08:29:22 1726734562

You can either add them to the site yourself via Tumblr or send them to me via email (jgc@cloudflare).

imhoguy · 2024-09-18T15:28:49 1726673329

I am not sure we should trust a site contaminated by AI graphics. /s

gorkish · 2024-09-18T15:39:03 1726673943

The buildings and shipping containers that store low background steel aren't built out of the stuff either.

whywhywhywhy · 2024-09-18T15:32:00 1726673520

Yeah pay an illustrator if this is important to you.

See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.

imhoguy · 2024-09-18T15:44:19 1726674259

I have updated my comment with "/s" as that is closer to what I've meant. However, seriously, from ethical point of view it is unlikely illustrators were asked or compensated for their work being used for training AI to produce the image.

heckelson · 2024-09-18T16:01:23 1726675283

I thought the header image was a symbol of AI slop contamination because it looked really off-putting

ClassyJacket · 2024-09-18T22:22:14 1726698134

:'( I thought I was clever for realising this parallel myself! Guess it's more obvious than I thought.

Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.

thebruce87m · 2024-09-19T06:26:26 1726727186

I too realised this parallel and frequently tell people about it.

Edit: just the first one

jll29 · 2024-09-18T15:56:36 1726674996

I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say "we're not all like that", even though it is a trend and we're close to peak hype (slightly past even?).

The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").

Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.

When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.

solardev · 2024-09-18T16:06:44 1726675604

Have "good" small webs EVER prevailed?

Every content system seems to get polluted by noise once it hits mainstream usage: IRC, Usenet, reddit, Facebook, geocities, Yahoo, webrings, etc. Once-small curated selections eventually grow big enough to become victims of their own successes and taken over by spam.

It's always an arms race of quality vs quantity, and eventually the curators can't keep up with the sheer volume anymore.

squigz · 2024-09-18T16:14:17 1726676057

> Have "good" small webs EVER prevailed?

You ask on HN, one of the highest quality sites I've ever visited in any age of the Internet.

IRC is still alive and well among pretty much the same audience as always. I'm not sure it's fair to compare that with the others.

solardev · 2024-09-18T16:20:53 1726676453

Well, niche forums are kinda different when they manage to stay small and niche. Not just HN but car forums, LED forums, etc.

But if they ever include other topics, they risk becoming more mainstream and noisy. Even within adjacent fields (like the various Stacks) it gets pretty bad.

Maybe the trick is to stay within a single small sphere then and not become a general purpose discussion site? And to have a low enough volume of submissions where good moderation is still possible? (Thank you dang and HN staff)

squigz · 2024-09-18T16:36:58 1726677418

I'm not entirely sure it's about content (while HN is certainly tech-focused, politics, health, philosophy all come up with regularity) or even content moderation, although they both certainly play a part (particularly the moderation around here. Thanks, staff!)

I wonder if it is more to do with the community itself. HN users tend to have very intelligent discussions on pretty much anything, and discourages shitty, unnuanced, one-line takes. This, coupled with a healthy moderation system, makes it hard for the lower quality discussion to break in and override the good stuff.

nick3443 · 2024-09-18T17:11:19 1726679479

The car headlight forums seem to expose the weakness of small web though, in that a lot of the forums that show up in search are "sponsored" by one or two major brands and any open discussion or validation of off-brand solutions, AliExpress parts, etc are quickly shunned or banned.

rovr138 · 2024-09-18T16:35:15 1726677315

Yes. That's the small web.

A good example of the generalization problem you discuss is reddit.

You have to unsubscribe from all the defaults and find the small, niche, communities about specific topics. If not, it's the same stuff, reposted, over and over, across different subs and/or social sites.

bongodongobob · 2024-09-18T16:31:15 1726677075

It's high quality when the content is within HN's bubble. Anything related to health, politics, or Microsoft is full of misinformation, ignorance, and garbage like any other site. The Microsoft discussions in particular are extremely low quality.

tdb7893 · 2024-09-19T01:18:48 1726708728

When economics has come up I've been curious and asked my brother about some of the stuff in the more upvoted comments (he has his PhD in economics with a focus on labor specifically) his reaction has always been something like "that doesn't match my understanding of that" or "I think their analysis is a bit oversimplified".

My experience here is that it's pretty good for things outside of tech (at least better than the average internet) but definitely not great.

nerdponx · 2024-09-19T08:50:44 1726735844

I don't have a PhD but I do have some background in economics, and economics is consistently one of the worst areas on HN. I think it's representative of society in general. There's something about economics that makes it feel like you can just reason through it with common sense, whereas that's rarely true in reality.

Retric · 2024-09-18T17:08:26 1726679306

IMO HN actually scores quite highly in terms of health/politics and so forth content because the both mainstream and fringe ideas get both shown and pushback.

A vaping discussion brought up glycerin used was safe and the same thing used in smoke machines and someone else brought up a study showing that smoke machines are an occasional safety issue. Nowhere near every discussion goes that well but stick around and you’ll see in-depth discussion.

Go to a public health website by comparison and you’ll see warnings without context and a possibility positive spin compared to smoking. https://www.cdc.gov/tobacco/e-cigarettes/index.html I suspect most people get basically nothing from looking at it.

chimeracoder · 2024-09-18T17:32:45 1726680765

> IMO HN actually scores quite highly in terms of health/politics and so forth content because the both mainstream and fringe ideas get both shown and pushback.

As someone with domain expertise here, I wholeheartedly disagree. HN is very bad at percolating accurate information about topics outside its wheelhouse, like clinical medicine, public health, or the natural sciences. It is also, simultaneously, extremely prone to overestimating its own collective competency at understanding technical knowledge outside its domain. In tandem, those two make for a rather dangerous combination.

Anytime I see a post about a topic within my area of specialty, I know to expect articulate, lengthy, and completely misguided or inaccurate comments dominating the discussion. It's enough of a problem that trying to wade in and correct them is a losing battle; I rarely even bother these days.

It's kind of funny that XKCD #793[0] is written about physicists, because the effect is way worse with software engineers.

[0] https://xkcd.com/793/

Retric · 2024-09-18T17:51:15 1726681875

Obviously on an objective scale HN isn’t good, but nobody is doing a good job here.

I’ve worked on the government side of this stuff and find it disheartening.

matrix87 · 2024-09-19T04:56:38 1726721798

people don't normally talk about healthcare on here so I'm not really sure what you're referring to or what your specialty is

mandevil · 2024-09-18T17:41:03 1726681263

As a software engineer married to a healthcare professional, I disagree strongly about the quality of the healthcare discussions here. A whole lot of the conversation is software engineers who think that they can reason from first principles in two minutes about this thing that professionals dedicate their whole lives to mastering, and who therefore don't understand the most basic concepts of the field.

Sometimes I try and engage, but honestly, mostly I think it's not worth it. Otherwise you end up doing this with your life: https://xkcd.com/386/

vladms · 2024-09-18T19:27:33 1726687653

> about this thing that professionals dedicate their whole lives to mastering

After doing some healthcare work I ended up understanding that some topics are not well known even by the professionals dedicating their whole lives to that because there are big gaps in the human knowledge on the topics.

I agree that people that think they can reason in two minutes about anything are a problem, but it's not a healthcare only issue (same happens for politics, economics, environment, etc.)

Engineers have the luck to work in the field where many things have a clear, known explanation (although, try to make an estimation about how long a team will implement a feature, and everybody will come up with something else).

mandevil · 2024-09-18T20:16:06 1726690566

As to the uncertainty and mysteries, you are 100% correct. One of the big failure modes for engineers in dealing with human health is the assumption that things are as simple and logical as the stuff we build, when it's simply not at all like that. There are (1) big arguments over basic things like "why do SSRI's work?" Outside of LLM's I can't think of a thing in software where we are still arguing about why things work in production. We never say "Why does Postgres work?" in the same way. (2)

And yes, this is true for many other areas of discussion at HN. It's just that it is most obvious to me in the area that my wife specializes in, because I pick up enough via osmosis from her to know when other people don't even have my limited level of understanding.

1: Or at least were 15 years ago when my wife told me about it- the argument might have been largely concluded and she just never updated me since I don't keep up with the medical literature the way she does.

2: Two decades ago there was a huge push for the "human genome project" under the basis that this would be "reading the blueprints for human life" and that would give us all of these medical breakthroughs. Basically none of those breakthroughs happened because we've spent the past 20 years learning all of the different ways that it is NOT a blueprint and that cells do things very differently from human engineers.

vladms · 2024-09-19T09:33:27 1726738407

Regarding the human genome project specifically it was research and no matter what was claimed (give us all of these medical breakthroughs) we (as the public) should understand there is no guarantee. Similarly to how most tech startups propose plans that lead to huge scales and ROI, but nobody is amazed when 3-4 years later they have a modest revenue (the lucky ones).

The benefits for understanding more about genomes are growing (ex: list of adverse effects based on genotype https://go.drugbank.com/pharmaco/genomics) but the field is/(was) so chaotic (just one example: there was not one standard about how to count: https://tidyomics.com/blog/2018/12/09/2018-12-09-the-devil-0...) and so lacking data that it will take many years to reap the benefits (ex: one of the largest study UK Bio bank gave access to researchers only in 2017 - https://en.wikipedia.org/wiki/UK_Biobank)

Retric · 2024-09-18T18:07:35 1726682855

Spend time with medical researchers and they start disparaging Doctors. Everyone wants that one authoritative source free from bias, but IMO even having a few voices in the crowd worth listening to beats most other options.

squigz · 2024-09-18T16:34:05 1726677245

I disagree. Even politics spurs intelligent, nuanced discussion here on HN.

And to hold up discussions about MS as an example of 'extremely' low quality discussion is, ah, interesting. Do you have any recent examples of such discussions?

matrix87 · 2024-09-19T04:53:34 1726721614

> spurs intelligent, nuanced discussion here on HN

relative to what? reddit?

also there's a trade off between entropy and "quality". too much "quality" and everyone gets bored and goes somewhere more entertaining

squigz · 2024-09-19T07:21:08 1726730468

Relative to... unintelligent discussions?

I also don't care if people leave because HN isn't 'entertaining' enough. I don't come here for that, and I don't expect the community members that make this place what it is to either.

matrix87 · 2024-09-22T20:12:45 1727035965

Idk, compare it to a forum like lobste.rs that has a much more strict filter for what's allowed. Personally I find lobste.rs a lot less entertaining because it's too on-topic and has a stuffy feeling

vundercind · 2024-09-18T17:06:50 1726679210

Politics and philosophy discussions here are intelligent in that most of the commenters aren’t dumb. They tend to be entirely uneducated and resistant to the educated.

bongodongobob · 2024-09-18T16:43:57 1726677837

I hide every single article about MS because it's filled with all the neckbeardy tropes about their products being garbage spyware, switch to Linux, they're stealing your data, the OS is trash etc. It's comments from people who have never managed large scale MS based environments comparing their Windows Home to the other 90% of the business ecosystem that has nothing to do with home users or MS's main cash cow, businesses, Azure/Entra and M365. I'm done wasting my breath on MS here.

squigz · 2024-09-18T16:45:15 1726677915

This is a funny comment in a thread about low quality discussion.

bongodongobob · 2024-09-18T16:47:47 1726678067

I'm describing why I no longer engage with MS related posts.

skissane · 2024-09-18T17:10:15 1726679415

I’ve posted four comments here on Microsoft in the last 30 days:

https://news.ycombinator.com/item?id=41499957

https://news.ycombinator.com/item?id=41408124

https://news.ycombinator.com/item?id=41335757

https://news.ycombinator.com/item?id=41327379

None of which fit your description of “neckbeardy tropes about their products being garbage spyware, switch to Linux, they're stealing your data, the OS is trash”.

And it isn’t just me, because if you look at those comments, I was talking to other people who weren’t invoking those “neckbeardy tropes” either

htrp · 2024-09-18T17:52:01 1726681921

Any curation mechanism that depends on passion and/or the goodwill of volunteers is unsustainable.

38 · 2024-09-18T16:37:41 1726677461

its so easy to solve this problem, not sure why anyone hasnt done it yet.

1. build a userbase, free product

2. once userbase get big enough, any new account requires a monthly fee, maybe $1

3. keep raising the fee higher and higher, until you get to the point that the userbase is manageable.

no ads, simple.

jachee · 2024-09-18T17:40:04 1726681204

Until N ad views are worth more than $X account creation fee. Then the spammers will just sell ad posts for $X*1.5.

I can’t find it, but there’s someone selling sock puppet posts on HN even.

abridges6523 · 2024-09-18T17:53:43 1726682023

This sounds like a good idea. I do wonder if enough people would sign up for it to be a worthy venture because I think the main issue with ads is I think once you add any price at all dramatically reduces participation even if it’s not about cost some people just see the payment and immediately disengage.!

squigz · 2024-09-18T16:15:31 1726676131

> people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening

The people who stay away from critical thinking were doing that already and will continue to do so, 'AI' content or not.

psychoslave · 2024-09-18T17:31:16 1726680676

I don't know, individually finely tuned addictive content served as real time interactive feedback loops is an other level of propaganda and attention capture tool than largest common denominator of the general crowd served as static passive content.

squigz · 2024-09-18T18:04:35 1726682675

Perhaps, but the solution is the same either way, and it isn't trying to ban technology or halt progress or just sit and cry about how society is broken. It's educating each other and our children on the way these things work, how to break out of them, and how we might more responsibly use the technology.

trehalose · 2024-09-18T16:43:00 1726677780

How did they get started?

squigz · 2024-09-18T16:44:11 1726677851

They likely never started critically thinking, so they never had to get started on not doing so.

(If children are never taught to think critically, then...)

sweeter · 2024-09-18T17:25:55 1726680355

It's almost like its a systemic failure that is artificially created so that people wont think critically... hmmm

vladms · 2024-09-18T19:32:50 1726687970

> is artificially created

You imply that thousands of year ago everybody was thinking critically?

Thinking critically is hard, stressful and might take some joy from your life.

sweeter · 2024-09-18T22:27:44 1726698464

I'm not sure how that would imply anything about the past. We as a society have spent decades defanging the public school system through changing school to be test score driven, tying a schools funding to the local property value, making them less effective and less safe, choking them out financially etc... it should be no surprise that children are not equipped to navigate modern life. I've been though these systems, they are deeply flawed.

squigz · 2024-09-18T17:27:16 1726680436

Yeah, it's almost like it has nothing to do with AI

Llamamoe · 2024-09-18T17:22:50 1726680170

> Good will prevail in the end.

Even if, this is a dangerous thought that discourages decisive action that is likely to be necessary for this to happen.

sweeter · 2024-09-18T17:22:15 1726680135

tangentially related, but Marx also predicted that crypto and NFT's would exist in 1894 [1] and I only bring it up because its kind of wild how we keep crossing these "red lines" without even blinking. It's like that meme:

Sci-fi author:

I created the Torment Nexus to serve as a cautionary tale...

Tech Company:

Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus"

1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm

Intralexical · 2024-09-18T22:15:12 1726697712

What if the way for good to prevail is to reject technologies and beliefs that have become destructive?

0xbadcafebee · 2024-09-18T17:25:52 1726680352

I'm going to call it: The Web is dead. Thanks to "AI" I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.

As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.

Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.

Vegenoid · 2024-09-18T22:35:33 1726698933

On Amazon, you used to be able to search the reviews and Q&A section via a search box. This was immensely useful. Now, that search box first routes your search to an LLM, which makes you wait 10-15 seconds while it searches for you. Then it presents its unhelpful summary, saying "some reviews said such and such", and I can finally click the button to show me the actual reviews and questions with the term I searched.

This is going to be the thing that makes me quit Amazon. If I'm missing something and there's still a way to to a direct search, please tell me.

cosmotron · 2024-09-19T04:32:15 1726720335

You can still get to product reviews directly and search them. Here's an example:

Product page (copy the identifier at the end): https://www.amazon.com/Long-Thanks-Hitchhikers-Guide-Galaxy-...

Review page (paste the identifier at the end): https://www.amazon.com/product-reviews/B001OF5F1E/

This seems to bypass all of the LLM stuff for now.

Vegenoid · 2024-09-19T05:05:29 1726722329

Pretty good! Unfortunately it does not include the Q&As, which are often just as useful as the reviews.

graeme · 2024-09-19T13:01:35 1726750895

Ran into this the other day. Amazon.ca still has the old version for now

bbarn · 2024-09-18T18:27:40 1726684060

No disagreement for the most part.

I used to be able to say search for Trek bike derailleur hanger and the first result would be what I wanted. Now I have to scroll past 5 ads to buy a new bike, one that's a broken link to a third party, and if I'm really lucky, at the bottom of page 1 will be the link to that part's page.

The shitification of the web is real.

klyrs · 2024-09-18T22:56:09 1726700169

R.I.P. Sheldon Brown T_T

(The Agner Fog of cycling?)

bbarn · 2024-09-19T15:26:35 1726759595

He was a legend.

Gethsemane · 2024-09-18T19:20:10 1726687210

Sounds like your laptop is wholly out of date, you need to buy the next generation of laptops on Amazon that can handle the modern SEO load. I recommend the:

LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID, WiFi, BT4.2, for Students/Business

Name rolls off the tongue doesn’t it

tim333 · 2024-09-19T11:30:24 1726745424

Or a Macbook.

cedric_h · 2024-09-18T23:58:08 1726703888

There is a startup whose product is better search. The killer feature is that you pay for it, so you aren't the produdct. https://kagi.com/welcome

codezero · 2024-09-19T00:29:32 1726705772

Can vouch for this. It’s the first non-Google search alternative I’ve used that has 100% replaced Google. I don’t need Google as a fallback like I did with others.

akkartik · 2024-09-19T05:28:32 1726723712

I've been slowly detaching myself from the web for the past 10 years. These days I mostly build offline apps using native technologies. Those capabilities are still around. They just receded for a while because they'd gotten so polluted with toolbars and malware. But now the malware is on the other side, and native apps are cool again. If you know where to look. Here's my shingle: https://akkartik.name/freewheeling-apps

On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now.

w10-1 · 2024-09-18T18:09:17 1726682957

> If I can in any way purchase something without the web, I'mma do that

To get to the milk you'll have to walk by 3 rows of chips and soda.

odo1242 · 2024-09-18T18:14:17 1726683257

Yeah, this is why I still use the web to order things in a nutshell lol

0xbadcafebee · 2024-09-18T20:50:33 1726692633

Where do you order things online that you aren't inundated by ads?

freddie_mercury · 2024-09-19T01:30:01 1726709401

It's a lot LOT easier for me as an adult to ignore ads online than it is for my kids in brick and mortar stores to ignore the candy and toys placed at their eye level.

cedric_h · 2024-09-18T23:57:07 1726703827

Ad blocker. Even just putting https://12ft.io/ in front of your link gets you pretty far.

0xbadcafebee · 2024-09-19T00:30:39 1726705839

Ah, you mean the web version of https://en.wikipedia.org/wiki/Blinkers_(horse_tack) . I don't think that helps when you're stopped in your tracks by an upsell. Dominos won't let you order a pizza online until you've declined garlic bread, cinnamon rolls and a liter of pepsi three times. And you can't just click "pepperoni pizza near me", you have to build your pepperoni pizza, after putting in your zip code, selecting the store, carry out, then click build again, sure you don't want buffalo wings too?, ....