Hacker News new | past | comments | ask | show | jobs | submit | kakakiki's comments login

I am just making a wild suggestion here. May be ask AI to generate one with the above prompt?


If you want your image processing algorithm to be optimized for the real world you should test against real photos. It’s not like we have a shortage of photos so synthetic data is absolutely necessary


Give me ten k and I'll make the AI take the prompt and return the best stock image, guaranteed

The optimal lenna replacer, for boomers who think they can reasonably belabor this point

Just pay me up front, papi needs a quick cash fix scratches their neck and tries to swivel their head 360 degrees


No need, we have an ethically sourced Lena image here https://mortenhannemose.github.io/lena/


This is clearly a sly reference to an exclusionary in-joke, which makes it an in-joke itself, and thus also exclusionary.

Needless to say, I'm still offended. And I think we can all agree that no one should ever be offended.


I'm offended that you're offended. And I think we can all agree that no one should ever be offended.

Hint: this path or yours? It doesn't work. Feeling offended and asking others not to offend you has to be reasonable within society's moral values. They've changed, obviously. But there is a large group of conservatives (as in: the type who wants traditional values) who loathe that change. The group is significant enough that they can push back changes like these. Case in point: Trump got elected and ensured SCOTUS is now 6-3 conservative-progressive.


I am offended you're offended they're offended. And I believe the best defense against trolling is direct offense

Hint: this little old glory jig of yours? You know why you're offended? Because you have no principles or values worth fighting for. But you want to fight for something. So you fight for an inglorious past. And you can't think of a better way than to skirmish with sjws at random.

We see you, and you are transparent to us. If you were above the struggle, you would say nothing. Yet you want to fight. You NEED to fight. We all do. And yet you have nothing worthwhile to fight for. So you fight us.

The problem with your approach is that your way of thinking will die, since it produces nothing new or worthwhile. To the degree that your views get presidence, there will be misery and gnashing of teeth, stagnation, and eventual failure of the empire built on your non-principles. But the cause of true justice will remain living, until victory.


LLM gibberish.


False. The best LLMs speak with more empathy than I do

If I was using an LLM, I wouldn't be beefing with you. I'd find some way to engage you while respecting your point of view.

Nope. I'm definitely a full 100% grade A human a*hole


That's not ethically sourced Lena. That's mansplained Lena. And it doesn't add anything new or better.

Srsly, just hand over the 10k. I'll get the job done! Then we'll have thousands of Lena alternatives! And they'll be objectively better!


As they say - Hope is not a strategy. I have to remind this periodically to avoid gathering moss.


From Elon's twitter (https://twitter.com/elonmusk/status/1674942336583757825?s=20)

"This will be unlocked shortly. Per my earlier post, drastic & immediate action was necessary due to EXTREME levels of data scraping.

Almost every company doing AI, from startups to some of the biggest corporations on Earth, was scraping vast amounts of data.

It is rather galling to have to bring large numbers of servers online on an emergency basis just to facilitate some AI startup’s outrageous valuation."


I have a bridge to sell anyone who believes this is true. AI companies have been a boon to businesses that want to lock down user data and now have an excuse. It may be true to the extent that Musk is legitimately angry that Twitter isn't getting a piece of that AI VC money (I'm sure he is). But:

A) Twitter would probably move in this direction even if AI companies didn't exist, this is an excuse. Nothing about Musk's Twitter has indicated that he cares about Open data access or anonymous access to the site, and this follows a general trend of closing down the platform to non-monetizable users. Musk has abundantly shown in the past that he would prefer everyone browsing Twitter be logged into an account.

B) "it's temporary" -- how? You don't have a way to stop this other than forcing login. That situation is not going to change next week. To call this "temporary emergency measures" is so funny; there is no engineering solution for this and you're not going to be able to successfully sue companies for scraping Twitter. Put a captcha in front of it? Sure, let me know how that goes.

You going to wait and see if the AI market collapses in the next month?

If this does turn out to be temporary, it'll only be because of migrations off of Twitter and because of user criticism, because Musk is impulsive and bends easily under pressure. But nothing about the situation Musk is complaining about is going to change next week.


FWIW: I've been scraping the shit out of social media for my AI training. I also do Amazon, AliExpress, etc.

Libs like puppeteer is so good these days that it's impossible to tell real users from fake traffic. Most of the blocks are just IP blocks.


Right. And the IP blocks only add a small cost to the scraping because it forces people to use residential IPs which can't sanely be blocked.


Not to mention Elon is great mates with the VC douchebags that are busy hyping and profiting off the AI hype train.


> there is no engineering solution for this and you're not going to be able to successfully sue companies for scraping Twitter

There absolutely is, if you try instead of whining on internet. People at Vercel have already developed new anti-bot + fingerprinting + rate limiting techniques which look quite promising. I dare say within a year, new tools will be powerful enough to do this easily.


> I dare say within a year, new tools will be powerful enough to do this easily.

I see where you're coming from, but if Twitter is in a position where it can't roll out those protections right now, given its current head counts, etc... it's not going to be in a position where it can roll out those protections next week. Probably not next month.

So it's less that no one could block companies from scraping Twitter (although anti-scraping mechanisms are probably always going to be a cat-and-mouse game, so I'm not sure that there is ever going to be a perfect easy solution). It's more that if Twitter can't do it right now, nothing is going to magically change any time soon about the situation it has found itself in. And waiting a year (even waiting 6 months) for tools to become available before rolling back this rate limiting would be incredibly self-destructive for Twitter.

The way I see it, they're basically guaranteeing that they will need to roll back these changes before they have a solution to whatever specific problem or irritation Musk is fixated on. They're not going to gain additional engineering capabilities in the next week. And how long does Musk plan to leave rate-limiting in place? A social media site where people can't look at content is just broken.


> Nothing about Musk's Twitter has indicated that he cares about Open data access or anonymous access to the site

Not so, he tasked George Hotz with getting rid of that horrible popup which prevented you from scrolling down much if you weren't logged in, which was added soon before he bought Twitter. When that was removed I rejoiced. But now Twitter's gone 100x in the opposite direction.


I don't know; was that Musk's idea, or was that Hotz's idea? I vaguely think this was a change that Hotz wanted that Musk went along with.

To be fair, Musk will regularly pay lip-service to the idea of Open communication. I guess that's not literally nothing, but most large site policies have been in the direction of locking down content.

If there ever was a version of Musk that cared about Open access, it's been a while since that version of him saw the light of day. It's very consistent with his overall behavior to believe that he views Twitter content as being primarily his property rather than a community resource, and that he thinks that scrapers/AI companies/researchers are literally stealing from him if they derive any value at all from data that Twitter hosts.


[flagged]


Let me guess. He got bored and moved on before making any breakthrough.


It's been a week now, you prediction held. Still can't access without an account.


Pretty sure they'd get into troubles with blue check mark buyers implicit promises if it's not temporary.


Elon has a good point there. Much of the current AI hotness is predicated on stealing peoples content and exploiting the infrastructure that other people have built. I don’t think it’s acceptable.

The licenses, compensation models, law, technical solutions, attribution, security and privacy all need time to catch up. Regulation has a role to play as its a bit of a free for all right now.

The irony of Elon mentioning “outrageous valuations” though!


Why would an AI company start scraping twitter html, instead of using an already existing archive? Something similar to archive.org could earn money from that. If all you want is the content, there's no reason to suck it through a straw.

I'd expect those that require real time data, such as stock market bots or sentiment data providers, to scrape twitter (if they don't provide the data by other means, for example the "firehose", which is another great way to earn money).

None of this makes much sense.

Also, it's much more complicated than it seems. The web works because the data is public. You cannot think of it as "my data". (Especially not twitter, since it is really their users'!) Twitter is not higher quality data than any other web page.

If we accept that thinking, every home page would require a login to see that specific company's phone number of opening hours. Those pieces of data are also valuable, in the right circumstances! And then the web would either not exist or the account system required would be so wide spread that accounts would carry no value and the system would become useless.


> Why would an AI company start scraping twitter html, instead of using an already existing archive?

I can think of a few possible reasons. They might want more up-to-date info, or they might have no real developers and the scraper was created by a business guru who prompted ChatGPT and didn't understand the code that came out.

Given what else Musk has asserted about Twitter, and how often former or current Twitter devs have contradicted him, it may not even be what Musk said.

> Twitter is not higher quality data than any other web page

Eh, depends how much you can infer from retweet, favourites, etc.

Won't be the only such site, but it's probably better training data than blog posts are these days.

But yeah, I absolutely agree that Twitter doing this caused a lot of damage to any orgs, corporate or government, which wanted to be public, anything from restaurants announcing special offers to governments issuing hurricane warnings. Twitter isn't big enough to assume everyone has an account, like Facebook is.


If there was more value in requiring login than there is in having this public and easily accessible, it would be behind a login form. The current internet has 99% of the time nothing to do with values someone imagined in the 80s


Value to whom? Twitter is more valuable to its users, to journalists who embed tweets in stories, and to web users at large who follow links and search results if it does not require a login to view posts.

Of course, none of those people own Twitter, and it may well be more valuable to its owners if it does require a login.


What you describe is the Facebook business model. Which seems to be a valid model, but twitter was not built around it and such a pivot would break all business moats around the company.

There was no web in the 80s so not sure what values you refer to, or how they are relevant to today's businesses.


What do you think the webs value is today?

What do you think the webs value was imagined to be in the “80s”?


Because they want an advantage over competitors who are using the archives already…


If every AI company pointed their scrapers to archive.org, that site would go down immediately as well.

This is just kicking the can down the road.

We have a major structural problem now. We want data to be free and machine readable, but no startup (and even a giant like Twitter) can afford the server cost to withstand all those machines.


> Elon has a good point there. Much of the current AI hotness is predicated on stealing peoples content and exploiting the infrastructure that other people have built. I don’t think it’s acceptable.

But then so is Twitter. They don’t produce any content whatsoever. The data they are having a fit about is not theirs, it’s been volunteered by the users. It’s the same line Reddit is pushing, and it’s bullshit. AI companies scraping the web is no more unethical than Google doing it.


Well, one thing is people going there and putting their data on a platform. That’s their choice.

Taking/scrapping/stealing that data out of said platform for the benefit of your over-hyped “disruptive” startup - and implying that others should give you all for free - is the issue.


That’s not the point. Twitter has a non-exclusive license to distribute the content; it’s not the owner of the data regardless of the high horse Musk feels like riding today.


Please don't throw around the word "stealing" so loosely.

Scraping data from a public website is not "stealing". It might be a violation of the terms of service, but then you have the whole issue of click-through (formerly shrink-wrap) licenses and contracts of adhesion.

If someone isn't vetting you and potentially signing you to a more meaningful contract before giving you access, for free, to data, then using that data for any purpose whatsoever (except republishing it or derived works, which might, depending on the nature of the data or the derived works, be a violation of someone's copyright) is so far from "stealing" that using that word is wrong, and I suspect intentionally inflammatory.

That's why Elon limited access, rather than going to the police to file charges for theft, or suing over copyright violation or breach of contract. Not to say he absolutely couldn't do the latter, but it's hardly a clear win.


There are much bigger issues.

1. Users or one might say content creators don't own their data. Not just do those platform owners make a lot of money with the content (which they have a license to, as per site ToS) but then you have third parties scraping it now for commercial products. Using the data to train models that are then sold back to some of the same social media users who produced the content for free in the first place wasn't a thing until very recently, it used to be a select few doing machine learning research in the past. The laws are lacking behind the tech development and regular internet users are being exploited because of it.

2. It absolutely is stealing in some cases, and even worse. For example when they scrape it for content which they then use to train their bots to impersonate humans. Or on Twitter, there's a very common type of bot that steals content from young attractive female social media users in China, auto-translated to English, to pose as them. If you're in finance and crypto circles they're swarming with these accounts (guess the scammers know their targets).

3. In general this is only going to get worse from here on. LLM are getting better and better. On sites like Twitter you already have no idea if you're interacting with a human or not. But these "AI" can not actually think for themselves, they can only emulate, they can copy other humans. At least so far. So for the sake of making progress and ensuring we can still have intelligent discussions and find novel ideas online, it's imperative to have a way to keep the machines out. Social media must become sybil resistant or it dies in a vicious circle of self-referencing bots ever parroting the same old talking points, or variations thereof. We urgently need human ID!


Theft requires the victim to be deprived of their property.

Nobody is deprived of their property, intellectual or otherwise, when this stuff happens, thus is it not right to call it stealing.


You may wish AI didn't exist, but it does. There's no putting the genie back in the bottle. We can still go after people who commit crimes using AI. Perhaps one day AGI will be possible and we will want to have discussions and share ideas with it just as do now with each other.

Governments, researchers, and all kinds of third parties have already been scraping every publicly available bit of data possible. There may be an increase now, but it's nothing new. It won't be the end of the society or the end of the internet anymore than AI will.

Also: https://www.youtube.com/watch?v=IeTybKL1pM4


We may be using different definitions of 'intelligence'. To me there is no AI that currently exists but I'm aware the companies market it as such of course.

>have already been scraping every publicly available bit of data possible

Data scraping is limited by economics just like anything else in the world. Storage costs money, someone has to pay for it. Researchers do not have unlimited funds. Some select few governments like the US may have most of the publicly accessible web archived. Keep in mind it's dynamic and requires massive data infrastructure to pull this off, there's tons of new data coming in daily. Private startups getting in on the action in a big way is a relatively new phenomenon, this used to be limited to enterprises with a specific purpose. Now everyone and their 4chan cousin are experimenting with their own deep learning models.


Heres the thing about the ToC and licenses:

Bots aren't people and can't read nor consent. They just consume.

Any page which can be served without first displaying a ToC or other terms which explicitly prohibit access is not protected by a ToC or other license from scraping, as they can be considered a Point of First Contact in each case, as the bot has selected each link from a simple aggregation of all links it encounters (each interaction being "new" in essence).

Now it could be argued that ignoring robots.txt is an explicit contravention of norms and standards which could be viewed as a violation of an implicit licence, but there is no law requiring adherence to robots.txt and thus no mandate that a program even look for it iiuc.


Bots aren’t people and can’t consent, sure - but they are tools that are wielded or deployed by people who absolutely can consent (setting aside whether click-wrap terms are enforceable or not). If I throw a brick through a window, it’s me in the shit, not the brick.


If I have an open door to my business and someone's automated robot walks in the door to see what's available, how is that different?

Even more applicable, this is like saying that a person walking down the street can't have a camera and take a picture of the front of the building....

Because the page you land on when entering a url is in fact little different than a store front, with the associated signage and access points defining how a person or automated device may interact with that business.

If you want to have it different then you have to actually put everything behind a locked door with no window, right?


This could easily be solved by making the unauthenticated access hard for machines to consume, like introducing delays or some kind of captcha or even just proof of work (reverse some hash). While the authenticated get all the snappiness they want.

I'm strictly anti account, so he just lost me as audience. The next walled garden after Facebook and Instagram that won't ever see me again.


It already was semi-hard to machine-read, that is the reason I use Nitter for doing my small-scale continuous scraping of twitter which is now temporarily broken. Nitter is tons easier to parse as it's not reliant on JS, etc, and simpler to create screenshots of with headless chrome.

However if you mean implementing some even worse obfuscation (kind of like FB putting parts of words in different divs etc) that is not really compatible with the situation that this needed to be done as more of a temporary emergency measure. And PoW doesn't sound reasonable because it sets mobile devices against the scraper's servers. If all of this was just so easy, scraping would be dead. Good that it isn't.


> And PoW doesn't sound reasonable because it sets mobile devices against the scraper's servers.

Scraper servers and mobile devices have different access patterns though. I I'm reading tweets then I'm fine waiting 1 second for a tweet to load. Page load times for this kind of bloated stuff are super slow anyway, meanwhile my mobile could spend a second or two on some PoW. But if you want to large-scale scrape, you suddenly have to pay for 1bn CPU seconds. And this PoW could even keep continuously increasing per IP. 0.1% with every tweet. Not noticeablr for the casual surfer sitting on the toilet, neck-breaking for scrapers.

> If all of this was just so easy, scraping would be dead. Good that it isn't.

Small-scale scraping could still be provided through API access or just a login.

The reason they are not doing the "easy" thing is that they don't see a need (yet, perhaps). Just get an account, they'd say, and they are right. It works for Instagram too, except for some weirdos who nobody really cares about.


Of course the scraper would have to pay too. But it makes for a race between how much they are willing to pay, versus how much worse the experience gets for real users. And for successful mobile apps, reducing average load even during active use is important (example: idle games that don't want to make your phone a drying iron, companies invest in custom engines and make all kinds of compromises to avoid this). And burst-allowing rate limiting is something I'm quite sure was already in place, especially with prejudice towards datacenter/VPN IP's. But similarly to how it is with search engine scraping, professional scrapers already have costly workarounds for these.

>The reason they are not doing the "easy" thing is that they don't see a need (yet, perhaps).

This argument just doesn't make any sense. Twitter notes that this is hurting them. Previews in chat apps, just clicking links in non-loggedin contexts is are broken. I feel like you just predict that this will turn out to be more accepted in the near future and become more a more permanent decision, which you don't like.


Im not fine waiting 1 second.

Most baffling is mobile reddit, where it takes like 6 seconds to load. Do they want us to use their crappy app, or they just dont care?


They're acting like they're desperate for you to use their crappy app.


They’re pulling every underhanded trick in the book to try and force mobile users onto the app. Yeah, I think they want you to use the app.


You can still get a login and have no delay.

For non-auth use, I rather wait for 1 second than not have any access at all. Which is the current state of affairs.


Maybe that is already the PoW anti-scraping measures haha.


HTTP Status Code 429 exists for this very purpose. While I sympathise with the idea that services need to protect their content from scraping to power AIs, I can't help but feel its a convenient excuse for these companies to re-implement archaic philosophies about online services. i.e. Killing off 3rd party apps and walling their garden higher, both feel very boomer in their retreat from the openness of the internet that seemed to be en vogue prior to smartphones. Perhaps this is just the transition from engineers building services to business, legal and finance trying to force the profit.

Correct me if I'm wrong, but surely throttling scrapers (at least ones that are not nefarious in their habits) is a problem that can be mitigated server-side, so I find it somewhat galling that its the excuse.


> is a problem that can be mitigated server-side

No matter what you do, this will cost server infra. That's Musk's argument for disabling access altogether.

Therefore it would make sense to have a solution which burdens the client disproportionately in relation to the server. A burden so low for the casual user that it's negligible but in aggregate, at scale, would break things. Which is what he wants.


> Which is what he wants.

Looks to me like both reddit and twitter are using the wedge to rather increase the height of the wall of their gardens and kill 3rd party development as opposed to genuinely trying to license bulk-users appropriately.

You're gonna need to license api keys so you're already identifying consumers and there's your infra which you need anyway. At which point you can throttle anyone obviously abusing whatever free/open-source tier offering you give out as standard.


Unless the captcha is annoying enough to a significant degrees, I doubt that it would work. With all the money in the bucket, scrapers can just hire a captcha farm to get pass the captcha with help from a real human.

Also a side note: distributed Web crawler is not unheard of these days, as well as residential IP proxies. Meaning the effectiveness of Proof of Work model maybe also limited.


How do residential proxies help? Scraping would effectively be bitcoin mining, which costs resources without shortcut.


Many online services (including Twitter) do employ some kind of IP address scoring system as part of their anti-scraping effort.

These systems tend to treat residential proxies as normal users, and puts less restrictions on them. On the other hand, if the IP address belongs to some (untrusted) IDCs, then the system will enable more annoying restrictions (say rate limits etc) against it, making scraping less efficient.


Sounds like outright banning is a stopgap measure, maybe they will implement one of these solutions


The other option would be to front caches through ISPs and the like.

This works far better when the items requested are small in number but large in volume (that is: a large number of requests against a small set of origin resources). When dealing with widespread and deep scraping, other strategies might be necessary, but these aren't impossible to envision.

Specifically permitted scraping interfaces or APIs for large-volume data access would be another option.

Of course, there's the associated issue that data aggregation itself conveys insights and power, and there might be concerns amongst those who think they're providing incidental and low-volume access to records discovering that there's a wholesale trade occurring in the background (whether that's remunerated or free of charge).


Elon is making a point and a reminder for everyone that what you share on social nets like Twitter is basically not owned by you, but the service.

Actually I’m surprised this took so long to do, and in the light of doing so shows that perhaps Twitter was sold for its existing content rather than existing or active user base.


> [...] perhaps Twitter was sold for its existing content rather than existing or active user base.

Not a very significant distinction: if active users stopped posting, scrappers wouldn't have much of a reason to keep scrapping.


Most AI startups don’t particularly care if content is 3 minutes old or 3 years.


If you're adding new/marginal data you should want it to be as current as possible so it'll have things like trending slang terms.


They absolutely do. What are you even saying?


AI startups training data covers content going back years. DALLE for example was trained on hundreds of year old paintings alongside more modern works.

Age may be included as part of the training but they generally want to suck up as much data as possible.


Yet.


> basically not owned by you, but the service.

It's shared ownership. You own it, but give Twitter non-exclusive permission to also use it.

This is why news agencies request permission from a twitter account before sharing a picture they took.


Twitter can delete the post without your consent, so you don't really own it.


You consented to their being able to delete it when you agreed to their terms of service. It’s like if you hire someone to clean your home. Mostly they’re tidying up and dealing with dirt and dust, but if they see what looks like a used napkin lying somewhere, they will probably throw it out without first asking if you still want it - without that being stealing and without ever owning it themselves.

It may seem weird weird to compare useful content to a used napkin, but hey, successful business founder stereotypes do quite often involve have an idea written on a napkin…


I didn't consent to anything, I don't have a Twitter account. I'm talking about people who do. And they often mistakenly think their content will stay on Twitter forever, so they don't need to back it up.


Fair enough. By “you consented … when you agreed”, I really meant “one consents … when one agrees”, as is common in informal English.

Yes, it’s a mistake to rely on social media content remaining up forever, agreed. That’s separate from ownership. Backups are important even for data on a hard drive you physically own, since hard drives can fail or be damaged or lost.


Sorry, I don't think I've ever seen “you” used as a generic pronoun in a past-tense sentence, which is why I took it personally.


Also fair - it’s possible that my choice of tense made that the only literal reading, but my point was intended as general and not accusatory.


You can't park there mate.


That's not past tense.


Ownership is not the same with right to display.

I can own a picture, but I can't place it on the NY Times website.


It's not though. They have a very permissive license but they don't have any actual IP ownership.


Was already the case for ICQ. (yes, it was in their ToS)


How do you define stealing? Is the AI data obtained from accessing private data? Data that users did not make publicly available but kept to themselves on their own devices?


I can't really agree. We've already had rulings about data scraping and I don't see the difference here. Just that a lot of people do it now?

Also, Twitter is a public platform. Twitter didn't generate comments, and people posting on a public account are indirectly subject to public viewing. Not much different from being indirectly recorded in a public park


> stealing peoples content

The Twitters and Reddits need to be careful here when complaining because without users generating free content, they also have no business.


We know that quality data is king, with all due respect to people tweeting, that data is most likely garbage.


If I visit Twitter to work out how to sort some JavaScript issue, and that makes my company $X, am I stealing content, or am I just using the platform?

There's one major player making money off of other peoples content here, and that's Twitter. Why are they ok doing that, but not anyone else?


Scraping has been a thing since the web started. Happens to every public site.

I recall that email at one point was 90% spam.


But is Twitter a good source for AIs?


Used responsibly, of course it is. A developer is able to ingest current language used in exchanges about current topics, as well as cite prominent sources that are still using the platform.


> exploiting the infrastructure that other people have built

Including you and me. WE built part of the infrastructure.


It's not like Twitter is compensating tweet authors either. For using art the debate is still opened in my opinion even if I'm personally not in favor of it but I don't see how those platforms built on user made content (even more of a clear cut than AI) can have a say on this


People are happy to put their content on social networks. Maybe they get some value in return such as sales, exposure, signalling or simple enjoyment.

Many people who aren’t that privacy conscious would however object to lots of companies, big and small, sucking their content into their databases for their own uses, then republishing after it’s passed through a few AI models.


>People are happy to put their content on social networks.

Do they have a choice? A handful of corporations have captured all the network effects. If you need to reach an audience to do your job or find your "friends", what other choice do you have but to give your data to them?


I fail to understand this argument. If your friends are close enough do you need a big corporate network to share content/thoughts?

If they aren't close why do you care?

Alternatively, what would make more sense is to participate in communities gated by these networks but then it's your choice to be there.


>If your friends are close enough do you need a big corporate network to share content/thoughts? If they aren't close why do you care?

I don't personally have this problem, but my observation is that most social relationships are somewhere in between closest friends and don't care.

My own concern is more about participating in professional, neighbourhood, civil society or political communities. Choosing not to be where they have decided to congregate means not being able to do my job and not making my voice heard where many decisions affecting me are taken.


Yes. AI is becoming the content launderer. I mean what's the difference? You could ask an AI to make not star wars. And what's the difference between that and all the not star wars movies made in the 80s? It's that it was automated this time around?

I think this points out that AIs clearly do not work like human brains. Human brains do not need all of the content of humanity to produce a replica of art station mediocre.


It's not like there's many alternatives, network effects are very powerful and even with Musk running the company into the ground, there's not many people really quitting which tells a lot on how hard that effect can be.


They have announced a plan to compensate creators based on ads shown and also have implemented a subscribers feature (people paying users for special access to some tweets)


I had no idea this existed, thanks for the insights.


Well they killed the API, what did they think would happen? It's easier to control access and rate limit with a proper API.


The actual problem seems to be that a large number entities now want a full copy of the entire site.

But why not just... provide it? Charge however much for a box of hard drives containing every publicly-available tweet, mailed to the address of buyer's choosing. Then the startups get their stupid tweets and you don't have any load problems on your servers.


What do you even charge for that? We might never make a repository of human made content with no Ai postings in it ever again. Seems like selling the golden goose to me


It's already public information. The point isn't to extract rents, it's to remove the incentive for server-melting mass scraping.


Substantially higher loads than Twitter gets today were not "melting the servers" until Musk summarily fired most of the engineers, stopped paying data center (etc.) bills, and then started demanding miscellaneous code changes on tight deadlines with few if any people left who understood the consequences or how to debug resulting problems.

In other words, the root problem is incompetent management, not any technical issue.

Don't worry though, the legal system is still coming for Musk, and he will be forced to cough up the additional billions (?) he has unlawfully cheated out of a wide assortment of counterparties in violation of his various contracts. And as employee attrition continues, whatever technical problems Twitter has today will only get worse, with or without "scraping".


Scraping has a different load pattern than ordinary use because of caching. Frequently accessed data gets served out of caches and CDNs. Infrequently accessed data results in cache misses that generate (expensive) database queries. Most data is infrequently accessed but scraping accesses everything, so it's disproportionately resource intensive. Then the infrequently accessed data displaces frequently accessed data in the cache, making it even worse.


In theory, wouldn’t continuous scraping by AI farms et al put a log of this infrequent data into cache though?


Caches are only so large. Expanding them doesn't buy you much, and increases costs greatly.

The key benefit to a cache is that a small set of content accounts for a large set of traffic. This can be staggeringly effective with even a very limited amount of caching.

Your options are:

1. Maintain the same cache size. This means your origin servers get far more requests, and that you perform far more cache evictions. Both run "hotter" and are less efficient.

2. Increase the cache size. Problem here is that you're moving a lot of low-yield data to the cache. On average it's ... only requested once, so you're paying for far more storage, you're not reducing traffic by much (everything still has to be served from origin), and your costs just went up a lot.

3. Throttle traffic. The sensible place to do this IMO would be for traffic from the caching layer to the origin servers, and preferably for requesting clients which are making an abnormally large set of non-cached object requests. Serve the legitimate traffic reasonably quickly, but trickle out cold results to high-demand clients slowly. I don't know to what extent caching systems already incorporate this, though I suspect at least some of this is implemented.

4. Provide an alternate archival interface. This is its own separately maintained and networked store, might have regulated or metered access (perhaps through an API), might also serve out specific content on a schedule (e.g., X blocks or Y timespan of data are available at specific times, perhaps over multipath protocols), to help manage caching. Alternatively, partner with a specific datacentre provider to serve the data within given facilities, reducing backbone-transit costs and limitations.

5. Drop-ship data on request. The "stationwagon full of data tapes" solution.

6. Provide access to representative samples of data. LLM AI apparently likes to eat everything it can get its hands on, but for many purposes, selectively-sampled data may be sufficient for statistical analysis, trendspotting, and even much security analysis. Random sampling is, through another lens, an unbiased method for discarding data to avoid information overload.


Twitter feels more stable today, with less spam, than one year ago. There's of course parts that have been deliberately shut down, but that's not an argument about the core product.


Pandemic lock downs are 99% over. People are getting back outside and returning to office. These effects have little to do with Twitter's specific actions.


I see more spam these days, particularly coming from accounts that paid for the blue check mark. IIRC, Musk said that paid verification would make things better since scammers wouldn't dare pay for it (I would find where he said this but I hit the 600 tweet limit), but given how lax their verification standards are, it seems to be a boon to scammers, much the same way that Let's Encrypt let anyone get a free TLS cert at the cost of destroying the perceived legitimacy that came with having HTTPS in front of your domain.

(And IMO, that perceived legitimacy was unfounded for both HTTPS and the blue check before both were easy to get, it's just that the bar had to drop to the floor for most people to realize how little it meant.)


The "massive layoffs" was just twitter returning to the same staffing level they had in 2019, after they massively overhired in 2020-2021. This information is public, but this hasn't stopped people from building a fable around doomsday prophecies.


I mean, it’s clear the Musk overcorrected. The fact that managers were asked to name their best employees, only to then be fired and replaced by them, or that musk purposefully avoided legal obligations to pay out severance/health insurance payments (I forget the exact name)/other severance, and that the site has had multiple technical issues that make it feel like there’s no review/QA process all show that he doesn’t know what he’s doing.

He got laughed out of a Twitter call thing with lead engineers in the industry for saying he wanted to “rewrite the entire stack” and not having a definition for what he meant.

Doomed or not, Musk is terrible at just about everything he does and Twitter is no exception


I think this action from Twitter showed that it isn't public information. It is pretty much twitter's to do whatever they want with it.


I think that’s always been known, but the tacit agreement between users and Twitter has always been “I’ll post my content and anyone can see it, if they want to engage they make an account”. From a business perspective this feels like a big negative to me for Twitter. I’ve followed several links the last few days and been prompted to login, and nothing about those links felt valuable enough to do so.


Just because it is published doesn't mean authors don't retain rights on it. None of that content is public-domain.


It's about $1 per thousand tweets and access to 0.3% of the total volume. I think the subscription is 50M "new" tweets each month? There are other providers who continually scrape Twitter and sell their back catalogue.

https://www.wired.com/story/twitter-data-api-prices-out-near...

Researchers are complaining that it's far too high for academic grants. Probably true, but that's no different from other obscenely priced subscriptions like access to satellite imagery (can easily be $1k for a single image which you have no right to distribute). I'm less convinced that it's impossible for them to do research with 50 million tweets a month, or with what data there is available. Most researchers can't afford any of the AI SAAS company subscriptions anyway. Data labelling platforms - without the workers - can cost 10-20k a year. I spoke to one company that wouldn't get out of bed for a contract less than 100k. Most offer a free tier a la Matlab in the hope that students will spin out companies and then sign up. I don't have an opinion on what archival tweets should cost, but I do think it's an opportunity to explore more efficient analyses.


> We might never make a repository of human made content with no Ai postings in it ever again.

Wow, never thought of it that way before. Kinda hit me hard for some reason.


Honestly I think that's why reddit is closing itself up too. Everyone sitting on a website like this might be sitting on a Ai training goldmine that can never be replicated.


Too little too late. Anything pre-ChatGPT is already scrapped, packaged and mirrored around the Internet; anything post ChatGPT launch is increasingly mixed up with LLM-generated output. And it's not that the most recent data has any extra value. You don't need most recent knowledge to train LLMs. They're not good for reproducing facts anyway. Training up their "cognitive abilities" doesn't need fresh data, it needs just human-generated data.


Precisely, which brings us back around to the question: why are social media companies really doing this?

I think "AI is takin' ooor contents!" is a convenient excuse to tighten the screws further. Having a Boogeyman in the form of technology that's already under worried discussion by press and politicians is a great way to convince users how super-super-serious the problem must be, and to blow a dog whistle at other companies to indicate they should so the same.

It's no coincidence that the first two companies to do this so actively and recently are both overvalued, not profitable, and don't actually directly produce any of the content on their platforms.


One that's slowly ageing away though.


Synthetic data fed into training isn't necessarily a bad thing. It can produce great results in many cases.


I've seen that work with self-driving cars. Simulating driving data is actually better since you can introduce black swan events that might not happen often in real world.


It doesn't matter all that much. Smaller but better data is better for training than a large, but garbage dataset.


Do you think twitter has no AI postings?


Are you really sure it's legal? In theory it's not different from providing the same information from API or website... but do people working in law think so?


Twitter purchased Gnip years ago, and it's a reseller of social media data. Companies that want all the public tweets, nicely formatted and with proper licensing, can just buy the data from Twitter directly.


I'm assuming their terms give them permission to redistribute everybody's tweets, since that's kind of the whole site. I don't know why they'd restrict themselves to doing it over the internet and not the mail, but do you have any reason to think that to be the case?


We're talking about Elon Musk. I'd be surprised if he gave a shit.


So, I'd just made that suggestion myself a few moments ago.

That said, there are concerns with data aggregation, as patterns and trends become visible which aren't clear in small-sample or live-stream (that is, available in near-time to its creation) data. And the creators of corpora such as Twitter, Facebook YouTube, TikTok, etc., might well have reason to be concerned.

This isn't idle or uninformed. I've done data analysis in the past on what were for the time considered to be large datasets. I've been analyzing HN front-page activity for the past month or so, which is interesting. I've found it somewhat concerning when looking at individual user data, though, here being the submitter of front-page items. It's possible to look at patterns over time (who does and does not make submissions on specific days of the week?) or across sites (what accounts heavily contribute to specific website submissions?). In the latter case, I'd been told by someone (in the context of discussing my project) of an alt identity they have on HN, and could see that the alternate was also strongly represented among submitters of a specific site.

Yes, the information is public. Yes, anyone with a couple of days to burn downloading the front-page archive could do similar analysis. And yes, there's far more intrusive data analytics being done as we speak at vastly greater scale, precision, and insights. That doesn't make me any more comfortable taking a deep dive into that space.

It's one thing to be in public amongst throngs or a crowd, with incidental encounters leaving little trace. It's another to be followed, tracked, and recorded in minute detail, and more, for that to occur for large populations. Not a hypothetical, mind, but present-day reality.

The fact that incidental conversations and sharings of experiences are now centralised, recorded, analyzed, identified, and shared amongst myriad groups with a wide range of interests is a growing concern. The notion of "publishing" used to involve a very deliberate process of crafting and memoising a message, then distributing it through specific channels. Today, we publish our lives through incidental data smog, utterly without our awareness or involvement for the most part. And often in jurisdictions and societies with few or no protections, or regard for human and civil rights, let alone a strong personal privacy tradition.

As I've said many times in many variants of this discussion, scale matters, and present scale is utterly unprecedented.


This is a legitimate concern, but whether the people doing the analysis get the data via scraping vs. a box of hard drives is pretty irrelevant to it. To actually solve it you would need the data to not be public.

One of the things you could do is reduce the granularity. So instead of showing that someone posted at 1:23:45 PM on Saturday, July 1, 2023, you show that they posted the week of June 25, 2023. Then you're not going to be doing much time of day or day of week analysis because you don't have that anymore.


Yes, once the data are out there ... it's difficult to do much.

Though I've thought for quite some time that making the trade and transaction of such data illegal might help a lot.

Otherwise ... what I see many people falling into the trap of is thinking of their discussions amongst friends online as equivalent, say, to a discussion in a public space such as a park or cafe --- possibly overheard by bystanders, but not broadcast to the world.

In fact there is both a recording and distribution modality attached to online discussions that's utterly different to such spoken conversations, and those also give rise to the capability to aggregate and correlate information from many sources.

Socially, legally, psychologically, legislatively, and even technically, we're ill-equipped to deal with this.

Fuzzing and randomising data can help, but has been shown to be stubbornly prone to de-fuzzing and de-randomising, especially where it can be correlated to other signals, either unfuzzed or differently-fuzzed.


I despise Musk as much as anyone else and charging for API access has hurt a lot of valuable use cases like improving accessibility but … how about not massive scraping a site that doesn’t want you to?


Scraping isn’t illegal, and to be honest, I’m not even sure it’s unethical. I’m assuming you think it so — if so, why? I’m not disagreeing, but haven’t given it much thought.


It’s ethical when your average Joe does it on a small scale to scrap their favorite favorite YouTuber or to buy something when it becomes available.

When you have financial incentive to build your business on someone’s data and you scrap literally millions if not billions of pages - it’s unethical.


The thing with social media platforms is that this data is user-generated, so you've got the company "owning" user content.

This data is often of great public value. I track conversations around a social issue as part of my work for a non-profit.

I'd counter it's unethical to prevent people from accessing this data.


I’m not disagreeing with your comment but

> great public value

Having been to twitter mostly through the most recent prominent war, man the signal to noise ratio is really low even when being careful about who to follow and who to block. There is so much disinformation, bad takes, uninformed opinions presented as facts, pure evil, etc.

So I guess it could be used for training very specific things or cataloging the underbelly of humanity but for general human knowledge it’s a frigging cesspool.


OK, not gonna argue with that. There is, I guess, a perception that it matters because policy-makers, and the wonks and hacks that influence them are hooked. The value for me (and ergo the public, some classic NGO thinking there for you) lies in understanding those dynamics.

I do not use the Twitters myself, and actively discourage others from doing so. Sends people bonkers.


I mean, we have found election manipulations like large-scale inauthentic activity of out-of-staters explicitly targeting African Americans, and projects here even to the extent of the perpetrators getting indicted. Other projects were tracking vaccine side-effect self-reports faster than the CDC and other disaster intelligence.

We were actually gearing up to switch to paid accounts as we found use cases that could subsidize these efforts... And then the starting price for reasonably small volumes shot up to like $500k/yr.


So, are we saying it's unethical for Google and other search engines who make money off of ad revenue to scrape sites like Twitter? Or are they paying a large sum to Twitter to do this?


If Google doesn't provide a way to say "please don't scrape my site", then it 100% unethical.

We have robots.txt. If Google doesn't respect that, it's unethical. Don't you think so?


Does twitter's robots.txt forbid scraping? Judging by the fact it shows up in Google I'd assume not.


Maybe it's time for an llm.txt

Not that the people you want to respect that would


The tricky part is it's much more harder to prove that they didn't respect that.


When there is a value exchange between the two entities that are relatively similar then I think it is ethical. People trade Google making money on ads for their site being found when people search. It is also possible to opt-out.


They benefit mutually from their symbiosis. Financially, AI bro model #1321 doesn’t bring anyone value except their owners.


If done against the wishes of the owner of the site, yes, I would consider that unethical. Thankfully, Google respects robots.txt and noindex.


But it's it ethical for the site owner to block access to random people and companies in the internet to _my_ data? I posted that tweet with the expectation that it's gonna be publicly available. Now the owner of the site is breaking that expectation. I would say that this part is also unethical.

Especially since they're not moderating things or anything.


I would say that this part is also unethical.

Agreed. However, it's probably covered by their terms of service.

Same thing with the recent reddit kerfuffle. I'd have much preferred a Usenet 2.0 instead of centralizing global communications in the hands of a handful of private companies with associated user-hostile incentive structures.


Being indexed by google is optional. Twitter could stop it a any time if they thought it was a bad deal for them. That not comparable to a startup company trying to scrape the entire site to train their AI and using sophisticated techniques to bypass protections Twitter has put in place


Translation: it’s ethical when I do it.


You wouldn’t download a car, would you?


Except with modern software, some wannabe genius programmer will think they can get a bunch of money or cred or whatever by infantilizing the process down to something your grandma could use. Then, suddenly, everyone is scraping. The net effect is largely the same -- server operators see an overwhelming proportion of requests from bots. Still ethical?


> I’m not even sure it’s unethical.

If it doesn't respect robots.txt, it is unethical.


Is it ethical for the "public square" to have a robots.txt?

Musk is trying to have his cake and eat it...

(Clearly it's not a public square, but his position is incoherent).


Yes, it is ethical. In many countries it is legal for humans to walk around the public square and overhear all conversations.

It is NOT legal to install cameras that record everyone's conversations, much less sell the laundered results.

Pre-2023 people went on Twitter with the expectation that their output would be read by humans.

A traditional search engine is different: It redirects to the original. A bastardized search engine that shows snippets is more questionable, but still miles away from the AI steal.


Many countries have freedom of panorama, which means it is legal to video record the public square. I'm not aware if anywhere has specific laws on mounting the camera on a robot.


>Pre-2023 people went on Twitter with the expectation that their output would be read by huma ns

Expectations =/= reality. And the reality is that bits have been reading comments for over a decade.


a) It looks to be permitted according to Twitter's robots.txt

b) Given Twitter is public, user generated content which they don't own but simply have a license I wouldn't call it unethical in the slightest.


If the background of the issue is as Musk described, then it certainly is not allowed by twitter’s robots.txt, which allows a maximum of one request per second.

I do a lot of data scraping, so I’m sympathetic to the people who want to do it, but violating the robots.txt (or other published policies) is absolutely unethical, regardless of the license of the content the service is hosting. Another way of describing an unauthorised usecase taking a service offline is a denial of service attack, which (again, if Musk’s description of the problem is accurate) seems to be the issue Twitter was facing, with a choice between restricting services or scaling forever to meet the scrapers requirements.

Personally I would have probably tried to start with a captcha, but all this dogpiling just looks like low effort Musk hate. The prevailing sentiment on HN has become so passionately anti-Musk that it’s hard to view any criticism of him or Twitter here with any credibility.


"You wouldn't download a car."

The only reason these websites and platforms aggregate any content at all is because they're effectively giant public squares.


No means no ! :)


Moreover, isn't making scraping impossible illegal per a couple-of-years-old bill?


This isn't going to make them stop either. Musk is about to see a spike in account creations using the method of lowest resistance. I expect "sign in with apple" will disappear as an option soon, given its requirement of supporting "hide my email" that makes it trivial to create multiple twitter profiles from one apple ID.


This works in his favor though, more accounts means higher ad rev and MAU for valuation.


Higher ad rev only until the advertisers realise your users don’t buy anything and ads are wasted and your ARPU drops through the floor.


It’s funny how being a locust is cool when you’re doing it, and a problem when others do it in a way which affects you.


You might want to think twice about taking him at face value then. He says it’s about scraping but who knows anymore


And yet people do. Kind of predicting what various people react including scammers, bots, scrapper and what not is, like, job of a management in a company like this.


> I despise Musk as much as anyone else

It's interesting how you assume that most people despise Musk.


Oh shit you solved the problem


> I despise Musk as much as anyone else

I don't despise Musk at all. Don't agree with him on everything, but he is a genuine and interesting person.


He holds views that were the progressive norm 15 years ago, which are now considered bigoted, this is considered unacceptable today. There's a lot I don't agree with him on, like Ukraine, but "despise" is a word I reserve for the likes of Putin.


I don’t think Putin is the epitome of evil that the west portray him to be either. War is hell and he surely started the larger scale war, but just remember that you’ve probably been introduced to less than 1% of his side of things as a western citizen. The western world has gone to war many, many times in history for lesser reasons.


I'm from a country that's arguably part of the West today (Romania).

Your nonsense straight out of the a Soviet propaganda book doesn't work on me.

Go ask people from Eastern Europe how they felt about 45/50+ years of Soviet imposed governments and regimes.

The West has done awful things but in what way do they excuse the attempted ethnic cleansing of Ukraine?


What do you know about Putin’s motives? What propaganda do you think you’re under?

You’re probably smart enough to understand that out of spite and regret of your country’s history with the Russians your countrymen have more motivation than many others to judge the Russian efforts without any further investigation into the matter.

The same applies for myself, since I’m Finnish. It’s almost sad to see how people abandon all reason and critical thinking skills because of some ingrained belief that “Russia bad”. All of my knowledge of the human nature leads me to believe that they’re no more bad than the next people, and that they probably have some motives to go to a taxing war that we don’t really understand here in the west - seeing as the first casualty in war is the truth.


>Yeah he started the one of the deadliest wars in the 21 century, threatens to destroy the entire planet with nuclear weapons, but he is not that evil because there were other wars started by the west

That's an extremely dumb take


I’ll rephrase your argument for you: “Why don’t you listen to the rapist’s opinion? The victim is surely not blameless. Besides, your cousin is a shoplifter”.


Is this really the best response you could put forward in to people trying to make a nuanced point?


Nuanced point? That person just said 'sure, putin is committing genocide and destroying an entire country, but we haven't heard his side of the story'

Then they followed it up with the old hacker news chestnut of 'whatabout the west'


Would his side of the story matter to you? I don’t think it’s a particularly nuanced point to you since you’ve already made up your mind, however ignorant it might be.


Putin already gave his side of the story. He declared Ukraine an invalid country, said there were nazis there and then went into full out war to destroy the country while committing countless atrocities.

I don’t think it’s a particularly nuanced point to you

What point and why do you keep saying 'nuance' over and over while giving zero actual information? What are you trying to say and what evidence is there?

Let's think about this super hard. What is the justification for an unprovoked genocidal war? Why are you defending putin?

however ignorant it might be.

Show me where you get your information, lets see the source of this nonsense.


It wasn't "nuanced" at all. Just muddled and obsequious.


1. As already mentioned, it's hardly a nuanced point.

2. If you actually want to hear my opinion, then the realm of geopolitics + good old-fashioned hate of the US government does a number on people's logic, so we get what I can only charitably describe as a parade of non-sequiturs, whataboutisms and other fallacies. And so it can be useful to frame it in the simpler terms, for example you could hardly find anyone even on this site who would condone the forced takeover of parts of people's homes. Literally the same is happening on a scale of the countries.


Totally agreed. Make it about individuals where the analogy holds.

It cuts right through the bullshit and frequently exposes a lot of hypocrisy, hate, self loathing, racism, etc.


Does Twitter have that same approach with user data?


You think companies massively scraping right now would respect API rate limit?


API rate limits are more easily enforceable. If they keep scraping there are methods to detect and thwart behaviour. I don't think twitter has the appropriate talent and work environment to allow proper solutions to be implemented. It's all knee jerk reaction to whatever Elon decides.


It's more easily enforced, except when you don't give them enough they just go back to scraping. Or create a million fake developer accounts and pool the free quota if that's possible. These are not hypotheticals, loads of companies have done both against all kinds of APIs over the years, Twitter included.


"control access and rate limit"

Isn't that basically what they did with the API changes?


But they were too stingy with the tiers and too greedy with their prices. Even for minor use cases where you need to make, say, 100 API calls a day, you’ll need to pay $100/month.

Which just leads people to scrape.


Why is it twitter that is “stingy” and not the people scrapping so they don’t have to pay?


I'm not going to pay $100 just to fetch 3000 records for a hobby project. I'll either skip the project, or I'll just abuse my scraping tool.

If they'd made some more reasonable pricing tiers, I would have been happy to pay.

Fetching something as simple as the total follower count from an API shouldn't be more (exorbitantly) more expensive than fetching data from, say, GPT-4. No reasonable person can make an argument for $10c/call pricing.


Did you actually read that comment? I think the point is very clear -- given a reasonable price, people may would want to use the API instead of scaping the data themselves. If you instead ask for exorbitant amount of money, it only forces people to scrape, because there is no business model that would make it possible to pay.


Isn't the firehose API still available?


Indeed, spot on.


Sorry I don’t buy it. Hundreds of millions of people use Twitter, and we are to understand that there are an enough people scraping to the extent that they had to suddenly take drastic action by shuttering unauthenticated access? Any dev would have told him that those supposedly scraping could simply setup Selenium or some other headless browser to login before scraping.

This smells of another failed Musk experiment at twiddling with the knobs to increase engagement, to me.


Not only that but unauthenticated access is the easiest thing to cache. There is no need to "bring large numbers of servers online". He's lying.


A bot scraping content will tend to go deep into the archives and hit all content systematically. Caching isn't as effective if you hit everything whereas real users will tend to hit the same content over and over again.

It can add nontrivial load.


They could but signing in, even in selenium, means agreeing to twitter's TOS. See the LinkedIn scraping case.


The same way these AI code completion tools respected GPL-licensed code?


You don't generally need to accept licenses in order to scrape something, only if you want to distribute it.

The legal ambiguity comes from the question of whether LLM outputs are a derivative work of the training data. I expect that they aren't, but anything can happen.


> Hundreds of millions of people use Twitter, and we are to understand that there are an enough people scraping to the extent that they had to suddenly take drastic action by shuttering unauthenticated access

Suppose 1 million people are accessing Twitter at any given time. An actual person might only be making 1 request / second. That's 1 million requests / second.

Suppose there are 100 AI companies scraping Twitter. A bot like this can make thousands to tens of thousands of requests per second. That's an additional million requests / second.

There are probably more than 100 "AI" companies now, trying to train their own bespoke LLMs. They're popping up like weeds so I can totally see Twitter's load doubling or tripling recently. So sorry, I just don't get the skepticism. Sure it could be a cover for something else, but his actual stated reason seems totally possible.


> A bot like this can make thousands to tens of thousands of requests per second.

You don't need to use a bot to do this, Twitter literally did this to themselves through their own buggy code https://sfba.social/@sysop408/110639474671754723

If Silicon Valley was still being produced, this would make for a great episode.


Yeah no you cant just 'use selenium'. To keep the same scraping volume you might need thousands of accounts and 10x the compute.


It’s not a little “use selenium” switch you can click, but it absolutely is an option (and there are others) if the barrier is simply to have an authenticated account and be logged in.

If these data scraping operations are as sophisticated and determined as he claims this measure is insufficient and actually it really hurts Twitter far more than it helps. Case in point: we stopped sharing Twitter links because when you click them in most iOS apps it opens up an unauthenticated web view and presents you with a login screen. So we just collectively decided “ah ok no sharing Twitter” and moved on.

I’m sure there are companies scraping Twitter. I just don’t buy that it’s as big of an issue as he claims it is, and that preventing people from viewing tweets without logging in is a way to mitigate against that (I’d first look at banning problematic IP addresses first, personally).

To me it’s either:

1) a very poor and very temporary mitigation against scraping, that could be bypassed with a bit of effort

2) an experiment in optimising metrics - Musk sees lots of unauthenticated users consuming Twitter, tries to steer them into signing up

3) it’s all just a big mistake

Option #2 makes the most sense to me, but frankly none of them are good


A decade ago I worked on building AI systems that made (legitimate, paid) use of the Twitter “firehose”. At that time more than 99% of the data was garbage. It’s worse now. The value then was largely in two areas: historical trends (something like Google trends), and breaking news; and only the latter was really that interesting. I doubt it’s a high value data source scraped in bulk; it could have value in a much more targeted approach. Seems unlikely to require the addition of “large numbers of servers … on an emergency basis”.


Is twitter worth scraping though for AI? I mean, Reddit I get, but twitter content has suuuuch a low signal to noise ratio.


Seems to entertain many so has value in that sense I guess. Plus perhaps post vs replies make some sort of challenge & response pair that can be leveraged?


Metadata is probably more interesting than the raw content.


> Reddit I get

Reddit is a cesspool of misinformation and phallic references.

I guess there may be certain subreddits/tweeters that someone might want to train on, but I dont understand why.


I get only get

> Something went wrong. Try reloading.

on that link. Which, frankly, is hilarious.


This is a direct consequence of Elon gutting Twitter's infrastructure.


This makes no logical sense. Why would the scraping not restart after its unlocked? More realistically, he got a lot of backlash from users and website owners where embedded tweets suddenly stopped showing up.


Presumably they will put some protections in place.


Possibly, but they wrote those protections in ~5 hours? Seems dubious.


Where did you get the 5 hours from?


I used to work for a very large financial institution. Scraping from finance apps was a material source of load even with substantial countermeasures in place. I can’t imagine what it does to sites like Twitter and Reddit (and HN).


HN volume is absolutely tiny (the ids are sequential so you can easily check how many items there are in a given day) and there’s an API. It’s no comparison.


HN has an open API look at the footer


Isn't Musk a major investor in OpenAI? If he said "my data feeds chatgpt, not its competitors", that would make sense, right?


He's been somewhat critical of OpenAI. Specifically the part about it pivoting to a for-profit business.

> I’m still confused as to how a non-profit to which I donated ~$100M somehow became a $30B market cap for-profit. If this is legal, why doesn’t everyone do it?

https://twitter.com/elonmusk/status/1636047019893481474

> I donated the first $100M to OpenAI when it was a non-profit, but have no ownership or control

https://twitter.com/elonmusk/status/1639142924985278466


> He's been somewhat critical of OpenAI. Specifically the part about it pivoting to a for-profit business.

Because it was thereby starting to compete with his own for-profit.

He was even asking for a moratorium for 6 months so his company could catch up.


You’re sharing lies. He donated $10 mil and talks it up 10x.


There's a lot of lies in tweets. The point was: he's been disparaging OpenAI. It doesn't matter how much he donated or if he donated.


Isn't the more important point that his consistent lying makes his disparagings unreliable as a signal?


Can you point out any of his lies?


I assume you are joking. In case you aren't, start with the subject of this very thread ($100M claimed Vs $10M reality). Then work backwards through every claim he ever made about anything. Here's a collection of his top hits https://www.elonmusk.today/


He tried to pressure them in 2020 to make him a CEO. They refused, so he pulled promised funding when they were on the brink of bankruptcy. They made a deal with Microsoft instead.

Then in 2022, they blew up and Elon's been spitting venom at them ever since as he missed his chance.


No, he sold all (I think) of his shares long time ago apparently because of conflict of interest with Tesla.


Wasn’t in a non profit back then? Do they actually have shares? I thought part of the point is they don’t turn a profit to pay out to investors.


He had committed to providing funding to them and was on the board. Being on the board is indeed the only form of control in a non-profit.

He tried to pressure them to make him a CEO, they refused, so he said "no money then, go bankrupt" and quit the board. They made a deal with Microsoft and survived.

Now he's pissed.


Disables API, gets scraped, needs more servers.

Congrats you played yourself.


Disables API, gets scraped, needs more servers, disables access without logins, gets millions of fake accounts, has to deal with the fake accounts, in the process deletes tons of real accounts, users pissed, scraping continues, server bills keep rising...


There was an old woman who swallowed a fly …

Twitter is maybe at the dog stage. Perhaps it’ll die.


Missed opportunity to poison the scrapers' well by showing them AI-generated tweets.

Bad data is devastating in a way a HTTP 401 is not.


> immediate action was necessary due to EXTREME levels of data scraping

It sounds like Elon doesn't "get" the open web


I think it will be sort of interesting to see what AI scraping does to the open internet.

I think that we are already putting too much content into Social Media platforms (HN included). Stuff that we sort of ought to self host because then we would actually own it. But will you even want to run your own sites publicly if they are getting scraped? I guess it’s it really a new issue as such, but I imagine it’ll only get worse as the LLM craze continues to rise.


But you can buy verified twitter accounts starting from $0,035 per pc. I really don't understand how can it pose any serious roadblock for scrapers.


You haven't been able to look at anything but the /explore endpoint for weeks without an account, and the "content" on there has been total garbage.

I was relieved when they started asking for an account this week because now I'll finally be able to break my habit of navigating to Twitter to "see what's happening" only to find a bunch of sports memes, pop music drama, or right-wing trolls pretending like Hunter Biden is the most nefarious person on the planet.

Imo, Elon is lying and he locked everything down for PR so he would make a headline and frame it like his site's content is _so valuable_ that he just had to take drastic measures to stop AI from training on it.


He's not wrong on this occasion, there are multiple companies out there, some even with a multi-billion dollar valuation that "farm" tweets for many reasons.


The whole thing is down now, returning “rate limit exceeded” in the UI. A very hamfisted affair imho


Planet Earth Inhabitants in 2023: 8 Billion -> Social Media Users -> 4.8 Billion -> Twitter Users -> 368 million active users who engage at least once a month. If those AI Models are being trained on a reduced set of 3% of human beings, they will lose a lot.


They could just provide or even sell well curated datasets instead.


that's just a justification to force user to create an account ... he killed the api that already had an rate limit ... its so obvious that it hurts


I just get a 'something went wrong' message from this link. Is that because I'm not logged in, if so why don't they say that?


> Something went wrong. Try reloading.


I am trying…


Doesn't really make sense… What prevents a logged in bot from continuing to scrape vast amounts of data?


Then we just add 2fa to the scraper and continue scraping the shit out of Twitter. Checkmate.


Updating from the future (20230703T113430Z) and this has not been unlocked.


I was about to say that, I read the same post, I also agree somewhat.


Curious how Elon is going to train his OpenAI competitor. Scraping?


Leave it to Musk to create a hair-on-fire emergency out of a regular cup of coffee. "EXTREME"! Please...

But hey, I am eternally grateful. As a super-compulsive Twitter user, I read accounts and my own lists after I log out. This fixes it.

Have your "emergency", Elon.


That's Musk-speak. Translation: "we broke shit again and need to shed load to keep the site up."

I wish it would stays that way though so that I wouldn't land on the site by accident.


Less likely the reason is technical / cost of service (which is very cheap) and more likely he is trying to exercise leverage in pursuit of monetizing engagement (which had already happened [1] ). It’s not that he broke the website’s tech but rather he broke the website’s business.

[1] https://www.datacenterdynamics.com/en/news/twitter-pays-goog...


Oh how the disruptors hate disruptors


Yeah but he's full of crap; why would we believe anything he says about anything?


The above comment doesn't deserve to be downvoted. If Musk wants people to believe his statements, he should have refrained from being a serial liar. Now his reputation is in the trash, and he has only himself to blame.


In fairness, the first part of my comment is a personal attack that could have been left out. But the second part is what you're agreeing with, that he doesn't have credibility because of making stuff up a bunch recently. And I think that's right.


His tweets are as trustworthy as Trump's.


And as full of random capitals.


I wonder if it's people seeking to move away from Twitter and working around crippled APIs, or if it's ClosedAI in which Musk himself invested before...


Amusing that I called precisely the cause well before Elon tweeted (https://news.ycombinator.com/item?id=36542697) and was downvoted for it.


HN has an audience which is full of hypocrisy


I came here to say exactly this. Is there a reason why they didn’t do it? I couldn’t figure it out from the article.


I have been a windows/linux user for about as long as I can remember - because cheap! For the 8 months I am using a MBP and absolutely love it. It is a pleasure to look at it and use - except for the keyboard! The trackpad experience is unparalleled IMHO.


Excellent tool. Can this analyse the result from kustomize files rather than actual k8s YAML?


Yes, kustomize is not supported natively, but you can achieve effect by piping the kustomize output to kube-score.

    kustomize build | kube-score score -


Thanks. One more question. For Visual Studio Code, Microsoft has a plugin called Kubernetes - which I currently use. Have you done a comparison against that?


Couldn't resist looking at it after your repeated warnings. It seems that shellcheck site has some suggestions to improve the code.


How does this compare against chaos monkey?


PingCAP engineer here.

ChaosMonkey is implemented by Netflix and focuses on testing microservice systems by terminating virtual machine instances and containers.

Chaos Monkey is the pioneer project of Chaos Engineering, from which we draw a lot inspirations. Compared to ChaosMonkey, Chaos Mesh focuses on kubernetes platform currently and provides more rich fault injection methods for complex systems, such as faults injection into network and file system. Kernel injection will also be supported in the fulture.


Okay. It would nice if this detail were to be added to the README of the repo.


Thanks for your advice, we will update the README ASAP.


self-driving cars are a no-no in indian roads! the chaos is and will be beyond AI that we will get to see in the near future.

reference : me (an indian)


India needs to get its act together with traffic. I'm heartened that there are anti-honking groups already --- that's a big start. The first step in fixing driving in India is changing the honking culture. Seriously.

Efficient transit is just table stakes for being a serious economic power. Indians are smart. They'll figure out how to fix things.


I envy them for their minimalism. I am not able to achieve that with years of trying. Items will start piling up unknowingly.


I spent six months travelling and working remotely recently, in Japan, Taiwan and Australia. I only had a carry on suitcase and I never missed anything.

I'm now back in London and after two weeks I have bought at least three small suitcases worth of new things. It's amazing how quickly things pile up. Once I had the space I started accumulating possessions. I go into shops and come out with things I didn't know I needed.


this post has some more images of the minimalist approach - http://www.thisisinsider.com/inside-japans-extremely-minimal...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: