More

kakakiki · 2024-03-30T08:31:37.000000Z

I am just making a wild suggestion here. May be ask AI to generate one with the above prompt?

Shawnj2 · 2024-03-30T08:54:30.000000Z

If you want your image processing algorithm to be optimized for the real world you should test against real photos. It’s not like we have a shortage of photos so synthetic data is absolutely necessary

barfbagginus · 2024-03-30T09:12:34.000000Z

Give me ten k and I'll make the AI take the prompt and return the best stock image, guaranteed

The optimal lenna replacer, for boomers who think they can reasonably belabor this point

Just pay me up front, papi needs a quick cash fix scratches their neck and tries to swivel their head 360 degrees

Shawnj2 · 2024-03-30T09:41:10.000000Z

No need, we have an ethically sourced Lena image here https://mortenhannemose.github.io/lena/

akoboldfrying · 2024-03-30T09:54:45.000000Z

This is clearly a sly reference to an exclusionary in-joke, which makes it an in-joke itself, and thus also exclusionary.

Needless to say, I'm still offended. And I think we can all agree that no one should ever be offended.

Fnoord · 2024-03-30T11:41:23.000000Z

I'm offended that you're offended. And I think we can all agree that no one should ever be offended.

Hint: this path or yours? It doesn't work. Feeling offended and asking others not to offend you has to be reasonable within society's moral values. They've changed, obviously. But there is a large group of conservatives (as in: the type who wants traditional values) who loathe that change. The group is significant enough that they can push back changes like these. Case in point: Trump got elected and ensured SCOTUS is now 6-3 conservative-progressive.

barfbagginus · 2024-03-31T21:54:21.000000Z

I am offended you're offended they're offended. And I believe the best defense against trolling is direct offense

Hint: this little old glory jig of yours? You know why you're offended? Because you have no principles or values worth fighting for. But you want to fight for something. So you fight for an inglorious past. And you can't think of a better way than to skirmish with sjws at random.

We see you, and you are transparent to us. If you were above the struggle, you would say nothing. Yet you want to fight. You NEED to fight. We all do. And yet you have nothing worthwhile to fight for. So you fight us.

The problem with your approach is that your way of thinking will die, since it produces nothing new or worthwhile. To the degree that your views get presidence, there will be misery and gnashing of teeth, stagnation, and eventual failure of the empire built on your non-principles. But the cause of true justice will remain living, until victory.

Fnoord · 2024-04-01T11:47:34.000000Z

LLM gibberish.

barfbagginus · 2024-04-02T03:35:08.000000Z

False. The best LLMs speak with more empathy than I do

If I was using an LLM, I wouldn't be beefing with you. I'd find some way to engage you while respecting your point of view.

Nope. I'm definitely a full 100% grade A human a*hole

barfbagginus · 2024-03-31T21:25:36.000000Z

That's not ethically sourced Lena. That's mansplained Lena. And it doesn't add anything new or better.

Srsly, just hand over the 10k. I'll get the job done! Then we'll have thousands of Lena alternatives! And they'll be objectively better!

kakakiki · 2024-03-21T04:10:08.000000Z

As they say - Hope is not a strategy. I have to remind this periodically to avoid gathering moss.

kakakiki · on July 1, 2023

From Elon's twitter (https://twitter.com/elonmusk/status/1674942336583757825?s=20)

"This will be unlocked shortly. Per my earlier post, drastic & immediate action was necessary due to EXTREME levels of data scraping.

Almost every company doing AI, from startups to some of the biggest corporations on Earth, was scraping vast amounts of data.

It is rather galling to have to bring large numbers of servers online on an emergency basis just to facilitate some AI startup’s outrageous valuation."

danShumway · on July 1, 2023

I have a bridge to sell anyone who believes this is true. AI companies have been a boon to businesses that want to lock down user data and now have an excuse. It may be true to the extent that Musk is legitimately angry that Twitter isn't getting a piece of that AI VC money (I'm sure he is). But:

A) Twitter would probably move in this direction even if AI companies didn't exist, this is an excuse. Nothing about Musk's Twitter has indicated that he cares about Open data access or anonymous access to the site, and this follows a general trend of closing down the platform to non-monetizable users. Musk has abundantly shown in the past that he would prefer everyone browsing Twitter be logged into an account.

B) "it's temporary" -- how? You don't have a way to stop this other than forcing login. That situation is not going to change next week. To call this "temporary emergency measures" is so funny; there is no engineering solution for this and you're not going to be able to successfully sue companies for scraping Twitter. Put a captcha in front of it? Sure, let me know how that goes.

You going to wait and see if the AI market collapses in the next month?

If this does turn out to be temporary, it'll only be because of migrations off of Twitter and because of user criticism, because Musk is impulsive and bends easily under pressure. But nothing about the situation Musk is complaining about is going to change next week.

m00x · on July 1, 2023

FWIW: I've been scraping the shit out of social media for my AI training. I also do Amazon, AliExpress, etc.

Libs like puppeteer is so good these days that it's impossible to tell real users from fake traffic. Most of the blocks are just IP blocks.

qingcharles · on July 1, 2023

Right. And the IP blocks only add a small cost to the scraping because it forces people to use residential IPs which can't sanely be blocked.

agentgumshoe · on July 1, 2023

Not to mention Elon is great mates with the VC douchebags that are busy hyping and profiting off the AI hype train.

random_cynic · on July 2, 2023

> there is no engineering solution for this and you're not going to be able to successfully sue companies for scraping Twitter

There absolutely is, if you try instead of whining on internet. People at Vercel have already developed new anti-bot + fingerprinting + rate limiting techniques which look quite promising. I dare say within a year, new tools will be powerful enough to do this easily.

danShumway · on July 2, 2023

> I dare say within a year, new tools will be powerful enough to do this easily.

I see where you're coming from, but if Twitter is in a position where it can't roll out those protections right now, given its current head counts, etc... it's not going to be in a position where it can roll out those protections next week. Probably not next month.

So it's less that no one could block companies from scraping Twitter (although anti-scraping mechanisms are probably always going to be a cat-and-mouse game, so I'm not sure that there is ever going to be a perfect easy solution). It's more that if Twitter can't do it right now, nothing is going to magically change any time soon about the situation it has found itself in. And waiting a year (even waiting 6 months) for tools to become available before rolling back this rate limiting would be incredibly self-destructive for Twitter.

The way I see it, they're basically guaranteeing that they will need to roll back these changes before they have a solution to whatever specific problem or irritation Musk is fixated on. They're not going to gain additional engineering capabilities in the next week. And how long does Musk plan to leave rate-limiting in place? A social media site where people can't look at content is just broken.

versteegen · on July 1, 2023

> Nothing about Musk's Twitter has indicated that he cares about Open data access or anonymous access to the site

Not so, he tasked George Hotz with getting rid of that horrible popup which prevented you from scrolling down much if you weren't logged in, which was added soon before he bought Twitter. When that was removed I rejoiced. But now Twitter's gone 100x in the opposite direction.

danShumway · on July 2, 2023

I don't know; was that Musk's idea, or was that Hotz's idea? I vaguely think this was a change that Hotz wanted that Musk went along with.

To be fair, Musk will regularly pay lip-service to the idea of Open communication. I guess that's not literally nothing, but most large site policies have been in the direction of locking down content.

If there ever was a version of Musk that cared about Open access, it's been a while since that version of him saw the light of day. It's very consistent with his overall behavior to believe that he views Twitter content as being primarily his property rather than a community resource, and that he thinks that scrapers/AI companies/researchers are literally stealing from him if they derive any value at all from data that Twitter hosts.

thissitesucks__ · on July 1, 2023

[flagged]

postalrat · on July 1, 2023

Let me guess. He got bored and moved on before making any breakthrough.

cassepipe · on July 9, 2023

It's been a week now, you prediction held. Still can't access without an account.

utbabya · on July 2, 2023

Pretty sure they'd get into troubles with blue check mark buyers implicit promises if it's not temporary.

benjaminwootton · on July 1, 2023

Elon has a good point there. Much of the current AI hotness is predicated on stealing peoples content and exploiting the infrastructure that other people have built. I don’t think it’s acceptable.

The licenses, compensation models, law, technical solutions, attribution, security and privacy all need time to catch up. Regulation has a role to play as its a bit of a free for all right now.

The irony of Elon mentioning “outrageous valuations” though!

xorcist · on July 1, 2023

Why would an AI company start scraping twitter html, instead of using an already existing archive? Something similar to archive.org could earn money from that. If all you want is the content, there's no reason to suck it through a straw.

I'd expect those that require real time data, such as stock market bots or sentiment data providers, to scrape twitter (if they don't provide the data by other means, for example the "firehose", which is another great way to earn money).

None of this makes much sense.

Also, it's much more complicated than it seems. The web works because the data is public. You cannot think of it as "my data". (Especially not twitter, since it is really their users'!) Twitter is not higher quality data than any other web page.

If we accept that thinking, every home page would require a login to see that specific company's phone number of opening hours. Those pieces of data are also valuable, in the right circumstances! And then the web would either not exist or the account system required would be so wide spread that accounts would carry no value and the system would become useless.

ben_w · on July 1, 2023

> Why would an AI company start scraping twitter html, instead of using an already existing archive?

I can think of a few possible reasons. They might want more up-to-date info, or they might have no real developers and the scraper was created by a business guru who prompted ChatGPT and didn't understand the code that came out.

Given what else Musk has asserted about Twitter, and how often former or current Twitter devs have contradicted him, it may not even be what Musk said.

> Twitter is not higher quality data than any other web page

Eh, depends how much you can infer from retweet, favourites, etc.

Won't be the only such site, but it's probably better training data than blog posts are these days.

But yeah, I absolutely agree that Twitter doing this caused a lot of damage to any orgs, corporate or government, which wanted to be public, anything from restaurants announcing special offers to governments issuing hurricane warnings. Twitter isn't big enough to assume everyone has an account, like Facebook is.

kungito · on July 1, 2023

If there was more value in requiring login than there is in having this public and easily accessible, it would be behind a login form. The current internet has 99% of the time nothing to do with values someone imagined in the 80s

Zak · on July 1, 2023

Value to whom? Twitter is more valuable to its users, to journalists who embed tweets in stories, and to web users at large who follow links and search results if it does not require a login to view posts.

Of course, none of those people own Twitter, and it may well be more valuable to its owners if it does require a login.

xorcist · on July 1, 2023

What you describe is the Facebook business model. Which seems to be a valid model, but twitter was not built around it and such a pivot would break all business moats around the company.

There was no web in the 80s so not sure what values you refer to, or how they are relevant to today's businesses.

fieldcny · on July 1, 2023

What do you think the webs value is today?

What do you think the webs value was imagined to be in the “80s”?

MagicMoonlight · on July 1, 2023

Because they want an advantage over competitors who are using the archives already…

mbrochh · on July 2, 2023

If every AI company pointed their scrapers to archive.org, that site would go down immediately as well.

This is just kicking the can down the road.

We have a major structural problem now. We want data to be free and machine readable, but no startup (and even a giant like Twitter) can afford the server cost to withstand all those machines.

kergonath · on July 1, 2023

> Elon has a good point there. Much of the current AI hotness is predicated on stealing peoples content and exploiting the infrastructure that other people have built. I don’t think it’s acceptable.

But then so is Twitter. They don’t produce any content whatsoever. The data they are having a fit about is not theirs, it’s been volunteered by the users. It’s the same line Reddit is pushing, and it’s bullshit. AI companies scraping the web is no more unethical than Google doing it.

hashtag-til · on July 1, 2023

Well, one thing is people going there and putting their data on a platform. That’s their choice.

Taking/scrapping/stealing that data out of said platform for the benefit of your over-hyped “disruptive” startup - and implying that others should give you all for free - is the issue.

kergonath · on July 1, 2023

That’s not the point. Twitter has a non-exclusive license to distribute the content; it’s not the owner of the data regardless of the high horse Musk feels like riding today.

harshreality · on July 1, 2023

Please don't throw around the word "stealing" so loosely.

Scraping data from a public website is not "stealing". It might be a violation of the terms of service, but then you have the whole issue of click-through (formerly shrink-wrap) licenses and contracts of adhesion.

If someone isn't vetting you and potentially signing you to a more meaningful contract before giving you access, for free, to data, then using that data for any purpose whatsoever (except republishing it or derived works, which might, depending on the nature of the data or the derived works, be a violation of someone's copyright) is so far from "stealing" that using that word is wrong, and I suspect intentionally inflammatory.

That's why Elon limited access, rather than going to the police to file charges for theft, or suing over copyright violation or breach of contract. Not to say he absolutely couldn't do the latter, but it's hardly a clear win.

trompetenaccoun · on July 1, 2023

There are much bigger issues.

1. Users or one might say content creators don't own their data. Not just do those platform owners make a lot of money with the content (which they have a license to, as per site ToS) but then you have third parties scraping it now for commercial products. Using the data to train models that are then sold back to some of the same social media users who produced the content for free in the first place wasn't a thing until very recently, it used to be a select few doing machine learning research in the past. The laws are lacking behind the tech development and regular internet users are being exploited because of it.

2. It absolutely is stealing in some cases, and even worse. For example when they scrape it for content which they then use to train their bots to impersonate humans. Or on Twitter, there's a very common type of bot that steals content from young attractive female social media users in China, auto-translated to English, to pose as them. If you're in finance and crypto circles they're swarming with these accounts (guess the scammers know their targets).

3. In general this is only going to get worse from here on. LLM are getting better and better. On sites like Twitter you already have no idea if you're interacting with a human or not. But these "AI" can not actually think for themselves, they can only emulate, they can copy other humans. At least so far. So for the sake of making progress and ensuring we can still have intelligent discussions and find novel ideas online, it's imperative to have a way to keep the machines out. Social media must become sybil resistant or it dies in a vicious circle of self-referencing bots ever parroting the same old talking points, or variations thereof. We urgently need human ID!

M2Ys4U · on July 1, 2023

Theft requires the victim to be deprived of their property.

Nobody is deprived of their property, intellectual or otherwise, when this stuff happens, thus is it not right to call it stealing.

autoexec · on July 2, 2023

You may wish AI didn't exist, but it does. There's no putting the genie back in the bottle. We can still go after people who commit crimes using AI. Perhaps one day AGI will be possible and we will want to have discussions and share ideas with it just as do now with each other.

Governments, researchers, and all kinds of third parties have already been scraping every publicly available bit of data possible. There may be an increase now, but it's nothing new. It won't be the end of the society or the end of the internet anymore than AI will.

Also: https://www.youtube.com/watch?v=IeTybKL1pM4

trompetenaccoun · on July 2, 2023

We may be using different definitions of 'intelligence'. To me there is no AI that currently exists but I'm aware the companies market it as such of course.

>have already been scraping every publicly available bit of data possible

Data scraping is limited by economics just like anything else in the world. Storage costs money, someone has to pay for it. Researchers do not have unlimited funds. Some select few governments like the US may have most of the publicly accessible web archived. Keep in mind it's dynamic and requires massive data infrastructure to pull this off, there's tons of new data coming in daily. Private startups getting in on the action in a big way is a relatively new phenomenon, this used to be limited to enterprises with a specific purpose. Now everyone and their 4chan cousin are experimenting with their own deep learning models.

DropInIn · on July 2, 2023

Heres the thing about the ToC and licenses:

Bots aren't people and can't read nor consent. They just consume.

Any page which can be served without first displaying a ToC or other terms which explicitly prohibit access is not protected by a ToC or other license from scraping, as they can be considered a Point of First Contact in each case, as the bot has selected each link from a simple aggregation of all links it encounters (each interaction being "new" in essence).

Now it could be argued that ignoring robots.txt is an explicit contravention of norms and standards which could be viewed as a violation of an implicit licence, but there is no law requiring adherence to robots.txt and thus no mandate that a program even look for it iiuc.

sourdoughness · on July 4, 2023

Bots aren’t people and can’t consent, sure - but they are tools that are wielded or deployed by people who absolutely can consent (setting aside whether click-wrap terms are enforceable or not). If I throw a brick through a window, it’s me in the shit, not the brick.

DropInIn · on July 4, 2023

If I have an open door to my business and someone's automated robot walks in the door to see what's available, how is that different?

Even more applicable, this is like saying that a person walking down the street can't have a camera and take a picture of the front of the building....

Because the page you land on when entering a url is in fact little different than a store front, with the associated signage and access points defining how a person or automated device may interact with that business.

If you want to have it different then you have to actually put everything behind a locked door with no window, right?

yuuuuuuuu · on July 1, 2023

This could easily be solved by making the unauthenticated access hard for machines to consume, like introducing delays or some kind of captcha or even just proof of work (reverse some hash). While the authenticated get all the snappiness they want.

I'm strictly anti account, so he just lost me as audience. The next walled garden after Facebook and Instagram that won't ever see me again.

maxlin · on July 1, 2023

It already was semi-hard to machine-read, that is the reason I use Nitter for doing my small-scale continuous scraping of twitter which is now temporarily broken. Nitter is tons easier to parse as it's not reliant on JS, etc, and simpler to create screenshots of with headless chrome.

However if you mean implementing some even worse obfuscation (kind of like FB putting parts of words in different divs etc) that is not really compatible with the situation that this needed to be done as more of a temporary emergency measure. And PoW doesn't sound reasonable because it sets mobile devices against the scraper's servers. If all of this was just so easy, scraping would be dead. Good that it isn't.

yuuuuuuuu · on July 1, 2023

> And PoW doesn't sound reasonable because it sets mobile devices against the scraper's servers.

Scraper servers and mobile devices have different access patterns though. I I'm reading tweets then I'm fine waiting 1 second for a tweet to load. Page load times for this kind of bloated stuff are super slow anyway, meanwhile my mobile could spend a second or two on some PoW. But if you want to large-scale scrape, you suddenly have to pay for 1bn CPU seconds. And this PoW could even keep continuously increasing per IP. 0.1% with every tweet. Not noticeablr for the casual surfer sitting on the toilet, neck-breaking for scrapers.

> If all of this was just so easy, scraping would be dead. Good that it isn't.

Small-scale scraping could still be provided through API access or just a login.

The reason they are not doing the "easy" thing is that they don't see a need (yet, perhaps). Just get an account, they'd say, and they are right. It works for Instagram too, except for some weirdos who nobody really cares about.

maxlin · on July 1, 2023

Of course the scraper would have to pay too. But it makes for a race between how much they are willing to pay, versus how much worse the experience gets for real users. And for successful mobile apps, reducing average load even during active use is important (example: idle games that don't want to make your phone a drying iron, companies invest in custom engines and make all kinds of compromises to avoid this). And burst-allowing rate limiting is something I'm quite sure was already in place, especially with prejudice towards datacenter/VPN IP's. But similarly to how it is with search engine scraping, professional scrapers already have costly workarounds for these.

>The reason they are not doing the "easy" thing is that they don't see a need (yet, perhaps).

This argument just doesn't make any sense. Twitter notes that this is hurting them. Previews in chat apps, just clicking links in non-loggedin contexts is are broken. I feel like you just predict that this will turn out to be more accepted in the near future and become more a more permanent decision, which you don't like.

rvba · on July 1, 2023

Im not fine waiting 1 second.

Most baffling is mobile reddit, where it takes like 6 seconds to load. Do they want us to use their crappy app, or they just dont care?

depereo · on July 1, 2023

They're acting like they're desperate for you to use their crappy app.

taneq · on July 1, 2023

They’re pulling every underhanded trick in the book to try and force mobile users onto the app. Yeah, I think they want you to use the app.

yuuuuuuuu · on July 1, 2023

You can still get a login and have no delay.

For non-auth use, I rather wait for 1 second than not have any access at all. Which is the current state of affairs.

WJW · on July 1, 2023

Maybe that is already the PoW anti-scraping measures haha.

Quarrelsome · on July 1, 2023

HTTP Status Code 429 exists for this very purpose. While I sympathise with the idea that services need to protect their content from scraping to power AIs, I can't help but feel its a convenient excuse for these companies to re-implement archaic philosophies about online services. i.e. Killing off 3rd party apps and walling their garden higher, both feel very boomer in their retreat from the openness of the internet that seemed to be en vogue prior to smartphones. Perhaps this is just the transition from engineers building services to business, legal and finance trying to force the profit.

Correct me if I'm wrong, but surely throttling scrapers (at least ones that are not nefarious in their habits) is a problem that can be mitigated server-side, so I find it somewhat galling that its the excuse.

yuuuuuuuu · on July 1, 2023

> is a problem that can be mitigated server-side

No matter what you do, this will cost server infra. That's Musk's argument for disabling access altogether.

Therefore it would make sense to have a solution which burdens the client disproportionately in relation to the server. A burden so low for the casual user that it's negligible but in aggregate, at scale, would break things. Which is what he wants.

Quarrelsome · on July 1, 2023

> Which is what he wants.

Looks to me like both reddit and twitter are using the wedge to rather increase the height of the wall of their gardens and kill 3rd party development as opposed to genuinely trying to license bulk-users appropriately.

You're gonna need to license api keys so you're already identifying consumers and there's your infra which you need anyway. At which point you can throttle anyone obviously abusing whatever free/open-source tier offering you give out as standard.

nirui · on July 1, 2023

Unless the captcha is annoying enough to a significant degrees, I doubt that it would work. With all the money in the bucket, scrapers can just hire a captcha farm to get pass the captcha with help from a real human.

Also a side note: distributed Web crawler is not unheard of these days, as well as residential IP proxies. Meaning the effectiveness of Proof of Work model maybe also limited.

yuuuuuuuu · on July 1, 2023

How do residential proxies help? Scraping would effectively be bitcoin mining, which costs resources without shortcut.

nirui · on July 2, 2023

Many online services (including Twitter) do employ some kind of IP address scoring system as part of their anti-scraping effort.

These systems tend to treat residential proxies as normal users, and puts less restrictions on them. On the other hand, if the IP address belongs to some (untrusted) IDCs, then the system will enable more annoying restrictions (say rate limits etc) against it, making scraping less efficient.

dragon111 · on July 1, 2023

Sounds like outright banning is a stopgap measure, maybe they will implement one of these solutions

dredmorbius · on July 1, 2023

The other option would be to front caches through ISPs and the like.

This works far better when the items requested are small in number but large in volume (that is: a large number of requests against a small set of origin resources). When dealing with widespread and deep scraping, other strategies might be necessary, but these aren't impossible to envision.

Specifically permitted scraping interfaces or APIs for large-volume data access would be another option.

Of course, there's the associated issue that data aggregation itself conveys insights and power, and there might be concerns amongst those who think they're providing incidental and low-volume access to records discovering that there's a wholesale trade occurring in the background (whether that's remunerated or free of charge).

larodi · on July 1, 2023

Elon is making a point and a reminder for everyone that what you share on social nets like Twitter is basically not owned by you, but the service.

Actually I’m surprised this took so long to do, and in the light of doing so shows that perhaps Twitter was sold for its existing content rather than existing or active user base.

epidemian · on July 1, 2023

> [...] perhaps Twitter was sold for its existing content rather than existing or active user base.

Not a very significant distinction: if active users stopped posting, scrappers wouldn't have much of a reason to keep scrapping.

Retric · on July 1, 2023

Most AI startups don’t particularly care if content is 3 minutes old or 3 years.

astrange · on July 2, 2023

If you're adding new/marginal data you should want it to be as current as possible so it'll have things like trending slang terms.

valval · on July 1, 2023

They absolutely do. What are you even saying?

Retric · on July 1, 2023

AI startups training data covers content going back years. DALLE for example was trained on hundreds of year old paintings alongside more modern works.

Age may be included as part of the training but they generally want to suck up as much data as possible.

theteapot · on July 1, 2023

dist-epoch · on July 1, 2023

> basically not owned by you, but the service.

It's shared ownership. You own it, but give Twitter non-exclusive permission to also use it.

This is why news agencies request permission from a twitter account before sharing a picture they took.

xigoi · on July 1, 2023

Twitter can delete the post without your consent, so you don't really own it.

jkaplowitz · on July 1, 2023

You consented to their being able to delete it when you agreed to their terms of service. It’s like if you hire someone to clean your home. Mostly they’re tidying up and dealing with dirt and dust, but if they see what looks like a used napkin lying somewhere, they will probably throw it out without first asking if you still want it - without that being stealing and without ever owning it themselves.

It may seem weird weird to compare useful content to a used napkin, but hey, successful business founder stereotypes do quite often involve have an idea written on a napkin…

xigoi · on July 1, 2023

I didn't consent to anything, I don't have a Twitter account. I'm talking about people who do. And they often mistakenly think their content will stay on Twitter forever, so they don't need to back it up.

jkaplowitz · on July 1, 2023

Fair enough. By “you consented … when you agreed”, I really meant “one consents … when one agrees”, as is common in informal English.

Yes, it’s a mistake to rely on social media content remaining up forever, agreed. That’s separate from ownership. Backups are important even for data on a hard drive you physically own, since hard drives can fail or be damaged or lost.

xigoi · on July 1, 2023

Sorry, I don't think I've ever seen “you” used as a generic pronoun in a past-tense sentence, which is why I took it personally.

jkaplowitz · on July 1, 2023

Also fair - it’s possible that my choice of tense made that the only literal reading, but my point was intended as general and not accusatory.

catoc · on July 1, 2023

You can't park there mate.

xigoi · on July 2, 2023

That's not past tense.

dist-epoch · on July 1, 2023

Ownership is not the same with right to display.

I can own a picture, but I can't place it on the NY Times website.

tracerbulletx · on July 1, 2023

It's not though. They have a very permissive license but they don't have any actual IP ownership.

x3874 · on July 1, 2023

Was already the case for ICQ. (yes, it was in their ToS)

quest88 · on July 1, 2023

How do you define stealing? Is the AI data obtained from accessing private data? Data that users did not make publicly available but kept to themselves on their own devices?

johnnyanmac · on July 1, 2023

I can't really agree. We've already had rulings about data scraping and I don't see the difference here. Just that a lot of people do it now?

Also, Twitter is a public platform. Twitter didn't generate comments, and people posting on a public account are indirectly subject to public viewing. Not much different from being indirectly recorded in a public park

matwood · on July 1, 2023

> stealing peoples content

The Twitters and Reddits need to be careful here when complaining because without users generating free content, they also have no business.

PartiallyTyped · on July 1, 2023

We know that quality data is king, with all due respect to people tweeting, that data is most likely garbage.

benj111 · on July 1, 2023

If I visit Twitter to work out how to sort some JavaScript issue, and that makes my company $X, am I stealing content, or am I just using the platform?

There's one major player making money off of other peoples content here, and that's Twitter. Why are they ok doing that, but not anyone else?

WeylandYutani · on July 1, 2023

Scraping has been a thing since the web started. Happens to every public site.

I recall that email at one point was 90% spam.

croes · on July 1, 2023

But is Twitter a good source for AIs?

bob-09 · on July 1, 2023

Used responsibly, of course it is. A developer is able to ingest current language used in exchanges about current topics, as well as cite prominent sources that are still using the platform.

wslh · on July 1, 2023

> exploiting the infrastructure that other people have built

Including you and me. WE built part of the infrastructure.

realusername · on July 1, 2023

It's not like Twitter is compensating tweet authors either. For using art the debate is still opened in my opinion even if I'm personally not in favor of it but I don't see how those platforms built on user made content (even more of a clear cut than AI) can have a say on this

benjaminwootton · on July 1, 2023

People are happy to put their content on social networks. Maybe they get some value in return such as sales, exposure, signalling or simple enjoyment.

Many people who aren’t that privacy conscious would however object to lots of companies, big and small, sucking their content into their databases for their own uses, then republishing after it’s passed through a few AI models.

fauigerzigerk · on July 1, 2023

>People are happy to put their content on social networks.

Do they have a choice? A handful of corporations have captured all the network effects. If you need to reach an audience to do your job or find your "friends", what other choice do you have but to give your data to them?

badosu · on July 1, 2023

I fail to understand this argument. If your friends are close enough do you need a big corporate network to share content/thoughts?

If they aren't close why do you care?

Alternatively, what would make more sense is to participate in communities gated by these networks but then it's your choice to be there.

fauigerzigerk · on July 1, 2023

>If your friends are close enough do you need a big corporate network to share content/thoughts? If they aren't close why do you care?

I don't personally have this problem, but my observation is that most social relationships are somewhere in between closest friends and don't care.

My own concern is more about participating in professional, neighbourhood, civil society or political communities. Choosing not to be where they have decided to congregate means not being able to do my job and not making my voice heard where many decisions affecting me are taken.

mnky9800n · on July 1, 2023

Yes. AI is becoming the content launderer. I mean what's the difference? You could ask an AI to make not star wars. And what's the difference between that and all the not star wars movies made in the 80s? It's that it was automated this time around?

I think this points out that AIs clearly do not work like human brains. Human brains do not need all of the content of humanity to produce a replica of art station mediocre.

realusername · on July 1, 2023

It's not like there's many alternatives, network effects are very powerful and even with Musk running the company into the ground, there's not many people really quitting which tells a lot on how hard that effect can be.

mgiannopoulos · on July 1, 2023

They have announced a plan to compensate creators based on ads shown and also have implemented a subscribers feature (people paying users for special access to some tweets)

realusername · on July 1, 2023

I had no idea this existed, thanks for the insights.

zerd · on July 1, 2023

Well they killed the API, what did they think would happen? It's easier to control access and rate limit with a proper API.

AnthonyMouse · on July 1, 2023

The actual problem seems to be that a large number entities now want a full copy of the entire site.

But why not just... provide it? Charge however much for a box of hard drives containing every publicly-available tweet, mailed to the address of buyer's choosing. Then the startups get their stupid tweets and you don't have any load problems on your servers.

thatguy0900 · on July 1, 2023

What do you even charge for that? We might never make a repository of human made content with no Ai postings in it ever again. Seems like selling the golden goose to me

AnthonyMouse · on July 1, 2023

It's already public information. The point isn't to extract rents, it's to remove the incentive for server-melting mass scraping.

jacobolus · on July 1, 2023

Substantially higher loads than Twitter gets today were not "melting the servers" until Musk summarily fired most of the engineers, stopped paying data center (etc.) bills, and then started demanding miscellaneous code changes on tight deadlines with few if any people left who understood the consequences or how to debug resulting problems.

In other words, the root problem is incompetent management, not any technical issue.

Don't worry though, the legal system is still coming for Musk, and he will be forced to cough up the additional billions (?) he has unlawfully cheated out of a wide assortment of counterparties in violation of his various contracts. And as employee attrition continues, whatever technical problems Twitter has today will only get worse, with or without "scraping".

AnthonyMouse · on July 1, 2023

Scraping has a different load pattern than ordinary use because of caching. Frequently accessed data gets served out of caches and CDNs. Infrequently accessed data results in cache misses that generate (expensive) database queries. Most data is infrequently accessed but scraping accesses everything, so it's disproportionately resource intensive. Then the infrequently accessed data displaces frequently accessed data in the cache, making it even worse.

phatskat · on July 1, 2023

In theory, wouldn’t continuous scraping by AI farms et al put a log of this infrequent data into cache though?

dredmorbius · on July 1, 2023

Caches are only so large. Expanding them doesn't buy you much, and increases costs greatly.

The key benefit to a cache is that a small set of content accounts for a large set of traffic. This can be staggeringly effective with even a very limited amount of caching.

Your options are:

1. Maintain the same cache size. This means your origin servers get far more requests, and that you perform far more cache evictions. Both run "hotter" and are less efficient.

2. Increase the cache size. Problem here is that you're moving a lot of low-yield data to the cache. On average it's ... only requested once, so you're paying for far more storage, you're not reducing traffic by much (everything still has to be served from origin), and your costs just went up a lot.

3. Throttle traffic. The sensible place to do this IMO would be for traffic from the caching layer to the origin servers, and preferably for requesting clients which are making an abnormally large set of non-cached object requests. Serve the legitimate traffic reasonably quickly, but trickle out cold results to high-demand clients slowly. I don't know to what extent caching systems already incorporate this, though I suspect at least some of this is implemented.

4. Provide an alternate archival interface. This is its own separately maintained and networked store, might have regulated or metered access (perhaps through an API), might also serve out specific content on a schedule (e.g., X blocks or Y timespan of data are available at specific times, perhaps over multipath protocols), to help manage caching. Alternatively, partner with a specific datacentre provider to serve the data within given facilities, reducing backbone-transit costs and limitations.

5. Drop-ship data on request. The "stationwagon full of data tapes" solution.

6. Provide access to representative samples of data. LLM AI apparently likes to eat everything it can get its hands on, but for many purposes, selectively-sampled data may be sufficient for statistical analysis, trendspotting, and even much security analysis. Random sampling is, through another lens, an unbiased method for discarding data to avoid information overload.

marvin · on July 1, 2023

Twitter feels more stable today, with less spam, than one year ago. There's of course parts that have been deliberately shut down, but that's not an argument about the core product.

johnnyanmac · on July 1, 2023

Pandemic lock downs are 99% over. People are getting back outside and returning to office. These effects have little to do with Twitter's specific actions.

saltminer · on July 1, 2023

I see more spam these days, particularly coming from accounts that paid for the blue check mark. IIRC, Musk said that paid verification would make things better since scammers wouldn't dare pay for it (I would find where he said this but I hit the 600 tweet limit), but given how lax their verification standards are, it seems to be a boon to scammers, much the same way that Let's Encrypt let anyone get a free TLS cert at the cost of destroying the perceived legitimacy that came with having HTTPS in front of your domain.

(And IMO, that perceived legitimacy was unfounded for both HTTPS and the blue check before both were easy to get, it's just that the bar had to drop to the floor for most people to realize how little it meant.)

unconed · on July 1, 2023

The "massive layoffs" was just twitter returning to the same staffing level they had in 2019, after they massively overhired in 2020-2021. This information is public, but this hasn't stopped people from building a fable around doomsday prophecies.

phatskat · on July 1, 2023

I mean, it’s clear the Musk overcorrected. The fact that managers were asked to name their best employees, only to then be fired and replaced by them, or that musk purposefully avoided legal obligations to pay out severance/health insurance payments (I forget the exact name)/other severance, and that the site has had multiple technical issues that make it feel like there’s no review/QA process all show that he doesn’t know what he’s doing.

He got laughed out of a Twitter call thing with lead engineers in the industry for saying he wanted to “rewrite the entire stack” and not having a definition for what he meant.

Doomed or not, Musk is terrible at just about everything he does and Twitter is no exception

eusto · on July 1, 2023

I think this action from Twitter showed that it isn't public information. It is pretty much twitter's to do whatever they want with it.

phatskat · on July 1, 2023

I think that’s always been known, but the tacit agreement between users and Twitter has always been “I’ll post my content and anyone can see it, if they want to engage they make an account”. From a business perspective this feels like a big negative to me for Twitter. I’ve followed several links the last few days and been prompted to login, and nothing about those links felt valuable enough to do so.

pyrale · on July 1, 2023

Just because it is published doesn't mean authors don't retain rights on it. None of that content is public-domain.

joshvm · on July 1, 2023

It's about $1 per thousand tweets and access to 0.3% of the total volume. I think the subscription is 50M "new" tweets each month? There are other providers who continually scrape Twitter and sell their back catalogue.

https://www.wired.com/story/twitter-data-api-prices-out-near...

Researchers are complaining that it's far too high for academic grants. Probably true, but that's no different from other obscenely priced subscriptions like access to satellite imagery (can easily be $1k for a single image which you have no right to distribute). I'm less convinced that it's impossible for them to do research with 50 million tweets a month, or with what data there is available. Most researchers can't afford any of the AI SAAS company subscriptions anyway. Data labelling platforms - without the workers - can cost 10-20k a year. I spoke to one company that wouldn't get out of bed for a contract less than 100k. Most offer a free tier a la Matlab in the hope that students will spin out companies and then sign up. I don't have an opinion on what archival tweets should cost, but I do think it's an opportunity to explore more efficient analyses.

zapdrive · on July 1, 2023

> We might never make a repository of human made content with no Ai postings in it ever again.

Wow, never thought of it that way before. Kinda hit me hard for some reason.

thatguy0900 · on July 1, 2023

Honestly I think that's why reddit is closing itself up too. Everyone sitting on a website like this might be sitting on a Ai training goldmine that can never be replicated.

TeMPOraL · on July 1, 2023

Too little too late. Anything pre-ChatGPT is already scrapped, packaged and mirrored around the Internet; anything post ChatGPT launch is increasingly mixed up with LLM-generated output. And it's not that the most recent data has any extra value. You don't need most recent knowledge to train LLMs. They're not good for reproducing facts anyway. Training up their "cognitive abilities" doesn't need fresh data, it needs just human-generated data.

KnobbleMcKnees · on July 1, 2023

Precisely, which brings us back around to the question: why are social media companies really doing this?

I think "AI is takin' ooor contents!" is a convenient excuse to tighten the screws further. Having a Boogeyman in the form of technology that's already under worried discussion by press and politicians is a great way to convince users how super-super-serious the problem must be, and to blow a dog whistle at other companies to indicate they should so the same.

It's no coincidence that the first two companies to do this so actively and recently are both overvalued, not profitable, and don't actually directly produce any of the content on their platforms.

yuuuuuuuu · on July 1, 2023

One that's slowly ageing away though.

echelon · on July 1, 2023

Synthetic data fed into training isn't necessarily a bad thing. It can produce great results in many cases.

ricardobayes · on July 1, 2023

I've seen that work with self-driving cars. Simulating driving data is actually better since you can introduce black swan events that might not happen often in real world.

ricardobayes · on July 1, 2023

It doesn't matter all that much. Smaller but better data is better for training than a large, but garbage dataset.

lozenge · on July 1, 2023

Do you think twitter has no AI postings?

raincole · on July 1, 2023

Are you really sure it's legal? In theory it's not different from providing the same information from API or website... but do people working in law think so?

itronitron · on July 1, 2023

Twitter purchased Gnip years ago, and it's a reseller of social media data. Companies that want all the public tweets, nicely formatted and with proper licensing, can just buy the data from Twitter directly.

AnthonyMouse · on July 1, 2023

I'm assuming their terms give them permission to redistribute everybody's tweets, since that's kind of the whole site. I don't know why they'd restrict themselves to doing it over the internet and not the mail, but do you have any reason to think that to be the case?

nameless_prole · on July 1, 2023

We're talking about Elon Musk. I'd be surprised if he gave a shit.

dredmorbius · on July 1, 2023

So, I'd just made that suggestion myself a few moments ago.

That said, there are concerns with data aggregation, as patterns and trends become visible which aren't clear in small-sample or live-stream (that is, available in near-time to its creation) data. And the creators of corpora such as Twitter, Facebook YouTube, TikTok, etc., might well have reason to be concerned.

This isn't idle or uninformed. I've done data analysis in the past on what were for the time considered to be large datasets. I've been analyzing HN front-page activity for the past month or so, which is interesting. I've found it somewhat concerning when looking at individual user data, though, here being the submitter of front-page items. It's possible to look at patterns over time (who does and does not make submissions on specific days of the week?) or across sites (what accounts heavily contribute to specific website submissions?). In the latter case, I'd been told by someone (in the context of discussing my project) of an alt identity they have on HN, and could see that the alternate was also strongly represented among submitters of a specific site.

Yes, the information is public. Yes, anyone with a couple of days to burn downloading the front-page archive could do similar analysis. And yes, there's far more intrusive data analytics being done as we speak at vastly greater scale, precision, and insights. That doesn't make me any more comfortable taking a deep dive into that space.

It's one thing to be in public amongst throngs or a crowd, with incidental encounters leaving little trace. It's another to be followed, tracked, and recorded in minute detail, and more, for that to occur for large populations. Not a hypothetical, mind, but present-day reality.

The fact that incidental conversations and sharings of experiences are now centralised, recorded, analyzed, identified, and shared amongst myriad groups with a wide range of interests is a growing concern. The notion of "publishing" used to involve a very deliberate process of crafting and memoising a message, then distributing it through specific channels. Today, we publish our lives through incidental data smog, utterly without our awareness or involvement for the most part. And often in jurisdictions and societies with few or no protections, or regard for human and civil rights, let alone a strong personal privacy tradition.

As I've said many times in many variants of this discussion, scale matters, and present scale is utterly unprecedented.

AnthonyMouse · on July 1, 2023

This is a legitimate concern, but whether the people doing the analysis get the data via scraping vs. a box of hard drives is pretty irrelevant to it. To actually solve it you would need the data to not be public.

One of the things you could do is reduce the granularity. So instead of showing that someone posted at 1:23:45 PM on Saturday, July 1, 2023, you show that they posted the week of June 25, 2023. Then you're not going to be doing much time of day or day of week analysis because you don't have that anymore.

dredmorbius · on July 1, 2023

Yes, once the data are out there ... it's difficult to do much.

Though I've thought for quite some time that making the trade and transaction of such data illegal might help a lot.

Otherwise ... what I see many people falling into the trap of is thinking of their discussions amongst friends online as equivalent, say, to a discussion in a public space such as a park or cafe --- possibly overheard by bystanders, but not broadcast to the world.

In fact there is both a recording and distribution modality attached to online discussions that's utterly different to such spoken conversations, and those also give rise to the capability to aggregate and correlate information from many sources.

Socially, legally, psychologically, legislatively, and even technically, we're ill-equipped to deal with this.

Fuzzing and randomising data can help, but has been shown to be stubbornly prone to de-fuzzing and de-randomising, especially where it can be correlated to other signals, either unfuzzed or differently-fuzzed.

worrycue · on July 1, 2023

I despise Musk as much as anyone else and charging for API access has hurt a lot of valuable use cases like improving accessibility but … how about not massive scraping a site that doesn’t want you to?

9935c101ab17a66 · on July 1, 2023

Scraping isn’t illegal, and to be honest, I’m not even sure it’s unethical. I’m assuming you think it so — if so, why? I’m not disagreeing, but haven’t given it much thought.

wiseowise · on July 1, 2023

It’s ethical when your average Joe does it on a small scale to scrap their favorite favorite YouTuber or to buy something when it becomes available.

When you have financial incentive to build your business on someone’s data and you scrap literally millions if not billions of pages - it’s unethical.

specproc · on July 1, 2023

The thing with social media platforms is that this data is user-generated, so you've got the company "owning" user content.

This data is often of great public value. I track conversations around a social issue as part of my work for a non-profit.

I'd counter it's unethical to prevent people from accessing this data.

jnsaff2 · on July 1, 2023

I’m not disagreeing with your comment but

> great public value

Having been to twitter mostly through the most recent prominent war, man the signal to noise ratio is really low even when being careful about who to follow and who to block. There is so much disinformation, bad takes, uninformed opinions presented as facts, pure evil, etc.

So I guess it could be used for training very specific things or cataloging the underbelly of humanity but for general human knowledge it’s a frigging cesspool.

specproc · on July 1, 2023

OK, not gonna argue with that. There is, I guess, a perception that it matters because policy-makers, and the wonks and hacks that influence them are hooked. The value for me (and ergo the public, some classic NGO thinking there for you) lies in understanding those dynamics.

I do not use the Twitters myself, and actively discourage others from doing so. Sends people bonkers.

lmeyerov · on July 1, 2023

I mean, we have found election manipulations like large-scale inauthentic activity of out-of-staters explicitly targeting African Americans, and projects here even to the extent of the perpetrators getting indicted. Other projects were tracking vaccine side-effect self-reports faster than the CDC and other disaster intelligence.

We were actually gearing up to switch to paid accounts as we found use cases that could subsidize these efforts... And then the starting price for reasonably small volumes shot up to like $500k/yr.

bojo · on July 1, 2023

So, are we saying it's unethical for Google and other search engines who make money off of ad revenue to scrape sites like Twitter? Or are they paying a large sum to Twitter to do this?

raincole · on July 1, 2023

If Google doesn't provide a way to say "please don't scrape my site", then it 100% unethical.

We have robots.txt. If Google doesn't respect that, it's unethical. Don't you think so?

RobotToaster · on July 1, 2023

Does twitter's robots.txt forbid scraping? Judging by the fact it shows up in Google I'd assume not.

pcthrowaway · on July 1, 2023

Maybe it's time for an llm.txt

Not that the people you want to respect that would

raincole · on July 1, 2023

The tricky part is it's much more harder to prove that they didn't respect that.

spullara · on July 1, 2023

When there is a value exchange between the two entities that are relatively similar then I think it is ethical. People trade Google making money on ads for their site being found when people search. It is also possible to opt-out.

wiseowise · on July 1, 2023

They benefit mutually from their symbiosis. Financially, AI bro model #1321 doesn’t bring anyone value except their owners.

cygx · on July 1, 2023

If done against the wishes of the owner of the site, yes, I would consider that unethical. Thankfully, Google respects robots.txt and noindex.

zladuric · on July 1, 2023

But it's it ethical for the site owner to block access to random people and companies in the internet to _my_ data? I posted that tweet with the expectation that it's gonna be publicly available. Now the owner of the site is breaking that expectation. I would say that this part is also unethical.

Especially since they're not moderating things or anything.

cygx · on July 1, 2023

I would say that this part is also unethical.

Agreed. However, it's probably covered by their terms of service.

Same thing with the recent reddit kerfuffle. I'd have much preferred a Usenet 2.0 instead of centralizing global communications in the hands of a handful of private companies with associated user-hostile incentive structures.

max51 · on July 1, 2023

Being indexed by google is optional. Twitter could stop it a any time if they thought it was a bad deal for them. That not comparable to a startup company trying to scrape the entire site to train their AI and using sophisticated techniques to bypass protections Twitter has put in place

kergonath · on July 1, 2023

Translation: it’s ethical when I do it.

wiseowise · on July 1, 2023

You wouldn’t download a car, would you?

pdntspa · on July 1, 2023

Except with modern software, some wannabe genius programmer will think they can get a bunch of money or cred or whatever by infantilizing the process down to something your grandma could use. Then, suddenly, everyone is scraping. The net effect is largely the same -- server operators see an overwhelming proportion of requests from bots. Still ethical?

pmlnr · on July 1, 2023

> I’m not even sure it’s unethical.

If it doesn't respect robots.txt, it is unethical.

glimmung · on July 1, 2023

Is it ethical for the "public square" to have a robots.txt?

Musk is trying to have his cake and eat it...

(Clearly it's not a public square, but his position is incoherent).

rnbaxter · on July 1, 2023

Yes, it is ethical. In many countries it is legal for humans to walk around the public square and overhear all conversations.

It is NOT legal to install cameras that record everyone's conversations, much less sell the laundered results.

Pre-2023 people went on Twitter with the expectation that their output would be read by humans.

A traditional search engine is different: It redirects to the original. A bastardized search engine that shows snippets is more questionable, but still miles away from the AI steal.

RobotToaster · on July 1, 2023

Many countries have freedom of panorama, which means it is legal to video record the public square. I'm not aware if anywhere has specific laws on mounting the camera on a robot.

johnnyanmac · on July 1, 2023

>Pre-2023 people went on Twitter with the expectation that their output would be read by huma ns

Expectations =/= reality. And the reality is that bits have been reading comments for over a decade.

threeseed · on July 1, 2023

a) It looks to be permitted according to Twitter's robots.txt

b) Given Twitter is public, user generated content which they don't own but simply have a license I wouldn't call it unethical in the slightest.

AmericanChopper · on July 1, 2023

If the background of the issue is as Musk described, then it certainly is not allowed by twitter’s robots.txt, which allows a maximum of one request per second.

I do a lot of data scraping, so I’m sympathetic to the people who want to do it, but violating the robots.txt (or other published policies) is absolutely unethical, regardless of the license of the content the service is hosting. Another way of describing an unauthorised usecase taking a service offline is a denial of service attack, which (again, if Musk’s description of the problem is accurate) seems to be the issue Twitter was facing, with a choice between restricting services or scaling forever to meet the scrapers requirements.

Personally I would have probably tried to start with a captcha, but all this dogpiling just looks like low effort Musk hate. The prevailing sentiment on HN has become so passionately anti-Musk that it’s hard to view any criticism of him or Twitter here with any credibility.

echelon · on July 1, 2023

"You wouldn't download a car."

The only reason these websites and platforms aggregate any content at all is because they're effectively giant public squares.

dizhn · on July 1, 2023

No means no ! :)

develop7 · on July 1, 2023

Moreover, isn't making scraping impossible illegal per a couple-of-years-old bill?

TechBro8615 · on July 1, 2023

This isn't going to make them stop either. Musk is about to see a spike in account creations using the method of lowest resistance. I expect "sign in with apple" will disappear as an option soon, given its requirement of supporting "hide my email" that makes it trivial to create multiple twitter profiles from one apple ID.

nrb · on July 1, 2023

This works in his favor though, more accounts means higher ad rev and MAU for valuation.

masklinn · on July 1, 2023

Higher ad rev only until the advertisers realise your users don’t buy anything and ads are wasted and your ARPU drops through the floor.

masklinn · on July 1, 2023

It’s funny how being a locust is cool when you’re doing it, and a problem when others do it in a way which affects you.

smcl · on July 1, 2023

You might want to think twice about taking him at face value then. He says it’s about scraping but who knows anymore

watwut · on July 1, 2023

And yet people do. Kind of predicting what various people react including scammers, bots, scrapper and what not is, like, job of a management in a company like this.

naasking · on July 1, 2023

> I despise Musk as much as anyone else

It's interesting how you assume that most people despise Musk.

ukFxqnLa2sBSBf6 · on July 1, 2023

Oh shit you solved the problem

okeuro49 · on July 1, 2023

> I despise Musk as much as anyone else

I don't despise Musk at all. Don't agree with him on everything, but he is a genuine and interesting person.

martindbp · on July 1, 2023

He holds views that were the progressive norm 15 years ago, which are now considered bigoted, this is considered unacceptable today. There's a lot I don't agree with him on, like Ukraine, but "despise" is a word I reserve for the likes of Putin.

valval · on July 1, 2023

I don’t think Putin is the epitome of evil that the west portray him to be either. War is hell and he surely started the larger scale war, but just remember that you’ve probably been introduced to less than 1% of his side of things as a western citizen. The western world has gone to war many, many times in history for lesser reasons.

oblio · on July 1, 2023

I'm from a country that's arguably part of the West today (Romania).

Your nonsense straight out of the a Soviet propaganda book doesn't work on me.

Go ask people from Eastern Europe how they felt about 45/50+ years of Soviet imposed governments and regimes.

The West has done awful things but in what way do they excuse the attempted ethnic cleansing of Ukraine?

valval · on July 2, 2023

What do you know about Putin’s motives? What propaganda do you think you’re under?

You’re probably smart enough to understand that out of spite and regret of your country’s history with the Russians your countrymen have more motivation than many others to judge the Russian efforts without any further investigation into the matter.

The same applies for myself, since I’m Finnish. It’s almost sad to see how people abandon all reason and critical thinking skills because of some ingrained belief that “Russia bad”. All of my knowledge of the human nature leads me to believe that they’re no more bad than the next people, and that they probably have some motives to go to a taxing war that we don’t really understand here in the west - seeing as the first casualty in war is the truth.

itwillnotbeasy · on July 1, 2023

>Yeah he started the one of the deadliest wars in the 21 century, threatens to destroy the entire planet with nuclear weapons, but he is not that evil because there were other wars started by the west

That's an extremely dumb take

yks · on July 1, 2023

I’ll rephrase your argument for you: “Why don’t you listen to the rapist’s opinion? The victim is surely not blameless. Besides, your cousin is a shoplifter”.

benjaminwootton · on July 1, 2023

Is this really the best response you could put forward in to people trying to make a nuanced point?

CyberDildonics · on July 1, 2023

Nuanced point? That person just said 'sure, putin is committing genocide and destroying an entire country, but we haven't heard his side of the story'

Then they followed it up with the old hacker news chestnut of 'whatabout the west'

valval · on July 2, 2023

Would his side of the story matter to you? I don’t think it’s a particularly nuanced point to you since you’ve already made up your mind, however ignorant it might be.

CyberDildonics · on July 2, 2023

Putin already gave his side of the story. He declared Ukraine an invalid country, said there were nazis there and then went into full out war to destroy the country while committing countless atrocities.

I don’t think it’s a particularly nuanced point to you

What point and why do you keep saying 'nuance' over and over while giving zero actual information? What are you trying to say and what evidence is there?

Let's think about this super hard. What is the justification for an unprovoked genocidal war? Why are you defending putin?

however ignorant it might be.

Show me where you get your information, lets see the source of this nonsense.

lisasays · on July 1, 2023

It wasn't "nuanced" at all. Just muddled and obsequious.

yks · on July 1, 2023

1. As already mentioned, it's hardly a nuanced point.

2. If you actually want to hear my opinion, then the realm of geopolitics + good old-fashioned hate of the US government does a number on people's logic, so we get what I can only charitably describe as a parade of non-sequiturs, whataboutisms and other fallacies. And so it can be useful to frame it in the simpler terms, for example you could hardly find anyone even on this site who would condone the forced takeover of parts of people's homes. Literally the same is happening on a scale of the countries.

oblio · on July 1, 2023

Totally agreed. Make it about individuals where the analogy holds.

It cuts right through the bullshit and frequently exposes a lot of hypocrisy, hate, self loathing, racism, etc.

bmitc · on July 1, 2023

Does Twitter have that same approach with user data?

oefrha · on July 1, 2023

You think companies massively scraping right now would respect API rate limit?

guax · on July 1, 2023

API rate limits are more easily enforceable. If they keep scraping there are methods to detect and thwart behaviour. I don't think twitter has the appropriate talent and work environment to allow proper solutions to be implemented. It's all knee jerk reaction to whatever Elon decides.

oefrha · on July 1, 2023

It's more easily enforced, except when you don't give them enough they just go back to scraping. Or create a million fake developer accounts and pool the free quota if that's possible. These are not hypotheticals, loads of companies have done both against all kinds of APIs over the years, Twitter included.

UberFly · on July 1, 2023

"control access and rate limit"

Isn't that basically what they did with the API changes?

spaceman_2020 · on July 1, 2023

But they were too stingy with the tiers and too greedy with their prices. Even for minor use cases where you need to make, say, 100 API calls a day, you’ll need to pay $100/month.

Which just leads people to scrape.

hcks · on July 1, 2023

Why is it twitter that is “stingy” and not the people scrapping so they don’t have to pay?

spaceman_2020 · on July 1, 2023

I'm not going to pay $100 just to fetch 3000 records for a hobby project. I'll either skip the project, or I'll just abuse my scraping tool.

If they'd made some more reasonable pricing tiers, I would have been happy to pay.

Fetching something as simple as the total follower count from an API shouldn't be more (exorbitantly) more expensive than fetching data from, say, GPT-4. No reasonable person can make an argument for $10c/call pricing.

o1y32 · on July 1, 2023

Did you actually read that comment? I think the point is very clear -- given a reasonable price, people may would want to use the API instead of scaping the data themselves. If you instead ask for exorbitant amount of money, it only forces people to scrape, because there is no business model that would make it possible to pay.

mrweasel · on July 1, 2023

Isn't the firehose API still available?

smetj · on July 1, 2023

Indeed, spot on.

smcl · on July 1, 2023

Sorry I don’t buy it. Hundreds of millions of people use Twitter, and we are to understand that there are an enough people scraping to the extent that they had to suddenly take drastic action by shuttering unauthenticated access? Any dev would have told him that those supposedly scraping could simply setup Selenium or some other headless browser to login before scraping.

This smells of another failed Musk experiment at twiddling with the knobs to increase engagement, to me.

dbbk · on July 1, 2023

Not only that but unauthenticated access is the easiest thing to cache. There is no need to "bring large numbers of servers online". He's lying.

gibrown · on July 1, 2023

A bot scraping content will tend to go deep into the archives and hit all content systematically. Caching isn't as effective if you hit everything whereas real users will tend to hit the same content over and over again.

It can add nontrivial load.

sebzim4500 · on July 1, 2023

They could but signing in, even in selenium, means agreeing to twitter's TOS. See the LinkedIn scraping case.

smcl · on July 1, 2023

The same way these AI code completion tools respected GPL-licensed code?