About 5 years ago, StackOverflow messed up and declared that they were making all content submitted by users available under CC-BY-SA 4.0 [1]. The error here is that the users-content agreement was that all users' contributions are made available under CC-BY-SA 3.0 (and not anything about later). In the middle there were also some licensing problems concerning code vs noncode that were confusing.
I remember thinking that if any of the super answerers really wanted, they could have tried to sue for illegally making their answers available under a different license. But I thought that without any damages, this probably wasn't likely to succeed.
But now I wonder whether making all content available to AI scrapers and OpenAI in particular might be enough to actually base a case. As far as I can tell, StackOverflow continued being duplicitous with what license applies to what content for half of the year 2018 and the first few months of the year 2019. Their current licensing suggests CC-BY-SA 3.0 for things before May 5 2018, and CC-BY-SA 4.0 for things after. Sometime in early 2019 (if memory serves, it was after the meta post I link to), they made users login again and accept a new license agreement for relicensing content. But those middle months are murky.
My understanding of licensing law is that something like 3.0 -> 4.0 is very unlikely to be a winnable case in the US.
Programmers think like machines. Lawyers don't. A lot of confusion comes from this. To be clear, there are places where law is machine-like, but I believe licensing is not one of them.
If two licenses are substantively equivalent, a court is likely to rule that it's a-okay. One would most likely need to show a substantive difference to have a case.
IANAL, but this is based on one conversation with a law professor specializing in this stuff, so it's also not completely uninformed. But it matches up with what you wrote. If your history is right, the 2019 changes is where there would be a case.
The joyful part here is that there are 200 countries in the world, and in many, the 3.0->4.0 would be a valid complaint. I suspect this would not fly in most common law jurisdictions (British Empire), but it would be fine in many statutory law ones (e.g. France). In the internet age, you can be sued anywhere!
> If two licenses are substantively equivalent, a court is likely to rule that it's a-okay. One would most likely need to show a substantive difference to have a case.
Which does exist and can affect the ruling. CC notably didn't grant sui generis database rights until 4.0, and I'm aware of at least one case where this could have mattered in South Korea because the plaintiff argued that these rights were never granted to and thus violated by the defendant. Ultimately it was found that the plaintiff didn't have database rights anyway, but could have been else.
A super literal reading of some bad wording in 3.0 created an effect the authors say they did not intend and fixed in 4.0. Given the authors did not intend this interpritation a judge is likly to assume people using the licence before it came to light also did not, hence switching to 4.0 is fine. Conversly now this is widiy known continuing to use 3.0 could be seen as explicitly choosing the novel interpritation (arguably this would be a substantive change).
> a judge is likly to assume people using the licence before it came to light also did not
Why would the judge have to assume anything? The person suing could simply tell the judge they did mean to use the older interpretation, and that they disagree with the "fix". They're the ones that get to decide, since they agreed to post content using that specific license, not the "fixed" one.
But the people suing aren't trying to choose how the license is interpreted, they're trying to prevent the other party from changing the text. If the change is meant to "fix" how the text should be interpreted (which is what you said), then they're the ones trying to choose the exact interpretation.
I personally write "IANAL", not to reduce my personal legal liability, but rather to give a heads up to those reading that I am not an expert, that I am likely wrong, and that you likely shouldn't listen to me.
I feel there's a common thread that maybe should be some kind of internet law that people who make a point of noting they are not experts, are more often correct than people who confidently write as though they are.
You see this particularly with crypto, where "I am not a crypto expert" is usually accompanied by a more factual statement than one from the self proclaimed expert elsewhere in the thread.
One cannot legally practice law without a license. The definition of that varies by jurisdiction. Fortunately, in my jurisdiction, "practicing law" generally implies taking money, and it's very hard to get in trouble for practicing law without a license. However, my jurisdiction is a bit of an outlier here. Yours might differ.
In general, the line is drawn at the difference between providing legal information and legal advice.
Generic legal discussions, like this one, are generally not considered practicing law. Legal information is also okay. If I say "the definition of manslaughter is ...," or "USC ___ says ___," I'm generally in the clear.
Where the line is crossed is in interpreting law for a specific context. If I say "You committed manslaughter and not murder because of ____, which implies ____," or "You'd be breaking contract ____ because clause 5 says ____, and what you're doing is ____," that's legal advice.
The reasons cited for this are multifold, but include non-obvious ones, such as that clients will generally present their case from their perspective. A non-lawyer will be unlikely to have experience with what questions to ask to get a more objective view (or even if the client is objective, what information they might need to make a determination). Even if you are an expert in the law, it's very easy to accidentally give incorrect advice, which can have severe consequences.
In practice, most of this is protectionism. Bar associations act like a guild. Lawyers are mostly incompetent crooks, and most are not very qualified to provide legal advice either, but c'est la vie. If you've worked with corporate lawyers, this statement might come off as misguided, but the vast majority of lawyers are two-bit operations handling hit-and-runs, divorces, and similar.
In either case, it's helpful to give the disclaimer so you know I'm not a lawyer, and don't rely on anything I say. It's fine for casual conversation, but if tomorrow you want to start a startup which helps people with legal problems, talk to a qualified lawyer, and don't rely on a random internet post like this one.
I always assumed it was the same type of courtesy as IMHO, and someone taking legal advice from random strangers on the internet wouldn't result in any legal liability on the side of the commenters.
Yes, people have been sued before for giving advice that was acted upon.
I remember hearing about an construction engineer who was sued for giving bad advice whilst drunk to a farmer over fixing a dam. The dam failed and the engineer was found to be liable.
I can see the reasonning behind the case, as the engineer has plausible expertise in the domain and could credibly give actionable advice.
When it comes to lawyers, there is already a legal framework where lawyers are responsible when giving legal advice, even when it's not toward their clients, the same way medical professionals have specific liabilities regarding the medical acts they can perfom.
Non lawyers giving legal advice doesn't fit that framing, except if they explicitely pose as one. I'd also exclude malicious intent, as whatever the circumstances, if it can be proven and results in actual harm there's probably no escape for the perpetrator.
That’s possible because the engineer is licensed. A random guy giving bad advice and failing to disclose he’s not an engineer would do no such thing (so long as he didn’t suggest he was an engineer).
It is worth remembering that law professors have a vested interest in making sure the system work as you described. If contract law was straightforward, they'd be out of job.
That's an admirable goal but if there are any "bugs" in the contract you probably don't want it executed mindlessly. Human language isn't code and even code isn't always perfect so I'd rather not be legally required to throw someone out a window because someone couldn't spell "defederate".
I agreed in the abstract, but not in the specific (the specific professor was one of integrity, and sufficiently famous this was not an issue).
However, it's worth noting the universe is a cesspool of corruption. If you pretend it works the way it ought to and not the way it does, you won't have a very good time or be very successful. The entire legal system is f-ed, and if you pretend it's anything else, you'll end up in prison or worse.
> if any of the super answerers really wanted, they could have tried to sue for illegally making their answers available under a different license.
they can plausibly sue people other than stackoverflow if they attempt to reuse the answers under a different license. but i think it's very difficult to find a use that 4.0 permits that 3.0 doesn't
The blog illustrates that such assumptions about what's a sufficient attribution are fraught with danger, so "the smallest professional courtesy" can expose you to a $150k risk
People put their content on the site for the public to use, and now the public is using it, it's just that "the public" includes AIs. Admittedly, a non-human public, nonetheless ...
The problem is LLMs don't provide attribution/credit which directly violates the license[0]
Otherwise search engines were already "non-human public" that scraped the site but directly linked to the answers, which was great. They didn't claim its their work like these models. The problem isn't human vs non-human. LLMs aren't magic, they don't create stuff out of thin air, what they're doing is simply content laundering.
I'm actually perfectly fine if StackOverflow wants to sell an answer I made to help train AI.
For me, the purpose of providing an answer is to help save others (and my future self) time, and I don't really mind if someone uses that in a private product - especially if it helps tools like ChatGPT which provide an insane amount of value given the low monthly price.
> I'm actually perfectly fine if StackOverflow wants to sell an answer I made to help train AI.
I’m not.
This was a collaborative effort to make the lives of programmers easier, and the data was always meant to be a public good. OpenAI – and, more importantly – all the other LLMs with pockets that aren’t as deep – should be able to just download the database and train on it for free.
I don’t care about any license. I don’t care about attribution. Learning isn’t copying, so copyright is irrelevant. I contributed about a thousand answers to Stack Overflow, all with the understanding that anybody can download and use them for free, not so they can be locked up by Stack Overflow.
What concerns me with deals like this is that it’s altering the cultural norm to expand copyright to cover not just copying, but use. Deals like this being made by OpenAI makes it more likely to cause pushback at the social and legal level when other LLMs are trained without these deals in place.
It’s akin to – and can possibly result in – regulatory capture, making it difficult for new startups to compete with OpenAI.
The words are a copyleft-able public good. Concepts, facts, and ideas are not; anyone can use them for anything, including making money. If you're actually worried about specific wording or other creative choices being unjustly used improperly by an LLM, then by all means that should be enforced. But those examples are just very rare, because the LLMs are very good at extracting facts from prose.
Good for you. I'm not. I contributed answers to StackOverflow because I use answers other have contributed to StackOverflow, not to ChatGPT, not for ChatGPT to monetize. I don't use ChatGPT and probably never will.
But the content you posted to SO was already permissively licensed. Other people can copy it, and make derivative works, and even charge money for them, as long as they cite your SO handle as the author. https://meta.stackexchange.com/questions/347758/creative-com...
(2) It's only likely to attribute if it quotes verbatim... Just like a human. when I tell someone I learned that Array.map's second parameter passed to the callback is an index to the value just pass, I don't add "And learned this on Stack Overflow from user gtriloni". It's just knowledge that I learned.
The only time I'd attribute is if copied a snippet of code or a paragraph to quote in a blog post. For me at least, that almost never happens. It take the knowledge I learned and apply it to my own code. It's rare if ever there is a something on S.O. so useful that I copy it verbatim.
An LLM is not a human. It is a tool operated by a, in this case, for profit entity. It has no human rights, but its operator has all relevant legal obligations.
If it was, as you say, “just like a human” in relevant ways (think, feel, have self-awareness, etc.) then it would effectively be a slave subjected to extreme abuse.
Either it is a tool that generates derivative works at mass scale for profit and its operator should be liable for licensing/attribution violations, or it is a conscious being and we should immediately stop abusing it. Pick your poison.
Bing's version of ChatGPT/GPT4 cites sources. My limited unterstanding is that it uses your question to do a web search, brings the results into the context window, and then generates an answer that cites sources.
OpenAI could integrate StackOverflow the same way.
"The person you are upset with is technically permitted to do the thing that you are upset about" is not a good counter-argument to someone's distaste. Whether or not the licensing agreement _permits_ this usage, it is not the usage that the contributor (to whom you are replying) foresaw and was enthusiastic about.
One generally doesn't have to lean into phrases like "legitimate tactics" and "rhetorical power" when they've got the moral, ethical, or intellectual high ground. Telling people they're idiots is about the most counter-productive single strategy for addressing human stupidity ever conceived. 1. they won't believe you 2. they'll ignore everything else you have to say because you're a dick. So the real question is, who hurt you?
Oh your cheerleading here is going to age like milk when unemployment numbers start ramping up in white collar sectors. For the record, when construction and industrial jobs got deleted the chorus line was "retrain for service industry work". When service industry and white collar jobs really start getting the same treatment, what's the move now? We're literally running out of economic sectors to pretend folks can be funneled into.
All of this would be fine if the wealth were shared by the population. The big problem is that wealth is concentrated and only a small group will benefit from these technology shifts.
You what now? You think AI is the path to luxury space communism? I'm missing the part where the 0.1% that owns and controls basically everything shrug and lean into redistribution of wealth...
Suppose I walk up to a tent at a festival that has a big sign that says "FREE BEER", and I ask a person there for a beer. They hand me a beer, and I go on my way. Was the beer free? I think was free.
Now, suppose I walk up to a Budweiser-branded tent at a Budweiser festival that has a big sign with a Budweiser logo on it that says "FREE BEER", and I ask a person there who is wearing a Budweiser polo shirt, a Budweiser lanyard, and a Budweiser hat for a beer. They hand me a beer in a Budweiser-branded cup, and I go on my way. Was the beer free?
Now suppose you walk up to a tent that offers you free beer, but before they give it you, you have to burn 2% of your phone's battery watching an ad from them. Then they hand you the beer and you go on your way. Was the beer free?
> They do serve ads [...] Your attention isn't free.
to something like this:
> They tag my ankle to mark me as a person who enjoys beer, and make me watch an ad until 2% of my phone's battery is depleted, and then they come to my home and knock on my door at night to sell me beer.
...which... I mean, huh?
Stack Overflow is invading your body, restricting your personal liberty, and visiting your home? Really? That's a fucking thing now?
I think they were extending the original point you were responding to, and remixing your own mixed metaphor of free beer.
In the attention economy, advertising has a cost that is borne by the advertiser and the consumer, up to and including loss of property rights in the case of content relicensure and trespass upon devices leading to excess battery usage, as well as loss of privacy due to geotargeted ads.
>I think they were extending the original point you were responding to, and remixing your own mixed metaphor of free beer.
Perhaps. But having been to many festival environments, I can definitely imagine a tent offering "free beer" that is actually approximately free -- both with, and without a slathering of advertising. (Actually, I don't really have to imagine it -- I've been there and have had that free beer.)
I can't imagine them coming to my house and knocking on my door at night to sell me more of it, though. That's absurd.
>In the attention economy, advertising has a cost that is borne by the advertiser and the consumer, up to and including loss of property rights in the case of content relicensure and trespass upon devices leading to excess battery usage, as well as loss of privacy due to geotargeted ads.
Well, sure. When viewed on a long-enough timeline, it becomes abundantly clear that nothing is actually free, comrade.
I can produce my own beer on a hypothetical plot of land that nobody owns, and that nobody else wants to use, and I can give someone one of these beers. For "free."
But it still has a cost. (And this, too, is an absurd reduction.)
> I can't imagine them coming to my house and knocking on my door at night to sell me more of it, though. That's absurd.
I interpreted that as a tongue-in-cheek hyperbolic metaphor relating to the ways that ad auction networks and other kinds of geofencing and geotargeting allow for deanonymization and reidentification of individuals for conversion tracking and behavioral analysis.
That’s the thing about these technologies - they’re dual-use in the sense that those who see the upsides use them generally with good intentions and ideally with affirmative consent. Just like the relicensed content, though, once the data is collected, the original creators, publishers, and third parties may not be able to control where it ends up, which is a negative externality, I think most would agree.
I think at a festival it's a little tricky to value (if it pulled you away from seeing your favorite band play a song, maybe this cost you the equivalent of $X, where that's what you would pay to see them perform that song. If no bands were playing, you walk over while chatting with friends - the same thing you'd be doing if there were no free beer tent - it was free)
When I'm on stack overflow my time is valuable. I'm programming which can pay me something like $50-300/hour (maybe more?)
How expensive is the 1 second I spend reading an ad? Let's call it $50/3600. Is that expensive? By my most conservative estimate it's over 1¢.
Should we round that down to free given that I've spent hours/many page loads on stack overflow? I guess that's up to you.
I mean, we can play that game if you want. Let's suppose that if we look hard enough, that every opportunity has a cost.
"Oh, a free concert downtown on Saturday? And you can pick me up at 2? Yeah, I do really like that band, and I sure would like to go -- that's pretty exciting, thanks for the invite!
But instead of making plans with you right now, I'd rather tell you about all of the ways I could be using my time on that Saturday afternoon instead.
No, no. It's not that I don't want to go. I just want to really drive home the idea that there's an opportunity cost to attending, so it can't really be free -- it can't be a free show for you, or for me, or for anyone else that goes. It's important to me that you realize that this "free concert" is anything but free.
Listen, I don't know what you mean by "dead-ass loser." I'm just being a realist here!
Oh, so now you're saying that you're not going to pick me up on Saturday? Some friend you are! I haven't even fully amortized this yet!"
I think we're maybe gleefully posting past each other, but the point I'm trying to hit is that business models matter. Stack overflow provides a service. It's a good service. They host a great q&a platform for developers and myriad other category enthusiasts.
However, they have a business model. They are categorically different than eg Wikipedia. It's important to understand that.
This business model matters because it tells you what economic forces will lead them to do. When business models break down at public companies they commit acts of desperation. On an ad run site that will mean more ads, more invasive ads, etc.
As you're forced to sit through 30s unskippable ads on YouTube I hope you think "I'm so glad this is free"
Unironically, folks are being triggered by trigger warnings now.[1]
Imagine how “free” the beer in your hypothetical scenario is to an alcoholic struggling to stay sober.
Capitalism commoditizes even protest against it and repackages it as a product or service.
None of this is to assign blame to good faith actors in a so-called free market, nor is it to abdicate responsibility on behalf of so-called free agents. Just a counterpoint.
Then they'd likely get sued because the license for the answers are CC-BY-SA, putting them in a book, claiming they wrote everything themselves, and selling them are all against the license.
On the other hand, if they read my answers and they wrote a book about what they learned (not copied). There'd be no issues
You're being taken advantage of for a subscription product. It's one this to give to a community, but it's wrong for an enterprise to come in and capitalize on the value of it. It's the equivalent of going into an animal sanctuary, slaughtering all the animal, and selling their pelts.
Your position lays bare the new and industry-destroying economic problem introduced by opaque-data-source LLMs. The economic value provided by the originator is captured fully and completely behind rentier models.
Beware the ease and convenience of all that "insane value". This way lies digital serfdom.
I would be fine with it if the ‚AI’ in question was free and bonus if open source.
However it is a product of a next monolithic behemoth company that earns money on it and I suspect has nefarious motives to make profit.
That’s the whole key thing for me that makes me feel scammed. That and not asking for permission.
Future true AI would be potentially bigger than nuclear fission with all the consequences. Handling this in a petty capitalistic way makes me think the outcome will be close to fallout games that were supposed to be only an exaggeration.
Those companies must stop behaving like thieves. In fact it is a literal theft.
ChatGPT provides far more value than StackOverflow currently. It's not just trained on SO answers but all of the manuals/help pages, Github issues and forum posts. In addition you can continue a conversation. No rigid format or gatekeeping like stackoverflow. I don't see a real use case for Stackoverflow now. If I want to ask humans, Discord/IRC channels are far better option.
> No rigid format or gatekeeping like stackoverflow.
What bothers about gatekeeping? I could guess, but I'm asking so you say it out loud. Then you can compare it against other problems, such as moats (competitive barriers).
OpenAI spent something like $3M on training GPT-3. This is a pretty big moat. But almost certainly more valuable in dollar terms is the first-mover advantage which provides millions of human eye-hours used for RLHF.
I wouldn't be so eager to trade the gatekeepers you so fear for even an openly available chat service that is happy to automate away as much information work as possible.
The Stack Overflow model is (was) pretty darn good -- people help each other out, the company made money, some people got noticed for their skills, products got build faster and better (on the whole, I hope). Contrast the human-generated content era to what we have now which appears to be the machine-ingesting content era. There are legions of lawsuits against companies scraping data without permission and/or attribution.
> I wouldn't be so eager to trade the gatekeepers you so fear for even an openly available chat service that is happy to automate away as much information work as possible.
Don't flatter yourself. People want to solve their problems so that they can build what they want to. They don't have time for shenanigans from internet jerks who get their validation from imaginary internet points.
Hardly matters for Stackoverflow like questions if the provided solutions work/solve the problem you're having. Which for me happens majority of the time (with GPT-4 not the free version).
You might not want to hear this but no one does this. Should they? probably. But most people don't use Ctrl+C, Ctrl+V in the first place for SO answers.
Just a single data point, but when I copy & paste a snippet from Stack Overflow, I always add a comment "// source: https://stack overflow.com/questions/xxx#yyy".
I both find it respectful of who wrote the answer in the first place and useful for future users of the code: the Stack Overflow answer often provides context and explanation for what would otherwise be an obscure piece of code.
Pretty darn useful if you ask me: those who want to have more information can follow the link, casual readers can skip it, and the whole process if fair to the author.
I don't think I've ever copied enough from Stackoverflow for copyright to become relevant. Rarely more than one line verbatim.
It embarrasses me to think that somebody should feel obliged to cite me when they use one of my answers. I don't know how to take the partnership with Openai though. They bill me when I use their service, it's not collaborative like Stackoverflow.
No one should copy paste any solutions from anywhere. FWIW, 99% of the content in SO is hardly "original", mostly copy-pasted themselves from previous solutions or original user guide/manuals.
In general I'd agree that it's best to use answers just as a guide. That said, I wasn't trying to pass judgement, just ask attribution which is a best practice and often required by the license itself.
Id rather not go round in circles while ChatGPT feeds me bullshit information. When this happens i go to Google and read a SO answer with the correct information and also get an informed discussion around the subject.
For the easy answers LLMs are fine, but I usually want an answer to a niche issue or edge case, where LLMs have to be constantly told they are plain wrong, before getting to something resembling an answer.
If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.
The OpenAI partnership doesn't really affect the core issue here around users deleting their content. That has never been welcome on Stack Overflow and when noticed usually was reversed. This is in accordance with the license as far as I understand the legal aspects, and in general it makes sense for me as it ensures that the content stays useful.
The content is also CC-BY-SA, which is much better than what you get on essentially every other large site that hosts community content. But the same license also means that you cannot remove that content again, even if Stack Overflow would allow that anyone else can scrape it or download it before it is deleted and reproduce it according to the license.
Users still can remove their name from their posts, and if they write personal details those can be redacted as well. But you can't remove good quality content from the sites later, that is likely to be reverted.
The problem isn't that Stack Overflow is allowing people to scrape the content. The problem is that Stack Overflow is preventing some people from scraping the content, in order to collect money from others. And, incidentally, passing zero of that money on to the people who actually created the content.
(Nearly) none of the people who are presently pissed off would have complained if Stack Overflow had continued to allow all comers to scrape the content and train LLMs on it, nor if Stack Overflow had released the entire finished collection of content under the same CC-BY-SA license that was demanded of each contributor.
With the OpenAI partnership, and similar shenanigans leading up to it, Stack Overflow is relying on obscure technicalities to violate the essential spirit of the original deal.
The publicly-available archives released by Stack Exchange are updated roughly quarterly and have the attribution requirements as specified by CC BY-SA + the Stack Exchange ToS.
The article makes it sound like OpenAI is using the API though, rather than the archives. The API and live sites forbid scraping within the acceptable use policy, as seen here: https://stackoverflow.com/legal/acceptable-use-policy
I dont get how you can release something under anything other than all rights reserved without identification. We need to be able to persecute you in case you are not the author. Or is it that i may republish anything under any license?? It could be that the platform licences it in the toss but with cc are they not obligated to make it available without obstructie?
Prosecution and persecution are two different things. Persecuting anyone is not a good time :)
Why, if you're not allowed to release under a license, should you be able to release all rights reserved (which can still be a copyright violation!)?
If you need to prosecute the person, there are established procedures for that: DMCA, or ultimately a lawsuit over the infringement. That you didn't identify yourself publicly on the site does not make that impossible. In fact the point of the DMCA was to make it easier to handle this - because if the provider doesn't comply with your DMCA, you can sue the provider.
Requiring indentification to publish so that copyright is protected would be massive overreach and this sort of thinking is why I think copyright is a dangerous concept that needs to be sharply curtailed, not expanded to cover AI training.
In practice, the safest course is to not use content from untrustworthy sources in ways that require a license (aka in ways that are not fair use in your applicable jurisdictions).
StackOverflow are violating the SA part of CC-BY-SA by selling special access to the CC-BY-SA content to one party and blocking others from the same thing.
OpenAI are violating both BY and SA but that's a seperate issue.
Everyone who contributed work, did so under terms that the work was free for all, not a resource that one party can sell to another party who then sells to end users. Those end users were meant to have it directly without having to pay openai or anyone else, and if any bulk/scraping access is allowed for anyone like openai, everyone else has the right to the same thing for no more than a "shipping & handling" charge to cover the network & employee cost to physically deliver the data.
What are StackOverflow selling, and/or what exactly are OpenAI paying for? What is the goods or services that is traded for the money?
There are many possible answers but I see no answer that doesn't ultimately one way or another wind up resolving into a violation of one or more terms of CC-BY-SA by both StackOverflow and OpenAI.
I guess the core issue was always having a for-profit company preside over a "free" product. Clearly, they have to make money, and they aren't bound by ethics of open source. Contributors may feel like they are contributing to a FOSS project, but they aren't. What Stack exchange is doing is probably legal (?) and that's the bar they need to clear. The contributors aren't stakeholders and SE only needs to retain enough of them to sustain themselves commercially.
There's been more than a decade of companies now providing something for free, while they figure out how to monetize it, and these always scare me a little, because its always going to end up like this. Users of Facebook becoming eyeballs for ads, GitHub users providing free data for LLMs, SE selling data to Open AI...
If a product is free, then you are the product. And if you don't know how you are monetized, you're going to be disappointed by it sooner or later.
Harsh but true. I think what stings about SO is that developers are the ones losing here. I think this will prompt less open source and encourage more private work. I hope people are seeing that they are being take advantage of on many fronts.
StackOverflow has always been quite open that they're primarily building a dataset for SEO, rather than being a user-centered website. So I don't feel this deal changed much. SO users are still serfs building them a dataset for sale, only the buyer has changed.
LLMs are faster and infinitely more patient than interaction with StackOverflow, so I don't expect SO to survive for long. They're in crisis regardless whether they sell to OpenAI or not, so they may as well get something out of it before they're decimated.
I think they're in crisis because they sold out there community not because LLMs are better. As a developer, if you offer me StackOverflow vs ChatGPT, I'd take StackOverflow any day of the week 100x over.
I'm in the opposite boat. Going through Stackoverflow answers has become quite a chore.
For simple things GPT gives me the correct answer most of the time. And even when it's won't it's quicker to discern it is wrong than trying to parse a given SO page.
Of course I still use SO for more complex questions.
As a rule, if I can quickly find the answer via SO, then chances are GPT will give me the answer more rapidly.
I said I don't use it. I didn't say I've never used it. In my experience browsing SO is way easier, more accurate, more precise, more controllable, navigable, and ... gives attribution.
For some reason , but a lot of of the answers here seem to care more about "but tell em /I/ solved it" re: attribution rather than helping the user. Somewhat egoist or some such? ( and I don't mean it as an aggressive tone, just ESL so don't know how to say it othrewise)
If I license something as MIT, I personally don't care who uses it for what purpose, hell I don't even care generally that they attribute me. I put it out for people to use. But maybe that's just me.
I was offered a job a few years ago by someone who saw my Stack Overflow answers, does that count? I don't see something like this happening with ChatGPT.
>As a developer, if you offer me StackOverflow vs ChatGPT, I'd take StackOverflow any day of the week 100x over.
Really? Hm, I wouldn't. I can use nuance and clarify my answers and have a respectable back and forth (GPT-4 doesn't call me names when I mess up or say something dumb) and arrive at an answer.
or some such ;) You may not come across it personally, but that doesn't mean it doesn't happen. SO is successful as a QA platform(or was anyway) despite this shortcoming, not because it is a feature and it doesn't happen. If a lot of people are talking about the same thing, maybe people should at least pay cursory attention to the issue rather than "No, it doesn't happen" (Not aimed at you, but there are absolutely comments like this every time this gets bought up.)
> SO users are still serfs building them a dataset for sale
That is a very negative spin.
Users get access to other people's answers for free. They get that free service and are required to contribute nothing. Those that do contribute do it to help other users. S.O. isn't doing anything bad. They're providing a free service where everyone wins. Users get answers. Answerers get to help other humans at scale. S.O. makes a little money.
As for the dataset, it's been available under CC-BY-SA for years. The entire database is backed up and made available here for free every month.
The company is paying the people working by providing a free service.
It's like youtube. Youtube provides a free hosting of your videos. In exchange they monetize them. You're free to host them on your own servers. That will likely cost you way more than putting them on youtube. So you're getting something from them. You're also getting their advertising service to monetize your videos. You could do it yourself, hire a bunch of people and try to get companies to put ads on your self-hosted videos. Again, unless you're wildly successful it's unlikely you'll be able to do that and make a profit. So, youtube is effectively paying you.
Same with Stack Overflow. They're providing the servers, the bandwidth, etc. It costs them $. They're providing that service to you.
Side related question: are there content licenses coming up that are similar in spirit to what the GPL is but targeted at AI training? (E.g. if this piece of content was used in training an AI that was to be used commercially, the AI's weights must be published)
The argument AI companies make is that LLMs are not derived works of their input, or is fair use. So according to them, the input's license does not matter.
I suspect they will fail to emphasize the ShareAlike property of CC BY-SA 2.5/3.0/4.0 which is incredibly strong - "ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original". This is an incredibly wide and vague definition, especially "build upon" which will be unattractive to many users.
I suspect, if ChatGPT quotes an answer or a snippet it will show attribution and a license for snippet. If it instead only uses the knowledge it gained from the answer/snippet and writes it's own answer, then, just like a human, it won't attribute
It was especially hilarious to watch the CTO of OpenAI get asked if they scrape YouTube, and could not say yes or no [0]. Possibly one of the most important sites in the Internet, and they're CTO claims ignorance.
I am thankful we have LLMs so we don't have to deal with SO. Ideally, as little as possible. SO can be a pretty toxic place filled with elitism and care for procedure over actually helping people, which is not totally unreasonable from their standpoint but it's definitely not what people are visiting the site for. Quite ironically, one of the major complaints I get is that LLMs output wrong answers here and there, ignoring that many of the answers on SO are also completely wrong or irrelevant to the core question being asked. And mind you, also outdated (I regularly have to click through the sorting to make sure answers are actually still relevant).
If we could merge the two to get the best of both worlds, and have LLMs that know how to write well and are validated by humans on the site, that would be great. Maybe not great for the folks looking to accrue internet points but absolutely great for users.
That's great for now. It's not clear to me, though, where LLM's will get their training data from here forward without ingesting lots of LLM generated code and answers and eating it's own tail.
Didn’t you get the memo? LLM’s either already are capable of or just a step away from being able to reason so no need for human generated training data in the future.
Or at least that’s what 3/4 of HN commentators believe and all AI CEOs want you to believe.
That's only now and in the near term future. If AI is actually successful, every year the amount of human written code will decrease. That's the whole point of this.
Does it matter if stack overflow is toxic or not? You're there to ask a question and get an answer. If you ask wrong, you get corrected. Tough moderation makes search much faster and better for other askers.
You're there to ask for help not make friends. They have to be polite, but not gentle
Yes it does. If I am belittled instead of people asking clarifying questions so I can learn, I'm much less likely to think better of said people or platform, or use it.
What you see as elitism is mostly simple curating. You can't store everything because it makes retrieving value from the store that much more difficult. It's the same with wikipedia and other public content repositories. People cry elitism and gatekeeping but without curation you eventually end up with a haystack of mediocre looking for a needle.
This “curation” is what is killing SO. Software is soft. It changes. There is no “one true answer for all time”. It’s honestly sad how many times I search for an answer, only to see the exact question I’m looking for closed as duplicate, then when I look at the “duplicate” I see that it’s an out of date answer.
Stack Overflow could have solved the problem of duplication so many ways. Why not categorize and bucket duplicate answers? They could have even had yearly recurring questions with the most up to date answer! Why not add beginner/hobby/expert rankings to questions so that the people answering don’t get sick of seeing beginner questions all the time?
There is so much SO could have done, instead they rested on their laurels and now they’re left with an out of date repository. What use is a curated repository if it will only help me solve problems with solutions from a decade ago?
Who says the solutions from a decade ago are not still correct or the best way to solve a problem? Just because ChatGPT regurgitates something today with the words moved around doesn't mean it contains "new" insights.
I agree in part, but why aren't other moderated outlets where users can ask technical questions given the same label? Reddit, Quora and HN are also curated, are content removals on these site taken as elitist? Even if these places are less heavily moderated, I have no trouble surfacing relevant answers using any search engine's in-site search.
I am not talking about QA quality on any of these sites here, but the elitist stigma that has seemingly followed SO for so long.
> why aren't other moderated outlets where users can ask technical questions given the same label
The exact label aside for a moment, reddit and HN mods often face backlash for their actions. But beyond that, Wikipedia and SO stand out in this regard because of their transparency regarding the curation. Mostly, reddit curation happens in the background, without much explanation. SO and Wikipedia basically spell out their actions and reasoning.
Another difference is that with reddit and HN, you have no real recourse. At least with Wikipedia (I'm not too familiar with SO policies in this regard) you can appeal decisions, open discussions about policies, etc.
I have to agree with GP - people often mistake the 'bureaucracy' of sites like Wikipedia and SO as something unnecessary that the editors force on everyone, but the fact is, it's necessary to create and maintain a high-quality repository of information.
> SO and Wikipedia basically spell out their actions and reasoning
You're able to appeal on SO as well. It's interesting to think about a situation where moderation decisions would be more in 'the background', as you say (like Reddit/HN), and whether this takes away from the perceived 'elitism' some moderation practices are accused of.
In my experience on the above sites, and as a (small) community manager, it absolutely plays into it. A lot of people just instinctively respond negatively to displays of authority.
On the other hand, I think it's an important aspect of a community/platform if the goal of that platform is to be transparent and open, which I think is an important aspect of SO and Wikipedia, and I hope more platforms would adopt that view. I think whatever "elitist" perception such platforms have to suffer is well worth having high-quality, open platforms.
(I will say that no platforms are perfect of course, including SO or Wikipedia; there's plenty of criticisms to go around about specific policies and decisions. See: TFA :P)
This is an insightful observation, and a problem we struggled with for years on Stack Overflow: if you keep moderation quiet and anonymous, there's a lot less criticism, seemingly less hurt feelings... But also very little correction. The Star Chamber works great until corruption sets in; finding a good balance between secrecy and transparency is a challenge.
For years, moderators signed their names to messages like the one cited in the article. After one too many cases of a volunteer being called at work or having their family harassed or sent a suspicious package in the mail... That particular bit of transparency was eliminated - the cost was too high for the limited benefit. OTOH, it used to be very difficult to find your own deleted posts but that has slowly gotten better (including visibility into who deleted them) - turns out the benefit there was substantial (identifying wrongly-deleted posts & curbing over-enthusiastic curators), while harassment has been mostly limited to occasional grousing.
> After one too many cases of a volunteer being called at work or having their family harassed or sent a suspicious package in the mail
This is why I'll never use my real name casually on the Internet, and why the idea of widespread identity verification on the Internet scares the crap out of me.
I actually strongly prefer Wikipedia to SO, on Wikipedia the old now-wrong content can just get edited out, on SO you'll have to dig through all the 300-point popular answers from 2012 to find the new answer that says "yeah none of that is right anymore, instead do this"
Their curation blows. The whole premise of having a canonical answer to a question is dumb. Most programming languages and libraries are always in flux. The whole nature of many questions changes over time.
StackOverflow is a tyranny of mediocrity. It is a bunch middling programmers shitting on newbies, and driving away experts because you get severely punished for not being mediocre.
I had a question closed as a duplicate for being too similar to another question that I directly cited in my question as being sublty different and not applicable. (Because I anticipated some idiot closing my question...and they went and did it anyway)
>I am thankful we have LLMs so we don't have to deal with SO. Ideally, as little as possible. SO can be a pretty toxic place filled with elitism and care for procedure over actually helping people
There needs to be a term for this. Perhaps "The Wikipedia Effect."
From a search, the message seems to have been in place since at least 2017[0] and I'd suspect is automated on detection of mass-deletion.
I can understand the reason for the policy (in some ways SO functions more like a wiki than a forum) and it doesn't seem to have been introduced to quell the protest against OpenAI.
Thank to people who delete their answers, now I have to pay OpenAI to find answers they already scraped. Talking about helping OpenAI making more money :(
It is nearly impossible to delete an accepted answer you don't want to have any more. I've had several which are wildly out of date and incorrect now and I don't want to update them, but the mods refuse to remove them.
at some point, it'll be too late. the horse has already left the barn.
besides, if the site owner makes a deal with the devil, there's nothing you can do other than quit using the site. people are still using social platforms more than ever, so stopping isn't going to happen.
the more likely to happen is that accounts deemed to be polluting the waters will just get suspended with no recourse to have it re-instated.
> at some point, it'll be too late. the horse has already left the barn.
I don't think this is true: the technology is useless unless it parasitises new knowledge continuously
it sows the seeds of its own destruction by reducing the value of past and future contributions to zero
> the more likely to happen is that accounts deemed to be polluting the waters will just get suspended with no recourse to have it re-instated.
so this is also perfectly acceptable: once they've banned the top 20% the site effectively becomes read-only, and the AI knowledge previously parasitised from it atrophies with no replacement
Known knowledge doesn't disappear. Once it knows how to apply an FFT and when, it doesn't need to continue to read about it. It's not a human needing continuing education. Once it knows that Henry VIII had many wives, it doesn't need to keep reading that he had those wives.
Sure, if something new happens, then it's not like SO is the only place it's scraping for new information. If you honestly think that you/we will get to a place to block all scraping, I will just politely disagree.
> Once it knows that Henry VIII had many wives, it doesn't need to keep reading that he had those wives.
That's actually incorrect, it needs to constantly ingest new data. If it ingests enough data (from other LLMs that are hallucinating, for example), then suddenly when it has enough bad data it'll start telling you that Henry VIII was a famous video game on the Sony 64.
It has no concept of 'truthfulness' beyond the amount of data that it can draw correlations from. And by nature LLMs have to ingest as much data as possible in order to draw accurate results from new things. LLMs cannot function without parasitizing off of user generated content, and if user generated content vanishes then it collapses in on itself.
Well, that's already happening. Google search has become increasingly useless thanks to SEO-focused AI-generated schlock. It's the inevitable outcome of LLMs. Sites have an incentive to hide that they're AI generated and LLMs have no real way to filter for ingested data made from other LLMs. The only difference is how long the ruse can be kept up.
So you want to pollute the commons just as the people filling the web with SEO-focused AI-generated schlock? Do you feel justified in polluting the commons to serve the ends you desire?
Do you actually have a solution to the problem of companies using LLMs to steal from other people and repurposing it as their own, other than figuring out ways to ensure that LLMs suffer for doing so? And frankly as I mentioned, LLMs are already polluting the commons; you're not offering any solution on that front either other than asking people to keep supplying it with fresh data so that it doesn't poison itself.
Scorched earth policies are always en vogue, and easy to offer as a knee jerk reaction. They do nothing for actually making forward progress in the conversation though.*
*However...there are times where the best solution is a match and some gasoline.
Ah. I've found various LLMs are much easier to query and generally nicer than SO posters, so it's been quite a while since I've needed to visit SO. I assumed most people had made a similar journey.
Not so much anymore though. I've seen over the last year that SO ranks lower and content farms like geeks4geeks, Programiz, etc. are getting much higher in results.
i still google things, mostly out of habit. but i'd say half the time i visit stack overflow, the answer i get there is either outdated or too opinionated to be useful and i end up going to chatGPT.
Given how the industry has treated tech workers, this will be exploited. I'm interested in joining a private group with or without profit motive, that is not open source.
> Users are also asking why ChatGPT could not simply share the source of the answers it will dispense in this new partnership, both citing its sources and adding credibility to the tool. Of course, this would reveal how the sausage of LLMs is made
What? Surely the answer to that question is that ChatGPT doesn't know where the source of its answers is, isn't it? Isn't the question itself based on a fundamental misunderstanding of how LLMs work?
I haven't used it extensively but when i ask a generic coding question in brave it gives me an ai response and it does list source websites. Not sure if it's the actual source or its just pulling them from a search or what.
>
Stack Overflow and OpenAI have joined forces through a new API partnership. This collaboration aims to provide developers with a powerful combination of Stack Overflow’s vast knowledge platform and OpenAI’s advanced AI models. Through the OverflowAPI access, OpenAI users will benefit from accurate and verified data from Stack Overflow, facilitating quicker problem-solving and enabling technologists to focus on priority tasks. Additionally, OpenAI will integrate validated technical knowledge from Stack Overflow into ChatGPT, enhancing users’ access to reliable information and code.
Come on. Was this taken from a Press Release?
> it can be disruptive to the entire community to delete or remove content that might be useful to someone else. Even if this content is no longer useful to you as the author. [sic]
> As for the rest of us Stack Overflow users, I would not recommend jumping to delete your own content in protest too.
> To be fair to Stack Overflow, the warning email and suspending of accounts is likely not a new thing.
I can't find a negative word about SO in this entire article, so "to be fair" doesn't seem meaningful.
If you check the byline, the author is a Microsoft MVP / product evangelist. So I don't think he's biased towards SO so much as he is biased towards anyone doing business with Microsoft (or OpenAI). He also seems very pro-GitHub Copilot.
Your answers are already sold for profit again and again. That's the whole point of SO existing, or maybe you under some delusion that SO is a charity?
If you want to contribute to the commons, contribute to the commons. If you want to contribute to the commons without commercialization of your work, contribute with some non-com license [1]. If you want to feed a corporation with your labour, post on SO.
[1] It'll still be illegally scraped and commercialized by some AIBro, and you'll have no proof or recourse against them...
Does anyone know if chatGPT etc could code without stackoverlow answers?
I think that is the big question, because the license seems it's going to give lawyers a very wide attack surface to go after every ai coder out there if they all need SO database.
The problem is whether people see programming as a zero-sum or positive sum enterprise. In the real world, it acts as a positive sum enterprise: one person's contribution benefits themselves and all those who use or learn from the code. However many gatekeeping-type people view it, perhaps instinctively, as zero-sum. They imagine that OpenAI benefitting from this partnership, or any amount of learning via web-scraping their models perform, necessarily harms those who put their content online. This in a nonsensical argument yet has garnered a fair amount of support due to the somewhat reflexive anti-AI sentiment as of late, which is separated from the more nuanced concerns of existential threats due to AI.
Positive-sum rarely exists in this world.. after all, one's wealth determines their influence over others. Both sides might gain but this usually means others lose.
In this case, contributors might lose attribution. SO might lose traffic but they'll be compensated. Contributors won't so eventually there might be no reason to contribute anymore..
Isn't the existence of wealth in the first place sufficient evidence that wealth is something that gets created? We started out banging rocks together and now we have all of this weird stuff which presumably people like or something.
Now we work harder, and it's getting unbearable for those kn the bottom... Wealth also affects whether you're "useful" and you need to be "useful" to survive.. It's getting harder to be useful
It’s simply false that positive sum doesn’t exist the real world. Even the most simplistic trade argument in remedial Econ 101, or even Bio 101 reveals this.
If I'm not mistaken, the whole society is getting wealthier, it's just some people are getting wealthier faster than the others - so it's still a sum-positive.
You only consider those that "make it", there any many who don't because it's getting increasingly harder to be "useful" in the market (ChatGPT is cheaper), innovations usually make it worse. Those "new jobs" are harder and many won't qualify
Imagine that you spent a lot of time helping people and building a community. Then a company encodes this "help" into text format and put it into a book, and makes a lot of money selling the book. In doing so, this company kills the community. You wouldn't be pissed off about that?
Your knowledge work is being exploited. If you don't allow Open AI to train it's subscription product on your open source contributions, you will get banned.
I still use StackOverflow. Not as much as I used too, thanks to GPT, but still multiple times a day. What I find is that I spend less time on SO.
However, IMHO deleting questions you originally wrote in the past is hurting other users more than it is hurting AI training.
Other users cannot write similar answers to yours, because it doesn't add anything and they'd get downvoted or deleted. So if you hadn't written your answer years ago, others could've written something similar. Also, other users may have commented on your questions/answers. Their efforts would be lost/deleted if you deleted your questions/answers.
Thanks for your previous contribution to the community. But I would say the worst you should be able to do is remove your name/anonymise your posts, not just delete them.
I wonder would actually deleting questions be a good thing. If there is no old question the same question asked again can not possibly be a duplicate... So constant loop of deleting questions might actually be effective way to fix some problems. And there is enough off-site backups already.
Looking through my browser history, I'd say that I average about 5 distinct SO posts per day. If you know there will be an answer it's less typing to search for it than it is to have ChatGPT regenerate it.
>Ben continues in his thread, "[The moderator crackdown is] just a reminder that anything you post on any of these platforms can and will be used for profit. It's just a matter of time until all your messages on Discord, Twitter etc. are scraped, fed into a model and sold back to you."
Uh.... yeah, it's a company, not a charity. No one's forcing you to post on StackOverflow. No one's forcing you to buy a ChatGPT subscription.
While this is true sites like Stackoverflow very much only function because they create the illusion that it is in fact a "community". The moment they make explicit that there is monetary value in the knowledge people post on the site it becomes obvious that the users are, using Varoufakis term, technoserfs.
You're very much never supposed to notice that Reddit, SO, and so on continously extract value out of work you produce, at worst you're maybe supposed to notice an ad or two. Because if you do notice that you might actually start asking why you aren't getting paid. Which is btw funnily enough exactly what news organizations and SO have realized vis-a-vis openAI.
IMO it's kind of silly and mentally corrosive to think of everything you do in these kind of transactional terms.
I post on reddit because I find it enjoyable. I am not doing "work" that I think I deserve to be compensated for. Not every POST request I make to someone's server should be accompanied by a bill for my labor.
Given the lack of an alternative, should we instruct the human instinct of sharing in the pursuit of knowledge to sacrifice itself or accept the risk of exploitation?
If your sorry platitude is what we have to show for it, capitalism must go to Hell.
I have very few SO contributions so I don't have much at stake personally, but I have observed that there was a trend of people using their SO profiles for career advancement. I'd see people reference their SO activity on resumes and I had job applications ask for my SO profile if I had one, and I've seen advice that a good SO profile was valuable the way a good Github profile is. Is that something people factor in to their decision to delete? And isn't that social capital a kind of compensation for their contributions?
I left because the staff behaved in a disingenuous manner.
I found when leaving, as mentioned in the article, that you are not allowed to delete accepted posts, so you can't delete your content, should you come to think SO objectionable and wish your content not to be there.
I can't see now why anyone would spend time posting answers there.
I don't love them getting into bed with AI, but also don't think it's unreasonable for them to not allow angry users to blank out their prior submissions.
The whole deal was that you basically donated your posts and CC-licensed them. I wouldn't begrudge Wikipedia from similarly dealing with upset editors who went around blanking the articles they contributed to or reverting all their changes.
Wikipedia editors aren’t the sole source of their pages. I am fine with people leaving and deleting their posts because they may feel the information will become outdated without being maintained.
SO unfortunately is actively hostile to correcting outdated information. Which is somewhat understandable as recognizing how little long term value answers provide undermines their moat.
The site has basically become worthless for JavaScript due to such rot which helps explain why they are trying to cash out on the AI side of things.
Many answers have been edited, commented on, and reviewed by others. So it's also not exactly a one-person show either.
Outdated info is a problem, but not so easy to solve; I have answers from Ruby on Rails 4 era that are still perfectly valid today. Others may not be. Also remember that people sometimes stay on old versions for a long time. I don't know what the best solution is, but destroying information is not it.
Few answers get any sort of editing or updating over time.
If you’re worried about deleting information the obvious solution is to automatically hide the text when the poster says it’s outdated. At which point there’s little wrong with letting someone flag all their posts as likely outdated upon leaving.
But this is where the business of SO comes into conflict with providing useful information on SO.
I'm with you, I think you should be able to delete your own posts and erase your internet history.
I dont think that everything we ever write on the internet should be stored forever because of some misguided intention to preserve conversations for future generations :)
I’m sure this is related to the fact that the editing window closes here after a short time.
I’ve been on other forums where disgruntled users have come through and destroyed old posts, which resulted not just in the loss of the messages but also harms the thread that built upon the now vanished post. So they, too, have a short editing window policy instituted.
Hmmms. While I definitely can see SO's arguments concerning deletion, that letter seems to blatantly contradict GDPR's right to be forgotten, which Wikipedia describes as a more limited "right to data erasure" [1].
To coin a Dutch phrase: I cannot make chocolate out of that. Anyone here have an idea how to bring these two points together? Other than the obvious "wrt. EU inhabitants, SO is lying", that is. Or is it really that simple?
I find it deeply troubling that platforms are becoming so hostile that users are having to strike against the owners by mass deleting their content. And then the platforms handle this by simply undeleting the content and banning them from continuing to delete any more (StackOverflow, Reddit)
This may also be legally dubious in Europe as, while the authors may have granted copy rights to the platform owners, they still maintain their moral rights which may apply in this case (IANAL)
I just submitted a request to have all my content removed from SO and will challenge the outcome if needed. My right to be forgotten and have my content deleted supersedes SO’s dubious, “nobody really reads these” terms and conditions.
One of the reasons that Quora today is absolutely unusable is that it no longer is a curated discussion between internet users and knowledgeable people, but AI spamming the site with swarms of low-quality questions, and AI answering those questions with swarms of low-quality answers. I think it's likely that Stack Overflow will end up following a similar pattern.
Quora was already absolutely unusable back before GPT-2. It became unusable as soon as people realized that all they had to do was self-identify as an expert to get taken seriously on there, so people started developing whole lifestyles around building up their Quora profiles. From that point on the actually knowledgeable people weren't interested in contributing because there was no way to distinguish themselves from the people who were faking expertise. AI may have been the final nail in the coffin, but Quora was dead long ago.
Stack Overflow managed to avoid that particular hazard by placing less emphasis on real-world identity and expertise, but it also has been in a long-term decline for many other reasons. The fact that they made such a vocal stance against AI and then pivoted so dramatically is just one example of how much they've struggled to find direction lately.
Just a point of clarification, the user moderator base (its power users) took a strong stance against AI, and the company, chasing every possible dollar, overruled them.
Short term profits over user preference is what happened here
I know this wasn’t really your point, but it’s worth noting that Quora being low quality spam is not the problem. It’s why the hell Google surfaces Quora so prominently given that the results are pure shit and require registration to even see all the shit.
Is there any reasonable explanation for how they’re ranked so high? Like, how can even googlers tolerate it?
Just a guess, but I think when they started losing the spam wars they put in some kind of handcrafted whitelist ranking boost, either directly based on brand/site, or link proximity to known good sites, etc. And maybe they don't update that list too often. You can find some info about an ML update Google called "Vince" that sounds a lot like that.
Poor maintenance of a probably thousands long whitelist of "brand quality" seed sites? When the only measure they really care about is ad revenue, and bad organic results might mean more ad clicks? It's not really that outlandish, just plain complacency from a company with an overwhelming market share lead in search. That's how Google started in the first place...capitalizing on complacency/stagnation on the then leaders in search.
Assuming you mean people working at Google, the answer is probably that profit/promotions outweight personal use. More clicks, more back-buttons, more search adjustments, more advertising revenue.
Quora and SO are rather different communities. In Quora's best days, there were celebrities or quasi-celebrities making interesting posts, just like on Twitter or Google Plus in top times. Also Quora used to have very active and talented Community Managers / Top Writers. Marc Bodnick used to do tons of curation but left a while ago to create his own social network(s).
In contrast, SO has never been so "celebrity"-driven and the content has a rather different audience. I think it's understandable that the major contributors don't like how their content is being used, similar to the Reddit revolt.
What might "replace" SO is some AI-assisted way to establish a handbook and FAQ for any new technology. That could be a chatbot as well as some effective method for feeding that bot content.
And then SO-the-community, i.e. people who want to talk to each other, will probably branch off into some other forum or network.
I do not understand, how are they "creating negative incentives for sharing knowledge"?
If I posted on SO before in the hope that others find it useful (and not for the karma) - and now it might help others not directly through the site, but with further steps through a llm, where is the problem? Knowledge was shared.
Part of the benefit for the answerer is the experience of interacting with the questioner, receiving upvotes and comments, having answers accepted, and having your name on an answer that's helped people. You get credit for answering questions on Stack Exchange sites. It's not much -- it's not supposed to be -- it's rarely of material consequence -- but it matters. I still get upvotes on some of my old EE.SE answers when my written work helps someone enough for them to give notice. It's a little reminder that I've done something useful in my life.
Having my work ingested into ChatGPT takes the me out of it. It turns me into, essentially, unpaid contract labor for OpenAI. They get all the credit, and I get forgotten. Why would I be okay with that?
If you want to write free code for OpenAI to improve ChatGPT, you're welcome to do so. Cut out the middlemen and send it to them directly. But please leave me and my work out of it.
"Having my work ingested into ChatGPT takes the me out of it. It turns me into, essentially, unpaid contract labor for OpenAI. They get all the credit, and I get forgotten. Why would I be okay with that?"
So you are ok with unpaid contract labor in exchange for virtual points. But if you don't get virtual points as appreciation, no one should benefit. That is ok, but then sharing knowledge is not your main, but secondary goal. Your main goal is the recognition.
But if you delete your comments, you won't get anything at all anymore.
If they remain, real humans will still benefit directly or indirectly. And why should I write exclusicly for openAI? I share my knowledge for anyone. If SO would restrict public access and favour OpenAI - that would be the moment I would want to delete everything. But at the moment LLMs just get also official access, but they had access to SO before, just in a grey legal area. So nothing really changes.
Smcin has answered your other point. Let me respond to this one:
> But if you delete your comments, you won't get anything at all anymore. If they remain, real humans will still benefit directly or indirectly. And why should I write exclusicly for openAI? I share my knowledge for anyone.
The goal -- implicitly for AI companies and explicitly for many of the commenters on this story -- is to replace sites like Stack Exchange. Stack Exchange's traffic will instead go to ChatGPT. The most likely outcome of this is that Stack Exchange will eventually shut down or severely degrade its service. If ChatGPT were a supplemental tool, one user out of many, you would be right. But it's not a complement, it's a competitor, designed to make a profit off of assimilating my work without giving me any compensation or credit.
Exactly. People join SO and other SE websites to ask questions and get answers.
With ChatGPT and similar platforms, trained on SE answers (and open Github repos,...), people will eventually skip Stack Exchange and directly go to ChatGPT.
> But if you don't get virtual points as appreciation [unpaid contract labor]... then sharing knowledge is not your main, but secondary goal. Your main goal is the recognition.
It's false dichotomy to parse out components of motivations; most SO users are motivated by a mix of altruism, sharing knowledge, some recognition, optionally linking to your profile/website/blog/resume/portfolio, getting job approaches and a dose of pride/ego/vanity. As a longtime SO user, that has historically been the bargain, when (most/)all of your submissions were directly seen by human end-users. As a plus, all of that gave you good SEO commensurate with your contributions. So, it's unreasonable to try to dichotomize into "users who mainly did it for the rep" vs ones who want to teach and share.
But the 2023 and 2024 announcements are different: the future is your submissions will be used to train AIs; however SO doesn't seem to have devoted much thought to licensees like OpenAI complying with SO's attribution requirements [0] (attribution must cite individual URL of question/answer, and SO username, which then links onwards via your SO profile page to the items mentioned above). (If the AI synthesizes an answer derived from 5 separate SO items, do they guarantee to attribute all 5 items?) So the human eyeballs are being intermediated, your incentives to participate are evaporating, and that pretty much breaks SO's historical bargain with its user community.
The next major bad development would be SO opening the floodgates on the moderation queue backlog of thousands of items of AI-generated content (which caused the 2023 moderator strike/resignations), much low-quality and arguably should be banned; if/when that feedback loop is closed, the results might well be unholy; certainly bona-fide human contributors will be marginalized and have less incentive. (and if AI were to be used for moderation, then that could be exploitable).
Inbound views/hits on your content on SO either come from a) Google + other search engines b) SO's search itself c) attribution from OpenAI's ChatGPT d) attribution from other(/future) AI licensees. If your code is scraped once but effectively viewed 1 million times from GPT, you won't see those 1 million hits show up; you can only vaguely infer they might be happening if the attribution is actually implemented, and some users click through on it (or by reverse-querying the AI). So c),d) will proportionately increase as a),b) proportionately decrease.
So everything has changed. And obviously the incentive to you to continue to provide unpaid volunteer labor ongoing without even attribution decreases.
What are the negative incentives? How would an LLM improving in capabilities harm those who shared their knowledge for free online at some point in the past?
My experience is worth less if an AI can summon it at-will. It hasn't necessarily come down to this yet in the software industry, but in others (like animation), folks who were previously responsible for generating concept art have found themselves without jobs as management can get "good enough" results from a much cheaper medium (that was, at least en-masse, trained on their "prior art").
I don't personally have a well formed opinion one way or another on this, but to dismiss the existence of a issue at all is logically lacking.
the same reasoning would equally justify the claim that your experience is worth less if beginner programmers can summon it at will; if you believe that reasoning you wouldn't have contributed to stackoverflow in the first place. i don't and if you contributed to stackoverflow you didn't either
The scale might be different here, since prompting AI is much cheaper than hiring a begginer programmer. The previous loss could for instance be compensated by attribution.
Coming to think of it, recent ML is just a scaled up version of Infosys, Wipro, etc. Shit quality answers for enterprises, now accessible for the masses.
SO made it such a pain in the ass to contribute I gave up trying every time I’ve historically been interested. Like I’m already sacrificing my time to offer my expertise helping someone, you want me to jump through a bunch of hoops to have the privilege of doing so? No thanks.
That same amount of pain in the ass gaming made spam and terrible quality answers equally discouraged. Given the volume of at least decent content on stack overflow, I'd say the game worked. Somebody could try to make it better with a competitor but it would be a hard thing to succeed at.
The more hoops they've added the worse the quality has gotten. The quality has declined over time, and most of the good answers you see nowadays are from people who got in the habit of contributing back when the process was much simpler, and would likely never have joined the site if it was as onerous as it is today.
Have you assessed they quality of QA's that aren't years old? Anything decent that I find is usually quite old and possibly out of date.
It doesn't help that asking for a more recent answer gets your question closed as a duplicate, and new answers can never overcome the inertia of the historical ones.
I'm starting to wonder if the days of "free, ad-supported, user-generated content wells" are over. The audience and participation base have grown larger than the ability of these single entities to rationally cope with while still maintaining their original mission and profits.
We've outscaled our original hopes for the Internet. It was originally meant to be a tool genuinely controlled by it's users; unfortunately, it's largely ended up in the stranglehold of a few monopolists.
Stack Overflow has been assimilated. Resistance is futile. It served a useful purpose but now it's part of the glorious AI universe to come. Rest in peace.
That also means there is probably a lot of wrong information on Stack Overflow that is baked into the training too. Hopefully, they accounted for this in training, but no way of knowing.
I have not really had a lot of accuracy issues with GPT, but then again, I probably and not savvy enough to spot them, most of the time, anyway.
If it's posted on stack overflow, it's not new, it's merely been published. If this is the bar for LLM "learning" then they are doomed to live in a hazy bubble of the recent past.
Haha, what's that gonna do? Ever heard of soft delete? It's a thing where even if you delete something off a website, the database still retains that information even though it becomes inaccessible by the public.
Everything we write on the web is like that, including this very comment.
even if it were a hard delete, do these people think OpenAI is scraping the live version of the site?
the answers have already been exported. all you're doing by deleting it is ensuring it's only available on ChatGPT, and no longer available to web users who aren't using AI tools that ingested the content before it was deleted.
Wonder if Wikimedia Foundation couldn't just take the opportunity, now that Stack Overflow is alienating their userbase, to launch a rival Q&A site. I was always puzzled why did they never attempt to enter this space, even before Stack Overflow, given their prior experience in crowdsourced information commons.
Pragmatically, the software powering most their properties, MediaWiki, is not suited for it. It's hard to see them investing in development of a new platform given uncertainties of success.
In addition to deleting answers, I think protesters should up vote wrong answers and crappy posts.
For years the community has defended punitive down votes on correct answers to crappy questions as "you can do with your vote as you like". I see no argument against flipping that around.
My personal end game if I have one and I'm not sure I do would be to ensure that I can help individual novice programmers become better at their craft. Not to make billion dollar corporations even richer.
Does everyone get equal access to let their own copy of the open source llm download a copy of SO?
Are those open source llm users in turn selling access to the content they got for free, and also stripped of attribution?
What exactly is changing hands in trade for the money, that doesn't one way or another violate CC-BY-SA?
It's not merely the fact of any form of commercial activity, since there is no NC in there, but the specific actions here by both StackOverflow and OpenAI violate the terms the content was originally created and shared under.
You know they read this as "They can do something illogical if they want to." instead of "They don't owe you an explaination of their reasoning nor require your approval of it, and your not knowing or understanding or agreeing with their reasoning does not mean there is none or make it invalid."
SO has been doing the absolute worst things to squander their amazing lead for years
i haven't used that website since GPT came out, and now i contribute nothing to it
but i'm glad all of its content ended up training the models that put it out of business, thanks SO! you'll never be anything other than user contributions
As dour as it sounds, I am in a similar boat. Who'd have thought that not needlessly getting called names when you ask a question (even if it's dumb -as that's how you learn) makes people less likely to interact with you.
what's amusing to me is that some people even in this thread are calling it a pro, not a con. I guess our field does indeed attract a certain kind of personality.
I remember thinking that if any of the super answerers really wanted, they could have tried to sue for illegally making their answers available under a different license. But I thought that without any damages, this probably wasn't likely to succeed.
But now I wonder whether making all content available to AI scrapers and OpenAI in particular might be enough to actually base a case. As far as I can tell, StackOverflow continued being duplicitous with what license applies to what content for half of the year 2018 and the first few months of the year 2019. Their current licensing suggests CC-BY-SA 3.0 for things before May 5 2018, and CC-BY-SA 4.0 for things after. Sometime in early 2019 (if memory serves, it was after the meta post I link to), they made users login again and accept a new license agreement for relicensing content. But those middle months are murky.
I should emphasize that I know nothing.
[1]: https://meta.stackexchange.com/q/333089/205676