I would be willing to bet that the driving force behind the decision was to make it less trivial for LLMs to say "the data was already there under an open license, so we legally undercut stack overflow".
The fact that everyone is hoarding data because they think there is a gold rush afoot is obvious. Everyone with loads of data is clamping down, hoping they can get a cut of those AI VC dollars. Except for Wikipedia at least.
But let's be real about the morality here: Stack Overflow is a badge-powered mechanical Turk. It uses 100% unpaid labor to go and search Google for answers and post them on SO, providing a "service"[1]. For it to moralize about the ownership or sanctity of data is irony.
[1] - There are exceptions, obviously. There are true experts who wander the virtual halls of StackOverflow and dole out wisdom. But overwhelmingly it is clear that answers primarily come from people who rush to Google and then copy/paste from blogs and tech papers. And while Stack Overflow dumps are CC because that's the agreement that it made with contributors, a lot of the content on the site was ripped without attribution and in defiance of IP. So...maybe not too many tears for SO.
> There are exceptions, obviously. There are true experts who wander the virtual halls of StackOverflow and dole out wisdom.
This is what makes SO valuable for me. Every year it seems like "trust but verify" requires a little more emphasis on the verify part, so it's extremely valuable when I run into a true expert where I know I can take their input at face value and rely on it.
I think the winners of the AI gold rush will be the ones that figure out how to help users assess the trust they should be putting in the information being surfaced. The problem I see with the current ChatGPT, etc. systems is that they seem to treat visibility and popularity as credibility and expertise when that's not the case. In my experience, the real experts don't say anything unless they can add to the conversation and there's never a lack of low quality information from uninformed participants.
For example, via ChatGPT...
Q: How do I protect against the scrub of death with ZFS?
A: The term "scrub of death" is commonly used to refer to a situation in ZFS where a data scrub operation can lead to catastrophic data loss due to undetected errors or issues. While ZFS is designed to provide data integrity and protection, it is still important to take certain precautions to minimize the risk of encountering such a situation. Here are some steps you can take to protect against the scrub of death with ZFS:
Except it's not a real thing [1]. For bonus fun, read the comment right after Ahrens'.
I don't understand why SO doesn't lean into that. Anyone can train an LLM on the raw data, but SO has the information needed to do a better job of ranking the quality of the inputs, so wouldn't they be able to build an LLM that's significantly better than anyone else with the same raw data? Understanding the quality and reliability of an answer is far more important to me than getting an answer.
What's more frustrating than getting an answer on a programming question and taking hours to figure out that it was complete BS and doesn't work as described?
I don't know much about LLMs, but, if I were SO, I'd be figuring out how to lock down the ranking information as quickly as possible because that's where the value is. The ranking and acceptance of answers, alongside tags, overall user rank, participation frequency, etc. should mean that SO has a significant advantage when it comes to ranking and weighting the input data, right?
I want the input from subject matter experts to count the most and SO has the best data set to provide that. I don't see the point of locking down the content when the real value is in the ranking. It's odd that SO doesn't see that considering the entire network is modelled on that idea. Maybe they do and there are bigger changes coming down the pipe.
I think the real debates are going to come in the future if SO releases a paid LLM product that's trained on community contributed content and rankings.
I would be willing to bet that the driving force behind the decision was to make it less trivial for LLMs to say "the data was already there under an open license, so we legally undercut stack overflow".