Hacker News new | past | comments | ask | show | jobs | submit login
All web "content" is freeware (rubenerd.com)
159 points by imadj 3 days ago | hide | past | favorite | 98 comments





This statement from Microsoft is just asking for a copyright infringement lawsuit because the courts have been very clear that web "content" is copyrighted unless it is explicitly placed in the public domain or old enough to no longer be under copyright.

Authors of open source code should consider adding explicit restrictions to their license barring the use of their code to train AI. This would make it easier to file lawsuits against Microsoft and others of their ilk who think they can train their AI with other people's work without fair compensation.


> Authors of open source code should consider adding explicit restrictions to their license barring the use of their code to train AI. This would make it easier to file lawsuits against Microsoft and others of their ilk who think they can train their AI with other people's work without fair compensation.

I see no reason to expect that this would alter or achieve anything. The wide-scale machine learning that’s been happening is entirely dependent on fair use exemptions from copyright. They’re not using it under your license—in fact can’t, current machine learning techniques and open source licenses already make it fundamentally impossible for them to comply—so what you put in it should be completely irrelevant.

No, if the fair use exemption is ever struck down, the entire field is dead in the water until (a) a change in the legal system, or (b) services like GitHub start demanding an additional license as part of their terms of service for the purpose.


No one would let AI get shut down in the US, there’s just too much at stake. Even if we don’t like what’s going on, we’ll take a measured approach in regulating, because otherwise it will just go overseas.

Does GPL does this already? Doesn't it already say that code derived from GPL code should be GPLed? So does that include any code produced by an LLM based on GPL code ?

That would seem to be a logical implication assuming courts reject claims that "everything on the internet is public domain" or that training an LLM on copyrighted material constitutes "fair use" of the copyrighted material.

I suspect it would technically be infringement even for MIT licensed code because the original author's copyright notice would presumably be missing.


Any such lawsuit would be settled out of court, with no admission of guilt, and no damaging information coming out via introduction into public evidence.

"Authors of open source code should consider adding explicit restrictions to their license barring the use of their code to train AI."

> Anyone can copy it, recreate with it, reproduce with it

He seems to be confusing "freeware", which is basically a license for copyrighted work, with "public domain", which is the absence of a copyright.


> the absence of a copyright

Ain't no such thing.

Copyright exists, immediately upon creation (not publication) of a work.

It's different from trademark, in that practical applications, enforcement, registration, etc., does not invalidate the copyright.

Copyright can expire, which then becomes, effectively, "public domain."

Registering a copyright doesn't create the copyright. It simply makes it easier to go after those that disrespect it.

I'm pretty sure that the only way to truly transfer the ownership of copyright of a work, is to have agreements in place, before it is created (like "work for hire" contracts).


As a creator you can also explicitly dedicate a piece of work to the public domain, thus relinquishing any copy right to it. That’s what licenses like CC0, WTFPL, and The Unlicense do.

However, even being in the public domain does not in itself mean you can do everything. For example, in France you still have to respect the “moral rights” of the author, meaning you have to include their name and original title.

https://en.wikipedia.org/wiki/Copyright_law_of_France#The_pu...


The "moral rights" in France and Germany, or the "Urheberrecht" in Germany and Austria and others in Europe prohibit even the creator to put things in the "public domain" to the full extend. There are pro and con debates about this, of course.

Even photos of works in the "public domain" might be protected again, e.g. read about the (in)famous Wikimedia Lawsuit from the German https://en.wikipedia.org/wiki/Reiss_Engelhorn_Museum


There is such thing. There are three main ways for work to be public domain.

- Expiry of copyright.

- Explicit dedicated to the public domain by the copyright holder.

- Non-copyrightable work (such as computer or animal generated work).


In the case of the first two, the copyright actually exists, but is unenforceable.

In the last one, copyright doesn’t exist, because it can’t, so the point is moot.

> animal generated work

Actually, didn’t that monkey get copyright of the image? I can’t remember, for sure.

We can’t actually transfer the copyright, itself; only the rights to adapt and/or reproduce.


UPDATE:

In the "monkey selfie" case, the monkey lost, and lost hard. Probably because PETA behaved like ... PETA ... They footgun themselves constantly, by acting way too extreme.

https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...


Ontologically, copyright doesn't exist. Copyright is an epistemology.

If copyright could exist, then a copyright for the copyright must be able to exist, and it'd be turtles all the way down.

This is not nitpicking. Copyright, as intellectual property, is entirely made up as all other intellectual property is.

Saying copyright exists is as laughable as saying intellectual property is as non rivalrous as the chair you sit in.


I am not sure what you are getting at, all property rights are made up agreements, as is what is defined as property, what can be privatized and what rights that affords you.

Take tangible land, your exclusive use of it has boundaries, for example airspace rights or mineral rights. It is all made up.

The difference is tangible v intangible, but in either case the rights are made up.


What is it with this new ontological wave on the Interwebs? For a mathematical axiom, do you need another axiom that tells us that the first one exists? And so forth?

How would you prove the existence of the universe? Do we not need a bigger universe that contains ours? And so forth? (Don't mention the big bang, which is a bunch of non-falsifiable formulas.)


narrator: the courts were not kind to the sophomore philosophy student whose defence was the non-existence of laws

Man, this is some weapons-grade hair-splitting. I tip my hat to you, sir.

Still. People have gone to jail for copyright infringement, so I doubt at least they would feel like laughing at the idea that copyright exists.


Who has gotten jail or prison time for copyright violations in recent times?

I’m aware of recent cases in Canada where defendants chose to ignore a court ruling and attempt to republish very similar material as what the court had originally found them to be in copyright violation for. They were then found to be in contempt of the court which is a criminal offence and then ordered to complete jail time and pay substantial fines.

Copyright violations are not criminal offences in countries I’m aware of. Please tell me of any cases where a copyright violator faced jail time for the copyright violation and not for related criminal offences.


> Swedish prosecutors filed charges on 31 January 2008 against Fredrik Neij, Gottfrid Svartholm, and Peter Sunde, who ran the site; and Carl Lundström, a Swedish businessman who through his businesses sold services to the site. The prosecutor claimed the four worked together to administer, host, and develop the site and thereby facilitated other people's breach of copyright law.

https://en.wikipedia.org/wiki/The_Pirate_Bay_trial

Swedish law has only gotten more strict since 2008 with regard to copyright.


>Ontologically, copyright doesn't exist. Copyright is an epistemology.

You keep using these words, ontology and epistemology. I don't think they mean what you think they mean.

>If copyright could exist, then a copyright for the copyright must be able to exist, and it'd be turtles all the way down

This doesn't make any sense.

First, not all things that exist are covered by copyright or have a copyright about them existing (air exists, but doesn't have a copyright. Neither do slugs, pebbles, Uranus, and other existing things).

Copyright is just sets of laws dictating ability to copy, distribute, and so on. It doesn't need a copyright for itself, and even if it did, the regular terms for reproducing any other legal code would suffice.

>Copyright, as intellectual property, is entirely made up as all other intellectual property is.

All human laws and conventions are made up. Doesn't mean anything - copyright is still enforceable with very real prison buildings, cells, and bars - and if resisting arrest for it, very tangible police battons, tasers, and bullets are not out of question either.


Good to see Terrance Howard is back after his brief hiatus

He said "fair use", and only then added, quite unnecessary "or freeware, if you want". He primarily meant fair use.

> Perhaps that’s why he bookended his claims with “since the 90s”

No, it's because the web has existed since 1991. (Though for the puritans, the paper was written in 1989 and the first browser was developed in 1990)

https://www.npr.org/2021/08/06/1025554426/a-look-back-at-the...


I bet if any other company did it instead of MS they would sue the hell out of them for using their data.

I bet if Microsoft were not extracting value from someone else's content, but instead had their content being used to power someone else's business, they'd be singing a very different tune.

This already happened and MS sent a cease and desist.

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


Without trying to take a stance on this, I do have to say I like the FastGPT feature that comes with Kagi. It basically does a search and uses those results to answer questions.

Now I'd just want it to have a better UI with history and some sort of notebook mode instead of chat. I'm not sure how, but I don't want to chat with AI, I want a different way to 'instruct' it.


I intend to use Mustafa Suleyman's likeness and name for my next project. It's part comic book/part novel and tells the story of a socially awkward tech CEO getting way out of his comfort zone by moonlighting as a male porn star. It ends with an OJ Simpson style police chase when it's discovered that Mustafa has been embezzling funds to support a drug habit and addiction to plastic surgery.

> But that means torrents of Windows are freeware!

For many, many years now, if you need Windows you can just download it from Microsoft and run simple, non-intrusive activation procedure (not from Microsoft) after installation. No cracks needed. As much security as hip high front porch gate.

So even for MS the understanding was that these things are de facto freeware for anyone that wants them at all.


Feel free to start a business selling computers with pirated copies of Windows and Office pre-installed, or build out a corporate network or cloud service with them, and find out first-hand how much Microsoft really considers their products to be "de-facto freeware".

Why would I be selling stuff I got for free? It's unethical. Installing software on a computer is not "transformative". Training AI is very much.

OpenAI/Microsoft are building a business around stuff they got for free. That is the whole point of the above comment.

Not quite. They are trying to build a business around AI and that they spent heaps of money to build and train. The free stuff serves the same purpose it does for all people on the planet, as examples of things that exist.

> run simple, non-intrusive activation procedure (not from Microsoft)

> No cracks needed.

These are contradictory statements, and I'm not sure why you'd think otherwise.


Simple. I just define crack as potentially dangerous program that you have to give full access to your machine so it can do it's job.

Instead it's sufficient to use normal MS activation procedure with a server that always says, sure, you can activate. Because why not.


Conveniently ignoring that you may be sued into oblivion if you have enough money to make it worth it for them. Come on. Windows is only free for people not making significant amounts of money with it. If you do make money... surprise: https://www.bsa.org/

Why would I care about rich people problems that try to get ahead without paying others they use to get ahead?

Laws are never the same for the rich and the poor. And if they are to differ then it's the better direction for them to differ.


Your assertion that Microsoft allows everyone to use Windows for free is false. What you care or not care about is irrelevant in this context. I have no clue why you brought it up.

Now if you wish to assert that Microsoft allows peons to use Windows for free, as long as it is convenient for them, I can agree with that. They're still a bunch of hypocrites.

Allowing Microsoft to selectively apply the law as it benefits them is not a good thing, you're confused.


If you do not allow people to do something and yet hundreds of millions of people on Earth do it and you do as much as I described to prevent it then you are de facto allowing it. Same thing the MS guy said. Whatever's published on a website is de facto freeware. "no copyright infringement intended". That's how it works outside of lawyers offices.

Commercial policy is not the law so MS can be as hypocritical as they want. I'm happy that their hypocrisy is going in the right direction this time.


Excellent trolling! Funny thread.

Has everyone forgotten the furore that was Cook's Source Magazine stealing a recipe that was published online?

https://yro.slashdot.org/story/10/11/04/1940257/cooks-magazi...


I agree, so please Microsoft shut you mouth if I grab your maps, wrap your services and so on, because they are web-based so I am free to do whatever I like with them, relevant licenses does not count.

Microsoft wouldn't care unless you started a competing maps product.

Why not, if they want my data to train their LLMs why not doing the inverse with their, for business as they do on their own side. If for them all public stuff is free for commercial use...

More discussion on similar article: https://news.ycombinator.com/item?id=40828438

> search engines link to their sources! Chatbots don’t.

Actually Copilot does provide links to its sources, which adds credibility and promotes further exploration.


How did they train auto-completers or classifiers if they didn’t train on the open web? How did Pandora train if not on copyrighted music?

Pandora can train on anything, but they can only stream music that they have paid to license. Microsoft isn't paying anything to anybody.

> Don’t blame us, the Torment Nexus is established practice!

Well, it is. And I for one, am absolutely delighted that some people with money finally have an incentive to accept that after three decades of copyright death throes.


It's true. People don't like it, but it's true.

If you provide content you created online for free, that content is now freeware.

If someone provides content that they didn't create that still has copyright restrictions in real life, that isn't freeware.

It's like all the photos uploaded to Facebook and Instagram are now free to use however the downloader wants (and Meta as well of course). It's true. But people don't like it.


Copilot links to its sources. The author should reconsider having this blatantly false and easily verifiable article up on their website

I think saying it links its sources is a bit of a stretch. It links related articles which may or may not be the source for what it just said (also, may or may not be related :P)

What tools are you thinking of? I think saying it links its sources is absolutely not a stretch. My experience is with Kagi and with Perplexity; both of which it has even returned messages saying something along the lines of the source documents not being able to answer the question.

Copilot doesn't link to the sources. It doesn't really know what their sources are. It links to article that may be related to what it just said. Many times the sources even contradict what was said. So definitely not a source in that case

They were thinking of the tool the 1st comment named probably.

Now that we have established that Microsoft information wants to be free, my next project is wget.ai:

wget.ai is a sophisticated real time LLM that trains itself while downloading "content". Like any LLM, it predicts the next output token (byte in this case) based on the statistical training. wget.ai is run at temperature zero. In this revolutionary setting it has arrived at the conclusion that the most likely output byte equals the input byte!

Armed with this theorem, wget.ai can transform and replicate a Windows 11 download in real time. No copying is involved, the advanced algorithms happen to arrive at input == output.

Users of Windows 11 can download activation keys (freeware) from the Internet.


> Armed with this theorem, wget.ai can transform and replicate a Windows 11 download in real time.

That’s a far bigger crime than IP infringement.


To legally run Windows you need a licence, not an activation key.

The instalation can already be downloaded for free.


But you're not running Microsoft's Windows. You're running the output of an AI! And as everybody knows, there is no copyright on the output of an AI.

The (real-time) training of the AI was also completely legal, as an AI may train on anything found on the web, as that's freeware anyway.

The AI never stored or stores any copyrighted material. It just learns from it. Now in revolutionary real-time!

So how could wget.ai, or anything produced by it, be considered illegal? Using data found on the web to train AI models is fair use after all!


Is anyone paying for Windows 11 in 2024?

If you buy a new laptop with Windows on it, you are [indirectly] paying for Windows.

Is anyone able to _not_ pay for Windows 11 in 2024? It's called the "Microsoft Tax" for a reason.

GNU/Linux, ChromeOS (Google GNU/Linux), Android (Google Linux), MacOS, iOS (and iPadOS is a different thing, right?) Are almost certainly collectively more popular than Windows. Even as a primary / exclusive computer. I think a lot of people are able to not pay for Windows 11 in $CURRENT_YEAR, probably most.

Yes, laptops without a windows license are pretty popular in at least some poorer countries. Most buyers install windows anyway and activate it via massgrave and friends, which lets you save 40 to 100 USD, which is a pretty big deal.

Lenovo offers its laptops (at the least the customisable models) with your choice of No OS, Windows Home, or Windows Pro.

Each Windows version has regressed from Windows 7 onwards. To the point that Windows 11 can almost be construed as malware. I'll be using Ubuntu henceforth.

Actually XP was probably the peak of Windows IMHO

Windows "Teletubby Edition"? :-) No, Win2k was "peak Windows", imho.

Frankly MS later ditched the quite ambitious Windows NT 5.0 project, which was the planed Win2k successor, for a Frankenstein monster made out of the super buggy WinME and Win2k. That became Windows XP.

Coming from Win98, Win98SE, WinMe, WinXP was for sure quite good. But compared to the super stable, fast, and well structured Win2k it was quite a disappointment. It didn't have almost any of the advanced features planed for WinNT 5, it was much more unstable and buggy than Win2k, it was quite chaotic with "old Win95" parts, stuff coming from Win2k, and some things on it own placed randomly.


Can generally confirm, the only Windows I regularly use is Virtualized XP for some old music making programs I like.

I guess if you want 64bit support windows 7 is generally better supported than 64bit XP

Business/site licenses, probably.

ipfs.ai

I like the fact that I can now reproduce any Microsoft content without paying for it. Cheers!

Incidentally, some AI chatbots do link to their sources. And it is a good idea to make that an explicit prompt if you're using one that doesn't. It's also worth prompting for how recent their information is.


I would argue that if I ask ChatGPT something, it doesnt "reproduce" what was written on certain website (or at least it shouldn', without attribution). It takes what it scrapped before and re-tell it in its own words. That isn't reproducing, looks like a grey area not yet addressed in copyright laws.

I would partially agree with the guy, that yes, that was a social contract since 90's, but before the AI era. Back then this use case wasn't anticipated.


> in its own words

LLM's have no words of their own.

Imagine training a LLM vs a group of people from birth on wrong information. The LLM will unquestionably just repeat in "its own words" the wrong information, whereas the group of people will of course believe some of the wrong stuff, but they will also doubt a lot of it as well.

You could say that an LLM is just not good enough yet so the comparison isn't fair. In other words that people are just even more LLM'ing than the LLM, but there simply is no mechanism for an LLM to go from wrong information to right information.

People on the other hand will always doubt, hypothesize, and compare and contrast whatever information they have to at least attempt to form correct answers from correct information. This in a sense is because they actually have their own words.

There is, as of today, never been a smart or creative thing an LLM has ever said that doesn't literally come from other people's words. If LLM's are smart, it's because people are smart.


There’s nothing ambiguous from a copyright perspective, it’s a derivative work. People seem to confuse plagiarism in an academic environment from copyright. Simply using your own words doesn’t mean you’re free from copyright.

However even when something infringes copyright that doesn’t mean anything necessarily happens. Just look at YouTube’s early history or the mountains of fan fiction out there.


> Just look at YouTube’s early history

But something did happen. Viacom and others sued them, and then YouTube introduced their Content ID system so that they could pay copyright holders for content that others uploaded, as well as to take down videos belonging to copyright holders that did not agree to other people uploading their content.


> something did happen

Yes, it took 2 years after creation and truly massive amounts of copyright infringement before the lawsuits by copyright owners showed up. OpenAI is getting sued, but don’t expect your requesting a website be rewritten to provoke anything unless you publish such rewritten posts at scale or something.


>then YouTube introduced their Content ID system

That's for content that's reproduced in part or fully, but verbatim (like a song, movie clip, etc, where Content ID can apply).

But the parent's point is you can have trouble even for content where you "retell" something "in your own words".


The part I was responding to is this:

> However even when something infringes copyright that doesn’t mean anything necessarily happens. Just look at YouTube’s early history or the mountains of fan fiction out there.

This part is talking about uploading a copy of something verbatim, the way I read it.


Last time I used Copilot, the "sources" often didn't support what it said and it seemed like they were obtained by adding search results from feeding the answer into Bing after it had already been generated.

And there were of cause tons of SEO slop links among them.


I asked ChatGPT for sources and they were impossible to determine if they were real or not. It'd cite things like "Sky and Telescope magazine" no edition, no page numbers no year, just a vague unverifiable citation

>I like the fact that I can now reproduce any Microsoft content without paying for it

Only if you have the same quality lawyers and financial backup to support them to get you off like MS has. Else what applies to MS doesn't apply to you :)


You are probably joking, but that is literally what MS said, they don't even hide it. A quote from the register: "Suleyman (Head of MS AI) did allow that there's another category of content, the stuff published by companies with lawyers." (https://www.theregister.com/2024/06/28/microsoft_ceo_ai/)

Has this become any better? Every time I asked ChatGPT for sources it makes up papers, with fragments of real paper titles and topically related authors. The supposed paper itself though can't be found anywhere.

Good when they do, but depending on what we are discussing, linking to all their sources might be completely impractical.

DRM and paywalls for thee, industrial-scale scraping for me. /s

It's time for us to build our own miniature versions of Internet Archive with the content that is personally important to us . The powers that be will take it down under the guise of defending copyright, while the bigcos continue to suck up every letter of every page that has a publicly available URL.


I find it good that the concept of IP is collapsing, but this shows clearly the corporate dishonesty around it. For decades corporate sites and APIs have pushed all sorts of illegal EULAs and ToSs in attempt to e.g. ban scraping. Now suddenly all of this is scrapped, with of course no explanations given as to why.

IP isn't collapsing for anyone with the means and connections to enforce the law. Microsoft is essentially pickpocketing the peasantry, while steering clear of the feudal lords like Netflix and Google, who can hit back.

In a world where physical media is no longer relevant and everything is on the internet, what the hell is "web content"?

> In a world where physical media is no longer relevant

Debatable. IMO print is the best medium for long-form written and grapgical narative work.

Streaming media can’t be lent and is locked up by excessive and arbitrary (from the consumer’s PoV) rights leveraging.

Given physical media is declining, yet more functional for its archive potential, Imd say it was now more relevant than ever.


I think you may be interpreting the word “relevant” in a different light than the person you’re replying to.

It reads to me as if you’re saying physical media is important for humanity as a whole and the preservation of knowledge, while your parent comment is saying physical media is no longer significant to individual consumers because it’s not their preferred method of consumption.

Both connotations can be true at the same time.


Literally everything isn't, and will never be, but I agree with your gist.

Also note how 'content' is corporate-speak (they especially like owning the platforms hosting it) :

https://craphound.com/content/Cory_Doctorow_-_Content.html#1


Internet != Web. E.g., an ebook downloaded from Apple Books is not web content, even if it comes from the Internet.

Until that eBook inevitably gets uploaded to a piracy site. The implication is that if a web crawler can find it anywhere then it's fair game, regardless of provenance.

So basically I can create a whinedows website with microsoft windows logo on it right?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: