Hacker News new | past | comments | ask | show | jobs | submit | vadiml's comments login


Seems like some longer videos gradually slip into the uncanny valley.


It's like I'm watching him on the news


Biden need to sign executive order to jail all judges voted for this verdict for high treason of not protecting US constitution. And he will be immune thanks to them.


Apl always was kind of write only language. But now you probly can paste all code in chat gpt and it will explain it


This would prevent initial SYN To reach the server too.


Not if you also match on established connections and outbound only.


I've recently needed to modify a program using c++ std::ostream to write info to a text log file. This is the writeup of exprience egaging ChatGPT for help


I don't understand really why these patents are granted. Software is math and math formulas are not patentable. The patent (and copyrights) system today is actually subverted by greedy players and instead of promoting progress and sciences is serving to enrich the middlemen by artificially SLOWING the progress.


Saying “software is math” is akin to saying “books are just letters”; the building blocks of software is not what people want to protect, it’s the idea and the effort to invent.

Whether software should be patentable or not is obviously open for discussion, but saying it’s just math isn’t really enough of an argument.


The best of the mainstream arguments against software patents has always been: What does Intellectual Property and Copyright not cover that patents do when it comes to software?

I only have a few friends I trust on such a topic, and my understanding is between IP and Copyright laws from them (ones a public policy researcher, the other a lawyer), it would be more than sufficient for protecting companies work and patents were lobbied only because their enforcement is more heavy handed, IE, it can stifle competition under the guise of "patent infringement"


> What does Intellectual Property and Copyright not cover that patents do when it comes to software?

Copyright cannot and does not prevent someone from clean-room engineering a replacement for your software. Patents can do that. (Whether or not they should is a different question, but that's not what you asked).

FWIW, patents are one of the forms of intellectual property.


Presumably clean room implementation takes the same amount of effort as production of specification, so it doesn't save effort of invention.


Oh, no, that's not true at all. An invention might not even be possible, let alone having a clear and (literally, provably) correct specification of inputs and outputs. Just seeing that it exists, to say nothing of getting the correct proportion of inputs and outputs, could shave a decade off your invention cycle.

Take, for example, the nuclear bomb. Just knowing that it could be done put you ten steps ahead. What if cold fusion or a warp drive were known to be possible because you could see it (even if from a great distance with little detail)? Airplane manufacturers leapt ahead (literally) after the Wright Brothers.

A tremendous amount of effort for worthy inventions is often involved simply in proving that it can be done. Once you know it can be done, you don't have to prove it anymore, and also large companies will throw buckets of money at a clone of something that's proven to work.

A patent (sometimes) prevents that -- at least, when everything is working as it should be. (In this case, clearly not!)


Possibility of data compression was known since 1977, video codecs exist since 1984.


Shannon, C. E. (1948).

A Mathematical Theory of Communication.

Bell System Technical Journal, 27(3), 379–423.

doi:10.1002/j.1538-7305.1948.tb01338.x

https://sci-hub.ru/10.1002/j.1538-7305.1948.tb01338.x

https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theo...

https://faculty.uml.edu/jweitzen/16.548/classnotes/Theory%20...


>FWIW, patents are one of the forms of intellectual property.

To keep the discussion well focused, I didn't want to get into the nuance of "patents are IP law too" since in broader discussions, IP and patents are usually discussed separately, even though yes, they exist under the same legal umbrella (Intellectual Property).

> Copyright cannot and does not prevent someone from clean-room engineering a replacement for your software. Patents can do that. (Whether or not they should is a different question, but that's not what you asked).

That's a fair point, only patents give an entity the legal "teeth" to do this, though there is room for argument that a clean room engineering replacement would then show novelty and non obvious aspect of a patent to be invalid, and could be grounds for patent invalidation

[0]: Arguably, the fact that courts are sorting this out and not specialized experts at the USPTO is one of the main drivers for why our patent system is broken. Federal judges are not required to be technical experts to oversee a patent case. In addition, this allows the USPTO to liberally grant patents as they pass the validity concerns off to the courtroom


Copyright can do that, in exactly the same way that patents can. It just happens not to as currently configured.

You might observe that a clean-room replacement is in and of itself evidence that a patent covering it was obvious and therefore not validly granted, which would tend to imply that patents cannot prevent this.


The application of patents and copyrights to software are identical to how they are applied in chemistry and other physical engineering disciplines -- I've worked with both. A patent covers the physical algorithm, a copyright covers the design of an implementation of the algorithm. In chemical engineering, these are licensed separately, but the patent is more important and the copyright is worth little in practice.

The algorithm is the expensive step, design of a novel implementation (a copyright) is purely mechanical and any engineer can produce this part. If there was no patent, everyone would just pay an engineer to produce a new implementation of the chemistry algorithm. This would put the inventor of the chemistry at a huge disadvantage, since the costs of producing a new copyright is the same for everyone but only the inventor would have to amortize the cost of the invention. It would be more economical to never license the copyright from the inventor in many cases.

Regardless of the mechanism, the question ultimately comes down to who is going to pay for the cost of R&D. Copyright does not answer this question either in theory or in practice.

The alternative to patents is trade secrets, which have their own issues. In areas of software that use trade secrets almost exclusively, the state-of-the-art in software is often decades ahead of academic literature and open source. The cloud has been a huge boon for software trade secrets in that it makes reverse engineering difficult. Trade secrets makes it difficult for outside people to advance the state-of-the-art because the know-how is not public and creates negative externalities in terms of employment contracts.

To address another notion, virtually no R&D is done in open source. This is an empirical observation made by many. The incentives for doing R&D in open source are very poor. There are already large gaps in technology between what is available in open source and what exists in closed source software. Again, it all comes down to who is going to pay the significant costs of R&D.


Your wording is a bit confusing I think. Algorithms are not physical as far as I'm aware. Since (pure) algorithms and formulas are different ways to express mathematics, my impression was also that they were not patentable as such. Maybe you meant something slightly different?

Also, do you have sources for the statement "virtually no R&D is done in open source"?


Algorithms are not physical either in software or chemistry. Patents in chemical engineering are essentially a set of differential equations that can be applied to a real system, no different than software algorithms. If you replace “molecules” with “bits”, it is identical to software. Patents in chemistry have no connection to specific physical machinery, they are abstract concepts. The reason software algorithms are patentable everywhere is that they can be manifested as concrete logic circuits and electronic circuits are patentable.

I have no source for the lack of R&D in open source. It is a widely held view even within parts of open source, often commented on, and generally not considered controversial. As an example I am personally familiar with, database technology is virtually all developed privately and is far ahead of what is available in open source. Open source tends to copy whatever bits leak out, is decades behind the state-of-the-art, and the gap has been getting worse over time.

Software that requires man-years of extremely specialized expertise to produce tends to be a poor fit for open source. The people with these skills are well-paid and in high demand, often with contractual clauses that do not allow them to work on open source. They have families and other interests. There are few incentives to spend years of their lives building this software for free.

If this kind of software is to become open source, it will require incentives that are not a pure loss for those that know how to build it. This is the current situation. Someone has to pay for it.


So, algorithms are not automatically (supposed to be) patentable like you suggest, quite the opposite. Although on both sides of the Atlantic people have been bending the rules one way or the other for quite some time.

US: https://en.wikipedia.org/wiki/Alice_Corp._v._CLS_Bank_Intern...

EU: https://en.wikipedia.org/wiki/Software_patents_under_the_Eur...

Nevertheless, Software Patents do not appear to be suitable for purpose.

> "[open source, and thus public information!] is decades behind the state of the art"

It would appear that software patents are not actually actually incentivizing the disclosure of workable methods-of-the-art to society. In fact, I don't hear of people using software patent documents to make something, like mechanical people sometimes do. I would love to be shown to be wrong on this. AFAIK, a practitioner of the art cannot take a software patent and trivially implement it.

Unlike a patent, a practitioner of the art can take a unit of FLOSS code and implement and/or improve it. So, based on your view of the world, open source seems to be taking the niche that software patents should have been creating.

On the one hand, fortunately the situation isn't quite as horrible as you suggest, and there are in fact innovative FLOSS projects. In part because some companies are incentivized to release their work as FLOSS to begin with, or work with a central FLOSS pool. On the other hand, this is all voluntary. There are often good incentives to defect from many different voluntary IP arrangements, even those that do include use of patents (see the case of H.265 ).

I think -with regards to software- that we are going to need a very different way of approaching IP. The current patent system is quite clearly useless at getting people to actually disclose their secrets, so we'll need a different method.


To be clear, I am entirely in the camp of algorithm patents being largely ineffective in software. I am not advocating for them, just recognizing the reality today.

In my view, the default outcome will be trade secrets, and it is already the case in many software areas. This has limits in practice as software trade secrets do have a tendency to leak out. I know a few clever database algorithms that are almost certainly trade secrets somewhere (origin is unclear), passed down but not in any public literature. On the other hand, I am aware of major (qualitative) tech advancements in e.g. graph algorithms that have not leaked after 15 years.

I think we need to be clear about the objective with IP law.


> The best of the mainstream arguments against software patents has always been: What does Intellectual Property and Copyright not cover that patents do when it comes to software?

Instinctively, that’s where I think it should be, agreed.


Why not patent actual math then? For example, math that is used in machine learning: linear algebra, quadratic error etc.


Presumably there’s plenty of prior art, maths has been around for quite a while now ;)


Aren't physical machines just math too? We can simulate them on computers, clearly just math, but maybe can't fully mathematically describe the non-idealized versions we produce in reality. Why does that level of completeness of description need to serve as such a sharp line on patentability?


Yes, and there are theorems to that effect. There is no distinction between "software" and "hardware" in mathematics.


Why are any patents granted? I've yet to read a patent that wasn't just Maxwell's equations, quantum mechanics, and general relativity which are all just laws of nature and not patentable.

As far as software goes, here's a question that can be interesting to ponder. Suppose there was some clever, useful, non-obvious entirely mechanical invention that was patented. If someone else tried to sell a product that accomplishes the same thing as that invention by having a computer running a general purpose physics simulation program which is given a model of that patented invention, would that be an infringement of the patent on the mechanical device?


> would that be an infringement of the patent on the mechanical device?

No because a patent has to describe the mechanism (the non-obvious inventive step). If there are multiple ways to achieve the same thing then in practice it's hard to protect and the patent is probably worthless, if not too obvious to be granted in the first place.


> Software is math and math formulas are not patentable.

Software isn't really math.

Software is logic, and usually opinionated logic choices at that.


So then is software a branch of math, or a branch of philosophy?


Is it the machine instructions being patented or their novel compression method?


Math should be patentable, too. The idea that math is "discovered" instead of "invented" is bullshit.

That, or get rid of patents altogether.


> The idea that math is "discovered" instead of "invented" is bullshit.

Nope, not to mathematicians. We routinely talk about the existence of mathematician constructs. These things exist and can be discovered, just not physically.


Nope, math is invented, without people there is no math. Math is a logic system invented by humans. You don't have to use math to describe relations between things. Math is a language, it describes real world and is not real world itself. So math is invented.


Nope, math is discovered, without people also there is math. Math is how the universe works. When you describe relations between things, that is Math. Math is notated using many languages, but the real world itself cares not for which notation you use. So math is discovered.

----

Have a look at how various cultures around the world did maths before meeting Europeans. You will quickly stop thinking "Math is a language".

Hell, even European maths wasn't entirely European. The most popular number system in use to this day, arrived in Europe via Arab traders and itself originated in ancient India. A culture that developed its own entirely different set of ways to explain some the logic of the universe.

While the ancient Indian system of arithmetic would look very different to anyone with a standard school education today, both systems describe the exact same things: addition, multiplication, subtraction, and division of things.

If we were to meet an alien civilization, who'd undoubtedly have their own language(s) and culture(s), the fastest way for us to learn how to communicate with them would be to look at how they do maths. Because, while their language and notation of maths may be different, what they describe is going to be same fundamental laws of this Universe.

Math is the Rosetta Stone of the universe.


> Math is how the universe works

This is bullshit, and the mathematicians themselves know it.

Just one obvious example that everyone can understand: Euclidean geometry does not describe the universe, even if it's useful.

But more broadly, the fact that math is not how the universe works was proven with math: https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...


Nobody's making the claim that Euclidean geometry is all of maths. But the part of the universe that Euclidean geometry represents has always, still does, and will continue to work even when the last traces of Euclidean geometry vanish from recorded knowledge and memory.

> ... link to Gödel's incompleteness theorems

That's a proof of some limits of formal systems — particularly those that want to formalise everything under one unified set of axioms — not limits of mathematics. Mathematics / the universe does care one iota if you use this particular set of axioms or another. Or even any. It continues to work without a care for your need to have a grand unified theory. That you cannot discover all of its secrets because you restricted yourself is not its concern.

Maths is how the universe works, whether you understand it or not.

----

But thank you for linking to Gödel's theorems. Your link directly answers the topic being discussed. You'll notice the text never says "invented" when talking about these or related theorems; it says "discovered".


> Euclidean geometry does not describe the universe, even if it's useful.

A statement that doesn't disprove the thesis in the slightest.

eg: There's non-Euclidean geometry which some say is handy in a post Newton Einstein universe.

If that fails, I feel there'll be something else again that conforms better to the universe as we understand it to be.

> the fact that math is not how the universe works was proven with math

Another statement that fails to prove the thesis; the universe itself is sufficiently complex that there can indeed be things out there that we will never 'prove' to our satisfaction.

You need to do some lifting here (perhaps a little more than 'some') to prove that Godel|Church|Turing results demonstrate beyond doubt that maths cannot underpin the workings of a universe.

Your comment reminds me a little of Gödel's ontological "proof" .. full of sound and fury but not really landing.


Mathematical notation is a language, but actual math isn't. For example, the concept of there being one of something is an inherent feature of our reality, but drawing short vertical lines for them is a thing we do. Similarly, we didn't invent 3.14..., that's just how circles work. We only invented the shapes I just used.


Arbitrary semigroups clearly don't describe the real world. Nor does this prevent us using particular groups to describe particular quantum fields.


That doesn't follow, patent doesn't imply every single invention should have one or else, you can design any patenting system you like


Did someone invent 1+1=2?


Obviously, yes. Integers don't exist in nature, being an abstraction made up by the human mind.

Math is only “discovered”, if by it, we mean the ability for humans to have the same ideas, simply because we think alike and live in the same environment.


Reminds me LabView


Which for me begs the question, is LabView worth it? It's the only widely used Visual Programming Language I can think of.

Every time someone posts a visual programming language prototype the same claims are made over and over again. Like; adds too much complexity, throws out benefits of text editing, eats up too much screen space etc etc.

I'm convinced that many of those issues can be overcome and that until we figure out how to overcome them I'd like to understand which niches Visual programming really makes sense until then.

If LabView makes sense and is worth it, there ought to be more applications where it makes sense!


> It's the only widely used Visual Programming Language I can think of.

Unreal's blueprints are also very widely used in production.


Yeah, is using a visual programming language worth it in that case?


Yes. Non-coders often find it more approachable. These systems are a good fit for games because games have many static assets that need to be referenced. Dragging and dropping to form an asset reference is preferred to using text ids.


Historically LabView had two other key things going for it besides the low-entry-barrier of visual programming.

1. A GUI comes for free; in LabView each program is two windows, each variable you declare shows up once on your code DAG and then again on your GUI as a switch, text box, slider, etc.

2. National Instruments also had a set of libraries and PCI cards for communicating with lab equipment like benchtop power supplies, voltmeters, and even up to big complex kit like oscilloscopes and exotic telecom protocol emulators.

Those two things enabled people to easily wire up a motley assembly of benchtop instruments and orchestrate them together for a unified test of hardware like circuit boards and so on.

Nowadays EE types have better coding skills and those benchtop instruments all have Ethernet ports and REST API's.

But SpaceX famously uses LabView for some things, most notably for the Dragon flight console.


I'm really baffled by all this discussion on copyrights in the age of AI. The Copilot does not 'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it. IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas.


The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim.

> This can lead to some copylefted code being included in proprietary or simply not copylefted projects. And this is a violation of both the license terms and the intellectual proprety of the authors of the original code.

If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

Edit: I'm definitely no expert on copyright law for code but my personal rule is don't include someone's copyrighted code if it can by unambiguously identified as their original work. For very small lines of code, it would be hard to identify any single original author. When it comes to whole functions it gets easier to say "actually this came from this GPL licensed project". Since Copilot can produce whole functions verbatim, this is the basis on which I state that it "would be a clear violation" of the licence. If Copilot chooses to be less concerned about violating the law than I am then that's a problem. But maybe I'm overly cautious and the GPL is more lenient than this in reality.


"The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim."

But only snippets as far as I can tell.

This is the codeexample linked from the author:

https://web.archive.org/web/20221017081115/https://nitter.ne...

It is still not trivial code, but are there really lot's of different ways on how to transpose matrixes?

(Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

And just slightly changing the code seems trivial, at what point will it be acceptable?

I just don't think spending much energy there is really beneficial for anyone.

I rather see the potential benefits of AI for open source. I haven't used Copilot, but ChatGPT4 is really helpful generating small chunks of code for me, enabling me to aim higher in my goals. So what's the big harm, if also some proprietary black box gets improved, when also all the open source devs can produce with greater efficency?


> (Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

This. People seem to forget that generative AIs don't just spit out copyrighted work at random, of their own accord. You have to prompt them. And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you? After all, it's you who supplied the missing, highly specific input, that made the AI reproduce a work from the training set.

I maintain that, if we want to make comparisons between transformer models (particularly LLMs) and humans, then the AI isn't like an adult human - it's best thought of as having a mentality of a four year old kid. That is, highly trusting, very naive. It will do its best to fulfill what you ask for, because why wouldn't it? At the point of asking, you and your query are its whole world, and it wasn't trained to distrust the user.


But this means that Microsoft is publishing a black box (Copilot) that contains GPL code.

If we think of Copilot as a (de)compression algorithm plus the compressed blob that the algorithm uses as its database, the algorithm is fine but the contents of the database pretty clearly violate GPL.


While I do believe that thinking and compression will turn out to be fundamentally the same thing, the split you propose is unclear with NN-based models. Code and data are fundamentally the same thing. The distinction we usually make between them is just a simplification, that's mostly useful but sometimes misleading. Transformer models are one of those cases where the distinction clearly doesn't make any sense.


>And if you prompt them in such a way as to strongly hint at a specific copyrighted work you have in mind, shouldn't some of the blame really go to you?

If you, not I, uploaded my GPL'ed code to Github is the blame on you then?


> If you, not I, uploaded my GPL'ed code to Github is the blame on you then?

Definitely not me - if your code is GPL'ed, then I'm legally free to upload it to Github, and to an extent even ethically - I am exercising one of my software freedoms.

(Note that even TFA recognizes this and admits it's making an ethical plea, not a legal one.)

Github using that code to train Copilot is potentially questionable. Github distributing Copilot (or access to it) is a contested issue. Copilot spitting out significant parts of GPL-ed code without attaching the license, or otherwise meeting the license conditions, is a potential problem. You incorporating that code into software you distribute is a clear-cut GPL violation.


The GitHub terms of service state that you must give certain rights to your code. If you didn't have those rights, but they use them anyway, whose fault is that?


>And just slightly changing the code seems trivial, at what point will it be acceptable?

If I start creating a car by using a blueprint of Fords to create something at what point will it be acceptable? I'd say even if you rework everything completely Ford would still have a case to sue you. I can't see how this is any different. My code is my code and no matter how much you change it, it is still under the same licence as it started out with. If you want it not to be then don't start with a part of my code as a base. In my opinion the case is pretty clear: This is only going on because Microsoft has lots of money and lawyers. A small company doing this would be crushed.


Easy. People get to throw rocks at the shiny new thing. To my untrained eye the entire idea of copyrighting a piece of text is ridiculous. Let me phrase it in an entirely different way from how any other person seems to be approaching it.

If a medical procedure is proven to be life-saving, what happens worldwide? Doctors are forced to update their procedures and knowledge base to include the new information, and can get sued for doing something less efficient or more dangerous, by comparison.

If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

I hear an awful lot of people complain all the time about climate change and how bad computers are for the environment, there are even sections on AI model cards devoted to proving how much greenhouse gases have been pushed into the environment, yet none of those virtue signalling idiots are anywhere to be seen when you ask them why they aren't attacking the bureaucracy of copyright and law in the world of computer science.

An arbitrary example that is tangentially related: One could argue that the company sitting on the largest database of self-driving data for public roads is also the one that must be held responsible if other companies require access to such data for safety reasons (aka, human lives would be endangered as a consequence of not having access to all relevant data). See how this same argument can easily be made for any license sitting on top of performance critical code?

So where are these people advocating for climate activism and whatever, when this issue of copyright comes up? Certainly if OpenAI was forced to open source their models, substantial computing resources would not have been wasted training competing open source products, thus killing the planet some more.

So, please forgive me if I find the entire field to be redundant and largely harmful for human life all over.


Yes, of course copyright is dumb and we'd all be better off without it. Duh.

The problem here is that Microsoft is effectively saying, "copyright for me but not for thee." As long as Microsoft gets a state-enforced monopoly on their code, I should get one too.


> If you write the most efficient code, and then simply slap a license on it, does that mean, the most efficient code is now unusable by those who do not wish to submit to your licensing requirements?

If you don't "slap a license on it" it is unusable by default due to copyright.


Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

I guess the likelihood decreases as the code length increases but the likelihood also increases the more constraints on parameters such as code style, code uniformity etc you pose.


> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

That's just copying with extra steps.

The way to do it legally is to have 1 person read the code, and then write up a document that describes functionally what the code does. Then, a second person implements software just from the notes.

That's the method Compaq used to re-implement the original PC BIOS from IBM.


Indeed. Case closed. If an AI produces verbatim code owned by somebody else and you cannot prove that the AI hasn't been trained on that code, we shall treat the case in exact the same way as we would treat it when humans are involved.

Except that with AI we can more easily (in principle) provide provable provenance of training set and (again in principle) reproduce the model and prove whether it could create the copyrighted work also without having had access to the work in its training set


>The way to do it legally is to have 1 person read the code

wasn't it to have one person run tests of what happened when different things were done, and then write up a document describing the functionality?

In other words I think one person reading the code is still in violation?


> Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.

https://en.wikipedia.org/wiki/Clean_room_design


yes, reading that description it seems pretty clear to me that they did not read the code but they had access to the working system and then

>by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design.

reverse engineering is not 'reading the code'.


Theoretically maybe, then they would have to prove they did so without having knowledge about the infringed code in court. You can't make that claim for AI that was trained on the infringed code3.


Yes, that's why any serious effort in producing software compatible with GPL-ed software requires the team writing code not to look at the original code at all. Usually a person (or small team) reads the original software and produces a spec, then another team implements the spec. This reduces the chance of accidentally copying GPL-ed code.


> Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

Maybe, but that would still be copyright infringement. See My Sweet Lord.


It’s not accidental. Not infringing copyright isn’t part of the objective function like it would be for a human.


Not learning or not being inspired by copyrighted code is not a human function either though.


Has a human ever memorised verbatim the whole of github?

If someone somehow managed to do that and then happened to have accidentally copied someone's code, how believable would their argument be?


> Has a human ever memorised verbatim the whole of github?

No, and humans who have read copyrighted code are often prevented from working on clean room implementations of similar projects for this exact reason, so that those humans don't accidentally include something they learned from existing code.

Developers that worked on Windows internals are barred from working on WINE or ReactOS for this exact reason.


Hasn't that all been excessively played through in music copyright questions? With the difference that the parody exception that protects e.g. the entire The Rutles catalogue won't get you far in code...


> this would be a clear violation of the licence

Not necessarily. If it's just a small snippet of code, even an entire function taken verbatim, it may not be sufficiently creative for copyright to apply to it.

Copyright is a lot less black and white than most here seem to believe.


That’s part of the rub. YouTube doesn’t break copyright law if a user uploads copyrighted material without proper rights. Now, if YT was a free for all, then yeah. But given it does have copyright reporting functionality and automated systems, it can claim it’s doing a best faith effort to minimize copyright infringement.

Copilot similarly isn’t the one checking in the code. So it’s on each user. That said, Copilot at some point probably needs to add some type of copyright detection heuristics. It already has a suppression feature, but it probably also needs to have some type of checker once code is committed and at that point Copilot generated code needs to be cross-referenced against code Copilot was trained on.


> If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

We aren't talking verbatim generation of entire packages of code here, are we? Code snippets are surely covered under fair use?


It would almost surely be fair use to include a snippet of code from a different library in your (inline) documentation to argue that your code reimplements a bug for compatibility reasons.

In general it is not fair use if you are using the material for the same scope as the original author[0] or if you are doing it just to namedrop/quote/omage the original.

It is possible to argue that a snippet can be too small to be protected, but that would not be because of fair use.

[0] Suppose that some Author B did as above and copied a snippet of code in their docstring to exlain buggy behaviour of a library they were reimplementing. If you are then trying to reimplement B's libary you can copy the same snippet B copied, but you likely cannot copy the paragraph written by B where they explain the how and the why of the bug.


> Code snippets are surely covered under fair use?

...for "purposes such as commentary, criticism, news reporting, and scholarly reports"? Sure.

For a commercial product? Best check with your lawyer...


Oracle would like to have a word..


The Fair Use concept is specific to the USA.


> it's that it spits out GPL code verbatim

It's not a problem in practice. It only does so if you bait it really hard and push it into a corner, at which point you may just as well copy the code directly from the repo. It simply doesn't happen unless you know exactly what code you're trying to reproduce and that's not how you use Copilot.


Just because code exists in a copyrighted project doesn't mean that it is on the only instance of that code in the world.

In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?


This seems like a really, really easy problem to fix.

It should be easy to check Copilot's output to make sure it's not copied verbatim from a GPL project. Colleges already have software that does this to detect plagiarism.

If it's a match, just ask GPT to refactor it. That's what humans do when they want to copy stuff they aren't allowed to copy, they paraphrase it or change the style while keeping the content.



So we should attack the problem of proprietary code. Maybe from Right to Repair angle. I believe there should be no such thing as closed source code.


Closed source code is beige corp-speak, its true name is 'malware'.


In Linus Torvald's book "Just For Fun", there's a chapter about copyright where he presents both the upsides and downsides of it in a pretty much balanced way. I think it's worth reading.


Bit of a false equation to act as though a massive computer system is the same as any individual.

People put code on github to be read by anyone (assuming a public repository), but the terms of use are governed by the license. Now you've got a system that ignores the license and scrapes your data for its own purpose. You can pretend it's human but the capabilities aren't the same. (Humans generally don't spend a month being trained on all github code and remember large chunks of it for regurgitation at superhuman speeds, nor can they be horizontally scaled after learning.)

You can still be of the opinion that this is fine, and I may or may not be fine with it as well, I just don't think the stated reason holds up to logic and other opinions ought to "baffle" you


And GitHub’s EULA gives it the right to train Copilot on public code you host on GitHub.


The issue, though, is not the code I personally upload to my own public repositories, but the code that someone else uploads to Github by cloning my repository held somewhere else than Github.

Personally I have eschewed any personal use of Github since the MS aquisition and only ever use it where that's mandated by a client (so not my code). If you clone my code from elsewhere into a Github repo, that's just rude and contrary to me every intent and wish.

I think it's time to add a "No GitHub" clause as an optional add-on to the various open-source licenses.


So then the person who uploaded your code to GitHub has committed a copyright violation and I’m sure GitHub would honor to remove your code from the model training corpus as it was illegally uploaded to GitHub.


It’s not necessarily a copyright violation if the license permits copying. Under a permissive license, you are expressly permitted to copy the code and distribute copies provided you comply with whatever conditions the license mandates, without an explicit blessing of the copyright holders. Most popular licenses do not include a prohibition on training AI models. Maybe people should start including a clause.


Many popular licenses include a prohibition on being used to create proprietary software. GitHub Copilot is proprietary.


That's great, but GP's argument was

> Copilot does not 'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it.

Not "the terms of use you agreed to allow them to do it". Different argument with different amount of merit in my opinion


Agreed. I was just saying in the current environment GitHub has that license, nobody else has. So if the courts decide one day that because machines learn differently from humans, they will allow copyright holders to add a license exception that disallows machine training, then GitHub will benefit from this. It’s kind of ironical. What’s best for society is to not have any such law enacted and continue to allow open source models to progress alongside proprietary ones (in addition to more level competitive dynamics on the proprietary side).


They could just train a model on GPL code that can only be used on GPL code.

For MIT licenses that's impossible currently because of the requirement to mention the authors.


Copilot has been caught multiple times reproducing code verbatim. At some point it spat out some guy's complete "about me" blog page. That's not learning, that's copying in a roundabout way.

Also, AI doesn't learn "like a human". Neural networks are an extremely simplistic representation of a biological brain and the details of how learning and human memory works aren't even all that clear yet.

Open source code usually comes with expectations for the people who use it. That expectation can be as simple as requiring a reference back to the authors, adding a license file to clarify what the source was based on, or in more extreme cases putting licensing requirements on the final product.

Unless Microsoft complies with the various project licenses, I don't see why this is antithetical to the idea of open source at all.


No disrespect but I am baffled by your statement that it learns, even to go so far as to say as a human coder would learn.

I don't really want this to comment to be perceived as flame bait (AI seems to be a very sensitive topic in the same sense as crypto currency), so instead let me just pose a simple question. If Copilot really learns as a human, then why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?


I think the comment was trying to draw the distinction between a database and a language model. The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller. This should tell us that a language model cannot reproduce copyrighted code byte for byte because the original data simply doesn't exist. Similarly, when you and I read a block of code, it leaves our memory pretty quickly and we wouldn't be able to reproduce it byte for byte if we wanted. We say the model learns like a human because it is able to extract generalised patterns from viewing many examples. That doesn't mean it learns exactly like a human but it's also definitely not a database.

The problem is that in reality, even though the original data is gone, a language model like Copilot _can_ reproduce some code byte for byte somehow drawing the information from the weights in its network and the result is a reproduction of copyrighted work.


I see what you're going for, and I respect your point of view, but also respectfully I think the logic is a little circular.

To say "it's not a database, it's a language model, and that means it extracts generalized patterns from viewing examples, just like humans" to me that just means that occasionally humans behave like language models. That doesn't mean though that therefore it thinks like a human, but rather sometimes humans think like a language model (a fundamental algorithm), which is circular. It hardly makes sense to justify that a language model learns like a human, just because people also occasionally copy patterns and search/replace values and variable names.

To really make the comparison honest, we have to be more clear about the hypothetical humans in question. For a human who has truly learned from looking at many examples, we could have a conversation with them and they would demonstrate a deeper sense of understanding behind the meaning of what they copied. This is something a LLM could not do. On the other hand, if a person really had no idea, like someone who copied answers from someone else in a test, we'd just say well you don't really understand this and you're just x degrees away from having copied their answers verbatim. I believe LLMs are emulating this behavior and not the former.

I mean, how many times in your life have you talked to a human being who clearly had no idea what they were doing because they copied something and didn't understand it all? If that's the analogy that's being made then I'd say it's a bad one, because it is actually choosing the one time where humans don't understand what they've done as a false equivalence to language models thinking like a human.

Basically, sometimes humans meaninglessly parrot things too.


> The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller.

This just means it's a really efficient lossy compression algorithm, not that it learns like a human.


> why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?

I've never studied computer science formally but I doubt students learn only from the CS curriculum? I don't even know how much knowledge CS curriculum entails but I don't for example see anything wrong including example code written by humans.

Surely students will collectively also learn from millions of code examples online alongside the study. I'm sure teachers also do the same.

A language model can also only learn from text, so what about all the implicit knowledge and verbal communication?


What they are saying is that if you’ve studied computer science , you should be able to write a computer program without storing millions or billions of lines of code from GitHub in your brain.

A CS graduate could workout how to write software without doing that.

So they’re just pointing out the difference in “learning”.


LLM's are not storing millions or billions lines of code, and neither do we. Both store something more general and abstract.

But I'm saying there's a big difference between a CS graduate and some current LLM that learns from "the CS curriculum". A CS graduate can ask questions, use google to learn about things outside of school, work on hobby projects, study existing code outside of what's shown in university, get compiler feedback when things go wrong, etc.

All a language model can do is read text and try to predict what comes next.


We do but we also simulate it doing homework very well.


AI doesn't "learn". It's statistical inference if anything.

If I took two copy-righted pictures and layered them on top of each other at 50% opacity. Would that be OK or copy right infringement?

AI models just use more weights/biases and more images (or any input).


And what is LEARNING in your opinion?


Cambridge dictionary has it as: "knowledge or a piece of information obtained by study or experience".

If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned? Or the application of a statistical inference model? This alone is probably far enough abstracted to never be an ethical or legal issue. However, if I had a model that was only "trained" on Stephen King books, and used it to write a novel, would that be OK? Or do you think it would be in the realm of copyright infringement?

By your definition anything a computer does means it has learned it. If I copy and paste a picture, has the computer "learned" it while it reads out the data byte-by-byte? That sure sounds like it is "studying" the picture.

"AI" and "ML" are just statistics powered by computers that can do billions of calculations per second. It is not special, it is not "learning". To portray some value to it as something else is disingenuous at best, and fraud at worst.


Your polaroid example would require someone to write code that does that one specific thing. You could also argue that this would violate copyright if it was trained on some photographer's specific unique style, made as an app and marketed as being able to mimic the photographer's style. But in your example you have 1000 random polaroid images of unknown origin, so somehow it becomes abstract enough that it doesn't become an issue.

In your stephen king example I would say it's still learned, because the "code" is a general language model that can learn anything. It's just you decided to only train it on stephen king novels. If you have an image model that trained 100% on public domain images and finetune it to replicate a specific artist's style I would personally think the finetuned model and its creator is maybe violating copyright.

But when it comes to learning I would say when you write a program whose purpose is to learn the next word or pixel, but it's up to the computer to figure out how to do that, the computer is learning when you feed it input data. It's the program's job to figure out the best way to predict, not the programmer. (it's not that black and white given that the programmer will also sometimes guide the program, but you get the idea)

When you write a program that does one or several things, it's not learning.

I think it's something to do with the difference between emergent behavior from simple rules and intentional behavior from complex rules.


I think you're using fancy language like "general language model" to obscure the facts.

If I created a program to read words from the input and assign weights based on previous words, I could feed in any data. Just like the polaroid example. (I suggested that the polaroid example was abstract enough not to be an ethical/legal problem because I believe it is mostly transformative, unless the colours themselves were copyrighted or a distinct enough work in themselves.)

Now If I only feed in Stephen King books and let it run, suddenly it outputs phrases, wording, place names, character names, adjectives all from Stephen King's repertoire. Is this a 'general language model'? Should this by copyright exempt? I don't think this is transformative enough at all. I've just mangled copyrighted works together, probably not enough to stand-up against a copyright claim.

I think people use AI and ML as buzzwords to try and obfuscate what's actually happening. If we were talking about AI and ML that doesn't need training on any licensed or copyrighted work (including 'public domain') then we can have a different conversation, but at the moment it's obscured copyright theft.


I can agree it's obscure in the sense that we shrug when asked about how it works. If you specifically train a model to mimic a specific style I can get behind it leaning more towards theft, or at least being immoral regardless of laws.

If you train a model to replicate 10000 specific artists, I could also get behind it being more like theft.

But if the intention was to train with random data (and some of it could be copyrighted) just like your polaroid example to generate anything you want, I'm not so sure anymore.

I feel the intent is the most important part here. But then again I don't know the intent behind these companies, and I guess you don't either. Maybe no single person working in these companies know the intent either.

It also gets murky when you have prompts that can refer to specific artists and when people who use the models explicitly try to copy an artists style. In the case of stable diffusion, if the CEO's to be believed the clip model had learned to associate images of greg ruktowski and other artists to images that were not theirs but in a similar style[0]

Even murkier is when you have a base model trained on public data, but people finetune at home to replicate some specific artist's style.

[0] https://twitter.com/EMostaque/status/1571634871084236801


> If I scanned a thousand polaroid pictures, and took their average RGB values and created a LUT that I could apply to any photograph to make it look "polaroidy" - would that be learned?

You wouldn't. LUT would.


It's data. No one owns data.


Can I have you’re credit card number, expiry and verification number please? Also your DNA ?

Since it’s data that should be cool right ?


Equating human cognition with machine algorithms is the root of the issue, and a significant part of its "legitimacy" comes from the need for "AI" companies to push their products as effective, and there's no better marketing than to equate humans to machines. Not even novel.


It requires abstraction. Something that LLMs are not capable of, beyond trivial amounts.


TRAINING your 3rd eye/branch predictor

if(nonfree_software){

// unhappy path

}


You can make out the two original copyrighted pictures in that case, and all you did was using 50% opacity which might not be very transformative, so probably?

In my mind (and I suspect others too) in machine learning context, statistical inference and learning became synonymous with all the recent development.

The way I see it, there's now a discussion around copyright because people have different fundemental views on what learning is and what it means to be a human that don't really surface.


If "like a human" is enough to get human rights then why did I get a parking ticket even when I argued that my car just stands there like a human ? This really isn't as good a defense as people portray. There are a lot of rights and privileges granted to humans but not to objects - we can all agree on that I think.


And if you need a person with supercharged rights and a slippery amount of liability...form a corporation.


There is a difference between a person learning and a commercial product learning from someone else’s work, probably ignoring all the licenses.


To be fair, when a programmer learns from publicly available but not public-domain code, and then applies the ideas, patterns, idioms and common implementations in their daily job as a software developer, the result is very much a "commercial product" (the dev company, the programmer themselves if a freelancer) learning from someone else's work and ignoring all the licenses.

The only leap here is the fact that the programmer has outsourced the learning to a tool that does it for them, which they then use to do their job, just as before.


No, the difference is that OpenAI has a huge competitive advantage due to direct partnership with Github, which is owned by Microsoft. In fact, it's even worse. With OpenAI making money from GPT, Github has even less incentive to make data easily available to others because that would allow for competition to come in. I wouldn't be surprised if Github starts locking down their APIs in the near future to prevent competitors from getting data for their models.

Nobody is arguing against uploading code. It's about Github/Microsoft specifically.


I agree there's a difference in the ease of access, a competitive advantage, sure. And I get that people writing public-source (however licensed) software don't want to make it easier for them (as in, Microsoft) to make money off of "learning" (of the machine type) from it. That's fair.

However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.

I mean, for the rest of the content all the new fancy LLMs have been trained with, there wasn't a Github equivalent. They just used massive scraped dumps of text from wherever they could find them, which most definitely included trillions of lines of very much copyrighted text.

In short: not only I don't really see an issue with Copilot-like AIs learning from publicly available code (as I described in the GP comment) but I also think if you publish code anywhere at all it's inevitable that it'll end up in Copilot, regardless of where you host it. If you want to make it more expensive for Microsoft to scrape it, sure, go ahead, but I don't think it matters in the long run.


However, at a first glance, it still feels to me like an unavoidable reality that if you publish source code it'll eventually be ingested by Copilot or whatever comes next.

I’d be quite careful with of this view.

By your logic, it should be ok to take the Linux kernel, copy it, build it, then sell it and give nothing back to the community that built it. Then just blame it on the authors for uploading it to the internet ?


> all this discussion on copyrights in the age of AI.

copyright is a thing, AI do not change that.

> does not 'steal' or and reproduce our code - it simply LEARNS from it as a human

And here we have the central problem, does it act like a human or does it not act like a human? Humans copy things they learn all the time, some of us know various songs by heart, others will even quote entire movies from memory. If AI can learn and reproduce things like humans do then you need to take steps to ensure that the output is properly stripped from any content that might infringe on existing copyrighted works.


There is a definite difference between singing a song while walking down the street and writing down the lyrics, putting it in a database, claiming it’s my content and then selling it on, even if it’s slightly rehashed.


I would have no problem if such AI systems are also completely open source, can be run by me on my system and come with all models to use them also easily available (again in some form of opensource license). I genuinely don't see that happening in the future with BigTech. As such, as a proponent of FSF GPL philosophy, I have no interest in supporting such systems with my hard work, my source code. So yes, I do consider it stealing - my hard labour in any GPL opensource work is meant for the public good (for example, to preserve our right to repair by ensuring the source code is always available through the GPL license). Any corporate that uses my work, for profit, without either paying me or blocking the public good that I am trying for is simply exploiting me and the goodwill of others like me.


Copilot does not steal. Copilot does not learn. If you want to apply these concepts to LLMs, first prove how an LLM is human and then explain why it doesn’t have human rights.

Rather, Copilot is a tool. Microsoft/ClosedAI operate this tool. Commercially. They crawl original works and through running ML on it automatically generate and sell derivative works from those original works, without any consent or compensation. They are the ones who violate copyright, not Copilot.


Whether an LLM actually learns is completely tangential to the topic at hand. A human coder who learned from copyrighted code and then reproduced that code (intentionally or not) would be in violation of the copyright. This is why projects like Wine are so careful about doing clean room implementations.

As an aside, it seems really strange to invoke "open source ideas" as an argument in favor of a for-profit company building a closed source product that relies on millions of lines of open source code.


It’s also fair to say that a lot of this carefulness has probably made life difficult for the developers of wine, but they wanted to avoid Microsoft’s legal team. So they respected the copyright laws.

Here is Microsoft doing as Microsoft does…


I'm in several communities for smaller/niche languages and asking questions about things that have few sources make it much more clear that it's not "learning" but grabbing passages/chunks of source. Maybe with subjects that have more coverage it can assimilate more "original" sounding output.


Plenty of people already argued that LLMs don't actually learn like a human. However, you should keep in mind the reason why clean-room reverse engineering exists: humans learn from source material. FLOSS RE projects (e.g. nouveau) typically don't like leaks, because some contributors might be exposed to copyrighted material. Sometimes, the opposite happens: people working on proprietary software are not allowed to see the source of a FLOSS alternative.


> it simply LEARNS from it as a human coder would learn from it.

It doesn't LEARN anything, let alone like a human coder would. It has absolutely zero understanding. It's not actually intelligent. It's a highly tuned mathematical model that predicts what the next word should be.


I can also learn things with no understanding (like a foreign word), I doubt that would make me immune to copyright ?


If you were to learn a phrase that insulted the king in Thai, and said it in Thailand, you would end up in jail. Doesn't matter if you understood what the phrase said. Ignorance doesn't make you immune to consequences.


Your comment implies that we’re in some age of AGI, but we’re not there yet. Some argue that we’re not even close, but who knows, that’s all speculation.

> it simply LEARNS from it as a human coder would learn from it.

The LLM doesn’t learn, the authors of the LLM are encoding copyright protected content into a model using gradient decent and other techniques.

Now as far as I understand the law, that’s OK. The problems arise when distribution of the model comes into play.

I’m curious, are you a programmer yourself? Don’t take this the wrong way, but I want to understand the background of people who coming to the kind of conclusion you seemed to arrive at about how LLMs work.


> it simply LEARNS from it as a human coder would learn from it

What humans do to learn is intuitive, but it is not simple. What the machine does is also not simple, it involves some tricky math.

Precisely if the process was simple, then it could be more easily argued that the machine is "just copying" - that is simple.

There's a lot of nuance here.

What the machine is doing "looks similar to what humans do from the exterior", the same way that a plane flying "looks similar" to a flying bird. But the airplane does not flap its wings.

> kind of irrational and antithetical to open source ideas

Open source ideas are not the only ideas in town.


Humans don't learn an algorithm by memorizing a particular implementation character by character.


That's all the more reason for the utility of solutions like Copilot? Humans are limited in both time and memory.

Though, GitHub would do well to also bake-in approp attributions if a significant portion of the generated code is a copypasta.


Neither does copilot.


But it does though. There have been many times where this was the case.


It only happens if you bait it really hard and push it into a corner. That's not representative at all. I use Copilot to write highly niched code that's based on my own repo. It's simply amazing at understanding the context and suggest things I was about to write anyway. Nothing it produces is just copypasted character by character. Not even close.


As others have pointed out, it means the model contains copyrighted material. So I guess that’s totally illegal. Like if I ripped a Windows ISO, zipped it up and shared it with half the world. You know what would happen to me don’t you ?


Not the same thing at all. The data isn't just sitting there in a store inside the model that you can query. No-one would be able to look at the raw data and find any copyrighted material, even if all it was trained on was copyrighted code (which I agree is an issue).


There’s a lot of misconceptions here but LLMs and stable diffusion have spat out copyrighted material verbatim.

So that’s not accurate.


What is not accurate? They are still not storing any material internally, even if the patterns they have learned can cause them to output copyrighted material verbatim. People need to break out of the mental model that an LLM is just a bunch of pointers fetching data from an internal data store.


Have a read through other comments on this thread, you'll see some good examples.


And airplanes don't flap their wings, but we still agree that they're flying, just as birds do.


There are people who do it... I personally know a guy whit photographic memory


He don't get an exemption from copyright law, or do he?


Humans are intentionally loading up giant sets of curated data for training, purposes, into a super computer to produce a model which is an black box and have provided zero attribution or credit to those who made this work possible. Humans are tuning these models to produce the results you see.

In the case of ChatGPT-x, Open AI company which is disguised as a not for profit with a goal of producing ever more powerful models that may eventually be capable of replacing you at work while seemingly not having any plan to give back to those who’s work was used to make them insane amounts of money.

They haven’t even given back any of their research. So it’s ok to take everyone’s open source work and not give back is it ?

This isn’t some cute little robot who wakes up in the morning and decided it wants to be a coder. This is a multi-national company who has created the narrative you’re repeating. They know exactly what they’re doing.


"Learning" is a technical term, AI doesn't really learn the same way a human does. There is a huge difference between allowing your fellow human beings to learn from you and allowing corporations to appropriate your knowledge by passing it through a stochastic shuffler.


Individuals can train their own LLMs too.


Copilot is run by a corporation, and the model is owned by the corporation - despite being trained on open source data.

In general individuals will have problems with the first L of LLMs - unless the community invents a way to democratise LLMs and deep learning in general. So far deep learning space a much less friendly place for individuals than software was when ideals of open source movement were formed.


A full LLM is too expensive for individuals to train, but LoRAs aren't.

There are multiple open source LLMs out there that can be extended.

We can already see it in AI art scene. People are training their own checkpoints and LoRAs of celebrities, art styles and other stuff that aren't included in base models.

Some artists demand to be excluded from base model training datasets, but there's nothing they can do against individuals who want to copy their style - other than not posting their art publicly at all.

I see the same thing here. If your source code is public - someone will find a way to train an AI on it.


>it simply LEARNS from it as a human coder would learn from it

I thought that was a sarcastic remark, given the capitalization of 'learn', but followed by IMHO dispelled that part.

We have no idea how humans learn, and the 'AI' has a statistical approach, not much more than that.


A human who learns to copy code letter for letter does just that: copies code. Same with an AI.

The interesting debate should be what happens in the gray area, when you read a lot of code and learns patterns and ideas.


Code, is at best, a trade secret (it is also data). Keep it close to your chest, or don't.


But.. to be clear what you can and can't do with certain code depends on the license. Imagine code that is "open source" as in openly visible and available, yet the license explicitly forbids the use of it to train any AI/LLM. Now how could the creator enforce that? Don't get me wrong, I am aware that the enforcement of such licenses is already hard (even for organizations like the FSF).. but now you are going up against something automated where you might not even know what exactly happens.


Potayto potahto. We all know there's a difference between training a machine learning model and learning a skill as a human being. Even if you can trick yourself into believing AI is just kinda like how human brains work maybe, the obvious difference is that you can't just grow yourself a second brain and treat it like a slave whereas having more money means you can build a bigger and better AI and throw more resources at operating it.

Intellectual property is a nebulous concept to begin with, if you really try to understand it. There's a reason copyright claim systems like those at YouTube don't really concern themselves with ownership (that's what DMCA claims are for) but instead with the arbitrary terms of service that don't require you to have a degree in order to determine the boundaries of "fair use" (even if it mimics legal language to dictate these terms and their exemptions).

The problem isn't AI. The problem is property. Ever since Enclosure we've been trying to dull the edges of property rights to make sure people can actually still survive despite them. At some point you have to ask yourself if maybe the problem isn't how sharp the blade you're cutting yourself is but whether you should instead stop cutting. We can have "free culture" but then we can't have private property.


> IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas

You may be right that this is antithetical to "open source" ideas, as Tim O'Reilly would've defined it - a la MIT/BSD/&c., but it's very much in line with copyleft ideas as RMS would've defined it - a la GPL/EUPL/&c. - which is what's being explicitly discussed in this article.

The two are not the same: "open source" is about widespread "open" use of source code, copyleft is much more discerning and aims to carefully temper reuse in such a way that prioritises end user liberty.


> it simply LEARNS from it as a human coder would learn from it.

This is really not how LLMs work.


A key difference is that a company is making a proprietary paid product out of the learnings from your code. This has nothing to do with open source.

If the data could only used by other open source projects, e.g. open source AI models, I don't think anyone would complain.

You could argue "well, but anyone can use the code on Github" and while that's technically true, it's obvious that with both Github and OpenAI being owned by Microsoft, OpenAI gets a huge competitive advantage due to internal partnerships.


Imagine if folks got royalties on commits, or the language model was required to be open as well.


The company that trains/owns the AI steals the content.


> it simply LEARNS from it as a human coder would learn from it

Does it though? It "learns" correlations between tokens/sequences. A human coder would look at a piece of code and learn an algorithm. The AI "learns" token structure. A human reproducing original code verbatim would be incidental. AI (language model, at least) producing algorithm-implementing code would be incidental.


If that were true, Copilot would have been scanning windows and office source code. But we don't see that.


Nobody wants that.


I want that. I very much want someone to take one of the Windows code leaks, use it to train a LLM, and then make a fork of ReactOS with AI-completed implementations of everything ReactOS hasn't yet finished. Because then we could find out if Microsoft really believes that LLMs are fair use:)


Apes love moralizing and being indignant. This joker wants to share open source code and restrict what other people do with it.


So, like any license except public domain?

Have you personally ever put out something in public domain?


I’ve released plenty under GPL. Not possible to assign to public domain everywhere.


So you restrict what to do with it…


The author is CS professor in university of Winnipeg


He's the perfect profile for having the start of a P/NP proof: a full professor in computer science, but not really in the subfield of CS theory.


Well, a strategic goal to let Europeans suffer during winter in order to shatter our resolve to support Ukraine, while avoiding contractual penalties for unilateral cessation of gaz delivery, can explain the benefit of exploding the pipeline. This can explain also the fact that the fourth pipe was left intact - as a potential carrot to Europe... I do not pretend that the above is truth, but it seams to be a plausible explanation.


I don't see why this is so downvoted. People are doing a lot of mental gymnastics to discount the possibility of Russian sabotage here and I really don't understand why that is. Nothing is certain, but I think it's silly to discount Russia being responsible it for the reasons people are giving.


I’m genuinely curious where the gymnastic is? I am asking in good faith. What is Russia’s interest in sabotaging it? They can’t sell gas to fund their war machine.

What’s the US interest? Now Europe must buy LNG from our fantastic gas companies. US exports of LNG are at an all time high price. Europe is pissed off at our rival. Not to mention our commander in chief threatened to blow it up, why?

Everything here says it’s in the interest of the USA, not Russia. I don’t feel like I’m bending any logic, in fact I feel like this is the only common sense way to think about it.


What people are saying is that he wouldn't be stupid enough to kill off something that is a vital source of foreign currency in wartime, and could be used as leverage in negotiating with the EU. Except it wasn't operational and was proving to be useless as a bargaining chip.

The gymnastics come in believing that the guy who was willing to throw caution to the wind in starting a war with Ukraine (causing his economy to take a hammering and killing trade with the EU, running the real risk of pushing NATO entirely into NATO's arms) is somehow going to be precious about a pipeline that had ceased to have any value. I don't see it.


Russia is the only country who could do it without risking provoking a war.

Russia knew they weren’t going to make any money on it due to the strong reaction to the war on Ukraine.

Russia could have decided to do it to remind Western Europe that it could do the same to other pipelines they depend on.

None of that is determinative that Russia did do it, but it’s sufficient to me to indicate that they could have done it, and insisting that it’s absurd is, well, absurd.


Russia is also not a single actor. There are factions, and they have different interests. The "liberals" aka the economic bloc in the administration - who know full well what kind of shitshow is ahead in the long term - would prefer the war to die down, and for gas to start flowing again, preferably in exchange for dropping sanctions. Because of that, they're a potential threat to the war faction; think of how some parts of the German military repeatedly tried to arrange for secret negotiations with the Allies for an example of how it could materialize. But if there are no pipelines, the "liberals" have no gas to offer in exchange for concessions even if they somehow manage to stop the war and remove the war party from power. This makes their platform less attractive.

(To be clear, this is all just conjecture. The factions are real enough, and have motives that could make players in this game, but it's obviously not the only viable explanation.)


I find this the most likely scenario too. Blowing up the pipeline cemented that Europe wasn't going to get Russian gas for the winter. It left less maneuvering room to any interest groups that could've used it as a bargaining chip as Europe was heading into winter and many expected widespread heating and power issues, some local authorities in Europe even planned for possible mass evacuations of vulnerable population. Gas was the most valuable concession that anyone in Russia could've made. Pipeline blew up and that was off the table. Nobody knew at the time that winter 2022/2023 was going to be exceptionally warm and in hindsight it's easy to downplay those fears and percieved value of the pipeline.


Really? You think Germany would declare war on the US?


Russia owns the pipelines. The U.S. would be risking war with Russia by blowing them up.


You know the USA sends sophisticated military weapons and training to Ukraine including actual HIMARS used to blow up a Russian barrack and kill 400 Russians?

But you think the USA blowing up an inactive pipeline might be the thing that provokes a war and therefore the USA would never do it?

https://www.telegraph.co.uk/world-news/2023/01/02/hundreds-k...


Russia doesn't have a Casus Belli. It's not Russia soil, and the pipeline isn't even owned by a Russian company.


My best guess as to Russian interest is that by performing the sabotage in a difficult-to-attribute manner it plants the seeds for anti-US (and by extent, anti-Ukraine) sentiment in the public perception in the EU. More or less the governments of EU countries need the will of the people to back military aid to Ukraine, so giving the public something to latch on to from an anti-US perspective is valuable in that it could lower or prevent aid.

The Russians can also probably be secure in knowing that NATO won't publicly identify Russia as responsible as that would probably mean war and it has been very clear that NATO's top priority is avoiding the spread of the war in Ukraine.

There are some other possible Russian motivations I've seen, but put less stock in like capability demonstration and limiting incentives for a coup.


You need to stop thinking like a rational capitalist. It's a common problem for westerners to try to look at Russia from this angle.

It makes the cost of energy go up in Europe and the US, which exacerbates the existing problem of monetary inflation caused by the recent pandemic. It was meant as punishment for the West's support of Ukraine to increase civil discontent with the existing pro-Ukraine western governments. This allows the electorate to be more receptive to populist candidates funded by Russia with anti-Ukraine positions which will give Russia what they want.

Then, Ukraine goes to Russia, gas goes back to Europe via the still intact pipeline, European money goes back to Russia, and Russia wins.


Now here are the gymnastics not found in a “rational capitalist” explanation of the events.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: