Hacker News new | past | comments | ask | show | jobs | submit login
Saying Goodbye to GitHub (ersei.net)
504 points by donutshop on April 3, 2023 | hide | past | favorite | 437 comments



I'm gonna get hated on for this, but I don't think "give back" is an open source concept.

I'm not aware of any Open Source license,or Free license for that matter,that has a give-back clause. Source code is available to -users- ,not prior-authors.

Some Open Source licenses can be used in proprietary code, (MIT, BSD etc) with little more than simple attribution.

Those developers chose that license for a reason, and I've got no problem with commercial entities using that code.

There is a valid argument to be made about the training of models on GPL code. (Argument in the sense that there's two sides to the coin.) On the one hand we happily train humans on GPL code. Those humans can then write their own functions,but for trivial functions they're gonna look a lot like GPL Source.

If the AI is regurgitating GPL code as-is, then that's a problem- not dissimilar to a student or employee regurgitating the same code.

But this argument is about Free software licenses,not really (most?) Open Source licenses.

Either way OSS/Free is not about "giving back",its about giving forward.

In the specific case here of co-pilot making money,I'd say A) you're allowed to make money from Free/OSS code. B) no one is forcing you to use this feature.


> I'm not aware of any Open Source license,or Free license for that matter,that has a give-back clause. Source code is available to -users- ,not prior-authors.

In essence, copyleft licenses are exactly that. They oblige the author of a derived work to publish the changes to all users under the same terms. The original authors tend to be users. So, a license which would grant this directly to the original authors would end up providing the same end result since the original authors would be both allowed to and reasonably expected to distribute the derived work to their users as well.

This aligns with the reason why some people publish their work under copyleft licenses: You get my work, for free, and the deal is that if you find and fix bugs then I get to benefit from those fixes by them flowing back to me. Obviously as long as you only use them privately you are not obliged to anything, the copyleft author gives you that option, but once you publish any of this, we all share the results.

That's the spirit here and trying to argue around that with technicalities is disingenuous. That's what Copilot does since it ignores this deal.


> That's the spirit here

not really.

All this Free Software movement started by something really similar to "right to repair", a firmware bug in a printer that was proprietary software. Free Software is about being in control of software you use. The spirit was never "contribute back to GNU", the spirit was always "if you take GNU software, you can't make it non-free". Those GNU devs at the time just wanted a good and actually free/libre OS, that would remain free no matter who distributed it.

You are using expectations of modern day devs in the world of a lot of social development thanks to Github.

You might claim that GP was using the technicalities of the licences, but you can actually check the whole FSF philosophy an you note that they align perfectly with "giving forward" not "giving back".

Free Software is about user's freedom. Not dev rights, or politeness, etc. Now obviously, some devs picked up copyleft licenses with the purpose of improving their own software from downstream changes (Linus states that is the reason he picked GPL), but that's a nice side effect, not the purpose. Which ofc, with popular social sharing platforms like github, those things gets confused.


> All this Free Software movement started by something really similar to "right to repair", a firmware bug in a printer that was proprietary software. Free Software is about being in control of software you use. The spirit was never "contribute back to GNU", the spirit was always "if you take GNU software, you can't make it non-free". Those GNU devs at the time just wanted a good and actually free/libre OS, that would remain free no matter who distributed it.

Distinction without a difference. The end result is the same.


A very distinctive difference.

GPL software is a box that must be kept open, so that everybody would be able to take from it.

If you pick the box and build an altered version of it, you must keep it open, you are legally prohibited from attaching a lid to it.

There's nothing about any expectations, let alone obligations, to put anything back into the original box. Usually it's not very easy (you must follow strict standards) or even impossible (see e.g. SQLite).


If you leave the box open, the original author is free to take your modification and put it into their box.


Good for them! As long as all the resulting boxes remain open, it's fine and in accordance with the (spirit of the) license.


It is a pretty big distinction with different end results in practice. Look at Android, you can use the source in a "right to repair" manner but Google doesn't take patches so you can't give back even if you wanted to.

The same goes for Apple and Google's OSS browser. The source is there, but there is more or less no way to give back, and they certainly don't.


> Look at Android, you can use the source in a "right to repair" manner but Google doesn't take patches so you can't give back even if you wanted to.

Well, license obliging the original author to take patches back would be weird one.

But Google could suck in any change to their own and make it better.

> The same goes for Apple and Google's OSS browser. The source is there, but there is more or less no way to give back, and they certainly don't.

That's a different problem that's a bit orthogonal to licensing and has more to do with project leadership. Like, you don't even need to have OSS license to allow users to contribute to project.


The four freedoms[1] of free software specifically state that:

> The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

Emphasis in "give the whole community a chance to benefit from your changes".

1. https://www.gnu.org/philosophy/free-sw.en.html#four-freedoms


"giving whole community a chance to benefit from your changes" is very different from "upstreaming your changes"


I'm not sure I agree with this as a general point of view.

Speaking generally, I'm not sure that one can claim

>> The original authors tend to be the users

There are endless forks of say emacs,and I expect RMS is not s user of any of them.

Of course RMS is free to inspect the code for all of them, separate out bug fixes from features, and retro apply it to his build. But I'm not seeing anything in any license that requires a fork to "push" bug fixes to him.

>> This aligns with the reason why some people publish their work under copyleft licenses: You get my work, for free, and the deal is that if you find and fix bugs then I get to benefit from those fixes by them flowing back to me.

I think you are reading terms into the license that simply don't exist. I agree a lot of programmers -believe- this is how Free Software works, and many do push bug fixes upstream, but that's orthogonal to Free Software principles, and outside the terms of the license.

>> That's the spirit here and trying to argue around that with technicalities is disingenuous.

Licenses are matters of law, not spirit. The original post is about this "spirit". My thesis is that he, and you, are inferring responsibilities that are simply not in the license. This isn't a technicality,it goes to the very heart of Free Software.


> Licenses are matters of law, not spirit.

But also, i think even the spirit of the original copyleft movement is being misunderstood. As a GP said, the spirit was about centering users, requiring developers to be responsible to _users_, in order to create the kind of society where our _use_ of technology would be unconstrained in certain ways.

It was not about anything owed to the "original" developers, it was not about developers responsibility to other developers. In original spirit, even. It was definitely not about creating a system where people could make adequate income from writing software. That was not even the spirit in which the licenses were devised.

(To be fair, it also imagined/hoped that a large portion of (but not all) users could also be "developers" in the sense they could tweak software to meet their needs -- for their own and others use, though, not for money. Even if users would be coding, the "spirit" still centered them as users, and centered the conditions of use, their needs and desires for how that software would work, not conditions of profit or income from charging people for software use).


Most people don't actually want to maintain a fork. They would prefer that their patches are mainlined.

Consider Linux. It's huge and most vendors really don't want to maintain a fully independent fork. One reason they might do it anyway, is if they could keep their patches private. But the GPL means they can't, so most just choose to upstream patches.


> Most people don't actually want to maintain a fork. They would prefer that their patches are mainlined.

But that's a downstream decision regarding efficiency in their developing process. That's not what free software is about, there is nothing about that in its principles nor licences.

That's just Development Process and maintenance decision, offloading patch integration to upstream, which they might or not accept depending on your changes. None of that is about Free Software. You can see similar decisions/trade-offs taking place in any org with multiple software dev teams with ownership over libs etc, regardless if it is free software or not.


It's like everyone here is a lawyer nitpicking the license as it exists today. But absolutely before the licenses existed Open Source was about those principles, to share code, to share bug fixes, to publish any improvements. So everyone would get better. To say the 'license' doesn't make this explicit it really missing the point.


It’s more like everyone here is very definitely IANAL and _still_ nitpicking (and also missing the point).


You'd think so, but there's also a good chunk of copyleft code that's just "here's our source code, go figure out how to deploy lol".

You can try to fork it into something workable, but that can sometimes literally mean trying to figure out what the actual deployment process is and what weird tweaks were done to the deploying device beforehand. In addition, forking those projects is also unworkable if the original has pretty much enterprise-speed development. At best you get a fork that's years out of date where the maintainer is nitpicking every PR and is burnt out enough to not make it worthwhile to merge upstream patches. At worst, you get something like Iceweasel[0] where someone just releases patches rather than a full fork (and having done that a few times, it's a pain in the neck to maintain those patches).

FOSS isn't at all inherently community-minded; it can be and can facilitate it, but it can also be used as a way to get cheap cred from the people who are naïeve enough to believe the former is the only place it applies.

[0]: "Fork" of Firefox LTS by the GNU Foundation to strip out trademarked names and logo's. It's probably one of their silliest projects in term of relevancy.


People do release software as open source for the "street cred" (perhaps unsurprisingly given the school of thought that you don't deserve a developer job if you can't show off a GitHub repo). A lot of people also create something and figure "why not?" They may just not be interested in doing any serious community development and maintenance. There are even significant open source projects that are pretty much a closed development process.


> They oblige the author of a derived work to publish the changes to all users under the same terms. The original authors tend to be users. So, a license which would grant this directly to the original authors would end up providing the same end result since the original authors would be both allowed to and reasonably expected to distribute the derived work to their users as well.

I might be in the wrong, but this is not how I understand GPL [0]. Care to correct me if I'm wrong.

What I get from the license is that you have to share the code with the users of your program, not anyone else.

AFAIK you could do an Emacs fork and ask money for it. Not only that but the source code only needs to be available to the recipients of the software, not anyone else.

A company could have an upgraded version of a GPL tool and not share it with anyone outside the company. Theoretically employees might share the code outside, but I doubt they'd dared.

[0] https://www.gnu.org/software/emacs/manual/html_node/emacs/Co...


> What I get from the license is that you have to share the code with the users of your program, not anyone else.

You're correct, but it's sort of a meaningless distinction because those users are entirely within their rights under the GPL to share that code on with anyone they want, which is why we don't really see the model of "secret GPL" you describe, in the wild.


I'd argue that this is exactly what you see in the wild. And it's why the AGPL license was created.

GPL code has to be shared with users who receive binaries. SAAS happily didn't shop binaries, so quite legally didn't ship source code.

AGPL redefines this in terms of "user" not "binary". That refinement completely exists to cater for unexpected use cases. No doubt new licenses (AIGPL?) will be needed to address this issue.

The whole need for Open Source protection played out with the (Apache licensed) Elastic Search. Switching to a ELv2 and SSPL license was controversial and in some ways "not open source", certainly not "free" because it limits what a user can do with the software.

So the distinction is far from meaningless and in some ways rendered GPL obsolete.


> That's the spirit here and trying to argue around that with technicalities is disingenuous.

First, I am not a lawyer, but don't licenses exist precisely for their technicalities? This is not like a law on the books in which case we can consider the "Letter and Spirit of the law" because we know in what context in which it was written in/for. With a written license however, someone chooses to adopt a license and accepts those terms from an authorship point-of-view.


Exactly. We all benefit from sharing contributions to the same code base. I use your library, you use mine, we fix each others bugs, add features, etc... The code gets better.


No need to hate on you for a valid response.

I think we're in a new enough situation that we can look beyond what's legal in a license. When many of us started working on open source projects, AI was a far-off concept. Speaking for myself, I thought we'd see steady improvement in code-completion tools, but I didn't think I'd see anything like GPT-4 in my lifetime.

Licenses were written for humans working with code. We can talk about corporations as well, but when I've thought about corporations in the past, I thought about people working on code at corporations. The idea of an AI using my open source project to generate working code for someone or some corporation feels...different.

Yes, I'm talking explicitly about feelings. I know my feelings don't impact the legalities of a license. But feelings are worth talking about, especially as we're all finding the boundaries of new tools and new ways of working.

I don't agree with everything in the post, but I think this is a great conversation to be having.


> Yes, I'm talking explicitly about feelings. I know my feelings don't impact the legalities of a license.

They don't impact the current legality of a licence, but it will affect future ones.

GPL/BSD/Apache/proprietary, they are all picked for ideological concerns which all stem from feelings. It is good to discuss these things, and it is good to recognise that these are emotionally driven.


> They don't impact the current legality of a license, but it will affect future ones.

Don't they? Even the most liberal licenses require that you at least keep the license and attributions. Which are exactly the parts that AI systems remove. I would have no problem with an AI system trained on GPL code if the output was still covered by the GPL.


Whether or not copyright applies at all to model training is an entirely open question, and where rulings have come down, it's likely closer to these situations being fair use (e.g. the Google Book's case, which was ruled transformative and not a direct replacement for the works in question).

The reality is, these models don't copy or distribute anything directly, which makes applying copyright a bit of a stretch. Many people feel like it is a use that should have some sort of IP law applying to it, which is why I think there's some chance that courts or legislators will decide to screw the letter of existing law and just wedge new interpretations in, but it's not super simple: they'd have to thread the needle and not make things like search illegal, and that's tricky. Besides that, these models are out there, they're useful, and if they're ruled infringing they'll just be distributed illegally anyways.

I don't envy the people who will have to decide these cases, I suspect what's better for the world overall is to leave the law as-is and clarify that fair use holds (nobody will stop publishing content or code just because AI is slurping it up, a few weirdos like the article author excepted), but there are going to be a lot of pissed off people either way...


Would I be able to train an AI only using microsoft's leaked windows code to write a windows clone with no copyright (since it comes from an AI) and be safe from legal repercussions because it was trained on fair use code I just happened to find online?

If they rule that it's ok to do that, I might be ok with AI being ruled as fair use.


Exactly. I'm betting if you asked GPT to create a windows clone, for sure MS would not let you distribute. This will go like every other law/license, big corp can sue little guy into the ground. When big corp uses your code it will be 'thats just a model generated code not yours'. But in other direction, if little guys creates windows clone, 'sorry, its off to jail for you maytee'.


Even if it's the opposite direction, big guy losing and small guy coming out ahead, it's still drama.

Just like Covid introduced epidemiological terms to the general public, this issue can introduce design choices around licensing, copyright and watermarking to more people.

I assume there is a group of researchers building tools to provide fine-grained historical views into AI output. And yes, for billions of parameters trained on billions of documents, linking every letter to a source document is a UX nightmare.

But what a cool problem. That's the interesting part. Yeah, something like TileBars[1] or Seesoft[1] seems like the right tool. But maybe keeping it all text with some graphical marker of authenticity is the better choice.

So many cool problems. But, that authenticity marker is the hard sell. Can reasoned discussions with others be enough to introduce that, or is drama required?

https://people.ischool.berkeley.edu/~hearst/irbook/10/node7....


Is it even fair use under the law to consume that Windows source code?


People who train AI models claim that anything they can find on the internet is fair use.

It's the whole point of the discussion… is it really?

And if it's not fair use to train on windows source code because of copyright… doesn't that same copyright law cover everything else as well?


I think there's a reasonable distinction to make between "you can train AI models on any code that you are legally allowed to have and read" and "you can train AI models on any code that you are able to feed into it, regardless of whether you have permission to possess/read it".


You're not legally allowed to have and read it if you knowingly violate its license terms.


You misunderstood. Their emotions do not impact current licences, but their choice of licence is an emotional act.

There are plenty of arguments to choose a licence, and as the world changes we will evolve. See how the GPL itself evolved to handle TiVo. These things arent static.

I personally agree that GPL trained code should produce GPL code, I see a distiction between teaching people and teaching computers but that isn't my call.


The MIT No Attribution licence https://choosealicense.com/licenses/mit-0/ doesn't require an attribution.

It's the licence I choose for new projects.


That's very generous of you. Don't get angry though if I go ahead and re-publish all your code with your name replaced by mine. That's exactly what you are allowing me to do with such a license.


That would be a copyright violation, not a licensing issue. GP is giving you a _license_ to use his code without having to say "GP wrote this", he is NOT giving up his rights to the code..

edit: to clarify, he's allowing you to _use_ it however you like, including making a derivative work, including it wholesale, etc. However, claiming authorship of the original code would still run afoul of the original copyright.

edit2: oh - if you mean relicensing the code - that is allowed.


They are emotionally driven, but that is not all there is too it. They are also driven by an insight, that most people/organisations do not give back to the community, unless obligated to do so. It is about advancement of society and goods everyone can use, if only they give back to society, when it is their turn to share.

This is exactly what code laundering ANNs circumvent and that might open up a dystopian future for all of us, not only us code monkeys, but society in general.


>they are all picked for ideological concerns which all stem from feelings

Or, perhaps more commonly, they were picked for business model reasons or (related) because building a community that includes commercial interests tends to favor more permissive licensing.


Yes. People here seem to be forgetting that Open Source was a community driven ideal first. The License came later as "protection". Corporations were stealing code and there was no recourse. The variety of open source licenses were created to provide a framework for the community, to fight off stealing, to keep it open. So GPT is very much 'laundering' the code just like criminals 'launder' money.


I agree that AI usage of code is somewhat murky with current licenses,which obviously don't mention it either way.

Free software has a principle of "freedom to run, to do whatever you wish" (freedom 0), so arguably has said that training AI is OK. (We could quibble over the word Run, but the Gnu.org,and RMS clearly say "freedom 0 does not restrict how you use it."

GPL code can be used by the military to develop nuclear weapons. Given that the is a guiding principle of the FSF its hard to argue that the current usage is not OK.


I have no problem with Copilot being trained on AGPL code and the getting released with a AGPL compatible license. Free to do whatever they want with it.

The problem is Copilot training on source code and then discarding any restrictions of the licenses. Maybe it is legal right now but I'm sure this case will find it's way into open source licenses pretty soon.


Even if usage is legal right now, the other obligations of the license need to be adhered to as well. Can't just pick or choose one tiny aspect of FSF philosophy and run with that. AGPL is clearly about sharing and spreading free/libre software as well.


Do we know if CoPilot X was trained on AGPL, not just GPL?

Additionally I'm not sure if AGPL does anything.

I suspect the ethics and such of licensing when large fractions of work are training AI and using AI need to be worked out rather than getting mad at any individual.

What does copy left look like for AI?


There is indeed a problem of transparency right now. Companies afaik did not release the complete training data set. Might even be intentionally, because they do not want to risk, that they trained it on stuff they should not have had, without building in license and attribution into the output of their models. Or it might be, that they know that to be a fact.

I can only hope, that lawmakers hurry to catch up with reality and impose transparency obligations for AI models.


I largely agree with you, but I think there is one question that hasn't been addressed yet: Are the weights learned by an LLM a derivative work?

When a person learns from GPL code this question doesn't arise. The state of a person's brain is outside of copyright. But is the state of an LLM also outside of copyright or outside of the terms covered by the GPL? I'm not sure.

An LLM can not only emit source code derived from code published under the GPL, it can also potentially execute a program and could therefore be considered object code.

This isn't necessarily a problem as long as the model isn't distributed and does not include any AGPL code.


> the state of a person's brain is outside of copyright

It clearly isn’t. Which is why clean-room reverse engineering always requires at least two people. Or why a musician that accidentally recreates a chord progression they heard years ago but don't remember the source might still get sued.


No, you're missing the very distinction I'm trying to highlight.

When I read and remember some text, possibly also learning from it, I'm not making a copy and I'm not creating a derivative work. The state of my brain is outside of copyright. Only at the point where I create a new representation based on what I have read I may be violating someone's copyrights.

But is it the same for an AI? Is the act of reading, remembering and learning (i.e. adjusting model weights) not in itself tantamount to creating a derivative work?


> The state of my brain is outside of copyright.

Is it actually? If we could fully pull out the state of your brain, and understand that you stored a copy of a copyrighted work, I think you could be on the hook for licensing it, paying fees every time you remember the work as a performance of it.


The state of your brain is moot wrt copyright as you cannot distribute your brain.

Copyright is to the exclusive right to make copies, it is the exclusive right of distributing them.

As a simple example reading a book aloud in your home or singing in the shower is not copyright infringement; not even if you record it.

If you sell tickets to these performances or stream them on twitch it becomes copyright infringing.

Similarly it cannot be in violation of copyright for GitHub to train copilot on any random code they can legally access. It can be in violation to sell access to the model trained in this way.


The problems are attribution and licensing. Not use.


Fully agree.

This may seem a bit nitpicky and philosophical, but anyway: these feelings you mention are about things, and these things the feelings come from are what is most important. Feelings are never standalone, if they are they are just moods which are so personal its hard to have a conversation over.

Let's call 'the things' values. I'd say feelings are perceptions of values, and as such they invariably have a conceptual element to them. And exactly that conceptual aspect makes them suitable for conversation and sometimes even debate, insofar as they can be incorrect. We can acknowledge the subjective, emotive aspect of feelings as highly and inalienably personal, respect the individual opinion behind them and contest the implicit truth-claims all at the same time.


> I'm not aware of any Open Source license,or Free license for that matter,that has a give-back clause.

§5.c of the GPL Version 3 states explicitly:

> You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.

As in, all modifications must be made available. Is that not meeting your definition of giving back? GPL (all variants) is one of the most widely distributed of the free software licenses and has an explicit "give back" clause as far as I can see it. -- and is part of why some people referred to GPL as a "cancer".

FWIW the issue I've come to have with copilot is that you're not explicitly permitted to use the suggestions for anything other than inspiration (as per their terms), there is no license given to use the code that is generated. You do so at your own risk.


>> As in, all modifications must be made available. Is that not meeting your definition of giving back?

Available to all users. Not previous authors. There may be overlap, or there may not be overlap.

Plus, I would say it's giving forward, not back. If there are public users then the original authors can become users and get the code. But there will be bug fixes and features smooshed together.

Which is why i posit that there's no "give back" concept in the license. Only "give forward".


Man, that is really splitting some fine hairs


Wow! Just wow... Apparently, some people don't get the idea of common good... as if it didn't exist...


Modifications you distribute have to be given back. You don't have to distribute the modifications though.


Not given back to the original authors, but given to the users.


Only if you give those users your modified version.

If I host GPL software on a webserver and my users use that webserver, I don’t have to give them the source code for modified GPL programs. This is fairly common.


Depends on which GPL you’re using.


Issue is that many large corporations use FLOSS internally with heavy extensions/modifications and never give back to the community. They don't have to, since all users are in-housr, and those tend to have access to the source code.

But that's ok. If upstream is somewhat active then it's just a pain to keep maintaining your in-house patches, compared to sending them upstream. So that's automatically a motivator. If upstream is not active, then it does not matter anyway.


I would say that open source as a movement sprung up from the principles of early netiquette[0]. Which themselves were built on the foundations of sharing your knowledge with your peers.

Whether you were trawling Usenet or just a presence in your local BBS scene, "teach it forward" was always a core concept. Still is. It's difficult to pay back to the person who taught you something valuable, so you can instead pay it forward by teaching the lessons - along with your own additions - to the later newcomers.

Of course the Eternal September changed the landscape. And now we can't have nice things.

0: https://en.wikipedia.org/wiki/Etiquette_in_technology#Netiqu...


I think the big move to formalization of GNU was that no free compilers for C existed. RS rightly saw this as a problem and did what he thought was needed to get a universal free c compiler.


Co-pilot spits back protected expressions, not novel expressions based on ideas harvested from code. It is therefore violating the licenses of numerous free and open source projects. The author is right to be pissed.


That's not the case, there's a probability it may "spit back" the protected expression. There's also a probability I, as a human "spit back" the protected expressions. This could either be by pure chance or from past learnings, reading the protected code and internalizing it as a solution, my subconscious forgetting I actually saw it elsewhere.

In Uni, students run their theses through plagiarism checkers, even if it's novel research as it naturally occurs.

As the thought experiment goes, given infinity, a monkey with a typewriter will inevitably write Shakespeares works.


...except you don't need an infinite number of monkeys. It has been trained to produce protected expressions by virtue of being trained on protected expressions. The probability of it producing a protected expression at some point is 1.


The same truth holds for you or me writing up that code.


No it doesn't. My mind contains information derived from expressions I've read which I can rearrange into novel expressions. I don't regurgitate protected expressions verbatim. Co-Pilot does.


That's exactly what co-pilot does, ask it to rearrange it if what it comes up with is the same. That's what code plagiarism checkers are for.


You are correct. The problem is that the GitHub Terms of Service probably (guessing) have a clause which invalidates your license if you upload your code there. And that's exactly why you shouldn't use GitHub.


The terms of service explicitly say that GitHub is not allowed to use your code for commercial purposes.


This seems to be what people imagine about it, not what it actually does, although I don’t doubt you could cherry-pick some snippet after a lot of trial and error to try to claim that it had regurgitated something verbatim. But certainly let’s see the examples.


You never know if a snippet it created came from another project verbatim or not… unless you claim you know all of the code that exists?


That's a bit extreme. In theory, an LLM's proclivity for plagiarism could be studied by testing it with various prompts and searching its training data for its responses (maybe with some edit distance tolerance).


Except the training data is secret…


You can search github and other open sources to find at least a likely subset of the training data though.


You suggest doing this, by hand, for every suggestion?


Just try it out for some code you have on github where you know yours is the only solution out there. You'll be pleasantly surprised to see that it does not suggest a verbatim copy/paste of your code or anything close to it, unless you try this with a one liner like how to do an fopen(), which would not be a good test, and would not be the only solution out there. And then seeing the result, you can adjust your theory. So, in short, I suggest simply testing your theory, not anything absurd like what you are coming up with.


What would that prove? I still have no access to all the proprietary code generated from copilot, and no idea if it did copy paste or not in all those cases.

You suggest I try it twice and since it will probably not copy paste in those 2 tries, assume it never copy pastes (despite existing evidence that it does copy paste in some other cases).

What problem would this exercise solve? I can't see it.


Search tools are a thing. Grep, Google, Github, etc.

But yes, exactly, you never know if a snippet came from another project or not, so let's not assume it did without some convincing evidence.


Why assume it didn't?


I’ve done tests and it passed with flying colors so it’s not an assumption. So the premise of your question is flawed.


It has been shown that github copilot does copy paste.

The fact that you tried it a couple of times (or 10 or 20) means absolutely nothing.

1 copyright infringement is enough for a lawsuit.


You're allowed to make money from Free/OSS code, and plenty of companies have (Google, Amazon etc.), but they have always also at least given back something to the community to earn some good will. The situation with AI is new because it not only doesn't give anything back, it actually takes something away by threatening developers' jobs etc.


> they have always also at least given back something to the community to earn some good will

The busybox authors disagree: https://busybox.net/license.html


One possible problem is if Copilot gets good enough that you can rather easily sidestep GPL (or any other license) by having Copilot implement functionality X for you instead of using a license-bound library providing X. Not only may this be questionable regarding the license, but it would also be tend to reduce contributions to the library which otherwise would have been used.


ChatGPT already can rewrite functions to use different algorithms.


Shouldn't be different from a human ripping or regurgitating parts of the code.

If it's not a hard derivation, then it's difficult to prove or even notice.


It shouldn't matter. If your model touched X during training, it should be seen as producing derivative work. This is the reason humans use clean room implementation techniques.


> If your model touched X during training, it should be seen as producing derivative work.

I as a programmer touched X during training (learning how to code). Is all my work now derivative because of that?


If you examined the source code of a library for a specific purpose Y, then shortly afterwards went implementing another library for purpose Y, there's a high probability of your code being infringing. That's the entire premise and purpose of clean room design (https://en.wikipedia.org/wiki/Clean_room_design).

Now factor in that machine models don't have a fallible or degrading memory and I think the answer is quite clear.


The opposite of clean room is probably much more common.


My point was more about the possible negative effects than about the legality.

An important difference is that AIs are much cheaper than a human. Having a library reimplemented by a human usually isn’t cost-effective, but having it done by an AI may become viable. That could cause a major change in the open-source dynamics, while possibly also reducing average software quality (because less code is publicly scrutinized).


It would be interesting to have Free Software License that requires that any thing which ingests the source code must be Free Software running on Free Hardware. If you train a model on such inputs, your model would need to be Free software and all the hardware the model runs on would need to be Free Hardware. This would create a massive incentivize to either not use such software in your model or to use Free Software and Free Hardware.

Taken to its logical conclusion, you could add the notion of Free Humans are legally bound to only produce Free Ideas. One could imagine this functioning like sort of monastic vow of charity or chastity. "Vow of silence on producing anything which is not Free (as in freedom)."

Would you take such a vow if offered 100,000 USD/year for the rest of your life (adjusted for inflation)? I would.


This idea ("make a stronger license") has come up in previous discussions of Copilot as well[0].

The problem is that the Copilot project doesn't claim to be abiding by the license(s) of the ingested code. The reply to licensing concerns was that licensing doesn't apply to their use. So unfortunately they would just claim they could ignore your hypothetical Free³ license as well.

[0]: https://news.ycombinator.com/item?id=34277352


> The problem is that the Copilot project doesn't claim to be abiding by the license(s) of the ingested code. The reply to licensing concerns was that licensing doesn't apply to their use.

I think github is largely correct in their view on licenses. However I would argue that you could create a stronger legal binding than say a GPL-3 license. For instance you could require and enforce that anyone that wishes to read the repo must sign a legal contract or EULA: "By decrypting this git repo you are agreeing to the following licenses, restrictions, contractual obligations, ..."


From the MIT license:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

Does GPT spit out the copyright notice when it regurgitates my code?


Is GPT spitting out “substantial portions” of your code?


It might. The ethical dilemma here seems to be:

1) We don't know when/if GPT will. GPT in its current form can't seem to guarantee safety (either from "substantial" verbatim snippets or from complex "hallucinations" of random pachinko output).

2) GPT doesn't know when/if it has. GPT in its current form likely cannot know this. (In part, because it doesn't really "know" anything, that's too anthropological a word for what is still mostly just a casino full of pachinko machines.)

3) Define "substantial portions" in a way that a jury of your peers can understand it in a court of law.

4) Can you define "substantial portions" again, only this time in code as guide rails for something like GPT? "Substantial portions" is barely a human term designed for human lawyers and courts. There's a fascinating challenge here on quantifying it.


Trivial functions aren't copyrightable, so it doesn't matter whether they are GPL or not.


> If the AI is regurgitating GPL code as-is, then that's a problem- not dissimilar to a student or employee regurgitating the same code.

Not "if". We know it does.

And since it doesn't show citations, it might be the case that you use and mistakenly end up making your entire software GPL, because of including copy pasted GPL software.


Yeah on the one hand, isn't opening your source all about not really minding what happens to it after that? It's intended to be copied and used. On the other hand something about the term "laundering" kind of resonated for me. It's kind of like automated plagiarism where you spread your copying out over millions of people. But plagiarism only has meaning as an offense when the thing being copied isn't intended to be copied. But for copyright purposes is there a difference between copying exactly, and the type of blending a LLM does? I'm too confused. That feeling when you hit on something society has never thought about before.


If we go by how many people explain open source, you would be right, but if we go by how people who actually know what their licenses are supposed to do explain open source, then no. You give a license for a specific reason. One might be to allow others to copy, but there is usually a condition, and that is to leave the license information intact. If we go further towards free/libre software licenses like GPL or AGPL, then we have more conditions to it. For example that, if you distribute software using that code, you need to distribute the source of your software as well (a bit imprecise).

If you want to get a better picture of the situation, read up on the licenses and what they do, specifically the term "copyleft".


That's why OSS is not about the contents of more restrictive licences (as you say, copyleft is a good example), but about the broadest definition that applies to all of them. More specialised sub-types can have more conditions, but then we should speak about those sub-types directly.

I make this point because it causes confusion. Saying "my OSS project is being run by others for money and I get none of it" is confusing, as it uses the supertype. Using the subtype is clearer, and even self-explanatory: "my project I licenced so others can make money from it while giving me none is being run by others to make money from it and I get none".

That may be a bad situation, but at least the reasons are now clear.


> Yeah on the one hand, isn't opening your source all about not really minding what happens to it after that?

No! That's a gross misrepresentation of what open sourcing is. It's the offer of a deal. You publish the source code and in return for looking at it and using it for sth, I have obligations. Like attribution and licensing requirements regarding derived works.


no, open source isn't about practically giving up your rights, its about restricting use of your code and software in exactly such a way that it gives every user as much freedom as possible.


This actually has been thought about before, in the context of remixes, collages, etc. The essential question is how much of the originality of the original work(s) constitutes the originality of the new/derived work. If it is little enough, then it’s okay. The issue with AI models is that they have no way of assessing originality and tracking the transfer of originality.


The term is being used here to imply that the generated code is somehow bypassing the licensing requirements, which isn’t necessarily true, and certainly isn’t a substantiated claim.

You can read licensed code, learn from it, and then write your own code derived from that learning, without having committed a copyright violation.

You can also read licensed code, directly copy paste it into your codebase, and still not have committed a copyright violation, as long as you did so in a way that constituted fair use (which copy-pasting snippets certainly would).

There’s no copyright issue here at all, and rationally speaking there aren’t any legitimate misuse of open source concerns either. If these people were honest they’d just admit to feeling threatened by AI, but nobody would care about that, so they just try to manufacture some fake moral panic.


I agree that copyleft is more about "giving forward", and I think it's a confusion a lot of people make. Reading through the thread, I get the impression that some think as soon as one "distributes" the licensed material, original authors should get a copy. I'm extrapolating of course, but even then I feel some people would agree with that statement.

GPL, for instance, merely states that distributed sources or patches "based on" the program should be "conveyed" under the same terms. In other words, anyone who gets their hands on it will do so under the same license.

If anything, I would be worried that GitHub trained itself on publicly-available but not clearly licensed code, because then it would have no license to "use" it in any way[0]. GPL provides such a right, so there is no problem there. It would be even more worrying if the not clearly licensed code was in a private repository but I think I remember reading that private repositories were not included in the training data.

However, would you consider a black box program, of which the output can consistently produce verbatim or at the very least slightly modified copies of code from GPL code to be transformative? The problem does not lie in how the code is distributed but in how transformative the distributed code is. Not only does the same apply to any program besides AI-powered software, it applies to humans[1].

Given how unpredictable the output of an AI is, one should not be able to train itself on GPL code if it cannot reliably guarantee it will not produce infringing code.

[0]: https://docs.github.com/en/site-policy/github-terms/github-t... (https://archive.ph/susi0#4-license-grant-to-us)

[1]: One such example would be how Microsoft employees allegedly prevented themselves from reading refterm source code, cf. https://github.com/microsoft/terminal/issues/10462#issuecomm...


Perhaps I'm out of the loop on this, but I always thought the concept of open source was primarily about the opportunity for personal professional development. The ability for someone not connected with a corporation to stay relevant and continuously update his skills in away that was not dependent on proprietary systems. That is a huge asset, not only for oneself but also for the the world.


That may be a benefit, but the primary concept of open source is that its open source.


Time will tell, and it's a destined trend for more devs close sourcing their code no matter what curious angle you are trying to justify large firms using AI exploiting money, and I doubt you are working for one of them.


IMHO, just like there was a robots.txt file made for the web, there needs to be a NOAI.txt for git repos. Sorry, this repo does not permit you to ingest the code for a learning model. Seems completely reasonable.


What would be the point of that?

If we were somehow able to prevent AI models from ingesting a codebase, that would mean everyone else who wants to produce similar code would have to re-invent the wheel, wasting their time repeating work that has already been done.

All because... the person who did it first wants attribution? They want their name to be included in some credits.txt file that nobody will ever read? That's ridiculous.


> All because... the person who did it first wants attribution? They want their name to be included in some credits.txt file that nobody will ever read?

Yes, and yes. Those would be the terms that person publishes their code under. If you can't agree to those terms - maybe because including a single name in a credits.txt file that no-one reads is somehow too onerous for your process - then you are always free to re-implement that code on your own.


That wouldn't work. robots.txt is not enforceable.


I wonder if the GPLv4 will be coming out soon with an anti-AI-training clause (unless your output model and its outputs are also GPLv4, of course).


People keep bringing this up. It's not as straightforward as a clause that says "you can't use this to train AI" (which is what I suspect many people think).

Licensing operates on a continuum of permissiveness. They can only relax the restrictions that you as a creator are given by default. You can't write a copyright license that adds them. You could write a legal instrument that compels and prohibits certain behavior(s), but at that point you're talking about a contract. (And there's no way to coerce anyone to agree with the contract.)

Harry Potter has even more restrictions than the GPL or any other open source license. It's "All Rights Reserved—it enjoys the maximum protections that a work can. And yet it would still be possible feed it into an AI model, even if all of Rowling, Bloomsbury, and Scholastic didn't want you to. They don't get a say in that. Nor do open source software developers in their works which selectively abandon some of the protections that Rowling reserves for herself and her business partners.

The only real viable path to achieve this using an IP license alone would be a React PATENTS-like termination clause—if your company engages in any kind of AI training that uses this project as an input, then your license to make copies (including distributing modified copies) under the ordinary terms are revoked, along with revoking permission for a huge swathe of other free/open source software owned by a bunch of other signatories, too. This is, of course, contingent upon the ability to freely copy and modify a given set of works being appealing enough to get people to abstain from the lure of building AI models and offering services based on them.


I amend my wondering to the prepend the GPL thing with "I wonder how long it will take disney et al to make it illegal to train AI with their stuff" which then opens that door to the GPL


> I'm gonna get hated on for this, but I don't think "give back" is an open source concept.

You're right. It's a politeness law some people have invented.

It's also a value people have, but that's for themselves. I like contributing to OSS projects. But, as soon as it's imposed on others, and there are punishments for disobeying, it's a politeness law.


Politeness keeps society civil and sane.


Politeness laws don't. They calcify and oppress.


Could you link to a politeness law on the books that you feel is oppressive?


On the books? I mean laws like "you shouldn't make money from OSS made by someone else". The context of this chat.


oh, so those things are what people call "opinions", which they are completely allowed to have, just as you have yours. They aren't oppressing you, they don't expose you to penalties, you can't get thrown in jail.


No, not just opinions. I don't understand why you'd redefine what I'm saying, as though that invalidates the original statement. It's not exactly difficult to spot.


I’m not redefining what you’re saying. I’m pointing out that you are exaggerating the magnitude and impact of Internet opinions by characterizing them as oppressive laws.


    then that's a problem- not dissimilar to  
                          ^
Discontinuity here.


> On the one hand we happily train humans on GPL code. Those humans can then write their own functions,but for trivial functions they're gonna look a lot like GPL Source.

Exactly. People are getting mad that Microsoft is making good money while the people who made all that free software available mostly did it for free (like in no money and no recognition). It can sound unfair but that's the deal. If you didn't want people or AI to learn from your code, open source was not the right option.


> If you didn't want people or AI to learn from your code, open source was not the right option.

There's nothing wrong with other people using - learning and creating derivative works of - one's open-source code, provided they respect the terms of the license. It seems to me that the real issue is the fact that these licenses don't have enough teeth.


Most people I know who contribute, or host open source projects, me included, do this for references. And the most successful ones find a way to generate revenue. "Giving back" is a nice additional thing, but I don't know anybody who does that _primarily_ to "help the world"


If we are being honest as a community, open source developer are pretty far down the list of groups with valid grievances against this current wave of AI for how they are trained. There is at least a debatable case that these systems are operating in the spirit if not the exact letter of general open source licenses. It is a much harder argument to make for the AI trained on writing and art that is clearly copyrighted. If you have ethical questions about Copilot, you really should be against this entire crop of AI systems.


So you're suggesting that developers shut up and let the artists talk first? I'm not sure what the "you're suffering less than these other people" thing is actually intended to translate into? What do we do with that?

All software licences are based on copyright, same as writing, art, music, etc. Some software licences are permissive. Some writing is permissive (e.g. Cory Doctorow). Some music is permissive (e.g. Amanda Palmer). It entirely depends on what the author wants. The fact that more software is permissive is a good thing, right?

I entirely agree that there are ethical problems with training AI on copyrighted training data. But please let's not start gatekeeping this. We need to have a serious discussion as a culture about it, and saying "you're way down the list of victims" isn't helping.


> So you're suggesting that developers shut up and let the artists talk first? I'm not sure what the "you're suffering less than these other people" thing is actually intended to translate into? What do we do with that?

The tech community has a tendency to not care about issues like this until it effects us. I’m not telling people to shut up about this. I’m saying don’t be a hypocrite. If this is the wrong approach for GitHub, it is a problem with the way all these AI are trained.


This is a pattern of argument I see all the time in minority communities, and it really irks me the wrong way.

You are allowed to care about the things you care about and not have a concrete well-informed opinion on something related that might be more pressing. It only becomes hypocricy as soon as you actively dismiss the other thing as if you were well-informed. And I don't see anyone here doing that.


There is indeed a problem with the way all these models are trained, and too many people want to pretend that colour can be laundered out <https://ansuz.sooke.bc.ca/entry/23>


> If this is the wrong approach for GitHub, it is a problem with the way all these AI are trained.

Broken logic.


What I agree with is the typical open source dev, who goes "I MIT license all my things, because I have seen it elsewhere and I don't want to think about licenses a lot." being pretty far down the list of groups of people to complain.

What I disagree with is the idea, that they should therefore not complain, or that there could not be an AI system, that does not code laundering, but keeps licenses in place and does this ethically and in an honest way. Adding "ethically" and "honest way", because I am sure that companies will try to find a way around being honest, if they ever are forced to add back the licenses.

In fact, artists might not be the group, that grasps the impact of training on that corpus as quickly as the dev communities. Perhaps it is exactly the devs, who need to complain loudest and first, to have a signal effect.


>I'm gonna get hated on for this, but I don't think "give back" is an open source concept.

Well I guess you know why you may be hated for this already. For anyone who has surf HN since ~2010 would know or should notice the definition of open source has changed over the past 10-15 years. Giving Back and Communities are the two predominant Open Source ideals now. Along with making lots money on top of OSS code being somewhat a contentious issue to say the least.

But I want to side step the idealistic issue and think this is more of an economic issue. Where this could be attributed as a zero interest rate phenomenon. You now have developers ( especially those from US ) for most if not all of their professional life living under the money / investment were easy, comparatively speaking. And they should give back when money ( or should I say cash flow ) isn’t an issue. When $200K Total Comp were suppose to be the norm for fresh grad joining Google. And management thinking $500K is barely enough they need to work their way to $1M, while seniors developer believes if Junior were worth $200K than they are asking for $1M total comp is perfectly sane, or some other extreme where everyone in the company should earn exactly the same.

If Twitter or social media were any indication you see a lot of these ideals were completely gone from the conversation. Although this somehow started before the layoffs.

It is somewhat interesting to see the sociologic and idealogical changes with respect to economics changes. But then again, economics in itself is perhaps the largest field psychology study.


> The code that was regurgitated by the model is marketed as "AI generated" and available for use for any project you want. Including proprietary ones. It's laundering open-source code. All of the decades of knowledge and uncountable hours of work is being, well, stolen. There is nothing being given back.

Leaving GitHub wont change that, OpenAI is training its models on every bit of code they can have, sourcehut, codeberg etc. If its public, they will train on it.

Also from my experience of trying to leave GitHub, you just end up having a couple of projects on your alternative platform, and everything else on GitHub. You are still active on GitHub, probably even more than your new alternative.

And if you want to build a community, you will quickly find out that the majority want to stick to GitHub, and leaving it can kill your projects chances of getting contributions.

Personally if the courts decide its fair use, that's it, I'm going back, its the best got platform out there, gitlab doesn't even compare in free features. However I have been eyeing Gitea and Gitea Actions, with it Codeberg could become a realistic choice for me.

To end it, here is a Hot take, I really hate Sourcehut.

it hard to use, the ui is .. Not great and trying to browse issues or latest commits is a nightmare.

Every time a project uses it, its a pain to deal with.


> Also from my experience of trying to leave GitHub, you just end up having a couple of projects on your alternative platform, and everything else on GitHub.

> And if you want to build a community, you will quickly find out that the majority want to stick to GitHub, and leaving it can kill your projects chances of getting contributions.

That's a defeatist attitude and a self-fulfilling prophecy at the same time. As more and more people leave GitHub (hopefully not to go to the same alternative), it becomes less and less of a must-have. The reason these things are somewhat true today is because of the network effect, and it's precisely that effect which we must actively attempt to squash by leaving.


Parent is talking about a fundamental feature of networks. A denser and larger network has much more useful network-related features, and if one company has a significant majority of the total addressable market for a network, it's a massive ask for people to extricate themselves and rebuild a network somewhere else.

It's why Facebook is still on top even though everyone hated it for a while; YouTube is the *only video platform, etc.


But we are developers, not my grandma. We ‘know better’ but haven’t been doing enough about it.


You are overestimating how many developers care about this really.


> Leaving GitHub wont change that, OpenAI is training its models on every bit of code they can have, sourcehut, codeberg etc. If its public, they will train on it.

Not every bit of code, they are respecting proprietary licenses.

When MS puts the code for Windows, Office, Azure and everything else in front of ChatGPT, Copilot, whatever other AI learning model they have, then perhaps they have a leg to stand on.

Otherwise, they're just being hypocritical to claim that no injury is being done by using code for training, because they are refusing to train on any of their code.

Right now it just looks like they are ripping off open source licenses without meeting the terms of the license.

https://www.lelanthran.com/chap7/content.html


AFAIK that has nothing to do with the license, it has to do with whether the code is public. You don't want the AI accidentally revealing proprietary non-public information (e.g. imagine someone had a secret API key in a private repo and copilot leaked it; that'd be a huge incident), so you don't train it on that information, regardless of what it's licensed under.

You could make a similar argument for not training on GPL code, but it's a lot easier to programmatically determine whether or not code is public than it is to programmatically determine what it's licensed under, particularly when you're training on massive amounts of unlabeled data. Not to mention it's way easier to delete an accidentally-added snippet of GPL code from a codebase than it is to "unleak" company secrets after they've been publicly revealed.


> Not to mention it's way easier to delete an accidentally-added snippet of GPL code from a codebase than it is to "unleak" company secrets after they've been publicly revealed.

How often do you think anyone will notice that some part of a proprietary codebase is copied substantially from GPL code? I think it's going to be very rare and a lot of this code will fly under the radar. The GPL was always a kind of legal jiu-jitsu, turning copyright against itself and allowing non-commercial entities to protect themselves from uncompensated exploitation. Models like copilot, if they're legal, upend the status quo tremendously. Even though your code isn't (always) used directly, a commercial entity like Microsoft will slurp it up and sell the resulting model back to you for $9.99/mo.


> Every time a project uses it, its a pain to deal with.

Sorry, but I consider that a plus.

One of the primary problems with GitHub right now is the "drive by" nature. Everybody is on Github because a bunch of idiotic big corporations made "community contribution" part of their annual review processes so we now have a bunch of people who shouldn't be on GitHub throwing things around on there.

Putting just a touch of friction into the comment/contribute cycle is a good thing. The people who contribute then have to want to contribute.


I like sourcehut, I'm just not a fan of email oriented collaboration workflow, so I dont use it. And the rest of the world isn't either, if the success of github is anything to go by. I get that Drew likes it, the greybeards are used to it, it works, it's adequate, and it keeps things simple, but I just never could do it. I don't like git either tbh, I grumble while I use it. IMO the perfect collaboration suite would be something like fossil with RSS feeds for every action.


I believe the goal is to build a minimal UI for those that don't prefer which is fine, but email & pull requests aren't the only model here. Look how much tooling is created to try to fit stack-based diffs atop Git+GitHub instead of using a different platform.


I'm mostly familiar with gitlab, what does github provide for free above and beyond that? I like that I can run my gitlab pipeline on my machines and sync to a free gitlab instance. I like that I don't read about security vulnerabilities in gitlab pipelines nearly as often as github actions. I like gitlab issues as they are fairly minimal.


GitHub registry, GitHub actions and GitHub Codespaces are unlimited for public repos, in addition to all enterprise features.

That's without talking about nice to have features like GitHub Sponsors, the for you tab, the (arguably) more popular UI layout, It's simply a better platform for Open source projects


> gitlab doesn't even compare in free features.

What features is GitLab missing? I don't know, I'm curious.


Unlimited package registry, unlimited Action run time, premium features unlocked and more. Also, the free tier on GitHub gives more for private repos too!, unlimited orgs, 2000 Ci minutes etc. It's just plain better, and It's because Microsoft can afford to play the long game, GitLab can't anymore.


I believe he just wants to do his bit by removing his activity from github towards lowering their dominance numbers in the space. I don't think he intends to stop those LLM code models.


This whole open source thing is the biggest farce on planet Earth. Someone with a good knowledge about geeks and their behaviour concocted up this open source bullshit. So now talented people give their skill to the "whole" and they have to beg for contributions and donations to get by. And other geeks (not suits with ties) finance the ones they sympathise with. It's ridiculous.

And faceless entities use their hard work for who knows what, but mostly to fatten up their already oversized corp and give back NOTHING.

And people, seemingly without common sense suck up to companies that rob them, and even disseminate their shiny new "free" tools.

This would be a Hugo-Nebula award winner novel if it wouldn't be reality.


This is such a misrepresentation of the open-source landscape. Yes, there are people working on open-source projects who beg for donations; but there also are open-source projects maintained by full-time employees (Eleventy, paid by Netlify; React, paid by Facebook; Angular, paid by Google; Next.js, paid by Vercel; Linux, paid by various companies; etc.). If a person thinks that his efforts will be better compensated elsewhere, he can always start looking for a paid job.


People should not beg for their work for a start. If you use it, you should pay for it. A small fee, or a fee that is set by the dev.


I use a lot of open source software and directly pay for approximately none of it.

Do you behave differently and pay for it today? What do you use to accomplish (track and manage) that?


That's the problem there are no systems in place for it. So if there is no UI for payments on github, people will never stop for a moment to think about what they take for granted.


You just described proprietary software. It's a philosophy incompatible with free software.

You aren't required to pay for a gift, which is what free software is.


Nah, nope. You usually don't get the source for proprietary software. Just like you get medical care when you are in an accident, same should apply to work done on os software. Somehow devs should be compensated for their work on os soft.


> So now talented people give their skill to the "whole" and they have to beg for contributions and donations to get by. And other geeks (not suits with ties) finance the ones they sympathise with. It's ridiculous.

Is it? I can't think of a single professional dev making money right now that isn't making money because they did not have to reinvent the entire tech stack that they are skilled in.

If there was no open source, we'd all be making a lot less, and the state of tech would be far far smaller than it is right now.


If there would be no open source, people would pay for libraries. Now we have open source, and a lot of devs are not compensated. End of story. No proper solution. That's all.

Roughly the same applies to newspapers. Ohm please do not turn off advertisements so we could keep the lights going.

Digital beggars everywhere.


> If there would be no open source, people would pay for libraries.

Nonsense. The cost of creating non-trivial software (say, 20+ dependencies, all needing payment) would put software out of the reach of ordinary people, meaning that there will only be a small niche of developer jobs.

Which means that most people making a non-zero income from writing software today would have been making a zero income from writing software in your hypothetical alternate universe.

There's a lot of butterfly-effect type results as well - due to how capitalism works, the majority of people who are capable of writing software would never be able to compete - whoever the bug players are, they could simply buy them out, shut them down or even product-dump.

FOSS levels the field somewhat: FOSS is a force multiplier, in that whatever FOSS creates can be used to create more software (even non-FOSS), reducing the dependency on one or two incumbents who were lucky enough to get there first and cornered the market.

Without FOSS, we'd all be running IE6 on Windows 98, because there'd be no competition.


I think you have issues with interpreting the idea as a whole, so you cling to one sentence and base some totally out of touch assumption on that very sentence.


While this is a pretty harsh take, I can't help but to feel that articles like "Software below the poverty line" support at least aspects of it: https://staltz.com/software-below-the-poverty-line.html


I don’t think open source per-se but certainly permissive licenses like Apache were a mistake. They’ve just allowed business to either get free things to make a profit while contributing nothing back or to literally create a business by selling the Apache licensed programs in the cloud.


Yikes. You sound very bitter. Is there a story behind that bitterness?

There's a wide variety of people in the open source community at large. And a wide variety of motivations for contributing. I for one am happy that open source software is a thing. It's been a net good for mankind. Sure, there are abuses, and I'm sure many things could be improved. But I'm glad it's there all the same.


Nah, not at all, just don't like power structures that feed on benevolent naivety. Of course it's good, but compensating those people should be the norm. Where is the payments side of github for example? So it's open source but in order to clone the rep you should spend 1 bucks on it, or if you want a bug/feature addressed you could name a price or the dev can set the price on it?

Because the holy sacred cow must not be agitated... suuuuuuuuuuuuuuuuure.

And people rationalizing the all devouring machine, hell, it is just bonkers.


> compensating those people should be the norm. Where is the payments side of github for example? [..] if you want a bug/feature addressed you could name a price or the dev can set the price on it?

Github does have monetization. It has "sponsors", and you can create a "sponsor" level that is basically "I will consult with you and prioritize bugs you choose".

It's totally normal for a developer who wants to monetize a popular open source project to offer consulting or "pay for me to work on your bug". That's already there.

... However, I would like to provide an alternative view. I am personally very happy that monetary compensation is not the norm in free software. I find joy in coding, but I find far more joy in coding when there's no money involved. When I am able to work as much or as little as I want, without feeling any form of financial obligation to others (which inescapably comes from being paid), I am happier.

If the norm was to pay or be paid in free software, I would not find joy in it. I would likely not participate.

By analogy, let's say that me and some friends get together to eat food, and each bring a meal. You might say "oh, that is a waste, the person who made a meatloaf could have sold that for money. Everyone at this meal should be paying each other for their cooking, and the person who cooked the most ends up making some money". Do you not see how that would ruin the feeling of cooking for your friends and enjoying time together?

To me, the free software community has a similar thing. Because the norm is assuming people are just trying to build stuff, not make money, it makes it a far more pleasant activity.

> And people rationalizing the all devouring machine, hell, it is just bonkers.

To me, the truly bonkers thing is people letting capitalism eat them. "You have to have your grindset, optimize your time to make money", it seems bonkers to me. People trying to rationalize their existence not by finding communities and trying to help others, but by trying to make their wealth as large as possible, often at the expense of happiness.


It should be default. They are nowhere to be found when you click on code, to get the clone url for example. It's still up to you whether you pay or slack, the code will be there, it's free, just it would weigh down on your conscience. It's a safety net and a way to say thanks to the devs.

There are a lot of people who do it, because they like what they do (esp in the beginning, while it's not a maintenance nightmare), but would also like to have some side income from it, but they are timid/shy to ask for it. So the burden should be on the service provider to provide these services and not on the developer. The dev can even opt out of it (like you) if he wants to, but I think that would be the very minority.

You pour your heart into a project, others use your project like it's a free service, but in the end nobody gives you nothing for it. All you get is stars and forks, and some stats. Wow, thank you for the exploitation of your naivety.

You can buy the favorite beer, coffee, hamburger from the money flowing in, and that's your tangible reward for your efforts.


I disagree that compensating FOSS contributors should be the norm. That would open people up to a huge can of worms (labour laws, liabilities, etc).

> Because the holy sacred cow must not be agitated... suuuuuuuuuuuuuuuuure.

I have no idea what you're talking about.


Yup.

FWIW, I keep thinking about some kind of dual licensing, FOSS and something-something-royalties. (Sorry, IANAL, so haven't gotten any further.)


I tend to disregard articles that default to the "Stochastic Parrot" argument. These tools are useful now, I don't personally care about achieving actual intelligence. I want additional utility for myself and other humans, which these provide now, at scale.

By a lot of measures many humans perform at just about the same level, including confidently making up bullshit.

This post reads like one of the "Goodbye X online video game" posts. I'll cut them some slack because this is their blog they're venting on and was likely posted here by someone else and not themselves doing some attention seeking, but meh.


Being useful and a stochastic parrot are not mutually exclusive. And I in fact think the opposite. it's Necessary to remind people what it really is, especially in this phase of "Enthusiasm" because I see to many people attributing some meaning or some hidden Insight and especially some innate infallibility to AI nowdays, maybe confused by the name AI.


Right, but most arguments, including the one here, go something like "AI is a Stochastic Parrot so it's a lie and now I think it's bad and we shouldn't do it."

Which is a pretty dumb position imo. Not that I personally think these newer LLMs are a stochastic parrot, or at least not to the degree proponents of the Stochastic Parrot argument would have you believe.


The argument of Stochastic Parrot is not that we shouldn't do it. It's just "Do not attribute any meaning to it". The one in this article is in the same vane that author thought the AI was learning to program when in reality it was just repeating the most statistically probable combination of the code it had seen. That is, "correctness" is not part of "considerations" that the model does. If the majority of the code that it has scanned contains a particular form of logic bug, it will suggest the same logic bug. The trap is in the fact that the AI will write perfect syntax because that is it's bread and butter and people seeing this perfect syntax attribute also perfect logic to it. As long as people are aware of this kind of problem, LLMs are a very useful tool that will save a lot of time. But if applied blindly "Because AI knows best" it will create more problems down the road.


That's not what I said the Stochastic Parrot is arguing, I said people use the Stochastic Parrot argument to justify not doing it.


It’s pretty ironic that the argument that LLMs are stochastic parrots actually sounds like the actual parrot.

I think we’re now way past that now with LLMs now quickly taking on the role of a general reasoning engine.


> now quickly taking on the role of a general reasoning engine

And this right here is why it's important to emphasize the "stochastic parrot" fact. Because people think this is true and are making decisions based on this misunderstanding.


Or maybe they just disagree with you?


Who? See: https://dl.acm.org/doi/pdf/10.1145/3442188.3445922 A 2021 research paper warning (among other things) precisly about this confusion.

> The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said

Is there any Researcher who maintains that LLM models contain Reasoning and intent?

Those who are working on this models are not confused, they know what they are, the Public is confused.


How is that different from "This evidence for X raises the risk that people falsely believe that X"? That's an argument for X, not against. And nothing in that paper, even if I discard the dross (ie. everything except one section on page 7), seems to actually make an argument against X of any strength beyond "it is wrong because it is wrong".

My point is this: I disagree with you. This is not because I have "misunderstood" something; it is because I understand the stochastic-parrot argument and think it is erroneous. And the more you talk about "the risk that people will come to falsely believe" rather than actual arguments, the less convincing you sound. This paternalistic tendency is a curse on science and debate in general.


> it is because I understand the stochastic-parrot argument and think it is erroneous.

Okay then, what exactly about it is erroneous? Because stochastically sorting the set M of known tokens by likelyhood of being the next, is literally what LLMs do.


There's a class of statements that can be either interpreted precisely, at which point the claim they make is clearly true but trivial, or interpreted expansively, at which point the claim is significant but no longer clearly true.

This is one of those: yes, technically LLMs are token predictors, but technically any nondeterministic Turing machine is a token predictor. The human brain could be viewed as a token predictor [1]. The interesting question is how it comes up with its predictions, and on this the phrase offers no insight at all.

[1] https://en.wikipedia.org/wiki/Predictive_coding


> The human brain could be viewed as a token predictor

No it really couldn't, because "generating and updating a 'mental model' of the environment." is as different from predicting the next token in a sequence, as a bees dance is from a structured human language.

The mental model we build and update is not just based on a linear stream, but many parallel and even contradictory sensory inputs that we make sense of not as abstract data points, but as experiences in a world of which we are part of. We also have a pre-existing model summarizing our experience in the world, including their degradation, our agency in that world, and our intentionality in that world.

The simple fact that we don't just complete streams, but do so with goals, both immediate and long term, and fit our actions into these goals, in itself already shows how far a humans mental modeling is from the linear action of a language model.


But the human mental model is purely internal. For that matter, there is strong evidence that LLMs generate mental models internally. [1] Our interface to motor actions is not dissimilar to a token predictor.

> The mental model we build and update is not just based on a linear stream, but many parallel and even contradictory sensory inputs

So just like multimodal language models, for instance GPT-4?

> as experiences in a world of which we are part of.

> The simple fact that we don't just complete streams, but do so with goals, both immediate and long term, and fit our actions into these goals

Unfalsifiable! GPT-4 can talk about its experiences all day long. What's more, GPT-4 can act agentic if prompted correctly. [2] How do you qualify a "real goal"?

[1] https://www.neelnanda.io/mechanistic-interpretability/othell...

[2] https://github.com/hwchase17/langchain


> For that matter, there is strong evidence that LLMs generate mental models internally.

Limited models, such as those representing the state of a game that it was trained to do: Yes. This is how we hope deep learning systems work in general.

But I am not talking about limited models. I am talking about ad-hoc models, built from ingesting the context and semantic meaning of a string of tokens, that can simulate reality and allows drawing logical conclusions from it.

In regard to my example given elsewhere in this HN thread: I know that Mike exits the elevator first because I build a mental model of what the tokens in the question represent. I can draw conclusions from that model, including new conclusions whos token-representation would be unlikely in the LLMs model, which doesn't explain anything about reality, but explains how tokens are usually ordered in the training set.


The relevant keyword you want is "zero-shot learning". (EDIT: Correction; "in-context learning". Sorry for that.) LLMs can pick up patterns from the context window purely at evaluation time using dynamic reinforcement learning. (This is one of those capabilities models seem to just pick up naturally at sufficient scale.) Those patterns are ephemeral and not persisted to memory, which I agree makes LLMs less general than humans, but that seems a weak objection to hang a fundamental difference in kind on.

edit: Correction: I can't find a source for my claim that the model specifically picks up reinforcement learning across its context as the algo that it uses to do ICL. I could have sworn I read that somewhere. Will edit a source in if I find it.

edit: Though I did find this very cool paper https://arxiv.org/abs/2210.05675 that shows that it's specifically training on language that makes LLMs try to work out abstract rules for in-context learning.

edit: https://arxiv.org/abs/2303.07971 isn't the paper I meant, since it only came out recently, but it has a good index of related literature and does a very clear analysis of ICL, demonstrating that models don't just learn rules at runtime but learn "extract structure from context and complete the pattern" as a composable meta-rule.

edit: I think I was thinking of https://arxiv.org/abs/2212.10559 , which asserts that ICL acts equivalent to gradient descent.

> In regard to my example given elsewhere in this HN thread: I know that Mike exits the elevator first because I build a mental model of what the tokens in the question represent. I can draw conclusions from that model, including new conclusions whos token-representation would be unlikely in the LLMs model, which doesn't explain anything about reality, but explains how tokens are usually ordered in the training set.

I mean. Nobody has unmediated access to reality. The LLM doesn't, but neither do you.

In the hypothetical, the token in your brain that represents "Mike" is ultimately built from photons hitting your retina, which is not a fundamentally different thing from text tokens. Text tokens are "more abstracted", sure, but every model a general intelligence builds is abstraction based on circumstantial evidence. Doesn't matter if it's human or LLM, we spend our lives in Plato's cave all the same.


> In the hypothetical, the token in your brain that represents "Mike"

Mike isn't represented by a token. "Mike" is a word I interpret into an abstract meaning in an ad-hoc created, and later updated or discarded model of a situation in which exist only the elevator, some abstract structure around it, and the laws of physics as I know them from knowledge and experience.

> built from photons hitting your retina, which is not a fundamentally different thing from text tokens.

The difference is not in how sensory input is gathered. The difference is in what that input represents. For the LLM the token represents...the token. That's it. There is nothing else. The token exists for its own sake, and has no information other than itself. It isn't something from which an abstract concept is built, it IS the concept.

As a consequence, an language model doesn't understand whether statements are false or nonsensical. It can say that a sequence is statistically less likely than another one, but that's it.

"Jenny leaves first" is less likely than "Mike leaves first".

But "Jenny leaves first" is probably more likely than "Mario stands on the Moon", which is more likely than "catfood dog parachute chimney cloud" which is more likely than "blob garglsnarp foobar tchoo tchoo", which in turn is probably more likely than "fdsba254hj m562534%($&)5623%$ 6zn 5)&/(6z3m z6%3w zhbu2563n z56".

To someone reaching the conclusion that Mike left the elevator first by drawing that conclusion from an abstract representation of the world, all these statements are equally wrong. To a language model, they are just points along a statistical gradient. So in a language models world a wrong statement can still somehow be "less wrong" than another wrong statement.

---

Bear in mind when I say all this, I don't mean to say (and I think I made that clear elsewhere in the thread) that this mimickry of reasoning isn't useful. It is, tremendously so. But I think it's valueable to research and understand the difference in mimicking reason by learning how tokens form reasonable sequences, and actual reasoning from abstracting the world into models that we can draw conclusions from.

Not in the least because I believe that this will be a key element in developing things closer to AGIs than the tools we have now.


> an ad-hoc created, and later updated or discarded model of a situation in which exist only the elevator, some abstract structure around it, and the laws of physics as I know them from knowledge and experience.

LLMs can do all of this. In fact, multimodality specifically can be shown to improve their physical intuition.

> The difference is not in how sensory input is gathered. The difference is in what that input represents. For the LLM the token represents...the token. That's it. There is nothing else. The token exists for its own sake, and has no information other than itself. It isn't something from which an abstract concept is built, it IS the concept.

The token has structure. The photons have structure. We conjecture that the photons represent real objects. The LLM conjectures (via reinforcement learning) that the tokens represent real objects. It's the exact same concept.

> As a consequence, an language model doesn't understand whether statements are false or nonsensical.

Neither do humans, we just error out at higher complexities. No human has access to the platonic truth of statements.

> So in a language models world a wrong statement can still somehow be "less wrong" than another wrong statement.

Of course, but so with humans? I have no idea what you're trying to say here. As with humans, in a LLM token improbability can derive from lots of different reasons, including world model violation, in-context rule violation, prior improbability and grammatical nonsense. In fact, their probability calibration is famously perfect, until RLHF ruins it. :)

> Bear in mind when I say all this, I don't mean to say (and I think I made that clear elsewhere in the thread) that this mimickry of reasoning isn't useful.

I fundamentally do not believe there is such a thing as "mimickry of reason". There is only reason, done more or less well. To me, it's like saying that a pocket calculator merely "mimicks math" or, as the quote goes, whether a submarine "mimicks swimming". Reason is a system of rules. Rules cannot be "applied fake"; they can only be computed. If the computation is correct, the medium or mechanism are irrelevant.

To quote gwern, if you'll allow me the snark:

> We should pause to note that a Clippy² still doesn’t really think or plan. It’s not really conscious. It is just an unfathomably vast pile of numbers produced by mindless optimization starting from a small seed program that could be written on a few pages. It has no qualia, no intentionality, no true self-awareness, no grounding in a rich multimodal real-world process of cognitive development yielding detailed representations and powerful causal models of reality; it cannot ‘want’ anything beyond maximizing a mechanical reward score, which does not come close to capturing the rich flexibility of human desires, or historical Eurocentric contingency of such conceptualizations, which are, at root, problematically Cartesian. When it ‘plans’, it would be more accurate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just interpolating between memorized data points in a high-dimensional space, and any interpretation of such fake-thoughts as real thoughts is highly misleading; when it takes ‘actions’, they are fake-actions optimizing a fake-learned fake-world, and are not real actions, any more than the people in a simulated rainstorm really get wet, rather than fake-wet. (The deaths, however, are real.)


> I fundamentally do not believe there is such a thing as "mimickry of reason". There is only reason, done more or less well.

    if transaction.amount > MAX_TRANSACTION_VOLUME:
        transaction.reject()
    else:
        transaction.allow()
Is this code reasoning? It does, after all, take input and make a decision that is dependent on some context, the transactions amount. It even has a model of the world, albeit a very primitive one.

No, of course it isn't. But it mimicks the ability to do the very simple reasoning about whether or not to allow a transaction, to the point where it could be useful in real applications.

So yes, there is mimicry of reasoning, and it comes in all scales and levels of competence, from simple decision making algorithms, purely mechanical contraptions such as overpressure-valves, all the way up to highly sophisticated ones that use stochastic analysis of sequence probabilities to show the astonishing skills we see in LLMs.


I feel this is mostly going to come down to how we define the word. I suspect we agree that there's no point in differentiating "reasoning" from "mimicked reasoning" if the performed actions are identical in every situation.

So let's ask differently: what concrete problem do you think LLMs cannot solve?


> what concrete problem do you think LLMs cannot solve?

From the top of my head:

Drawing novel solutions from existing scientific data for one. Extracting information from incomplete data that is only apparent by reasoning (such as my code-bug example given elsewhere in this thread), aka. assuming hidden factors. Complex math is still beyond them, predictive analysis requiring inference is an issue.

They also still face the problem of, as has been anthropomorphized so well, "fantasizing", especially during longer conversations; which is cute when they pretend that footballs fit in coffee-cups, but not so cute when things like this happens:

https://eu.usatoday.com/story/opinion/columnist/2023/04/03/c...

--

These certainly don't matter for the things I am using them for, of course, and so far, they turn out to be tremendously useful tools.

The trouble, however, is not with the problems I know they cannot, or cannot reliably, solve. The problem is with as of yet unknown problems where humans, me included, might assume they can solve, and suddenly it turns out they can't. What these problems are, time will tell. So far we have barely scratched the surface of introducing LLMs in our tech products. So I think it's valueable to keep in mind that there is, in fact, a difference between actually reasoning, and mimicking it, even if the mimicry is to a high standard. If for nothing else, then only to remind us to be careful in how, and for what, we use them.


I mean, do you think a LLM cannot draw a novel solution from existing data, fundamentally, because its reasoning is "of the wrong kind"? That seems potentially disprovable. - Or do you just think current products can't do it? I'd agree with that.

What's the easiest novel scientific solution that AI couldn't find if it wasn't in its training set?


> because its reasoning is "of the wrong kind"?

No, because it doesn't reason, period. Stochastic analysis of sequence probabilities != Reasoning. I explained my thoughts on the matter in this thread to quite some extend.

> That seems potentially disprovable.

You're welcome to try and disprove it. As for prior research on the matter:

https://www.cnet.com/science/meta-trained-an-ai-on-48-millio...

And afaik, Galactica wasn't even intended to do novel research, it was only intended for the, time consuming but comparably easier, tasks of helping to summarize existing scientific data, ask questions about it in natural language and write "scientific code".


Alright, I'll keep an eye open for instances of networks doing scientific reasoning.

(My own belief is that reasoning is 95% habit and 5% randomness, and that networks don't do it because it hasn't been reflected in their training sets, and they can't acquire the skills because they can't acquire any skills not in the training set.)


Well what you are saying is that the argument of the creators of the model is erroneous and you know better. More power to you.


Bad argument is bad argument, doesn't matter who says it.


> Is there any Researcher who maintains that LLM models contain Reasoning and intent?

Researchers with jobs that depend on promulgating this belief. E.g., "Open"AI employees.


>Is there any Researcher who maintains that LLM models contain Reasoning and intent?

That's the funny thing about the (in)famous OpenAI letter; the first sentence kind of does this:

>AI systems with human-competitive intelligence can pose profound risks to society and humanity, as shown by extensive research[1]

'human-competitive intelligence' sounds like reasoning to me. What's even funnier is that [1] is the stochastic parrot paper, which argues exactly the opposite!


> 'human-competitive intelligence' sounds like reasoning to me.

Yes, and when AIs reach that level of intelligence, we can revisit the question.

However, as long as LLMs will confidently try to explain to me why several footballs fit in an average coffee mug, I'd say we are still quite some way away from "human-competitive intelligence".


Yes, I fully agree with you! there's no reasoning or intelligence in modern LLMs, but the OpenAI open letter and recent comms strongly implies there is


LLMs cannot be considered as general reasoning engines because they do not make any kind of reasoning though.


Since ChatGPT I've become much more aware of my own thoughts and written text. I'm now often wondering whether I'm just regurgitating the most frequently used next word or phrase or whether it could actually be described as original. Especially, for things like reacting with short answers to chat messages, I am confident that these are only reactionary answers without alternatives, which could have come from ChatGPT trained on my chat log. I feel like knowing and seeing how ChatGPT works can elevate our own thinking process. Or maybe it only is similar to awareness meditation.


In this vein, ChatGPT is a nice way to start thinking about any topic; ask it about its opinion on anything and it will give you the most run-of-the-mill, middle-of-the-road text that is possible, standing for nothing. If you find yourself agreeing with ChatGPT it may be time to reconsider your own thinking!


LLMs may be an approximation of our knowledge, but understand that there’s more to reasoning than a language model. GPT understands how tokens relate to each other, but humans understand how objects, places, and abstract concepts relate to each other as well. We have a little further to go for AGI.

I do believe that GPT4 is a really good approximation of our language though, and feel similarly to you when I respond off the cuff.


Now that is a thought that could not possibly have come from GPT. Thanks, this has given me something to think about.


Logical reasoning is the one thing LLMs are really bad at.


LLMs suck at deduction and factual reasoning. How are they reasoning engines?


> I think we’re now way past that now with LLMs now quickly taking on the role of a general reasoning engine.

No we're not, and no they are not.

An LLM doesn't reason, period. It mimics reasoning ability by stochastically chosing a sequence of tokens. Alot of the time these make sense. At other times, they don't make any sense. I recently asked an LLM:

    "Mike leaves the elevator at the 2nd floor. Jenny leaves at the 9th floor. Who left the elevator first?"
It answered correctly that Mike leaves first. Then I asked:

    "If the elevator started at the 10th floor, who would have left first?"
And the answer was that Mike still leaves first, because he leaves at the 2nd floor, and that's the first floor the elevator reaches. Another time I asked an LLM how many footballs fit in a coffe-mug, and the conversation reached a point where the AI tried to convince me, that coffe-mugs are only slightly smaller than the trunk of a car.

Yes, they can also produce the correct answers to both these questions, but the fact that they can also spew such complete illogical nonsense shows that they are not "reasoning" about things. They complete sequences, that's it, period, that's literally the only thing a language model can do.

Their apparent emergent abilities look like reasoning, in the same way as Jen from "The IT crowd" can sound like shes speaking Italian, when in fact she has no idea what she is even saying.


> but the fact that they can also spew such complete illogical nonsense shows that they are not "reasoning" about things

Have you ever seen the proof that 2=1 ? It looks convincing, but it's illogical because it has a subtle flaw. Are the people who can't spot the flaw just "looking like they are reasoning", but really they just lack the ability to reason? Are witnesses who unintentionally make up memories in court cases lacking reasoning? Are children lacking reasoning when you ask them why they drew all over the walls and they make up BS?

You can't just spout that an LLM lacks reasoning without first strictly defining what it means to reason. Everybody keeps going on and on about how an LLM can't possibly be intelligent/reasoning/thinking/sentient etc. All of these are extremely vague and fuzzy words that have no unambiguous definition. Until we can come up with hard metrics that define these terms, nobody is correct when they spout their own nonsense that somehow proves the LLM doesn't fit into their specific definition of fill in the blank.


> Are the people who can't spot the flaw just "looking like they are reasoning", but really they just lack the ability to reason?

Lacking relevant information or insight into a topic, isn't the same as lacking the ability to reason.

> You can't just spout that an LLM lacks reasoning without first strictly defining what it means to reason.

Perfectly worded definition available on Wikipedia:

    Reason is the capacity of consciously applying logic by drawing conclusions from new or existing information, with the aim of seeking the truth.
"Consciously", "logic", and "seeking the truth" are the operative terms here. A sequence predictor does none of that. Looking at my above example: The sequence "Mike leaves the elevator first" isn't based on logical thought, or a conscious abstraction of the world built from ingesting the question. It's based on the fact that this sequence has statistically a higher chance to appear after the sequence representing the question.

How does our reasoning work? How do humans answer such a question? By building an abstract representation of the world based on the meaning of the words in the question. We can imagine Mike and Jenny in that Elevantor, we can imagine the elevator moving, floor numbers have meaning in the environment, and we understand what "something is higher up" means. From all this we build a model and draw conclusions.

How does the "reasoning" in the LLM work? It checks which tokens are likely to appear after another sequence of tokens. It does so by having learned how we like to build sequences of tokens in our language. That's it. There is no modeling of the situation going on, just stochastic analysis of a sequence.

Consequently, an LLM cannot "seek truth" either. If a sequence has a high chance of appearing in a position, it doesn't matter if it is factually true or not, or even logically sound. The model isn't trained on "true or false". It will, likely more often than not say things that are true, but not because it understands truth, but because the training data contain a lot of token sequences that, when interpreted by a human mind, state true things.

Lastly, imagine trying to apply a language model to an area that depends completely on the above definition of reasoning as a consequence of modeling the world based on observations and drawing new conclusions from that modeling.

https://www.spiceworks.com/tech/artificial-intelligence/news...


You must have missed the part where I said:

> Until we can come up with hard metrics that define these terms, nobody is correct when they spout their own nonsense that somehow proves the LLM doesn't fit into their specific definition of fill in the blank.

"Consciously", "logic", and "seeking the truth" are not objectively verifiable metrics of any kind.

I'll repeat what I said: Until we come up with hard metrics that define these terms, nobody can be correct. I'll take investopedia's definition for what a metric means, as that embodies the idea I was getting at the most succinctly:

> Metrics are measures of quantitative assessment commonly used for assessing, comparing, and tracking performance or production.[0]

So, until we can quantitatively assess how an LLM performs compared to a human in "consciousness", "logic", and "seeking the truth", whatever ambiguous definition you throw out there will not confirm or deny whether an LLM embodies these traits as opposed to a human embodying these traits.

[0]: https://www.investopedia.com/terms/m/metrics.asp


To elaborate a bit on my own post here:

The sequence "Mike leaves the elevator first" has a high statistical probability. The sequence "Jenny leaves the elevator first" has a lower probability that that. But it probably has still a much higher probability than "Michael is standing on the Moon", which in turn may be more likely than "Car dogfood sunshine Javascript", which is still probably more likely than "snglub dugzuvutz gummmbr ha tcha ding dong".

Note that none of these sequences are wrong in the world of a language model. They are just increasingly unlikely to occur in that position. To us with our ability to reason by logically drawing conclusions from an abstract internal model of the world, all these other sequences either represent false statements, or nonsensical word sald.


GPT-4 reasons about a lot of gotcha logic puzzles correctly any pre GPT-4 opinions should be reconsidered, that is effectively two different epochs in the history of AI effectiveness and reasoning.

> Me: Mike leaves the elevator at the 2nd floor. Jenny leaves at the 9th floor. Who left the elevator first?

> GPT-4: Mike left the elevator first, as he got off at the 2nd floor, while Jenny left at the 9th floor.

> Me: If the elevator started at the 10th floor, who would have left first?

> GPT-4: If the elevator started at the 10th floor and went downward, then Jenny would have left first, as she got off at the 9th floor, while Mike left at the 2nd floor.

> Me: How many footballs fit in a coffe-mug?

> GPT-4: A standard football (soccer ball) has a diameter of around 22 centimeters (8.65 inches), while a coffee mug is typically much smaller, with a diameter of around 8-10 centimeters (3-4 inches). Therefore, it is not possible to fit a standard football inside a coffee mug. If you were to use a mini football or a much larger mug, the number of footballs that could fit would depend on the specific sizes of the footballs and the mug.

It easily answered all of your questions and produces explanations I would expect most reasonable people to make.


That changes exactly nothing about the validity of my statement.

Yes, GPT-4 is better at this mimicry than GPT-3 or GPT-3.5. And GPT-3 was better at it than GPT-2. And all of them were better than my out-of-fun home-built Language Model projects that I trained on small <10GiB Datasets, which in turn were better at it than my Poc models trained on just a few thousand words.

But being better at mimicking reason, is still not reasoning. The model doesn't know what a coffeemug is, and it doesn't know what a football is. It also has no idea how elevators work. It can form sequences that make it look to us that it does and knows all these things, but in reality, it only knows that "then Jenny would have left first" is a more likely sequence of tokens at that point, given that the sequence before included "started at the 10th floor".

Bear in mind, this doesn't mean that this mimicry isn't useful. It is, tremendously so. I don't care how I get correct answers, I only care that I do.


> But being better at mimicking reason, is still not reasoning

How do I know people are not using a similar process when they perform "reasoning" but with a way more elaborate model?

Can you prove me that the two are inherently different in the type of output they produce regardless of how large a ML model is or can be?

Because if you can't, and they produce the same type of output, the processing could be similar enough to be considered reasoning.


> but with a way more elaborate model?

Simple: I know that humans have intentionality and agency. They want things, they have goals both immediate and long term. Their replies are based not just on the context of their experiences and the conversation but their emotional and physical state, and the applicability of their reply to their goals.

And they are capable of coming up with reasoning about topics for which they have no prior information, by applying reasonable similarities. Example: Even if someone never heard the phrase "walking a mile in someone elses shoes", most humans (provided they speak english) have no difficulty in figuring out what this means. They also have no trouble figuring out that this is a figure of speech, and not a literal action.


>Simple: I know that humans have intentionality and agency. They want things, they have goals both immediate and long term. Their replies are based not just on the context of their experiences and the conversation but their emotional and physical state, and the applicability of their reply to their goals.

This all seems orthogonal to reasoning, but also who is to say that somewhere in those billions of parameters there isn't something like a model of goals and emotional state? I mean, I seriously doubt it, but I also don't think I could evidence that.


> but also who is to say that somewhere in those billions of parameters there isn't something like a model of goals and emotional state?

No one, but as is well established, absence of proof of nonexistence isn't an argument for existence. https://en.wikipedia.org/wiki/Russell's_teapot


Correct, but the problem is how you prove that for humans is by using the output and inferring that. You can apply the same criteria to ML models. If you don't, you need some other criteria to rule out that assumption for ML models.


For humans I can simply refer to my own internal state and look at how I arrive by conclusions.

I am of course aware that this is essentially a form of Ipse dixit, but I will do it anway in this case, because I am saying it as a human, about humans, and to other humans, and so the audience can just try it for themselves.


> I know that humans have intentionality and agency.

You assume that. You can only maybe know that about yourself. But my question was bit different. How do you know that the ML model doesn't?

> about topics for which they have no prior information, by applying reasonable similarities.

This is a contradiction. If you have no prior information about a topic you can't know even what topic is similar.

> Even if someone never heard the phrase "walking a mile in someone elses shoes".

Same for ML modes. They don't have a representation of every possible prompt.


> You assume that. You can only maybe know that about yourself.

I can also only say with certainty that planetary gravity is an attracting force on the very spot I am standing on. I haven't visited every spot on every planet in the universe after all.

That doesn't make it any more likely that my extrapolation of how gravity works here is wrong somewhere else. Russels Teapot works both ways.

> How do you know that the ML model doesn't?

For the same reason why I know that a Hammer or an Operating System don't. I know how they work. Not in the most minute details, and of course the actual model is essentially a black box, but it's architecture, and MO are not.

It completes sequences. That is all it does. It has no semantic understanding of the things these sequences represent. It has no understanding of true or false. It doesn't know math, it doesn't know who person xyz is, it doesn't know that 1993 already happened and 2221 did not. It cannot have abstract concepts of the things represented by the sequences, because the sequences are the things in its world.

It knows that a sequence is more or less likely to follow another sequence. That's it.

From that limited knowledge however, it can very successfully mimick things like math, logic, and even reasoning to an extend. And it can mimick them well enough to be useful in a lot of areas.

But that mimickry, however useful, is still mimickry. It's still the Chinese-Room thought experiment.


I agree about the utility part.

However, I don't really accept the idea that this isn't reasoning, but I'm not entirely sold either way.

I'd say if it mimics something well enough then eventually it's just doing the thing, which is the same side of the argument I fall on with Searle's Chinese Room Argument. If you can't discern a difference, is there a difference?

So far GPT-4 can produce better work than like 50% of humans and better responses to brain teaser questions than most of them too, I'm at least just in a bubble and so I don't run into people that stupid that often. So it's easier for me to see the gaps still.


> I'd say if it mimics something well enough then eventually it's just doing the thing

Right up to the point where it actually needs to reason, and the mimickry doesn't suffice.

My above example about the Football and the Coffemug is an easy one, the objects are well represented in its training data. What if I need a reason why the Service Ping spikes every 60 seconds, here is the code, please LLM look it up. I am sure I will get a great and well written answer.

I am also sure it won't be the correct one, which is that some dumb script I wrote, which has nothing to do with the code shown, blocks the server for about 700ms every minute.

Figuring out that something cannot be explained with the data represented, and thus may come from a source unseen, is one example of actual reasoning. And this "giving up on the data shown" is something I have yet to see any AI do.


I could say the same about most second rate software engineers. Thats why im not moved by your arguments. Theres plenty of peope just as stupid and who will give you confidently wrong answers.


I like that idea a lot. "General reasoning engine" sounds much more sensible than "AGI", especially given that the "I" in AGI is so poorly defined.


> These tools are useful now

Guns are also useful tools because you can take them into a store and get things for free as a result. But that doesn't make okay to do.


Robbing people does not benefit anyone but yourself. This is a terrible analogy.


I can take the things I stole and sell them to other people, which is exactly what Github is doing here.


I think AI is here to stay (obviously) but we do need a much better permission model regarding content, whether this is the writing on your blog, your digital art, your open source code, video, audio...all of it.

The current model basically says that as soon as you publish something, others can pretty much do with it as they please under the disguise of "fair use", an aggressive ToS, the like.

I stand by the author that the current model is parasitic. You take the sum of human-produced labor, knowledge and intelligence without permission or compensation, centralize this with tech about 2 companies have or can afford, and then monetize it. Worse, in a way that never even attributes or refers to the original content.

Half-quitting Github will not do anything, instead we need legal reform in this age of AI.

We need training permission control as none of today's licenses were designed with AI in mind. The default should be no permission where authors can opt-in per account and/or per piece of content. No content platform's ToS should be able to override this permission with a catch-all clause, it should be truly free consent.

Ideally, we'd include monetization options where conditional consent is given based on revenue sharing. I realize that this is a less practical idea as there's still no simple internet payment infrastructure, AI companies likely will have enough non-paid content to train, plus it doesn't solve the problem of them having deep pockets to afford such content, thus they keep their centralization benefits. The more likely outcome is that content producers increasingly withdraw into closed paid platforms as the open web is just too damn hostile.

I find none of this to be anti-AI, it's pro-human and pro-creator.


An important legislative step for this is that anyone creating and publishing an AI learning model needs to be able to cite their sources - in this case, a list of all the github repositories and files therein, along with their licenses.

If that is made mandatory, only then can these lists actually be checked against licenses.

There will also need to be a trial license, to establish whether an AI learning model can be considered derived from a licensed open source project - and therefore whether it falls under the license.

And finally, we'll likely get updated versions of the various OSS licenses that include a specific statement on e.g. usage within AI / machine learning.


In the age of reposts and generative AI, "attribution" is irrelevant. Nobody cares who originally made some content, and it truly doesn't matter.

>The more likely outcome is that content producers increasingly withdraw into closed paid platforms

Nah. You didn't get paid to write that post, did you? You did it for free. People nowadays are perfectly willing to create free content, and often high quality content, sometimes anonymously, even before generative AI.

There's no need for financial incentives anymore. As content creation becomes easier, people will start creating out of intrinsic motivation - to express themselves, to influence others and to inform. It's better that way.

Restricting content so that others can't benefit from it is not pro-human or pro-creator, it's selfish and wasteful. We should get rid of licenses altogether and feed everything humanity creates into a common AI model that is available for use by everyone.


I maintain a popular OSS project which code is hosted on Github [1].

The entire "Github doesn't give back" argument is wrong. For "free", Github lets me host our code, run thousands and thousands of hours of free CI (and we are aggressively using it), host releases and docker images, and lets us manage thousands of issues. Also, Copilot is free when you are eligible to it, so we are fortunate enough to not have to pay for it as well.

Yes, they monetize our attention and train Copilot with the code, but the only argument which can't be used against this company is that they don't give back.

[1]: https://github.com/monicahq/monica


Why hasn't someone just changed the GPL license already:

"If you train an AI on this code, you must release the source code and generated neural net of that AI as open source" or something to that effect.

It won't stop it, but it will slow it down, and it seems like the right T&Cs to put on training against GPL code because it gives an advantage to open source AIs, however minor.


Implying the "AI" community respects intellectual property rights.


Also implying software companies do respect open source licenses


I'm not sure why you're being downvoted. Many companies ignore open source licenses because and it's difficult to prove they aren't following them.


Aren't they claiming that it's fair use? IANAL, but wouldn't that make the licence irrelevant if training AI/ML models was found to be fair use? And if not, it's a licence violation anyway?


It will be difficult to claim fair use if training AI model is explicitly mentioned in the license, I think.

Currently GPL says:

> To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work.

> A "covered work" means either the unmodified Program or a work based

If in addition it would say something like "Generative AI models trained on the program source code as well as the text produced with such models is also a work "based on" the Program", then there will be little room for a fair use claim, I think.


> It will be difficult to claim fair use if training AI model is explicitly mentioned in the license, I think

Fair use is a statutory right (codifying what courts had 0reviously found to be an aspect of Constitutional free expression rights limiting the copyright power) that limits the exclusive rights of copyright owners. It can't be reduced in scope by license terms, because it deals with what the owner has no right to control in the first place.

(You may be confusing “fair use” with “implied license”, and, yes, explicit license terms would be a powerful argument against an implied license argument.)


I don't mean "implied license".

My impression, they claim the fair use only because there is no other ground for ML training on GPL code. The license only describes other uses of the code, so the first question is on what grounds Open AI uses the code at all. And the only thing OpenAI comes with is the fair use.

ML training does not fall under free expression or similar basic rights, I think. Only because it looks too restrictive to forbid scanning the code which is publicly available for people to read, and the license itself does not mention such use, the fair use may be considered as some kind of justification.

I, of course, can be mistaken. Especially that I didn't study this subject deeply.

Also, for authors who are against the ML training on their GPL code, it's better to prove it violates even the current GPL version, rather than just introduce a modification to the license. Because new license will only cover new code, while old versions are already available under the old license.

On the other hand, such an amendment to the license will not be harmful, even if later proven to be useless. And it will clearly state the copyright holder's position regarding the ML training, instead of some subtle reasoning we should apply currently.

Some people may want an opposite amendment to their license - ML training does not produce a "work based on the Program" and can be freely performed by anyone.


To me, that's like saying there would be little room for a fair use claim for news reporting/parody/[insert legitimate fair use here] if the licence expressly forbids it. IANAL though.


IANAL either, but the license still applies to the end-user (the person who trained the AI) so it would seem like it would add at least 1 non-trivial license violation for that user?

Edit: I googled "fair use copyright US" and have now decided that US copyright law is stupid.


> I googled "fair use copyright US" and have now decided that US copyright law is stupid.

Fair use helps artists, journalists, and (indirectly) the general public. It prevents censorship of critical or opposing views, among a lot of other uses that are beneficial to society (see sibling comment). US copyright law is backwards in a lot of ways but this isn't one of them.


> Edit: I googled "fair use copyright US" and have now decided that US copyright law is stupid.

Don't be like that. Fair use is what allowed VCRs to continue existing, what allows Google Images and Books to exist, what allows the development of emulators... I could go on.


"I googled "fair use copyright US" and have now decided that US copyright law is stupid"

How long exactly did you study the issue? It is very complex.


Yes, if it's fair use than the license is irrelevant (at least in the US).


When GH devised Copilot, they could have (internally, at GH) decided to make a two-tier model, one tier trained only on unrestrictive licenses, the other bringing in more-restrictively license code too. And then offer them to the GH-using public as two different functionalities. An intelligently differentiated product line for intelligent people.

But: NOOOO.

In order to close off this possibility, which would restrict Copilot revenue, they instead would roll out a single undifferentiated product and with lots of "gee whiz!" and associated hooplah, and be sure to offer it for free for a while to suck everyone in and head off criticism.


They are really not going to care about what you put into your license file, they are just going to claim that the use of GitHub binds you to their terms of service and that this supersedes your own license. Good luck fighting that.


ToS doesn't supercede copyright though, does it?

The real rub will be the first court precedent on whether GPTs infringe on source data IP.

Could see it going either way: fundamentally transformative or not.


You agree to Microsoft's ToS before you can put any code on Github, regardless of the license.

You can't opt-out of those terms, regardless of your license, just as you can't opt out of Facebooks terms which give them the right to use your content for their business or marketing purposes, even if the various chain posts that have spread there might claim otherwise.


The vast majority of the code I've written that is on GitHub isn't on GitHub because I would ever have stooped to putting it there, but because it is open source under a license that lets other people redistribute and edit the code (most often GPL or AGPL; maybe some older code under BSD) and they have chosen to use GitHub (which makes me sad, but as far as I'd have been concerned is totally within their rights). Are you claiming that people should not be allowed to clone other peoples' open source projects and put copies up on GitHub?


> Are you claiming that people should not be allowed to clone other peoples' open source projects and put copies up on GitHub?

If the github ToS supercedes the author's own licence, then I guess the uploader is effectively relicensing the code without the author's permission. That would mean the author has cause for action against the uploader, but not against github.

I personally dislike git; I find it too complicated, because of features I don't need. Microsoft has always disliked FOSS anf GPL, and I suspect that Copilot is a deliberate effort to undermine it.


Well upthread the discussion changed to using a non-open license that prevents people from training AI on it. If you released software under such a license, someone re-uploading to Github would probably be violating their terms or yours. Regardless, Microsoft would probably remove the repo if you contacted them to let them know you're the copyright holder, and the software license is incompatible with their terms.

It remains to be seen if they have a way to then clean their training data of the influence.

It would be the same situation if someone uploaded any other proprietary code.


You shouldn't have to though, they have a responsibility all their own to check that they have the rights regarding someone else's copyright before they do what they want to do, rather than to do it anyway and then to wait for the rights holder to come to them.

Copyright isn't 'opt in'.


I mean, if someone uploads a repo that contains proprietary code that also contains CI actions from the proprietary codebase, formatted the same as github actions, they're going to run those actions under the assumption that they are allowed to (even though they aren't, because it means they're running proprietary code). It's all automated. The person uploading the proprietary code would be the one infringing in that case.


Yes, but that's a different discussion. In this case the person does have the rights under the GPL to do what they do, but GitHub does not have an automatic right to assume that that gives them the right to enforce their ToS on the original copyright holder, which they effectively do.


> Copyright isn't 'opt in'.

Before 1989, copyright protection was opt-in in the US.

https://en.wikipedia.org/wiki/Copyright_notice


It's 2023.


If code with a license, say GPL, goes out somewhere else, by someone else (and therefore I don't have the right to change the license) and then I fork it, as per the license I keep the license and put it on github, and Github violates that license, aren't they violating the law? Don't they then have to remove that code?

If that's the case then in that scenario the license supersedes the ToS right?

Now imagine this: I write some code, license it, but don't publish it yet. It's licensed. Then I upload it to github. Does the license supersede the ToS? Doesn't github have to remove the code as a ToS violation? What if I show my roommate the licensed code first, does that count as publishing?

The whole thing is absurd on it's face. All code is licensed before it ever goes on github. The license always supersedes the ToS. All licenses violate the ToS. All code on github should be removed by Microsoft for ToS violations or because Microsoft cannot abide their licenses. Their ToS is fucking illegal.


> If code with a license, say GPL, goes out somewhere else, by someone else (and therefore I don't have the right to change the license) and then I fork it, as per the license I keep the license and put it on github, and Github violates that license, aren't they violating the law? Don't they then have to remove that code?

You are violating the law, probably. The ToS would say something like "I hereby declare that I hve the right to agree over the software to submit it under the ToS".


I'm not violating the law, I'm violating the ToS. They should them remove the my account and the offending code, lest they then go on to violate the law, no?


Yes but I guess it won't happen until someone complains. Similar to other content, e.g. YouTube, but in reality nobody requests takedowns of forks/copies.


Alright, now suppose someone does. Doesn't that mean Microsoft has to rework all work they made with these codebases using this legal argument and not a fair use one? Doesn't even doing this set them up for a potentially very expensive compliance action?


I think what you say is true: either they train on any open sourced code with fair use, no matter if it was published on github or anywhere, and ignoring the license, OR they trained on data that is potentially not complying with their ToS (e.g. uploaded by someone that is not the author, regardless of license, they couldn't legally agree to a ToS that gives away additional rights of the work).

However, the reality is that this is all extremely muddy, far from proving that software A has copied some code from software B where you can just compare the source code. There are too many muddy steps, and you can bet that Microsoft will just get away with it.


> If code with a license, say GPL, goes out somewhere else, by someone else

If the code is GPL licensed, you can't relicense it under a non-FOSS license like you're talking about

> Don't they then have to remove that code?

Yes, if code gets uploaded whose license is incompatible with Microsoft's terms, they probably do have to remove it.

> Now imagine this: I write some code, license it, but don't publish it yet. It's licensed. Then I upload it to github. Does the license supersede the ToS? Doesn't github have to remove the code as a ToS violation

Again, yes, they probably do, and they also probably have an obligation to clean their training data of it. However, if you're the copyright holder of the code and you agree to their ToS before uploading, they might make the case that you agreeing to their ToS does grant them the license to use it in training data.


Not re-license. I upload with the same license I got it with, as per the license.

So then Github is breaking the law a significant portion of the time then at the very least?


Microsoft/Github makes a reasonable attempt to remove infringing code. If you obtain a copy of proprietary source code owned by Apple on the dark net, and upload to github, they'll definitely remove that. If companies were responsible for user-uploaded content that the company takes reasonable steps to remove, no one would be able to accept user-uploaded content in the first place.

Facebook wouldn't be able to allow users to upload photos.

Hacker news wouldn't allow me to post this comment (someone else could own the copyright, right?)


If Github's ToS says that they have carte blanc to do whatever they want with FLOSS licensed code, including to relicense it, then either every single codebase on github violates their ToS or their ToS violates every single license and therefore the law. A reasonable attempt under these circumstances would be to remove every single FLOSS licensed repository on github, so I'd argue no, they do not make a reasonable attempt to remove infringing code.


You are conflating many different things.


This is true, but a ToS should not be able to override an important law such as copyright, which provides you with several inalienable rights that you can only contract out of with your explicit consent. Doubly so if the ToS are changed after you post your code there without your explicit approval. I could put text in the ToS of my website that you owe me your firstborn, but that wouldn't make it legal or enforceable.


Microsoft cannot opt out of common law either


It shouldn't but good luck fighting the 500 Lbs gorilla on its home turf. Most people will avoid the fight, even if they are in the right.


The ToS is necessary for GitHub to provide their services, the wording is pretty carefully constructed so that GitHub are safe to change their service and develop new things. Loosely speaking "by uploading code you grant GitHub permission to blah blah with that code"

I'm not sure how well it will go down in court fighting this... since we agreed to it. But the more interesting question will be is the result a complete "you loose" and GitHub walks away, or if they are forced to take actions in order to defend the copyright of users producing content... a "Code Id" type system that warns you if the code your uploading is too similar to someone else's in order to allow you to use the fun new AI tools to make code and pay GitHub, but also simultaneously defend users legal intellectual property rights.


I just want to make sure you appreciate that if you really believe this argument then GitHub can only be used by the people who actually directly own the copyright on projects; and if you, for example, want to clone and edit my software (the vast majority of which I explicitly never uploaded to GitHub) then you wouldn't be allowed to (which doesn't seem like either the intention or the way it is commonly used)... and like, it would essentially be impossible to use GitHub to work on an open source project that has some long storied history with many hundreds of contributors without going back and getting all of them to agree.


Ah yes, good point: plenty of the people that fork projects do not actually have the copyright to that code to begin with, they just use github while they themselves are in compliance with the license, that definitely does not give GitHub rights that they would have otherwise to negotiate with the original copyright holders. 'Open source' does not equate 'public domain' and GitHub effectively seems to try to make that claim.


I'm pretty sure thats a narrower interpretation than GitHub are aiming for. I'm just paraphrasing the parts of GitHub's ToS that I can remember since current debate on the topic has lead to me remembering a few important parts reasonably well but I've certainly not memorised them. So this is a good opportunity for me to go re-read them and quote them directly... (also in case anyone is about to mention it ... I am aware this I'm linking to the current incarnation of the ToS and it may have changed... but there have been equivalent sections in the ToS for years, and this is pretty standard stuff for User Generated Content licenses, and digging up Internet Archive links to specific historical versions is a bit further than I feel necessary for the purposes of this specific reply)

The relevant section is this:

GitHub Terms of Service: Section D, Sub-Section 3 - https://docs.github.com/en/site-policy/github-terms/github-t...

The phrase relevant to your point is "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post; that you will only submit Content that you have the right to post; and that you will fully comply with any third party licenses relating to Content you post."

It's fair to interpret that as GitHub are not going to be copyright police. The bit at the end where I suggest a "code Id" is more of a thought experiment as to how they could continue to offer the service while complying with a potential adverse ruling that doesn't ascribe blame on them or the service since theres another section of the ToS that I, with my "knows slightly more about law than average but absolutely not a lawyer" hat firmly on, feel will be how GitHub's legal team at least try to make short work of the lawsuit, their success with this tactic is a matter for the Courts, and I'd love better legal scholars to weigh in.

GitHub Terms of Service: Section D, Sub-Section 4 - https://docs.github.com/en/site-policy/github-terms/github-t...

Which reads thus:

"We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."

For me the key quote being "including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users;". Now theres some legal arguing to be done about if charging for the AI constitutes an infringement on the second paragraph which opens with "This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service" but thats a very different argument to what I see a lot of people making. People are arguing (generally speaking) from the standpoint of "its not right, this violates my rights as the author having selected this license for my code/commits and published it under that licensed for others to share" ... not "i didn't agree to GitHub selling my content and this constitutes violation of GitHub Terms of Service: Section D, Sub-Section 4 where they told me that they would not sell my content".

But broadly speaking, unless the argument shifts to GitHub Terms of Service: Section D, Sub-Section 4 and classifying this as an unapproved sale of the users content, then I don't see how GitHub are not well within their rights to have trained the AI model and offered it as a service. We by agreeing to the ToS agreed to GitHub Terms of Service: Section D, Sub-Section 3 where we promise to only post code we have the rights to post and that we the user will comply with the legal complexities of third party licenses and basically are responsible for not posting stuff to GitHub for which we cant grant GitHub the requested legal rights, which when combined together means that we gave them permission to use our code and commits, regardless of any license files we may have put in the repos, to train the AI model. We can definitely argue derivation and what justifies a sale, and I'd be inclined to say they may actually have breached that term, but no one I've read is talking about that, its all about copyright infringement for AI generated code and moral rights with respect to using the code to train the model, not a clear cut contractual breach of the Terms of Service that GitHub may or may not have perpetrated on us as the other party agreeing to be bound by the contract.


The key distinction I'm interested in is providing the GitHub (or any similar product) "Service" vs selling a separate, derived product (Copilot / ChatGPT).

A: Common ToS to say that a product's owner obtains a license to user content for purposes of providing that user the product service.

B: Somewhat common ToS to extend that to providing the product service to third party users (i.e. use your content for other users of the service), but depends on business model (e.g. most social* businesses).

C: A lot less common ToS to obtain a right to distribute user content in derived products.

A number of sites have gotten into hot water with their userbase over trying to update their ToS from B to C. From memory... Adobe Cloud, DeviantArt, maybe some others?

Typically this gets flak in creative communities, given that it is many people's business, and they're more concerned about distribution rights than your average coder.

At its base, OpenAI/Microsoft/etc. will eventually run into the exact same issues that bedeviled the Linux kernel in the 1990s, except with a much thornier IP ownership question (given the greater number of parties).


But... we're all aware - as is GitHub - that plenty of the content there is not posted by the original copyright holders, who are the only parties that are able to enter into such a contract. That was the reason for GitHub coming into existence in the first place. You can't turn around a couple of years later and start arguing that the use of GitHub allows for a blanket exemption on copyright law, which is effectively what this amounts to.

GitHub ToS is written by GitHub, it's not a contract in the sense that no consideration has been given to the other party and as such it isn't legally binding on that other party, but regular law, such as copyright law, still applies to GitHub.


Its the same as other user generated content sites... The ToS is to legally shift blame from GitHub to the users... and thats what made me think of "code id" actually, since GitHub have a firm defence in the form of "Users doing illegal things isn't our fault, we asked them not to and tried to kick people off when we found out they were violating the terms, but they might still get slapped around a bit by the Court and need to implement some form of safeguards the way YouTube was forced to, because your point about how binding the terms of service are when the consideration is "use of this service in exchange for agreement" is true, there is not a super strong contract here, its nominally more binding than the average clickwrap contract pre-install EULA since the consideration in exchange is use of the service itself, but as case law around things like scraping and other internet activity has shown, its definitely not as binding as a physically signed sale contract would be...


It shouldn't matter if the copyright holder agreed to it directly, if they've published the original code under an open source license. Since open source licenses all allow people to use the code for "whatever"

Even GPL doesn't (yet) include a clause saying the code can't be used to train AI unless the AI itself is open source


> Since open source licenses all allow people to use the code for "whatever"

That's not what they allow for, and copyright being a 'right' it allows you to pass those rights on to others and to retain some for yourself. If not explicitly passed on the right still rests with the original author, plenty of precedent for that.


To take an example: someone who used MIT licensed code but doesn't reproduce the license.

Therefore isn't following the terms of the copyright grant, ergo doesn't have a license for use, ergo is violating copyright.

Now what does that look like when I take 100 different open source licenses, including MIT, put them in a GPT blender, and then productize my output without following any of the licenses?

... makes you think there might be a legal component to why OpenAI switched to a SaaS model. Although believe they'd still be in hot water over any AGPL et al. code.


I can't wait for this stuff to be legislated to establish once and for all what the legal status is.


This is why I only put code on GitHub if I want it to be seen by everyone (including Big Data).


It is a heavy gorilla.

1. Huge company.

2. Impossible to prove that a weight of -0.7 in a neural net means they used your code.

3. The code spat out by the bot isn't your code.


Apple, Google, Meta, Amazon, Nvidia use Github. They together are a bigger gorilla.


They are not your gorilla


And all of them would have jumped at that chance if they had seen it.


> ToS doesn't supercede copyright though, does it?

It does unless you can afford to sue Microsoft.


> ToS doesn't supercede copyright though, does it?

But in what way does reading a copyrighted work and then producing a mass of numbers as a result infringe copyright?


The copyright infringement comes about later, when that mass of numbers is used to produce a topically related work. The same rules apply for humans -- see the concept of "clean room implementation".


It really doesn't. Prose isn't source code. Learning something, then later writing something else isn't copying.


My limited understanding of case law is that transformative use is still judged very human-centricly.

E.g. the courts take a dim view of any attempt to create a machine (in the abstract sense) that takes in copywritten works and churns out similar-but-uncopywritten works


> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.


By the same logic. If I read your code and learnt something from it. Do I need to open source all my thoughts?


"And the code generated by the model also falls under GPL"


I abandoned github the day that they (and others, including people here) started arguing that their ToS trumps your code's license. That's absurd. It's authoritarian, it's hostile, it's an act of enmity. Fuck all that bullshit. I do business with no entity, period, that treats me with that level of disdain.


Yeah and it's like, nonsensical for the service: if you are working in open source, the reality is that a lot of the time you are working with software you don't own. I develop a lot of software, and the vast majority of it was open source... but I've only ever put two projects of mine on GitHub (and one only because I was working with some other people and I essentially got outvoted ;P). And yet, if you search for my code, I'm sure you can find almost all of it on GitHub, because it was open source and other people wanted to be able to edit it or even merely redistribute it... which I'd have said is there right, but I guess not if supposedly that overrides the license on the software? Or like, if this were the case, how would one expect some large/old open source project with a ton of prior contributions--which is normally fine as everyone has the same rights under the license and so you just all mix your code together and are happy: you don't actually need some central organization with ownership until you want to change the license (which is something many people explicitly don't want to ever happen)--to be hosted on GitHub? Even simpler: most of Google's code is open source--such as the Android Open Source Project, or Chromium--but they don't host it officially on GitHub... I guess it isn't OK for anyone to work on this stuff on GitHub either, right?


Yup, and this contrived conundrum is proof positive of a truth: all code is licensed before the ToS is agreed to, whether published or not. Licenses override ToS, which means that Microsoft needs to remove all code on github either due to ToS violations, since their ToS directly contradicts all licenses, or because the licenses contradict their ToS.

What it comes down to is that the Github ToS is illegal.


Absurd is communities like Elm where all identity and package management must be published on Microsoft GitHub or you and your code can't be a part the community repository.


their ToS trumps your code's license

wait, what?

do you have more details on this? what rights does the ToS claim that would violate an existing license?


Is this their official stance? I'd love to read about it but web searches lead me back to this comment section.


Never heard of this, that's quite shocking. Do you have any source?


If someone were to break Microsoft licensing terms and give me a copy of windows code

Then if I read that and built my own understanding

Then if I used that knowledge to implement my own version of windows that was compatible with Microsoft’s and distributed it under my own license

Would that be legal?

WINE etc are built in clean room environments for good reasons.


This. It shouldn't matter whether copyright violation was generated by a human brain or via an instrument (AI). ReactOS developers also rely on clean room approach: https://reactos.org/wiki/ReactOS_FAQ#Is_ReactOS_based_on_Mic...


You can find the Windows XP source code publicly on github right now. So I'm not sure how serious MS actually is about this...


"When I was young I found X. It was wonderful and full of stars! As I grew up, things changed, but I stayed with X because I was used to it. Finally, Y happened and I have to leave X because I really dislike Y. Goodbye! (ps. I'm not actually leaving because I want to be involved with things that are still going on there, but I want it to be known I don't really like X anymore)"

Replace X with GitHub, Facebook, Twitter, Instagram, Usenet, IRC, ... this is an archetypal growing up and facing change in the world story.


> Replace X with GitHub, Facebook, Twitter, Instagram, Usenet, IRC, ... this is an archetypal growing up and facing change in the world story.

It might have the same structure, but that is merely a grammatical structure or maybe reason structure. It does say nothing about the actual reasons themselves. We should not dismiss an argument just for following a well known structure. For all of the things you listed there can be valid reasons to leave them and those could be put in that structure or another typical one. I think it becomes even more natural to have that structure, if we consider the "flow" of time. Of course things are going to change. But that does not invalidate the argument at all.

Maybe you only wanted to highlight the structure. I don't know that. I merely want to mention, that arguments should not be dismissed purely based on their structure.


have seen this type of advertisement for Y product lot lately. You don't have to bash one product just to advertise another one. AI is not going away; you either embrace it now or after a few years. With its usage, it will evolve and get better with time, both as a product and the solutions they provide


While I agree with the general observation of the article, the tone and wording feels needlessly harsh and subjective. Sentences like „I could go to GitLab, but they have their own fair share of issues and a disgusting UI/UX.“ really diminish the article and by association, also the author.


It's a personal blog. The author has no obligation to keep it to the standard of a trade magazine. Blogs are supposed to be subjective and it's okay for a blog post to be harsh if the author feels so.


Sure. Just saying that for me personally, it’s a turn-off and makes me take the article less seriously.


For me it's not just the lack of professionalism but the writing makes the author sound pretty junior and inexperienced with how the world works. Not that there's anything wrong with that (we've all been there). But it can cheapen the discussion and feel like what really needs to be untangled is overlooked by stuff like "I was upset because they changed the UI and then I got used to it." Not very interesting details and not very useful thing to discuss.


The author is a college freshman, and that seems pretty par for the course for someone in their late teens. It's not really an "article" targeted for HN critical review, just a blog post on a personal site.


This is what I noticed too.

We have a kid with no experience in the real world telling us what they think. Why should we care? I actually don't


Nice display of arrogance there.


I appreciated the specific word choice there; 'disgusting' implies a strong, visceral reaction, as if to say GitLab's work is so far from the author's taste that it blocks them from considering the product, and didn't seem all that harsh if that's how the author actually feels.

In a counterintuitive way, it was nice to read -- it says that UI is important; that it shouldn't be an afterthought; that it has the power to actively invite people in or greatly repulse them instead.


what is so bad about the UI/UX? i'm not a frequent gitlab user but it doesnt strike me as anything out of the ordinary


A few things I personally don't like:

- Use of space: GitLab has more empty space across the site. Elements have a bit too much padding for my tastes and are spaced a bit too far apart. Valuable screen space (top center) contains a lot of needless information on GitLab (table headers, project ID, language breakdown, a prompt to add a CONTRIBUTING file) which is either not present or located off to the side on GitHub.

- Color and typography: GitHub is one of the best sites at this, using pops of color for important buttons and really fun typography. I love how GitHub has your avatar "speak" a commit description, it's a great little touch. GitLab's design feels like it has less personality. Quick comparison: https://imgur.com/a/Ejd1Q1X

- Attention to detail: A bit less of it on GitLab -- icons and text don't vertically align in some places; several of the dropdown menus could use a bit of visual improvement, or have an ill-fitting item or two; in my screenshot above, GitLab's "Commit message" and "Target Branch" have inconsistent capitalization.


The main page has a huge prominent “Get Free Trial” button, while the “Sign In” button is small and hidden in the menu. A common dark pattern.

The performance is sluggish.

There's always a lot of buttons and stuff on the screen that most users will probably never use, making the UI cluttered.


I actually agree with the author. Using GitLab is unpleasant. I’ve also tried cgit, Rhodecode, Gitea, Bitbucket, &c. GitLab just feels cludgey, personally.

I find comments on “professionalism” about a personal blog to be rather snooty and gatekeepy in tone. Reminds me of the unpleasant atmosphere of Lobste.rs.


I don't understand why GitLab (as in gitlab.com) doesn't focus more on the discovery part in order to make people feel more welcome.

I was using github.com/explore almost daily while it lasted. The trending repositories by language were an amazing tool to discover new tools and ideas.

I really wish GitLab had something equivalent to this!


GitLab team member here.

In our 15.10 release (March 22), we added a new section called Explore that helps with content discovery and includes a tab for Trending projects that can be filtered by language. You can read out it in the release notes [0] or just go check it out [1].

0 - https://about.gitlab.com/releases/2023/03/22/gitlab-15-10-re...

1 - https://gitlab.com/explore/projects/trending


Gitlab is a bit weird in that regard. Their standalone self-hosted use of it is great, but overall discovery is extremely poor on both self hosted and cloud editions.

They seem to have based the whole thing on the way Bitbucket works and completely ignored that being able to find a repo you've not been directly added to would be slightly useful.

If someone from gitlab is reading this and thinking "But you can find things" - you're blinded by the issue being an internal user. There might be ways of doing it, but its a convoluted mess and doesn't come close to github in that regard.


I would probably agree that GitHub is better for discovery than GitLab, but the OP's comment is funny to me, because having worked extensively with both, I'm a much bigger fan of GitLab's UI and UX for builders. Groups and nesting are massive when it comes to being able to work with teams, and the fact that you can set options across groups at any level eliminates whole classes of things that are frustrating problems on github, or anything else that uses flat hierarchies.


I wrote a (public) uBlock filter list to hide this social media-like stuff. The “you might also like” genre exists to keep you on the platform and I have enough distractions.


I think gitlab just gave up on trying to compete with github. They don't care about anything other than enterprise use cases, which are mostly on private instances.


Yeah, I agree that some words could be choosed better, but overall it changed my mind over the use of GPL licenced code. Copilot and other ML tools should not use GPL trained models to create proprietary licenced code.


if someone misuses code people put a lot of time into, they may eventually get cease & desist and then sued, especially open source code under licenses like GPL.

I don't care for AI ethics, its simply a breach of copyright to take my code and reuse it somewhere (theres a lot of my code that isnt licensed at all on GH). Since OpenAI can't guarantee it wont breach licenses or copyright, it should simply not allow code questions, the same way it doesnt allow sexual or illegal content "just in case".

But wait, no, that one's too much of a money cow.

I'm sure it would happily reproduce 1:1 a piece of leaked source code some company owns, and OpenAI would tell you their hands are tied and theres nothing they can do.

Of course, when it comes to making sure it doesnt say "fuck" or any variation of it, that instead seems a high priority.

And people go defending it, as if theres nothing OpenAI could do. Yes people are also trained on code, but people generally dont have the capacity to remember 1:1 a 50+ line function, and if they do, they can likely remember the license, too. Its a non-argument.

And then, to top it all off, tell me with a straight face that people have not gotten in trouble for writing a similar enough work, e.g. in academia.


> That there was nothing being given back.

Out of curiosity: How much does keeping github afloat cost per anno?

And not just afloat, but fast, reliable, secure and convenient.

And now we factor in how many users github has, and how many of them use it for free. Hell, I'm sure more than a few companies use it for free.

Microsoft is a corporation. Corporations want to make money. This is hardly a secret, and they can hardly be blamed for it. If people want to use a free offering of a company, well, then they have to be aware that the company in all likelyhood in some way benefits from that free offering as well. Because all that compute, all that bandwidth, all that storage, and all the people developing & maintaining it, don't come for free either.

If I don't like it, I either have to use an alternative run by a different company, or setup my own git server.


I think the "laundering" charge works both ways. While it is true that it will help some leverage work that went into GPL projects for new non-GPL projects, it is more important to know that it provides cover from accidentally importing copyrighted material from Microsoft, et al, into GPL projects.

While the value of the trade might not be fair, the cost that legions of corporate "Intellectual Property" lawyers can force upon the lone open source developer are a far larger drag on development.


I am someone that also left GitHub. [1]

To the author, yes, it's worth it.

I didn't go to another service, though; I set up Gitea on my own server. Because of drama there, I'm going to have to replace Gitea as well.

But GitLab itself is going down the drain, I'm not welcome on SourceHut, and I haven't heard the best things about Codeberg. Also, git needs a replacement.

[1]: https://gavinhoward.com/2020/04/i-am-moving-away-from-github...


I'm out of the loop. What drama has been with Gitea?

(I'm not a bit willing to put my code on the likes of SourceHut or Codeberg. They are just new iterations on the Sourceforge/GitHub pattern that didn't go bad up to now. Besides, I like to keep the software I don't publish on my computers.)



Oh, interesting.

The silver lining seems to be that it forced the creation of a fork, promised to stay free, with actual documentation, and that doesn't force the docker bullshit upon installation.

Looks like I'll finally upgrade from Gogs.


Has HN become an airport now that we are broadcasting departures? This article is interesting on a personal level if you know the author, but his reasons are not something most users care about and him moving on is not something I will think about at all


I vaguely get what the author is saying, but I think "give back" is the end result, not the issue. The vast majority of people don't give back. It's exposure.

When GPT uses code from stack overflow, you don't see the community, so there isn't any possibility to engage with it; essentially starving the community.


> I feel like GitHub has lost that magic.

Am I the only person that used Github for what it was? A place to push and pull my repos from my CLI? That hasn't changed. I rarely use the website.


People is so melodramatic. Yes this is the future, yes AI is scary, but this is it. Whether you like it or not, technology and life can continue without you. We could very well be living in what Intelligence from another era lead to and it is happening all over again. Life is short, enjoy the ride.


"We could very well be living in what Intelligence from another era lead to and it is happening all over again."

Very unclear statement...


The timing of this post is a bit uncanny.

I left Github last week (removed all of my repositories) and made contributions private. I went further and removed all public mirrors on my personal site (along with all writing and projects on my site).

I have been growing tired of the open source community in general and with a particular distaste for the Github community (and company as a whole).

I won't get into all of my grievances and experiences simply because it doesn't matter and nothing will change.

I decided to focus on creating things for myself. I don't need your stars, green check marks, crappy pull requests and endless issues. Maybe some day I will find the spark to open my personal site back up, but it feels unlikely.


Unfortunately moving to Gitlab or Sourcehut doesn't really help, because the underlying model (GPT-x) is trained on the entire internet, so that includes all scrape-able websites. The only way for your data not to be used in GPT (and therefore Copilot) is to not to put it on any website or make it very difficult to access, like encrypting it.


> Unfortunately moving to Gitlab or Sourcehut doesn't really help, because the underlying model (GPT-x) is trained on the entire internet, so that includes all scrape-able websites. The only way for your data not to be used in GPT (and therefore Copilot) is to not to put it on any website or make it very difficult to access, like encrypting it.

Having the entire git history decorates specific chunks (at least entire commits) with context by the commit message. So you may not only process the entire repo at one specific state in time, but the entire history in at this point in time. There is valuable knowledge while making sense of it; But this is not accessible to us. It relies in the knowledge base of one company (or two).


Not sure about this, but training a model on the website that displays the code is not quite the same as training it specifically on just the code. Moreover, (raw) repo content files might not even be included in crawled datasets (e.g., look at https://gitlab.com/robots.txt). I think there is something specific to GitHub as it being part of Microsoft that makes processing that data much easier.


Is that such a bad thing? I write code to get it out there, not for it to be exclusive or something. Any code that I write that helps somebody else in some way is a huge win for me and what keeps me going. If I could, _all_ of the code that I write would be open source or public but as-is the best way to make a living for me is to write closed-source code.

I see my code getting scraped by these AI tools as me having contributed to something greater than the sum of its parts. And I use it! My code helps you, your code helps me.


Have to say, I agree with the author. It should be super easy for us to select "don't use my data" on anything we own - Github accounts, websites, anything that has our real identity attached to it. With 2FA our identities are attached to Github accounts. Further it should be equally as easy to check training data across all models to ensure our data is not included.

Glad to learn about SourceHut. I'll check it out.


I have also been using GitHub since the early octocat days (2009?). But never saw it as this magical place the blog describes. Consider that we concurrently had access to really magical feeling collaborative open source tools like ruby gems and npm. By comparison, Sourceforge felt way more magical during it's generation.

What WAS and IS magical, is git, as a fully distributed, has everything you need but not too much, beautiful repository with easy to use history editing to cultivate sensible code trees. The only magic from GitHub imo was that it made git the defacto standard when we could easily be stuck using svn today.

As for his points about AI being against the philosophy of open source, I imagine he will have the ear of RMS and a few other absolutists, but in my personal experience AI assistance has lit a fire of individual empowerment for the little guy wanting to take on big projects so bright that I think it is the single best thing we have left to save open source. It is a new wave of open source, and we can collaborate with 100% of the population instead of 5%.


I stopped a while back. And now I'm unsure what I think given the AI revelations.

On the hand MS gave a great product to programmers for free, and on the other they've used the code therein to do a lot of junior programmers out of their future jobs, and then again this will increase the productivity of many programmers, for a profit.

It's hard to be too critical yet I'm no fan either.


Not trying to Ad Hominem, but why should anyone care that university freshman is moving away from Github? This is essentially blogspam.


Have to read it yourself to decide if the reasoning appeals to you.


> I feel like GitHub has lost that magic.

I'm willing to bet that GitHub itself hasnt lost "that magic". What's happened is:

- internet and development communities have matured. The types of 'fun' that's happening is different, but it's no less magic

- author got older


I recognize your journey. Most of my GitHub work has gone stale and although I didn't say goodbye, most of my work is now on GitLab where it receives more love and care and where I am certain I could always move to a self hosted instance if GitLab decided to go in a direction that doesn't match my philosophy. GitHub has been an interesting ride, a place where I landed after sourceforge and codeplex went upside down. I felt at home until I didn't. Thank you GitHub for the great things we achieved.


> For just $10 a month or $100 a year, you too can get programming help with no obligation to give back to anyone!

I would say the use of trained data is "giving back" to all users of Copilot?

The only thing that has to be considered is if the pricing is fair, and with all the competition in the space will be soon, or may already be below cost.

Also in my opinion, not using open-source for programming aids is less freedom (libre)--without getting into specifics of compatibility of doing so with individual licenses.


My instincts are giving me similar feelings. It’s time to move on.

This has probably been the play for longer than we’d like to admit.

Ingest everyone’s code and sell it back to us at a price.

Fools will be soon voluntarily uploading whole code bases and any novel solutions they have left. These solutions will then be then ALT+Tab’ed into the competitions editors for a small fee…wow and lol.


Q: What about the social features of GitHub? I enjoy the discovery and star features. There just isn't any community like it. Where can I get that?

It would be great if there was something that had the discovery & social aspects but was agnostic to the actual site hosting the code. Does something like that exist?


Forgejo [1] is working on integrating activitypub [2] which will let instances federate with each other.

[1] https://forgejo.org/ [2] https://codeberg.org/forgejo/forgejo/issues/59


+1 for Forgejo


Maybe some day Software Heritage (https://www.softwareheritage.org/) will expose enough metadata between repositories so that people will be able to train their own recommenders for better discovery.


You don't give up GitHub as long as you're employed in many companies that have accepted it and are looking for information on pretty much most open source software these days. The only thing is to just host your own projects elsewhere. I did this a long time ago because I was unhappy over GitHub's interface. I am realizing further that all I need is a web page and a link on my personal web site. If I really wanted to publish a project, I'd probably find other avenues. I think SourceHut does look nice if I cared enough.

In some cases, "GitHub" has become a required skill in some jobs now (not git, mind you, but _GitHub_). I personally fumbled a little bit when I was asked that - what is it? The CI/CD stuff specifically, or the "using git" part which is the skill? :) I was a bit befuddled.


I have heard from some people that many companies specifically ask for your github profile, in many places only the user tag, not a link which can be changed to gitlab or something else.

So it will effect your job chances, even if the job has nothing to do with managing GitHub.


Thankfully that hasn't happened for me, yet. Or maybe that could be my own selection criteria - an active GitHub account should not be a prerequisite to get any type of computer job. If the hiring manager is serious about me, they can find the URL in my GitHub profile.


I've done a little bit of helping with hiring and I always interject that it should be links to code forges.


My thoughts on the co-pilot thing remain as vague as ever but I made something of a connection reading this:

> GPL violations by soulless corporations stealing the hard work of independent programmers and assimilating it into their own proprietary software

"stealing"? Where's it gone? It vanished!? It's been erased from the internet?

Why do we play these semantic artificial scarcity games with code that has deliberately been made publicly accessible? If I read your code and copy it to a file on my computer it hasn't been destroyed, it hasn't gone anywhere. If my computer reads the code it hasn't vanished into the ether.

I don't pirate films/games/TV/books but I think open source code is an entirely different beast to the former.


I'm not sure people need to announce their departure from GitHub. You may not like their use of AI but legally speaking its entirely within their right to update their TOS for this. Is it sketchy? Absolutely. But given over the years they've provided billions of dollars of hosting and compute services to the opensource community for free I'm not sure there can be much of an argument that they don't have a right to create a sustainable income source.

They undoubtely should have handled it better, perhaps even offering an option for individual repos to opt out entirely. But if they didn't do it, someone else would have.


>I'm not sure people need to announce their departure from GitHub.

I think people should if they are leaving for reasons described in the article, or other reasons due to Microsoft. If enough people leave there is a very remote hope Microsoft its ways. FWIW, I have been waiting for that to happen for well over 30 years :)


To be honest I never really believed in the original business model: - subscriptions for companies and proprietary software - Deals with 3rd party integrations

It's cheaper and not more difficult to self-host most parts now a day, and improving (k8s, helm charts, things like flux and gitlab). The hard part being to actually write those integrations (github actions). You also have more flexibility as you can run whatever code you want on your infra.

So it was a matter of time that things would change (less features for free tiers, ads or another way to make money out of the user base, like with copilot)


I think the author hasn't considered this: I believe Copilot is trained on not only OSS on GitHub but also proprietary software. In terms of reproduction by Copilot, Copilot is not licensed to distribute OSS without license just as it is not licensed to distribute proprietary software. To any extent that OSS authors have a case against Copilot, so too do proprietary software authors, and likely moreso. If there is indeed a case against Copilot reproducing code, I'd expect it to be championed by well-financed corporations foremost.


Recently I noticed if you try to search a project’s “code” on Microsoft GitHub while unauthenticated, you’ll get `?` results and be prompted to log in to see anything.


I've replaced most of my repositories on github with notices like this one (for the same reason): https://github.com/tpapastylianou/process_optargs

I wonder if there's a licence out there already that enables use for humans, but restricts use by robots. (GPLv4 perhaps?)


> Well, there just aren't many places I can go. I could go to GitLab, but they have their own fair share of issues and a disgusting UI/UX.

Not sure when was the last time the author checked them out, GitLab had a bad UI, but they've some improvements done during the recent times. IMO they are pretty OK now (still not as good as GitHub, that is I don't even think GitHub has good UI).

But SourceHut is a nice pick anyway :)


GitLab team member here. Thanks for the feedback.

Beautifying our UI [0] is an ongoing effort and usability improvements [1] are specifically mentioned in the product investment themes for this fiscal year.

0 - https://gitlab.com/groups/gitlab-org/-/epics/7781

1 - https://about.gitlab.com/direction/#world-class-devsecops-ex...


well, GitHub isn't as "raw" as it used to be, that's for sure. It's much more user friendly now, much easier to make a single click and do what you need, like "copy to clipboard" or "download (a single file from a repo)" and so on.

as things get easier to use, the coolness factor definitely goes down, and the pride you get from learning the tool goes down, because as things get easier to learn, the less special any given person is for learning that tool. There are software packages which intentionally remain complex to keep this feeling, and it's considered bad.

From personal experience, if you want GitHub to feel like magic again, host an enterprise server instance on AWS and have over 10k active users. You will feel the complexity in your soul as you work out ways to monitor that thing effectively and react to problems before the symptoms get noticed by your extremely demanding users.

I realize that very few people can do this, as GitHub Enterprise licenses are expensive, but hoo boy GitHub has not lost its magic for me.


There's github for decentralized era: https://radicle.xyz/


Github was always private and their founders had already sold other companies to Microsoft, so it's hard to understand how anyone could be shocked and disappointed when they sold Github as well.

Copilot is great IMO, so good it feels almost like cheating, but if you are so broke that you cannot come up with $10 a month then don't use it, you can still use Github though.


Is there really anything ‘unique’ about the code we write on a line or even function level? It all feels like plumbing to me. It’s how you put it all together that makes what you create with it unique and special.

Copilot so far has generally just figured out what I was going to type next anyways without it, nothing mind blowing, but super helpful.


While I don't think MS has the right to just ignore copyright I think this is closer to the idea of code reuse than previous approaches where someone had to choose a specific level of granularity and abstraction and hope the result can be effectively used without any change.


See I've been doing Open Source for a long time and I love that GitHub/OpenAI/Microsoft is training on my code.

I want it to be helpful.

I'd prefer to run my own version, but we are I guess between 3 and 6 months from having competitive open offerings.

When I write open source code I'm serious about the open, non-discriminatory nature of it.


FWIW, that's great for your intentions with "open source code", but some people--like myself--in fact explicitly placed licenses on our code (and additionally did not even upload it to GitHub ourselves: our code is open source and of course anyone can edit it in public and redistribute it) that add "discriminatory" clauses as we aren't merely being purely altruistic: we are attempting to help provide code as ammunition in a war against developers who would attempt to lock users out of control over their own technology.

When you put your code under GPL (as I explicitly and actively chose to; or to put your content under something like the Creative Commons ShareAlike license, which I've explicitly used for many photos I've distributed over the years) you are purposefully choosing to help only those people who are willing to agree to the same level of open-ness in their products. And, as a reader, you know that if you read my code and "learn" from it, you are tainting yourself in a way that might make it difficult to later defend yourself from claims I might make.

To the extent to which it is legal for Copilot to train a model off my code as it might be fair use as some kind of transformative work, I want to be clear that I do not at all believe it is legal for someone to USE Copilot to write software that might be similar to my software... at least without having their lawyers carefully vet the resulting code it generates to determine that none of the expressive intent of my code has managed to leak through Copilot's attempt to launder my code as the user attempts to "autocomplete" something similar.


The main problem is that GitHub is becoming unreliable like the rest of the MS software (lead by Azure).

Everything else is just business.

At the same time I don't think it's smart to abandon GitHub for networking / CV reasons.

Once I'll be rich enough, I'll delete my account and use a self hosted git server.


US copyright office seems to say anything written by AI cannot be copyrighted. Also a work that does not disclose which parts are AI generated cannot be copyrighted. All open source licences and creative commons are based on copyright. If you do not own the copyright you cannot release code under MIT.

As far as I can tell no one is labeling which parts of code used AI. Most devs I know are using Chat GPT at least for inspiration in solving problems. Pretty sure soon almost every file will have snippets of AI code. Companies have little control to stop this. So to me it looks like every company is unknowingly making its codebase public domain.

Open AI caches its responses so can scan private codebases to know which use AI. Copilot also knows which codebases use AI. If they find a codebase that uses AI but does not disclose where it is used then that codebase can be used for further AI training. Major companies will need to use copilot x, vscode and github to remain competitive. So microsoft could end up sucking up all the proprietary knowledge of every industry.


Competitive at what? Templating?


OP: "it's a stochastic parrot" GPT-5: "I know you are, so what am I"


I kept my github account to report issues on the projects still on github.

Even though github is own by the untrustable msft, it (still) does work with noscript/basic (x)html browsers (well I created my account probably a decade ago).

gitlab: I cannot even create an account.


Stop reporting issues to github.


Tell valve to move to another noscript/basic (x)html friendly source code and issue reporting site then.

Yep, I have a sin: I play native elf/linux games.


Just had to report a bug on the alsa-lib... and it has to be done on github.

Ok, we have a situation here. Critical/significant open source components going on sites which should be avoided. There are noscript/basic (x)html alternatives.

Why those are not chosen??


Why do people believe that giving a gift (releasing free software) obligates the recipients of those gifts?

This idea that open source code can be "stolen" is insane. You can't steal a gift. Once you give it away, it is no longer yours.


Maybe for the same reason people believe in reciprocity with actual gifts and in other areas of human interaction?


If someone says "stochastic parrot" as an argument I know they have no idea what they are talking about.

That's calling a human a bunch of cells. It's true, but you are losing track of the (astronomical amount of) emergent properties.


Github has too many open source resources for people to get up and quit. Code repositories on github are akin to google services; the logo may not be there but the foundations seep through to our existing products.


Code on github is not open source, end of story.


Why don't the alternative projects just copy the good old github interface instead of starting from a bad scratch, scaring the transitioning devs off?


I'm sure there are others, but Codeberg is exactly this: https://codeberg.org/


Yes, that's a good example (and the author also mentioned this), but then there is sr.ht


Copyright could be an issue. Also, they may not agree that the GitHub interface is “good”.


I'm not sure © blocks, see Codeberg that is pretty close. Have you seen an example of an interface that is better?


I haven’t really looked around, but the GitHub interface is rather bad as far as I’m concerned.

I’m not sure how Codeberg would fare if GitHub sued them for copying their design. I guess GitHub doesn’t really care as long as they’re the king of the hill.


Microsoft eventually shits on everything, so this was predictable. The spirit of open source is orthogonal to that of corporate interests.


My only concern is whether GitHub is training on private repos or not. That would be the main problem, but I guess we will never know.


> It's a stochastic parrot

Stopped reading right there. The author clearly has no clue of how neural networks work or what makes them tick.


"stochastic parrot" is an interesting shibboleth for a certain school of thought on the capabilities of LLMs, but i think it is uncharitable to make the leap straight to "has no clue of how neural networks work". on some level, all of us who spend our free time enough to know what "stochastic parrot" refers to have some idea how NNs work, and on another, none of us know how NNs really work.

we could all do with a bit more humility dealing with this topic and each other.


That's fair, but it is a negative take that disregards all emergent properties. If you strip all emergent properties from it, there is nothing left. The same thing is true of all biological systems.

Why bother being "human". We are all just a bunch of cells exchanging chemicals and electrical signals. That's all there is to it. There is no reasoning, just a bunch of signals going back and forth.


How honestly can you say that "emergent properties" are real if you haven't really seen the training data or you don't actually know how the thing works?.

It stand's to reason that the bigger the model, the more likely you'll get an answer to a question you're looking for. Even the apparently "tricky questions".

Even things like translating code from one coding language to another...

Anyway, maybe we are ALL stochastic parrots (including ChatGPT-10) and that's all we'll ever be...bravo.


Emergent properties are never "real". They just are and you can see them happening, but "underneath" it's nothing.

Edit: I meant to say I don't need access to training. By experimenting with in/outputs you can get a basic picture. I don't need to see biological scans to say something about your personality either.


I think an important distinction here is to say that currently, you perceive them to be real. They aren't factually real things, at least not yet.

Judging someones personality is a subjective process, not an objective one.


I do not. What I say is that I perceive them. Their realness is a non-issue (to me). "Factual", you mean by "authorities"? I do get your point, but I think you overthink the issue. If you see something, it is there. It can be illusory, sure, but think about why that matters.


Do you have any resources you'd recommend to form a better understanding of how NNs tick? I'd like to get a better intuitive grasp on whats going on - I've mostly just been responding to that with "Well, if stochastic parrotism can do all this..."


In case you haven't seen it yet, the term "stochastic parrot" got introduced by this paper [1] titled "On the danger of stochastic parrots". A related paper [2] titled "Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data" got awarded Best Paper by the Assoc. for Computational Linguistics and it's also easier to read.

Those two papers are critical of LLMs and discuss what researchers believe that they can and cannot do. I don't say that you need to agree with them but I think reading them should give you a good primer on why some researchers are not as excited as HN users are.

[1] https://dl.acm.org/doi/10.1145/3442188.3445922

[2] https://aclanthology.org/2020.acl-main.463/


Thank you!


The recent post from Stephen Wolfram[1] is pretty good as an introduction, but I haven't seen any super comprehensive material that tries do dissect all the interesting behaviour we see in the really big llms. For that just reading the relevant papers themselves has been pretty fruitful for me. Some of them are actually very well written, even if you aren't used to reading scientific papers. I can recommend the Sparks of AGI paper[2] and the toolformer paper[3].

Obviously there's much more out there, those three things are a pretty good read.

[1]: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

[2]: https://arxiv.org/abs/2303.12712

[3]: https://arxiv.org/abs/2302.04761


At this point, anyone using the term "stochastic parrot" is just giving credence to my personal belief that humans are also stochastic parrots.


A more apt term would be delusional parrot, applies both to the LLM and to everyone else who thinks GPT is the second coming of Jesus.


I mean, its probably a component (or an approximation of a component) of what we do, at some level. Christ knows I've felt like a stochastic parrot when I'm zoning out 3 hours into a meeting and someone asks me a question out of the blue. I probably have a smaller context window at those points than GPT3 does...


Author is a college freshman, I stopped reading after reading the about me.


I am totally against GitHub using public code, but not permissible licence. I too to want to move out of it.


nudge


I recently rolled out my first full software product. Some of my code is OK to be open-sourced, and some isn't. I considered GitHub, but it's just too Microsoft-y now. I ended up rolling out a Gitea instance. It doesn't have feature parity but at least my code is on my server and there aren't any AI's / MS employees reading it (I hope!)


No sense getting to the front page…


Hacker News is like a book club: the articles are catalysts for discussion—and clearly a lot of folks have things to say


Feeling betrayed by Microsoft? Well, you should have learned your lesson the first time.


You have your points, i have to admit. Made me think about it too.


They took down Male Poon for no actual legal reason.


Anyone who refuses to embrace these AI tools will simply not be able to compete. As new tools and tech appear in any trade you more or less have no choice but to embrace them.


Does Copilot train on private projects too?


It's been proven that yeah it does. I suspect they've been retraining Copilot in the past months to have not made this initial training mistake.


My git client can talk to any remote origin, who cares? Good on this kid for the clout chase though. Being upset with AI is just comedy at this point...


AI tools like GitHub copilot are able to help “give back” to open source because open source developers can use them to be more productive.


The greater good here is what tools like GPT in its various flavors and interfaces are offering. Learning is good, not bad. Get a grip.


A bit pompous without real content.


Perhaps a variant of Eternal September...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: