> If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?
Github's position doesn't appear to offer any advantage with regards to Copilot's creation.
OpenAI Codex (which copilot grew out of IIRC), Amazon and Salesforce versions of Copilot exist. Huggingface Bloom was trained on a sizeable amount of public code. Tab9, now behind, was one of the earliest to combine public code repositories with Deep learning for smarter autocomplete. The data requirements for Transformer scaling mean any and all public facing repositories will be assimilated, whether Github, Gitlab, Stackoverflow or so on.
Wish more energy was spent on how to fund pretrained models that will also run efficiently on CPUs, fine-tuneable to one's language and local environment. Removing reliance on cloud services.
Curious about people's opinions on Dall-E 2 or Google Image-gen, which parallel pretty much the same thing with Renders, Illustrations and Paintings, or upcoming models doing the same for voice acting and music. Coders seem more excited about the potential of those tools.
Is anyone using Dall-E, Imagen, or any other generative model for art to create commercial products? If so, they're probably also concerned about copyright issues.
CoPilot is being offered for widespread commercial use, so it's held to a higher standard. Respecting copyright is much more important when you're building a business and not just sharing fun AI art on social media.
OpenAI currently offers a GPT-3 API and an invite only DALLE2 API. Both of these are commercial products trained on web datasets and can output collisions with the training set. They have effectively zero concerns about copyright due to it being covered under fair use and OpenAI having copyright on all outputs (in the case of DALLE2).
The boring answer is probably something along the lines of “copilot was trained by employees of OpenAI who aren’t technically MS employees”. When I worked at MS you had to jump through all sorts of hoops to get access to code from other orgs. I can’t imagine what BS you’d need to do to give access to a vendor.
At least a year ago in Azure that wasn't true; everyone had access to nearly every internal service's code (+the windows kernel). Though there were some exceptions (the Teams team didn't want to share their source at all for whatever reason).
> you had to jump through all sorts of hoops to get access to code from other orgs
This may be the dumbest move from M$ that I have read on this thread! Sure, companies need to protect their private IP, but this really feels like creating unnecessary friction for no good reason...
I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
I also want to know why people think their code is so special that no one else could have ever come up with it independently. Each and every opponent of Copilot is the best developer ever, I guess?
That said, I don't understand the choice to use GPL for any reason, so maybe I'm not equipped to understand the arguments against Copilot. Forcing your code to be open forever isn't freedom, it's the omission of freedom. Someone using your (for example) MIT-licensed code in a closed-source commercial software project doesn't "un-free" the code you released; your code is still exactly as open and as available as it was before, and zero freedoms were lost by anyone.
> I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
Please feel free to use my code in any way that its license permits: attribution for the permissive licenses, share-and-share-alike for the copyleft licenses. Those license terms are the price of the code, no different from a proprietary product's "this costs $x" or "this costs $x/month". I'm happy to give away most of what I work on every day, and I ask that people 1) give credit, and 2) in some cases, share under the same terms, and 3) in many cases, don't sue me or other users of code I've written over software patents (which shouldn't exist).
If the day comes that copyright goes away, and we can freely copy and share the code of any currently proprietary software and other works, I'd celebrate that. Until then, I don't want an asymmetric situation in which proprietary licenses must be adhered to but Open Source licenses are ignored.
If copyright goes away, that won't magically make the source code of all proprietary software public. The only thing that will be liberated is existing shared-source software.
>I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
Nobody has claimed that they want this. People just want derived work to adhere to the license they chose for their project.
>I also want to know why people think their code is so special that no one else could have ever come up with it independently. Each and every opponent of Copilot is the best developer ever, I guess?
Would you feel the same way about ripping off game assets, or music?
I think you just have an axe to grind with free software in general based on your messages and the general tone. Just because you don't understand it doesn't mean that the ideas are invalid.
I am also curious why copyright laws should protect proprietary software, music, games, writing, etc but not apply to my software, even if it isn't the highest quality work?
At one point does AI recreating patterns it has seen from reading source code count as a derived work? What if a human learns to code by reading only GPLed code, does all the code they write fall under GPL as a derived work now?
> “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” [Kate] Downing, [an IP lawyer specializing in FOSS compliance] says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”
This has some interesting implications – for example, it means I can't mirror somebody else's (open source) code on GitHub without their explicit agreement.
> > “If you look at the GitHub Terms of Service, no matter what license you use, you give GitHub the right to host your code and to use your code to improve their products and features,” [Kate] Downing, [an IP lawyer specializing in FOSS compliance] says. “So with respect to code that’s already on GitHub, I think the answer to the question of copyright infringement is fairly straightforward.”
So any code uploaded by someone other than the copyright holder renders someone liable to be sued for copyright infringement, AFAICS. The only question is whom it makes liable -- the uploader, GitHub (=Microsoft!), or both?
I can see arguments either way: The uploader is clearly infringing by giving away a right that isn't theirs to give. But so is GitHub / Microsoft, for using a "right" they haven't been properly given. So I'm provisionally leaning towards "both".
> I can't mirror somebody else's (open source) code on GitHub without their explicit agreement.
Who is doing the "mirroring" -- you, in uploading the code, or GitHub / Microsoft in actually hosting it, keeping it available for download from their "mirror"[1] site?
___
[1]: Is that even the correct terminology nowadays, when AIUI for lots of projects GitHub is their primary code repository?
So GitHub should immediately take down (and remove from their Copilot learning model!) all *GPL code uploaded by anyone but the ("primary"?) copyright holder.
There's one thing I'm missing from all these discussions and posts: is the generated code even copyrightable? IANAL, but code snippets often fall under the "scènes à faire" doctrine (everybody would do it in a similar way), in which case it's not. https://en.m.wikipedia.org/wiki/Sc%C3%A8nes_%C3%A0_faire
GitHub seems to think it is copyrightable, personally I doubt it is, simply because a human didn't create it and the process it was created by was automatic with no creativity.
Well, if the entire thing was generated, then no (according to the first link I posted above), since it was not produced by a human. However, no useful program is going to be entirely written by an AI, so any real program would have quite a lot of user input (I regularly will take what copilot suggests and then tweak it to what I specifically want). And then, yeah, it's copyrightable.
Also, there's no way for anyone to know what portion of code that I commit was hand written vs. generated, so you kind of have to treat it all as written by the committer anyway.
Though this does bring up interesting questions about what happens with things like automated PRs that fix bugs / update dependencies... are those then non-copyrightable? ¯\_(ツ)_/¯
Here's the kicker: your modified code snippet may still not be copyrightable if it's generic enough that everyone would do it in a similar manner.
Just as much as a hero riding off into the sunset is not copyrightable in a movie script. However, a hero riding off into the sunset with bananas in the pistol holsters would be.
This is what I would want to hear more about when discussing if Copilot violates copyright.
No, it's a good analogy, because it's not between the similarity of people and code. The cases are similar, because in both you restrict freedom to enable freedom.
Making source code available and not requiring the same of those who use it is a temporary fleeting freedom that soon turns into lack of freedom.
Like thinking you're ending slavery by freeing all the current slaves but not making it illegal to own, buy, and sell slaves, or capture previously free people into slavery. Guess if you'd have slavery again very soon?
The analogy is about freedom vs lack thereof, not manual labour vs software. And as you see, it works very well.
> their code is so special that no one else could have ever come up with it independently
I'm worried about exactly the opposite: having Copilot help me write code that seems quite generic to me, but which in fact makes my code subject to a license I don't even know about, and/or simply violates copyright.
For an open-source project this could be embarrassing but probably fixable. It gets more complicated if FAANG is doing due diligence on your company. I can see Copilot being both an accelerant and, later, a liability for startups.
There's a setting on GitHub that blocks any suggestions that exactly match code in the training set. I doubt you'd ever get in trouble for code that was similar in structure but different variables etc from existing licensed code (especially since most small snippets of code are not terribly unique to begin with).
I mean, it's nice that they have a setting for the bare minimum a lazy undergrad would do to avoid getting caught for plagarism — replace some of the words in the copied paragraph with replacements from a thesaurus. It's not something I'd personally expect to hold up under real scrutiny though.
AFAIK that's not enough, for instance see the long-standing industry practice that people working on the Important Stuff are not allowed to ever look at the source code of the Direct Competitor; or clean-room reverse engineering, etc.
I guess time will tell how much acquiring companies (my worry) care about Copilot. Given the difficulty hiring good devs, and the productivity level of body-shop devs, I see it getting a whole lot of use very soon, acknowledged or not.
There's a big difference between reverse engineering (i.e. intentionally writing software that behaves identically to another piece of software), and writing your own code to solve your own problem that may superficially contain small portions of the similar logic as some other project. Copyrighted code has to be sufficiently creative and unique to qualify, otherwise after the first person wrote code to parse json from a web request, no one else would be able to do the same thing.
Kind of interesting.. I would like to point out this seems to be specific for the US.
But also.. In that case, when I commission an artist to paint my portrait, surely I can't claim to be the artist.. But I'm no lawyer.
I'm not sure there is a contractual agreement in GitHub's co-pilot that says: "Any code you write here is commissioned work". But honestly I didn't read the T&C's.
So I think you MAY have debunked my analogy, but not the main reason for the analogy.
Copy and paste doesn't really write code, just copies it from one place to another. Copilot on the other hand does generate new potentially novel code.
I'm sure that's what people said when they went from punch cards to assembly, and from assembly to C, and from C to Java.... and yet, here we are. Tools that let us write higher level code faster, just allow us to create more complicated software in a reasonable amount of time.
That's still 100% true of the examples I mentioned. There's always a higher level to consider. When we moved to C, we could stop worrying about what registers we were using. When we moved to python/Java we could stop worrying about managing memory. When we moved to web frameworks we stoping writing the guts of our servers. And if anything, programmers have become even better paid, despite so many more people in the industry.
I agree with you--however, programmers have not become even better paid because society values programmers. They have become better paid because software is a relatively new artefact in human society which has taken the human life by storm, which has made software companies immensely profitable, which meant more companies wanted to create software and attract the people that could help them do it.
As software takes a back seat (or at least a "normal" seat) in society, would we see a normalization of income? Could this be hastened by the development and introduction of tools such as copilot?
Potentially, unless there are new / better things that humans can claim they can provide compared to AI tools. This is the point where I think you and I agree, and I think it's your primary argument in any case (unless I'm mistaken).
AI can code low level stuff. This one function. This small piece of logic. What it can't do is conceive of how to take a bunch of different functions and put them together to produce an actual product. It can't tell you if you should use postges or mongo. Programmers will always be needed, we'll just move up the stack, and we'll produce more value per hour of our work, justifying our high salaries.
Compare the visible output of someone writing in assembly vs someone writing on top of a modern web framework. Is assembly harder? Yeah. But the web framework is going to give you a usable product in a fraction of the time with way more features. And that's worth more money to the company you work for.
It's always going to be a knowledge worker's job. It's always going to reward experience and creativity and attention to detail. A lot of programming is looking at the world, seeing a gap in what exists, and figuring out what best fits that gap. An AI can't do that. Programming is making 1000 tiny decisions that can't possibly be specified completely by a product manager and need a human to weigh the tradeoffs.
> AI can code low level stuff. This one function. This small piece of logic. What it can't do is conceive of how to take a bunch of different functions and put them together to produce an actual product.
Thats what everybody in the chess world said: "AI can decide low level stuff. This one move. This small attack on a rook. What it can't do is conceive of how to take a bunch of different tactics and put them together to produce a game of chess."
...Until Deep Blue beat Garry Kasparov.
> It can't tell you if you should use postges or mongo.
Yeah, and then came: "It may be able to play chess, but it can't tell you how to play Go."
The hard part about writing code isn't "how to write a for loop" and similar trivial things. Copilot make this process faster, but the hard part is still organizing your code so that it doesn't become a steaming pile of cowdung a few iterations down the line. That Copilot does not do for you.
So, unless you are a code monkey punching code into autogenerated skaffolding all day, your job is safe.
Forcing your code to be open forever is guaranteeing freedom of all users of my code, both direct and indirect. Developers don't need to have any more freedoms than other users.
> Forcing all of your code to be GPL is like saying “I am on a diet, so now I will force everyone else be on the same diet. Freedom!”
Nobody is forcing anyone to use the code.
If they chose to use it they have to abide by the licensing terms because that’s how it works. If the people laboring for free to produce this code don’t want it to be used in a proprietary application then tough luck, write the code yourself.
Every time the GPL comes up someone drags out this same old dead horse to beat on a little bit more.
until the time comes when a tax department gets the funny idea to use it, and forced you to use it, or people with guns come to your door and haul you away in the morning.
its not about whether its a problem in real life, its about whether the end user might be forced to use a product, which IS a thing, that that is the ONLY point I made
> Forcing all of your code to be GPL is like saying “I am on a diet, so now I will force everyone else be on the same diet. Freedom!”
This is a terrible analogy. Here’s a better one: I’m holding a potluck. If you decide to come, you can eat all you want. If you take food from my event, you can’t hoard it, you must share it, even if you’ve “made it better” by changing it somehow after you left.
Don’t like my rules? OK, don’t come to my potluck.
By analogy, there is a law against me putting handcuffs on another, and in fact the police would stop me from doing so. Did the police protect freedom? Aren't they restricting me from handcuffing others?
In a similar manner, under the MIT I can restrict my users from modifying and compiling my source code. Is a license that means I have to let my users modify code restricting freedom? Isn't it ensuring freedom of others, in the same way that making laws of "you shall not handcuff others for no reason" is ensuring freedom of others?
Suppose that there's a law that states that water and access to it is always supposed to remain public, because water is a public good.
Suppose that someone comes tomorrow and starts claiming ownership of all the water springs in your country, he becomes the only entry point to get water, and you have to pay him a fee every time you open a tap.
Is he still free to do so? In other words, is the freedom of someone who restrict the freedoms for everyone else still a form of freedom that is worth even considering, let alone respecting?
Because the foundation of your ideas is exactly the reason why capitalism fucked things up and just let a bunch of jerks get rich without merit.
> I want to know what stuff you guys are putting in public GitHub FOSS repos that you don't want replicated in any way...
What a disingenous reply. FOSS licenses do not grant ability to replicate "in any way" that you wish. You still have to comply with the license terms. What the hell is wrong with you?
> I don't understand the choice to use GPL for any reason ...
Also note: Copilot violates the attribution requirements of permissive licenses like MIT as well. Even if you put your code on GitHub with the intent of it being freely used in proprietary software, attribution is still a fair demand.
Just to clarify: you seem to believe that most of our code isn't good enough, so copying it is not a big deal.
Do you feel the same about other creative processes as well? Can I rip a Justin Bieber's song and say that it's mine just because it's a shitty song anyway, so who cares? Or does this only apply to software because software is somehow an "inferior" art? Do licenses even have any legal value to you?
The D language uses the Boost license because it is the least restrictive. Anyone is free to use it in closed-source non-free commercial apps if they like, or Open Source if they like.
I don't know what 0-clause BSD is. The Boost license is:
Boost Software License - Version 1.0 - August 17th, 2003
Permission is hereby granted, free of charge, to any person or organization
obtaining a copy of the software and accompanying documentation covered by
this license (the "Software") to use, reproduce, display, distribute,
execute, and transmit the Software, and to prepare derivative works of the
Software, and to permit third-parties to whom the Software is furnished to
do so, all subject to the following:
The copyright notices in the Software and this entire statement, including
the above license grant, this restriction and the following disclaimer,
must be included in all copies of the Software, in whole or in part, and
all derivative works of the Software, unless such copies or derivative
works are solely in the form of machine-executable object code generated by
a source language processor.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
0-clause BSD goes even further, and completely omits the attribution requirement:
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.
This is a very "american" definition of freedom, which is basically, just let me do what I want.
GPL uses a different definition of freedom, which I prefer. They look at consequences of restrictions / permissions, and their implication on freedom (not just for me, but for everyone). So some restrictions can lead to actually more freedom, while some permissions can actually decrease freedom.
This is similar to gun-control. While it reduces freedom for gun owners, it allows everyone to be more free of hanging out anywhere they want without being afraid of being shot. Similar arguments can be made for vaccine mandates.
So GPL restricts usage of software because in the long term it gives back power to users, which will be more free.
> This is a very "american" definition of freedom, which is basically, just let me do what I want.
Eh. I see what you’re saying about gun control, but the idea that “some restrictions can lead to actually more freedom, while some permissions can actually decrease freedom” is actually very American.
The free software movement says that everyone deserves software freedom. The Declaration of Independence similarly says “We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.” While I haven’t found a source confirming it, I think that the founders believed that the freedom of speech was one of these unalienable rights.
The GPL puts restrictions in place to make sure that downstream projects give users software freedom. The Constitution put restrictions in place to ensure that the federal (and nowadays the entire) government doesn’t interfere with our unalienable rights.
Take a look at how the first amendment is worded:
“Congress shall make no law respecting an establishment of religion, or prohibiting the free exercise thereof; or abridging the freedom of speech, or of the press; or the right of the people peaceably to assemble, and to petition the Government for a redress of grievances.”
The first amendment does not grant the freedom of speech because it doesn’t need to be granted. From the founders’ perspective, god already grants the freedom of speech to everyone forever. The key phrase here is “Congress shall make no law”. The first amendment is restricting Congress to ensure freedom.
The idea that “some permissions can actually decrease freedom” is also present in the Constitution. For example, take a look at Article I sections 8 and 9. The framers of the Constitution could have given Congress the power to pass any law. Instead, they chose to specifically enumerate what Congress can and cannot do.
Perhaps, though, most Americans don’t know much about our founding and think that freedom=just let me do what I want. I don’t know.
I think GPL is a good idea (and ensures that everyone is having the freedom from a modified version of the code, and other things that it protects users from), but there is some problem too, such as I think it can be complicated to deal with.
For this reason, I had idea to make up a new license (although I will not write most of my ideas here but will do so elsewhere). But, its main working would be: mostly you can do whatever you want (including omitting attribution and copyright notices) without worrying about the license, but you cannot use legal processes (such as lawsuits, DMCA, etc) to prohibit these freedoms to any downstream recipients (regardless of how many). The license would also ensure patents can be used freely, disclaimer of warranty (if the license is included in the copy and the recipient has not paid for the copy), and some other things to ensure freedom (although there can be some restrictions on the use of trademarks (e.g. to avoid false advertising), and some things to avoid working around the freedoms in certain ways). You can be forgiven any number of times, though; the license will not be terminated. Furthermore, for a practical reason of license compatibility, relicensing by GPL3 and AGPL3 (and possibly also CC-BY-SA 4.0, for works other than computer programs) are also allowed, as long as you have a copy of the source code and can satisfy the terms of those licenses.
"I also want to know why people think their code is so special that no one else could have ever come up with it independently. "
Really? What exactly does this CoPilot thing actually spit out? I can't help but think that it spits out near verbatim, which in the UK is probably dodgy on Copywrite.
You then go on to decide that the GPL isn't for you. That's fine. You even explain that you are ill-equipped for something. That too is fine.
You are not a fan of free or "libre" stuff. That comes across loud and clear. Thank you.
Forcing the code to be open is the kind of freedom where restricting locally something enables the freedom globally. Granting the freedom to do whatever with the code will make the code end up used in closed ways, empowering those who close the code.
A similar line of thought is the "paradox of tolerance", which posits that if a society tolerates the intolerant, the tolerance of that society will lessen.
Freedom is not, and cannot be, an absolute. If I am 100% free, that by definition restricts the freedom of others (for instance, if I am free to punch you in the face, you are not free to not be punched in the face; if I am free to own you as a slave, you thus lose a lot of freedoms).
Determining what freedom should mean is not, and has never been, a simple matter of "well, if you make any restrictions on it, then it's not real freedom, so everyone just gets to be free!" It's all about finding balance, and dealing with nuance, and all that frustrating hard stuff.
Note that while Copilot is a major motivator for this effort, it isn't the only one; there's a pile of other reasons listed at https://sfconservancy.org/GiveUpGitHub/ . GitHub and its lock-in has been a problem for a long time, and this is just the most recent problem.
I mean the answer to that question is obvious: they're not under any obligation to include their own code in the training data. Why would they?
A better question would be whether they would take legal action against a competitor that creates a copilot equivalent and publicly states that they trained it on leaked, proprietary M$ source code. That would actually be an example of hypocrisy.
> They're not under any obligation to include their own code in the training data. Why would they?
Because these models work better with more data and presumably this a lot of high quality data that they already have lying around anyway? Because there no downside according to their own reasoning? Because it would shut up a lot of these criticisms right away? Because marketing would be so much easier with that kind of dogfooding?
In short: because according to their own story there would be only upsides, no downsides.
According to their logic, if I train a model using stolen Windows source code, it's fair use.
Just because they use FLOSS licenses, does not allow them to evade things like Affero GPL3. And, to that end, if they are using Affero, I want the source to the whole copilot infrastructure -or- proof they used no AGPL3 code anywhere.
Perhaps because there is a (small) risk of leaking confidential information through its output.
But that's not as damning as it sounds.
First, we know Copilot, if given the right prompt and told to autocomplete repeatedly without any manual input, can regurgitate bits of code seen many times in many different repositories, like the famous Quake fast inverse square root function and the text of licenses. That doesn't mean it does so under normal prompts and normal use. Perhaps it does sometimes, and that would be a real concern. But any regurgitation that isn't under normal use, which only happens if the user is trying to make Copilot regurgitate, is not a problem when it comes to copyright violations of open source code (since anyone trying to violate an open source license can do so much more easily without using Copilot), yet it may still be a problem when it comes to leaking confidential information.
Second, whether something is a copyright violation and whether it risks leaking confidential information are somewhat orthogonal. A copyright violation usually requires at least several lines of code, and more if the copying is not verbatim, or if the code is just a series of function calls which must be written near-verbatim in order to use an API. On the other hand, `const char PRIVATE_KEY[] = ` could hypothetically complete to something dangerous in just one line of code. That said, it almost certainly wouldn't, since even if a private key was stored in source code in the first place (obviously it shouldn't be), it probably wouldn't be repeated enough to be memorized by the model. Yet…
…third, the risk tolerances are different. If, to use completely made-up numbers, 0.1% of Copilot users commit minor copyright violations and 0.001% commit major ones, that's probably not a big deal considering how many copyright violations are committed by hand – sometimes intentionally, mostly unintentionally. (When it comes to unintentional ones, consider: Did you know that if you copy snippets from Stack Overflow, you're supposed to include attribution even in any binary packages you distribute, and also the resulting code is incompatible with several versions of the GPL? Did you know that if you distribute binaries of code written in Rust, you need to include a copy of the standard library's license?) But when it comes to leaking confidential information, even one user getting it would be somewhat bad (though admittedly Microsoft does distribute much of their source code privately to some parties), and taking even a small risk would be a questionable decision when there is a ready alternative.
> Perhaps because there is a (small) risk of leaking confidential information through its output.
If Microsoft/Github ever made that argument, that also means that when Copilot is using GPL software as input, the output can only be released under the GPL.
Copyright licenses don't apply to small snippets, no matter if you think they do, and learning and applying other people's code isn't prohibited by the license, and thank god, can't be prohibited.
FWIW, there are some (admittedly fairly naive) checks to prevent PII and other sensitive info from being suggested to users. Copilot looks for things like ssh keys, social security numbers, email addresses, etc, and removes them from the suggestions that get sent down to the client.
There's also a setting at https://github.com/settings/copilot (link only works if you've signed up for copilot) that will check any suggestion on the server against hashes of the training set, and block anything that exactly duplicates code in the training set (with a minimum length, so very common code doesn't get completely blocked). Users must choose the value for this setting when they sign up for copilot.
I tried using copilot and it literally attributed the function i was writing to someone else even before I could start writing a line. its been updated since and these errors are rare now, but still exist
> why are your Microsoft Windows and Office codebases not in your training set?
This is my favorite question about Copilot ever.
While GitHub might have a license to use that code to train the model, it’s debatable what license applies to the output of the model, and what users of the model can do with it.
It’s possible for an AI to reproduce something so close to the original that it would be considered an infringement on the original work.
The reasons that Windows is awful have nothing to do with code quality. Windows is awful because of intentional choices Microsoft made (e.g., bloatware that gets reinstalled with every update, mandatory Microsoft accounts, and mandatory telemetry).
Whenever I have to start windows 10, I still see the same kind of bugs, that were present on XP. One example: They seem to be simply unable to fix the icons "near the clock", which are still shown, when some app has been killed, until you hover over them. Things like that, but of course also lots of stuff that affects people more in form of annoyances, making every action take at least twice as long as on GNU/Linux distros I run. It only takes minutes, and I am already frustrated with the system, because everything takes so long to do.
One similarly ignored bug that springs to mind is the performance of the "Send To" context menu item in File Explorer. I always dreaded dragging my mouse over it by accident.
They could also cache that computed menu and proactively update the cache whenever the relevant keys are changed. Either way, pretty far from "cannot be fixed without breaking the API".
Windows 11 is an example of poor code quality. Bugs everywhere, while the same things work on Ubuntu/popos.
Past MS engineers have been commenting for a decade on how MS has grown too big, can't manage, and has become a monolith "too big to fail". By nature when engineers are small pieces of a giant machine, they don't do their best work. And those with the experience move on to better things.
My experience has also been that Windows 11 is buggy (haven't been using it for a while because it can't even reliably connect to the internet). But also in my limited experience (just one install on a single machine in ~2020, used for a few months): Ubuntu is just as bad or even worse.
Your experience its quite limited and you probably need to know how to properly update ubuntu since most of the issues I've found with it (since I started using it ~12 years ago) are usually issues caused by lack of drivers (which gets solved in 15 minutes once you know where to click) once those are solved it is sturdy and you can keep it runing for several months without having to restart it or it becoming unusably slow as it hapens with windows systems after about 4 days of uptime
This comment is funny to me because it was up to date and the particular issue wasn’t driver related: it was specifically that after not touching it at all for a couple months each subsequent time I logged in it would randomly lock up, took about 15min to boot.
It would lock up as in just take an extremely long time to do certain things in the UI. That sounds like a pretty odd way for a driver issue to manifest, but maybe I'm missing something.
The biggest issue with Windows isn't poor code or shitty engineering, it's the support for legacy software. MS engineers are some of the smartest in the world. The devs can fix the code and make a much better OS but that would break boomer software used by big banks that haven't updated since the 80s. When Microsoft write code, it has to promise support for decades, that means having to maintain the same old outdated APIs for many years.
Outdated APIs don't have to affect the shell and built in programs or anything else that is kept up to date. My linux programs are no more buggy due to having Wine installed for similar compat with legacy Windows executables.
Was it? I recall the kuro5hin analysis of the leaked Windows 2000 source code[0] that said:
>there is nothing really surprising in this leak. Microsoft does not steal open-source code. Their older code is flaky, their modern code excellent. Their programmers are skilled and enthusiastic. Problems are generally due to a trade-off of current quality against vast hardware, software and backward compatibility.
They explicitly listed the reasons they think it's awful. My personal grievances with Windows align more or less with theirs and while I wouldn't go as far as to say it's awful, I'd use something certainly stronger than "annoyance".
Specifically, clear anti-user choices that exceed by far being "annoying":
* Making it exceedingly difficult or impossible to use the OS without logging in with a Microsoft account.
* Forcing the user in various ways to surrender data to Microsoft. Some of them can be disabled if you really go out of your way, others can't.
* Prompting me again and again to switch to Edge and other MS defaults. I've had the same install for a few years now and NO, I don't want to change to "Microsoft recommended defaults", no matter how many times you ask me.
* Showing the same "OS setup" screen after some updates, requiring me to pay very close attention to what I'm clicking, lest I select something MS is trying to lead me to. The amount of attention required from the user on those screens corresponds quite well with anti-user behavior.
>Making it exceedingly difficult or impossible to use the OS without logging in with a Microsoft account
This is hilarious. I recently got a new laptop that has window$ 11. After setting it up with a Non Microsoft email (which required some good fight), I tries to install some random app from the Microsoft store, but got a "something went wrong please try again" on the first screen.
It's pathetic. I haven't used Windows since Win 7 , which I basically installed for gaming. Seeing the latest version of the OS makes me feel sorry for them. That's why Apple with all their assholery is eating their lunch (on the flip side my wife just got a MBP m1 and I was pleasantly surprised that it has hdmi port, magsafe, several USBc ports. Apple seems going in the right direction.)
You haven't had root admin on Windows since Windows 7.
The telemetry makes this clear. Reboots and updates even more so.
The UI lag and stealing of focus ("oh, you're typing a document... too bad, I want to launch a new Explorer window that will immediately steal focus") make it clear that the computer is in charge and will probably listen to your requests, but on the timeline it chooses.
The default of windows already do compatibility in some crazy way. And the compatibility mode lies to the program about system version or even fake old bugs so program relies on bug will run. And I'd imagine. To make this work, ms would need tons of most shitty code you'd imagine in the source o fake those behaviors.
I called it microsoft DNA. The way they do stuff is inhuman alien logic without any compassion or remorse (like all 10k+ their windows apis, or dontnet, or way they add features and handle support ).
It is however plausible that the code is only "good" given internal considerations. Microsoft has a specific internal coding styles designed to work with internal tools
As a leader of a FOSS project that is on github, and migrated off of sourceforge because SVN and email patches were not scaling - I'm a bit confused by this article.
Co-pilot has issues, ergo github is going the way of sourceforge, and so now we must abandon github? Do I have that reasoning correct?
We need to:
- migrate the bug queue
- have all links in commit history break
- application integration with githib for bug reporting be removed
- update documentation
- find a new website host (and no longer github.io)
- find a new CI/CD (we were already burned by travis, github workflows are nice)
- teach our user contributors to actually use git! There has been a lot of heartache from them that they have to use the pencil icon on a web ui to edit config files, now we have to take them back to using a git GUI client! We were on github before that blessed pencil icon feature came out, there was no end to the wailing about how unapproachable the process was (super frustrating when users see they have to do something.. frustrating for us because our users wanted to just email is stuff so we could then do the uploading to git work)
- lose all PR history
- migrate project tracking
- find a new place to host release artifacts
- update our website to use a new distribution URL (the website scrapes github api to get latest version for download link; it's nice never updating website as we do releases on every merge)
- figure out and migrate repository permissions. (We have a hundred repositories of user generated plugin content, everything about migrating that would be a lot of work and missing important features)
- lose our search ranking and rebuild our SEO
What else to add to this pile.. and all because co-pilot smells?? Meanwhile all of that work is busy work, and not at all feature-pare. That kind of migration would take a long time, seems like that pivot without good reason is the worst kind of churn. Convince me this article is not a temper tantrum about copilot...
That list is a perfect example why GitHub is so problematic.
Forget Copilot. Even without that you’ve put all of those services in one centralized basket, fully controlled by a for-profit company (with proven track record of unethical behavior).
And not only that, but this is true for the vast majority of FOSS projects!
I see your point. The disagreement I have is I would not at all consider copilot as putting MSFT into the evil empire category yet. While it was very unnerving for the ownership to change, MSFT has really had a pretty hands-off approach on Github.
I touched on it, but I still see this for-profit model as being compatible with FOSS. Specifically, make it free to FOSS so that those professional software developers then move to adopt the same platform at their private companies. There are lots of examples of for-profit companies that provide free-to-foss platforms.
For example, just considering code scanners, Codacy, CodeClimate, Snyk, LGTM, all used by this project and all have the same for-profit model (but free to FOSS). We also use 'install4j', which has the same free-for-foss model and is a for-profit company.
I think the heart of it is the argument that MSFT is going to expand that for-profit model and leverage data in a way that violates FOSS licenses? Is copilot an example of that? Should this all be taken to the extreme and declare that MSFT is the evil empire for this and by extension so is Github?
> MSFT has really had a pretty hands-off approach on Github
(a) since it is proprietary, you can't see where their hands are
(b) the fact that they haven't done anything also means you have no signal of their intent once they do. Not doing anything would be the logical play for a period of time whether you had evil intent or not.
> I would not at all consider copilot as putting MSFT into the evil empire category yet
The question is not whether Microsoft should be put into the evil empire category, but if they really left it or are just pretending like they did every time before.
> As a leader of a FOSS project that is on github...Do I have that reasoning correct?
The way I read the article was that by being on GitHub, you are implicitly agreeing to no longer be a FOSS project as regards licensing. GitHub customers can use Copilot to generate proprietary code that's identical to your project's code (several articles I have read call this overall idea "laundering through Copilot", which sounds incendiary but accurate to me) without needing to respect your license.
The other stuff you said is...kind of irrelevant. Sure, you get a lot of convenience from GitHub. If you don't care about software freedoms in the libre/copyleft sense and regard a "do whatever you want" license as the best, then it's probably fine to keep using GitHub.
Is copilot really a violation of GPLv3 (our license)? How is co-pilot different from someone lifting sections of code? Lifting a few sections of the code is a world different compared to re-distributing the entirety of the source code, or forking the project and replacing 2 or 3 letters from our brand name & then redistributing that.
I think the article needed to go into more detail about how that really is a violation of a license. This seems like a similar argument that was made in court whether the Java APIs themselves could be copyrighted. Can an algorithm be copyrighted or licensed? If someone uses the same algorithm as found on FOSS, have they violated the license of that FOSS?
Then the reaction, instead of pursing litigation, and/or communicating and working with github, the reaction is we should 'cancel' github and move to.. gitlab? Is that even an answer? If we think algorithms are copyright-able, wouldn't have any kind of code search be a violation? Would allowing for any kind of transcription of code be a violation? If so, then seemingly having the source be open would invite this.
I think this gets to the heart of FOSS in some ways. It's closed for privatization, open to the community, and what matters is the software provided. If someone cribs the project to configure a Feign client, or set up a unit tests with DbRider - it's okay! It's the same thing as viewing HTML source code, learning a cool javascript trick by looking at how some website did that trick - is part of the openness.
I wonder then, is the point of FOSS openness only to allow others strictly to view and edit the code for the purpose of contributing back to that exact software product? Or is the openness more than that, and that others are going to use the software in creative and novel ways, and use it for learning and who-knows how else (all pursuant to GPLv3).
> Is copilot really a violation of GPLv3 (our license)? How is co-pilot different from someone lifting sections of code?
If it's reproducing vebratim your code, yes it's a violation. And this is no different from others copying sections of your code.
The license that you chose says that people are allowed to do many things (e.g. copy and modify the code), provided they also fulfil some obligations. If they don't do that, they are violating the license.
Please note that this is not specific to GPLv3 (your chosen license). Other licenses, BSD-3-Clause, for example, also give rights (e.g. to copy and modify) and have their own set of obligations to be fulfilled (e.g. "Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer"). If someone redistributes source code (not the "complete" source code, even a "section of code" to use your words), they have to fulfill the obligations, or they violate the license.
It sounds like Github needs to index the license of where it obtained code, and check the license of a project using co-pilot so that it only suggests GPLv3 code for compatible projects. I'm not sure that it is really enough, enforcement is tough, it could be like one of those "are you 18" checkboxes.
I get the impression that this article and it's reaction is largely a lot of people wanting to cry big brother and mostly just shit on Github & Git. IMO it would be more appropriate to look for solutions and post a public letter to github and/or just pursue actual litigation before advocating FOSS projects spend inordinate amount of effort across the board to vacate Github.
I think the article starts from the position "everyone should respect the software license" and obviously promoting the SFC views.
You are correct that the problem is hard, because adding licensing info to training data and subsequently use it is something not yet accomplished by anyone (who at least has spoken publicly about it). It might be the only way out would be to have different training sets according to license, but then you lose the advantage of scale.
On the other hand, I believe it's a little much to ask SFC to do the cutting-edge research and propose solutions to Microsoft. Let's not forget that, when faced with the problem, their chosen path was "completely disregard licenses (for now), ask legal later". Many legal opinions are that there would not be any issue if copies of the input code were not reproduced verbatim; unfortunately this is not the case.
Microsoft could afford to train several models for every significant class of licenses. So if you don't want the model which included e.g. GPL code in its training set, you would opt for the model that was trained on the dataset which didn't include any. It's a bit of a hassle for MS, but that would partially resolve this issue.
> Is copilot really a violation of GPLv3 (our license)? How is co-pilot different from someone lifting sections of code? Lifting a few sections of the code is a world different compared to re-distributing the entirety of the source code, or forking the project and replacing 2 or 3 letters from our brand name & then redistributing that.
Yes. Many projects will see your license and never copy your code if their licenses is incompatible with yours. Copilot disregards your license and offers your code or derivations of it to anyone who asks the correct questions, breaching your license both in ethical and legal level.
You can't lift a function from a GPL licensed codebase and plant it to a MIT licensed codebase. It's that simple.
I appreciate the response. The ethical level, in my opinion, is reproducing the functioning software and the "business" specific logic. At a specific function level, and particularly for boilerplate code, I don't view that as specific to this (GPLv3 licensed FOSS) project. There are also dozens more examples of the same code being used, I don't think FOSS necessary gets a monopoly on technology.
Which I think creates an interesting debate, where is the line of general technology vs the intent of not allowing private companies to re-package FOSS for their own benefit?
> You can't lift a function from a GPL licensed codebase and plant it to a MIT licensed codebase. It's that simple.
If that function is a 'pad-left' type of function, is there an ethical violation? Or is this similar to learning Javascript by viewing the source code of webpages? At some point, functions are hardly unique in what they do, and there are only so many different ways to write a 'pad-left' function. I mention this question not to refute what you've said (I think I agree there is a likely license violation), but to explore where the line is for the ethics. I mean, an inferred implication could be that software developers stop looking at HTML source in order to learn. If you learn how to write a standard algorithm or function, and then you reproduce that later in private software, that is "lifting code". It's not much different committing that to memory then doing an outright copy-paste of a 2 or 3 liner. This makes me also think to the variety of shell scripts and standard bash'isms, if a single FOSS project uses a 'sort | uniq -c | sort -n', does that mean a MIT codebase has to re-invent a new way to do that?
You're welcome. First of all thanks for your kind and welcoming response.
It's an interesting debate for sure. When a program is divided into many functions, and when there are many utility functions, the debate indeed gets blurry. Moreover, many of the open projects in today's world are web applications and "simple" in nature.
In this context, simple means that the application can be composed via many simple, relatively general functions with a unique asset set and unique way of connecting them, so the magic (or sauce if you pardon the term) gets more abstract.
On the other hand, we have another group of programs, which can be similarly called "complex" programs. The biggest difference is these programs use complex, one of a kind functions.
Consider Blender, KiCAD, many open source and GPL licensed scientific software, libraries like Eigen, video/audio encoders, CD/DVD tools. Even a three liner in these programs and libraries can be a game changer.
I have a research oriented code, and a ~25 line function in this is worthy of its paper. I published a paper, and didn't obfuscate the language and algorithm, so you can implement it if you want.
However, if I open the code of this algorithm with AGPL3, can you say it's too small to be licensed? I don't think so. A more general algorithm can be ruled out as "too simple", but "fast_inverse_square_root" or my function or any other math heavy secret sauce can't be excluded because it's 2-3 lines.
As a result, while this issue needs serious discussion, we need to understand that even a small piece of code can carry a lot of research, knowledge and advantage in itself, regardless of its size.
So, while lifting a left pad from a GPL code can be understandable up to a certain point, getting the secret sauce, improving it and keeping it closed or merging into a similar, but incompatibly licensed software is inexcusable.
At the end of the day, this means we need to defend our GPL codebases, because we can't protect our secret sauce functions without defending the simpler ones.
If source its not subjet to corpyright then independently of it being illegaly leaked, i could unbrand XP's source code (that as I recall was leaked no long back) use it and redistribute it even for profit... I would like to see M$ answer to that.
I think the issue with Java's API was actually terribly explained even by Oracle's lawyers, Google stole that API, first because it already had a large amount of developers familiar with it so they could benefit of cheap code monkeys developing for their platform, added modifications (memory management) which under Java's open source licence (Oracle owns java yet it is still open source) they should have made public for everyone's benefit (in a true Open Source spirit), modified the packaging mechanism (slightly, you know .apk), so instead of you know having a package (jar) that can run on a desktop or rather on any JVM it could only run on the devices with their* OS (or rather dalvik implementation) and guess what, now that there are some options to run those on other environments the "casually" decided to change it again instead of letting people benefit from the rich APK ecosystem somewhere else than Android
> I think the issue with Java's API was actually terribly explained even by Oracle's lawyers…
Of which none of your examples are in any way applicable to the case of google violating the Java copyright.
The Java Specification gives terms and conditions for usage which is what the lawsuit was about and not some laundry list of things which annoys a random internet person. In fact, not annoying your users isn’t even mentioned once in the conditions to call an implementation “Java” strangely enough.
Or I’m completely wrong and oracle doesn’t spend enough on their legal team.
> The way I read the article was that by being on GitHub, you are implicitly agreeing to no longer be a FOSS project as regards licensing.
In this case, they have violated our license. The license states how it cannot be changed, and implicit changes by the hosting company is not one of them. Isn't there a legal case here? If no, then the license is not violated. This makes me wonder why not take Github to court vs a tamper tantrum (we're taking our marbles and going home!)
> the other stuff you said is...kind of irrelevant.
It goes to show that Github is providing really, an excellent service to FOSS. We could migrate to GitLab, though, why?
I know that sounds like 'fan-boy', but the list is large, useful, and all highly available and free to FOSS. By those measure, it's a good service.
> Sure, you get a lot of convenience from GitHub.
The items are more than convenience, they are core to our application and project. Uprooting them is not a small task.
The automatic integration with issues is excellent for example, before we did that - we had no idea how many users were seeing errors. We run a thick client that is downloaded and added an integration to upload error reports to github issues. That has been pretty invaluable. So, we have to move all that, to another host: who is to say that other host will always be a better FOSS steward? who is to say that other host has anyone near the level of features? An integrated CI/CD and hosting of release artifacts is huge.
So, on the premise that our license has been violated, instead of sue'ing, we should take our FOSS somewhere else? Again, the list is large, we need to migrate all of that. It took a long time to get out of sourceforge, it's not even more work to get out of github because we automated so much (to allow our team to scale better). It's not just a matter of 'convenience' to go somewhere that is feature-sub-par and spend the better part of a year to do that and nothing else.
> This makes me wonder why not take Github to court vs a tamper tantrum (we're taking our marbles and going home!)
Mainly because it might be more in line with traditional FOSS ideals to take our marbles and go home? It's hardly a temper tantrum. After the Linux kernel got into one too many arguments with Bitkeeper, there was an inflection point where Torvalds got fed up and just wrote git instead. I think this is an inflection point like that. And "taking Github to court" is much easier said than done for a FOSS project, although I assume that someone will be doing that at some point.
> So, we have to move all that, to another host: who is to say that other host will always be a better FOSS steward? who is to say that other host has anyone near the level of features? An integrated CI/CD and hosting of release artifacts is huge.
In addition to GitLab, I think Sourcehut offers similar features and is also FOSS (AGPLv3).
> So, on the premise that our license has been violated, instead of sue'ing, we should take our FOSS somewhere else? Again, the list is large, we need to migrate all of that.
I don't think anyone is telling you what to do, just offering suggestions. My suggestion is that switching to a forge that respects FOSS is a reasonable alternative to legal action against GitHub or just ignoring the potential license violations via Copilot. You can of course choose any of the alternatives, or the status quo; which is what you seem to have largely convinced yourself is fine. Good luck, and thanks for working on FOSS anyway!
> generate proprietary code that's identical to your project's code (several articles I have read call this overall idea "laundering through Copilot", which sounds incendiary but accurate to me)
Can you point me to those articles? I have seen the Quake thing, but "your project's code" is not like that one function in Quake (namely, it's not duplicated hundreds to thousands of times through countless license violations already).
Anyone can already generate proprietary code that is identical to your GitHub project's code just using copy and paste.
For people that care to respect copyright, there's a copilot setting to block exact copies of code in the training set (which only happens a tiny percentage of the time, unless you're actually trying to make it happen).
For people that don't care to respect copyright, git clone is a way more efficient way to violate your license.
Forget Copilot for a second. Do you think it is healthy for the industry to be so dependent on one single service provider, and a closed one on top of that?
There is gitlab, a decent alternative. Bitbucket is a (poor) option too. I think I would quibble with that dependency statement. Github is providing a lot of services, high availability, project tracking, a web api, etc...
The money model, as is for many free to FOSS tools is that by getting devs tooo use those tools, they'll carry forward to their professional lives and recommend the adoption by their companies. That does happen in practice, so it seems like the money model will not necessarily flip like it did for sourceforge (which kinda was garbage and the only game in town)
I would disagree about the industry dependency compared to FOSS. Many companies are not on github
So, that is to say the dependency aspect is a concern. So far Microsoft has overall been a steward for FOSS and copilot is not at all nearly enough to lose that trust. It is always a bit nerve wracking to place your balls in someone else's hands... and it was concerning when MSFT bought github.. but they have not been evil, not even close yet (in the grand picture)
You say BitBucket and Gitlab are alternatives, but at the same time you can not fathom the idea of migrating away from GitHub. So it's safe to say that GitHub has a de facto control of the market, much like Windows controlled the desktop market until early 2000's.
I can fathom a migration. It's just not pretty & is expensive. The experience coming out of source forge was not pleasant, and that was before the project even had a CI/CD. The early days of Github were game changers for FOSS, no more consolidating email patches together, etc.. So, this goes to whether Github still has a reputation and a brand for being a good home to FOSS. The argument that copilot, which automates what is otherwise an available and a manual process, and lifts just lines of code and small sections - is not at all "reproducing software". It's like someone used the "pad left" functionality from someones Javascript on their web page. Being able to do that is part of the point, it is a feature of openness, it's not a corporate back-door, market monopoly enabling flaw.
I'm curious if anyone can find references, though when I researched market share of code hosting companies a few years ago (for a private company that was moving off of BitBucket), it turned out that there were more private companies on Gitlab than Github. Github though had a big advantage for hosting FOSS. We wound up moving to Microsoft Azure because the scrum boards and Microsoft integration were appealing and familiar to the company. I don't see it being analagous as Windows desktop control in the early 2000's.
As an alternative to github... one of these is not like the others.
In any case... why github? What is so unique about it that you can't even consider other possibilities? I guess soon people will be non-ironically saying "no one ever got fired for hosting their code at Github" and turn a blind eye to perfectly usable, open alternatives who does not lock us in.
For what? Fear of taking responsibility for maintaining the basic tools for their job? If that is the case, you can always pay for other smaller, independent companies who can host at competitive rates.
Anyway, you do you. I'm tired of playing Cassandra, and I'm tired of seeing people giving in to convenience and general conformity.
I think it's like any social network. It grows in value with how many people use it. Yes, you can host your own public git repo or even your own gitlab, but then there's a barrier of entry to contribute to your code, and it's a lot harder for others to discover it.
The network grows in value with the size of the protocol, not just with a platform. Social networks can and should operate like email, not like siloed platforms. All the value ends up being captured and controlled by one single entity.
(I will not get in a tangent about web3, but that is the one thing that web3 skeptics always fail to acknowledge is how the current web is broken in that regard. We were promised open protocols, and we end up with a handful of companies building their own walled gardens)
The only way that Github would get any modicum of credibility would be if they joined the effort from codeberg/forgefed and integrated with activitypub. As it is now, github will be nothing but a mirror for my repositories that I will be hosting on gitlab and/or my own gitea.
Familiarity with the UX and conventions on that platform. Almost everyone knows how to make a PR against a GitHub repo. But some random other code hosting site? It would be a lot less familiar, and people would have to spend time making an account and figuring out how to contribute.
Even low barriers of entry can cause a big drop in user engagement.
If GitHub didn't provide value, then it wouldn't be where it's at. Considering that collaborating on FOSS before GitHub was a mess if you weren't technical, I'd say GH has earned their spot.
Also, there's no way to avoid centralization. At some point somewhere, you're relying a mega-corp for critical services, and if not you, someone else.
There's no getting away from it. Unless you're RMS, you're dependent on Microsoft or Apple. Just like your phones are dependent on Google or Apple. Is it a good thing? Nope, but it's real convenient and so most people will compromise.
> Unless you're RMS, you're dependent on Microsoft or Apple
No. Stopping believing this shit. It has never been easier to not depend on any of them.
There are good, perfectly usable phones with de-googled Android. You can buy laptops and desktops that run Linux without issues out of the box for years. You can even game on Linux today better than you can on an Apple box. You have legions of people working on different open source projects that make hosting your own server a matter of point-and-clicking.
It's not an Sisyphean job to keep yourself away from bad tech products. It takes some discipline, but so does anything worth doing.
And if you really just want to pay to get rid of any "headache", why not then pay for an open source alternative so you can be safe knowing you won't get locked in?
> Co-pilot has issues, ergo github is going the way of sourceforge, and so now we must abandon github? Do I have that reasoning correct?
No, you don't. If you had actually read the whole article[1], you would have noticed that they also listed several other issues which far predate Copilot.
___
[1]: You need to correct this part of the guidelines, dang. Some times -- like here -- the injunction against saying this is just fucking wrong.
TL;DR: The rest of the article is vacuous, the list of other reasons is a link of bothersome but not compelling other reasons. In sum, it's not a compelling argument to give up github. I disagree that the correct course of action based on the arguments presented would be to boycott Github rather than take any of course of action that would lobby them to change.
If this is just fucking wrong, I wonder about the implication. Would you say, I am just fucking right? I would be careful whenever having any such conviction. If I were already a hater of github, this article would have resonated with me a lot more. Perhaps there is some kool-aid drinking happening here? Maybe not and everything is totally reason to abandon Github as a hosting platform.
> you would have noticed that they also listed several other issues which far predate Copilot.
I don't quite see that list. I do see this article as quite focused on co-pilot. Though, I do see this:
> There are so many good reasons to give up on GitHub, and we list the major ones on our Give Up On GitHub site. We were already considering this action ourselves for some time, but last week's event showed that this action is overdue.
Which links to this https://sfconservancy.org/GiveUpGitHub/ (I'll point out here, linking to a list is different from listing the other points, so.. 'fucking right?')
Re-capping that list from sfconservancy:
(1) co-pilot
(2) contracts with ice
(3) githubs hosting code itself is not FOSS
(4) no self-hosting with github code options (again, github hosting is itself not FOSS)
(5) work to discredit copyleft
(6) wholly owned by MSFT
In this article, it essentially says that co-pilot was the last straw, and a decisively large one at that. So my response is still, this is the last straw where we need to abandon github?
If you are fully vested in the other reasons, then I think that this article would be preaching to the choir.
For me, points 1-6 are bothersome, but still just 2s and 4s on the 1-10 scale of fire alarms. My personal take on this list:
(1) co-pilot: seems problematic, perhaps github can fix it. Maybe a better solution is to lobby github first before doing a cancel campaign
(2) ice contracts: this is bothersome; but I can see how it could be a bit complicated given ownership by MSFT and the complexity of government contracts. It is bothersome though.
(3) closed source: I don't put a lot of weight to this criticism. Just because I can't run Githubs code for myself.. I mostly shrug. Yes, I'd prefer for it to be open source too, but I respect there are various for-profit models out there (and holy-hell I wish I was payed market-rate for FOSS work).
(4) no self-hosting option: Seems like the same point as (3)
(5) CEO leadership discrediting copyleft: bothersome, but without concrete examples, for me it is not fully substantive and just bothersome (but not major, like, wow, they are shutting down FOSS projects, or aiding in getting them taken down by ginning up BS charges, etc..). So, yes, the CEOs of Github were at times discouraging to copyleft. Are they evil incarnate here where those bad actors needs to spurn everything Github? Did the CEOs of Github personally oversee any FOSS projects being sued, or made into non-FOSS? Did they personally increase the cost on FOSS?
(6) owned by MSFT: big companies are big companies and they are really hard to avoid completely... MSFT has had a big culture shift in the last 5, 10 and 15 years. It's not the same company it once was. That is not to say this is not a concern. Though, until I have specifics around how/why I thnk MSFT has become actively evil, this remains just a notable concern.
> If you had actually read the whole article[1]
Apologies for seeming like I only commented on the first paragraph. A lot of the article seems vacuous to me and generally trying to gin up a mountain out of what might just be a gnarly hill.
Git is confusing for just about everyone. It is really easy to shoot yourself in the foot once you step away from git add/git commit/git push. Hell, you can foot-gun yourself even with the usual workflow.
GUIs help with bridging the gap, but because the GUIs make everything that's actually happening opaque, troubleshooting gets really complicated when something goes wrong (GitHub also sells professional services).
Github and Gitlab have done a lot to make Git easier.
>What else to add to this pile.. and all because co-pilot smells??
You are absolutely right that this all is a huge pain in the ass. GitHub, and later Microsoft, played their cards well. The product both works well, and also creates such a moat of vendor lock-in that it won't make sense to leave.
SourceForge went bad ... and everyone left. That doesn't seem like a bad thing and there's no reason for me to think any given site / service will or won't go bad too. I expect that for any number of reasons I might need to move from one site to the next.
The rest too is kinda hollow to me. The fact that they're for profit doesn't upset me. I figured they wanted to make a profit when I signed up even ... not sure how that would surprise anyone now.
Co-pilot, I personally don't feel there is a compelling reason to leave github due to that either.
Maybe I'm not versed enough in some of this but as a rando dev I'm just not having any problems on github these days ...
I'm not saying the author is wrong or right or that I'm right or wrong, just that I'm not finding that article very convincing.
There actually _is_ a good argument in there, but the article is really poorly written and all the preamble about SourceForge ends up being a distraction from what you really need to stop and think about:
FOSS projects like the Linux kernel use the GPL license because the developers want their code to be free not just for themselves, but for everyone everywhere for all time. It's not acceptable terms for you to take their work and use it to build an alternate operating system that you aren't going to share. If this wasn't important to them they could have just published their code under MIT/BSD licenses.
If you were to build an AI that used the Linux source code to generate a "new" closed-source operating system, in a very real sense all you've done is invent a new way to plagiarize the Linux community's work so that you can weasel your way out of their license terms. Even if you got away with this in the courts, it's obviously very unethical.
What Copilot does is enable the mass plagiarizing of open source code from everyone all at once, mixed up together so that it's hard to know who the original authors were, and then pretend that somehow this makes it ethical.
It's been years, but as I remember it SourceForge's primary downfall was the bundling of malware with binaries. That's why people I know stopped using it completely rather than because it was run on a proprietary platform.
Sourceforge, for all of it's name, wasn't really a place to get source - I'd say some immensely large percentage of users used it to download binaries for Windows.
I find it very strange that the same HN crowd that loves open source and wouldn't touch anything proprietary with a 10 foot pole has so many reservations against moving off of GitHub. I am not generalizing here, just surprised. According to what I have seen in my short experience, HN should have ditched GitHub long since.
Anyway...
I have always been curious as to why the largest hosting for OSS isn't open source itself. Maybe I am not intelligent enough to realize the reasoning behind this. Imagine if git wasn't open source! That's like an OS that only runs open source software but isn't open source itself. It just doesn't make sense for people to trust such an obviously flawed service...and yet they do.
And if that wasn't enough when GitHub got acquired by Microsoft very few people thought it amiss. Indeed, even now a lot of people are happy that Microsoft is running their digital homes. I think if GitHub was measured on the FOSS scale it would fall short on every measure.
Co-pilot isn't even that big of a deal. That's just the icing on the top. GitHub is untrustworthy top-to-bottom even before there was Co-pilot.
But I suppose convenience always trumps openness and freedom. It's especially sad because the whole point behind FOSS was this. Is the whole FOSS idea getting old?
The worst thing is that GitHub has monopolized the open source world. We can't even think of moving off of GitHub because of what we'll lose. But how about we do this:
We create a dummy repo on GitHub for our project that has all the fancy README, releases, issues, actions etc. but we keep the actual code out of GitHub on an open source service. Would that work? Is that feasible?
Basically, we use GitHub's wide adoption for what it's meant to be used: to market/share your project but keep the source code on a separate platform. This would create a new host of problems but I think it can actually work.
As a strong Free Software supporter on HN that uses Github I'll take a shot.
I use Github to publicly host my projects. I use github as I want them as public as possible and that is where all the people are.. I.E. it has the lowest friction for others. The tradeoff is a bit of extra work on yourself to make sure you keep the 'git' part of your repo the central thing. Use the hosting as extras appropriately, but don't rely on them. All relevant context, reasoning, etc. needs to be in the git commit messages. The git repo should stand alone and tell the complete story.
IMO following this simple rule you can keep your project git repo whereever is easiest for the users without lockin. Github's added features over basic git hosting are decent but none of them are irreplaceable if you keep your repo up properly.
Hosting your projects on Github is not an issue. The issue is hosting your projects only on Github, or depending on it for anything other than visibility.
I have my own gitea for private projects, but the open source one I host primarily on Gitlab. It's where I set up the CI, it's where I have pages, docs, etc. I do have a mirror on Github, but on the "Contributing" section from the README I make it clear where I prefer to receive PRs.
If you are not expecting other people to collaborate with you and if you do the same local hosting for your CI and issue tracking, etc... Fine, I guess?
I only worry about collaboration via github. I don't need an issue tracker or CI on my local repo. Github issues and PRs are discussions, not records. That's the point of the self contained repo commit messages.
You seem to be under the impression that OP was criticizing your approach. Your usage of Github is not representative of the majority of cases. You are only in Github for visibility, and the commit history/repo hosting are already distributed, which is fine. You can move away at any time.
What OP was criticizing was these larger FOSS projects who don't seem to mind that they are doing all their work on a closed platform, and that they have a lot to lose if Github decides to pull the rug from under them.
I went off a bit into more personal cases in the thread. But my original post was aimed at other/larger projects. If they maintained a proper git repo and used the platform tools as secondary tools (eg. discussion oriented instead of record oriented) then they wouldn't be locked into Github. I agree that a bunch of ones don't do it correctly and put the context that should be in the repo/messages into the PRs or issues. That is a mistake. But switching from Github won't help that... they'll just do the same thing elsewhere and lock themselves into that site or tool (ie. lockin doesn't require a service, just tools that do more than manage the repo).
> But I suppose convenience always trumps openness and freedom. It's especially sad because the whole point behind FOSS was this. Is the whole FOSS idea getting old?
It seems so---I read the post above saying essentially "I don't care about any of this criticism, I'm going to keep using GitHub because switching would cause too much churn" and did a double-take.
I get the impression that people no longer understand what it is to boycott something.
The whole point is to punish the boycottee which may cause inconvenience. Much easier to dogpile on twitter I suppose.
I can’t really think of a universally sustained boycott since South African apartheid to be honest. Even the current Russian stuff is too inconvenient for the majority of the world so they keep buying their oil and gas.
The problem with a lot of other proprietary platforms is that you don't own your data, so it's not easy to migrate to something else. With git, it's trivial to `git push` to a different platform if you decide to move off of GitHub. And you can at least get your issues out via the API. I think for this reason, as well as network effects, is why many people here are more accepting of GitHub over other proprietary platforms.
I publish my public FOSS work on a self-hosted Gitea. I don't allow account creation, and people can send me pull requests by email. That said, I think one thing (other than interface and brand loyalty) that keeps FOSS projects on GitHub is network effects. You can reasonably expect to search it and find the projects you're looking for, and your account lets you use the issue tracker and pull requests on other projects. I think forges are unnecessary in general, but to wean people off of GitHub, FOSS forges like Gitea need to federate, so that you can search the whole space of public federated forges, and an account on one lets you open issues and pull requests on another.
They seem to be making slow but steady progress on this at Gitea, maybe at other FOSS forges, too.
Gitea (and a few others) are working on federation for pull requests, which would allow someone to fork your project to their own server, and send a pull request offering you to merge from their server into yours.
It also builds on top of ActivityPub which is supposed to allow federation with the greater ActivityPub ecosystem ("fediverse"). I guess this would allow people to like or comment your issue or pull request from Mastodon and those other platforms.
I am not that excited about that last part, but federating pull requests sounds like a killer feature and a necessary step for a chance to topple GitHub. If hosting my own forge means I have to either get patches over email or let people register so they can create their own fork here, it's a non-starter for many.
> I am not that excited about that last part, but federating pull requests sounds like a killer feature
Git already has a pull request feature [1] that's as federated as it can get. The 'request-pull' command can be used to request pull on upstream repositories hosted anywhere (or not at all). The only requirement is that the downstream clone must be online. I know that HN isn't particularly found of email-based workflows for git. But requesting a pull is as simple as copying the output and mailing it (or via any text messaging service) to the maintainer. And doing a pull as a maintainer is actually easier than doing local PR merges using Github.
A pull request on a Git hosting platform is much more though. It allows commenting, tracking versions of the branch, and showing the diff. Even if the pull request is denied, those comments and the diff will stay available for all to see.
You can replicate that with email with a patch-based workflow, if you have a mailing-list server with public archives. That is not that much less software, and you have to deal with email deliverability etc.
Simply sending someone a git-request-pull doesn't carry anything for posterity. It contains a link to some place that hopefully contains the changes at one point (if you typed it right, git does no validation), but probably won't contain them for long.
> But requesting a pull is as simple as copying the output and mailing it (or via any text messaging service) to the maintainer.
I love email based workflows as much as anyone but for many people this is not "simple". For one, you need to be able to send plain text email or at least not have your client mangle it too much.
> I know that HN isn't particularly found of email-based workflows for git.
I would think it's not just HN. Do you think people are using github PR feature because they just don't know about the email-based workflow available, but would prefer it if they did? Most people don't want an email-based workflow here.
Gitea and Gogs weren't originally meant to be replacements for Github. They were meant for self-hosted internal repositories. That's why they are still on Github, even though hosts like Codeberg use them for public hosting these days.
IMO it kinda speaks loudly that Gitea's development happens on github!
There's a reason that's the case, and it's likely one of the reasons I should just use github as well; despite being morally opposed to what they're doing WRT copilot.
That's because when Gitea was early in development, it would have been not usable enough to develop itself (the Gitea people have a post somewhere explaining about it; how too-early dogfooding can actually make things worse as you try to implement 'urgent' features in a rushed manner rather than taking the time to do it right).
Sibling has already provided the tracking issue for getting off it; as far as I remember it's close now :-)
For early development that makes sense. But if Gitea is not ready to self-host now then it is also not ready to host most other projects so that doesn't really invalidate the criticism. However from the sibling comments it seems that the move to self-hosting is in progress.
It looks like the only feature blocking their move to self-host is importing data (issues and PRs) exported from GitHub. Which is to say, it's perfectly ready for hosting new projects.
I love email & personally prefer it for communication, but there is an entire cohort of developers headed your way who rarely us it outside of "job-realated requirements".
A possible middle ground here is to allow GitHub OAuth2 logins to your Gitea instance - then at least GitHub users can join in your project without too much friction.
Yes! Federation is so important. Federation for issues or issue triage on something other than Gitea would probably be important.
Better start self hosting gitea right now. And you can do it for free. I think the best option is oracle cloud, or does someone know how well it works on fly.io? https://paul.totterman.name/posts/free-clouds/
Actually, I don't understand why should I cater to all needs of all developers. I'm opening the code, accepting PRs, and these are my terms. You can agree or disagree.
If you really want to contribute, but don't want and e-mail based flow, send me a mail, and we can discuss.
Generally the ones I don't mind losing. If anyone can't figure out how to send a git pull request or patchset by email, I'm happy for them to email me with questions on how to, which I will answer as best I can.
I can figure out how to send you a patch set via email (see my Linux kernel contributions), but if I can avoid doing that, sure as heck I will. Your project must be really important to me, or I have to get paid.
Based on your first example of running git send-email without providing it any patch files or revision list, you appear to be making the assumption that someone doesn't bother reading the documentation before using the tool.
This would be like someone trying out make the first time and not realizing why it isn't working becaue they didn't realize they need literal tab characters in the make file for the rules to work. But if they don't read the documentation, there's no way they would know that.
The real problem is people trying to figure out how tools work by experimentation as opposed to reading documentation. If someone reads the documentation of git send-email and the project's contrib document contains the preferred settings for that utility, then submitting patches should not be an issue.
That's a valid observation of one of the reasons I won't use git-send-email.
I have limited mental resources, and given the choice between a tool where I have to spend half an hour before I can begin using it, and a tool which will guide me, I'll always choose the latter. After I'm done, I can even forget I ever used the latter tool! It's a boon for one-offs.
Keep reading, there's more criticism on other aspects of the tool.
After reading through the rest of the post, I do see your point. When I last tried it, I thought that most of the email formatting should be done with git format-patch and then git send-email should be used to actually send the email without having to answer any questions.
That would address one of your concerns about saving the email on disk and also ensuring that the headers have the correct contents.
If the project's contrib document contained information about what settings to use for format-patch and send-email, then the process would be much more seamless. I haven't looked at the kernel (or subsystems) documentation on that. The git project itself doesn't seem to contain that information though.
Regarding your other point about using your email client to handle sending the emails, git does have a utility called imap-send that would allow you to upload the patches to an IMAP folder, which, I believe, would allow you to then send the messages using your MUA of choice instead of git send-email.
You could mail patches as plain text attachments, if there are concerns about the clients mangling them. You could also try some easier plain-text clients instead of mutt. Claws mail is a simple GUI based one.
Honestly, if you can't be arsed to format an email then why should you expect anyone to spend the effort to review your patches and maintain your additions going forward.
Drew DeVault created this[1] to help people start using git send-email. I'm not sure if it works for the purpose of contributing to your project, but it may save you from having to explain the same thing repeatedly :)
I want to add the guide to submitting patches for the Linux kernel [1]. It contains a bit more finer points, in addition to what's on the site you linked.
Honestly it's not that bad. If you don't insist on getting correctly-formatted email for git-am, most people manage fine. They'll send you the output of git-diff, a git-bundle, or attach the files they changed.
What Microsoft is doing with GitHub is gross. Like VSCode, it’s an example of exploiting our natural affinity towards convenience to lock ever-growing parts of the developer community to their services.
I had thought that JetBrains products requires a licence, unlike VSCode which is free in regards to monetary cost.
However - the main gripe this community has with VSCode is that it is only partially open source. Microsoft adds extra bits to the code that is not open - mainly the items around telemetry. You can get around this with VSCodium or VSCode-OSS, but those are arguably forks and not a MS product.
I still don't understand why that's an issue. VSCode is a proprietary product. There are tons of FOSS alternatives if it doesn't work for someone or the telemetry is a problem.
My criticism would only be that the telemetry is not obvious enough for a casual downloader. I don't use VSCode but I'm sure a lot of people do without knowing about it.
The main selling point of VSCode is their extension ecosystem. Most extensions are open source / created by the community. In essence, Microsoft channels FOSS work by third parties in order to create a pipeline for upselling their products and services.
The writing is a problem. It's a call to action, it's a history lesson, it's an opinion piece. And, it's a mess. This is not The New Yorker: I want to know the gist of what you have to say in the first paragraph, and I want intro and outro to be good summaries. Bonus points for a structure that let's me surveil the finer points easily.
Agreed. They use the word "problematic". Any time I see someone using that word, I close the tab instantly to avoid the brain damage that would result from me reading the rest of the piece. The word is meaningless and cowardly. Say what you think. Quit tap dancing around the conversation.
And by not at least naming one of the problems, the author is being cowardly and/or lazy. It's like describing a reviewed item as "good". It conveys no information other than "I like it". The word "problematic" does the same thing here. It signals that the author doesn't like something but doesn't tell us anything about why. In other words, it's fucking useless.
This article doesn't compel me to give up github. It seems like it basically talks a lot about how proprietary software is evil and then complains about Copilot.
Then it drops this:
> GitHub's business model has always been “proprietary vendor lock-in”.
How is this Github's business model? Unless you're using Github specific features like Github actions and workflows, it's fairly easy to switch to another Git based host.
Then the article provides "alternatives" that are all lacking important features.
> If you're ready to take on the challenge now and give up GitHub today, we note that CodeBerg and SourceHut0 are excellent options right now.
The article immediately talks about drawbacks with all of these alternatives and then mentions a guide on how to self host using git lab. Why would I go through all the trouble of swapping to a different version control host if I don't gain any value? In addition to not gaining value, I'll also lose features that are very nice to have.
This article doesn't convince me at all. Yes Copilot is questionable and we should pursue the ethics behind what it does, but if you want to convince people to give up Github you should at least be prepared to give an alternative that offers a great deal of feature parity.
> it's fairly easy to switch to another Git based host.
> if you want to convince people to give up Github you should at least be prepared to give an alternative that offers a great deal of feature parity.
These can't both be true. In fact Github is hard to switch away from, because of all the Github features. This is the lock-in. Github then monetizes this by charging for large files (https://docs.github.com/en/repositories/working-with-files/m...), private repos, etc. So the SFC argument is that you should switch away from Github now and get the alternatives to feature parity, to avoid Github getting a monopoly.
GitHub is a business and currently provides free storage and a pretty nice interface to it.
It’s easy to say “our rights are being stripped away” but the view that businesses should operate like non profits or government services with the common good in mind is ludicrous!
These are the immediate "products" GitHub provides, however, I would argue it provides a lot more.
GitHub provides a place for people to easily collaborate on FOSS software. It has helped millions of people getting into software development or into their first FOSS project by lowering the barrier of entry significantly.
Can you imagine how many people would start contributing to FOSS early in their career if they had to deal with mailing lists, patch sending, multiple git remotes, rebasing, etc. all at once just to start providing a small contribution? - Probably not as many.
I don't support everything GitHub does and I do see CodePilot as problematic but the article opens with "Those who forget history often inadvertently repeat it.". - You know what has screwed us over a lot in recent history? Cancelling something/someone without thinking it through first. Oh the irony.
You are right about the benefits GitHub provides to the FOSS community but I don't think that was ever the stated goal of this company. Their goal is to make profit, not foster FOSS, so we shouldn't be surprised when they make decisions that benefit their bottom line.
It is objectively not ludicrous. It might go against the commonly taught idea that businesses should focus solely on generating profits, but it is not unreasonable to create a system where businesses have to keep the common good in mind.
There is a difference between 'how things are now' and 'how things could be'. Imagining and wanting a different status quo is not by itself ludicrous (especially since we all stand to benefit from such businesses), it's a first step towards change.
It's not just a "commonly taught idea that businesses should focus solely on generating profit", it's the fundamental principle upon which economies are built today almost anywhere in the world. Sure, there are other ways of organizing economic systems, but to suggest that we are simply or easily going to switch to one is unrealistic. Imagining and wanting a different status quo will not lead to a different status quo, especially if all we are doing is making demands on others to change their behavior and use their property in ways that we want. In other words, if we want a different status quo, we won't get it by bitching about GitHub but by building a competitor company that does things the way we want it.
> It's not just a "commonly taught idea that businesses should focus solely on generating profit", it's the fundamental principle upon which economies are built today almost anywhere in the world.
It's hardly the fundamental principle. The fundamental principle is that people need things to survive and its more efficient if people specialize and trade than if everyone creates everything they need.
The pervase idea that businesses should focus solely on generating profit is also directly responsible for lots of problems almost anywhere in the world from driving out less vicious competitors to rent seeking to externalizing costs to everyone else e.g. via pollution.
I think you're actually both right, in different ways.
Fairly self-evidently, the sane fundamental principle for a business is "make a good/provide a service, and if you do so well, you make a good profit".
Unfortunately, for the past few decades, businesses in the Western world (and particularly the US) have increasingly been operating based on a fundamental principle of "make as much money as you possibly can, and if you have to make a good/provide a service to do so, that's a necessary evil".
Realistically I'm not leaving GitHub anytime soon and I do agree that businesses need a way of making money. I'm generally fine with a free service that also restricts certain features to paying customers and I think that GitHub worked well with that formula so far. But I don't like the double standard highlighted in this article about Copilot: they are training their IA on Open Source repos and using the result without taking into account possible licenses incompatibilities by making an argument about this being comparable to a compiler's output, but at the same time they are not using their proprietary codebase to train it to protect their own intellectual property.
I'm not saying that businesses have to provide stuff for free, I'm just saying that there are if not more legal at least more ethical ways of making money, because as it stands now it seems to me that Copilot is in a legal gray area.
Where does this end? You write a license that your GPL code can only be re-hosted on non-GitHub hosts? git still exists, if I'm unhappy with GitHub I can just add a new origin (sourcehut, gitea, gitlab, self-hosting, etc) and push there.
But I'm perfectly happy with GitHub and I'm fine if their ML thingy makes money off my code, I get free actions runners, a nice UI, pull-requests, etc, into the bargain, not bad.
Like knock yourself out working out if the "monkey selfie" Supreme Court case law applies to copilot or not, what jurisdictions it covers, etc. But I don't care, sorry, I'm not interested.
> But I'm perfectly happy with GitHub and I'm fine if their ML thingy makes money off my code, I get free actions runners, a nice UI, pull-requests, etc, into the bargain, not bad.
As an indivdual you can certainly think so. But as a community we must balance how much advantages do we get from GitHub versus how much advantages it gets from us.
Considering that GitHub will probably make millions with copilot, then it is fair to say that a big part of that success comes from the quality of the code we collectively put one. Therefore, money should be shared. And the first thing to do is to ask Microsoft : how much money do you make on us ? And the only possible way to get an honest answer is to make sure the management of GitHub is done jointly by MSFT and the community.
I don't think this is going to happen anytime soon. So I'm seriously considering getting out.
Isn't this sort of the expectation of these free services? They provide a service to you for free, and in exchange they are able to collect data from you, store it in a database, and do things with it.
I fully expected GitHub to do something like this (minus Enterprise repos that pay big money), and it's why I stopped using them. I instead use gitbucket, which is a free and open source self hosted thing.
I also expect GitLab to do something similar. It is as they say in the crypto world.
Not your wallet, not your crypto.
Not your source control server? Not your code. Legally? Might be, but that doesn't matter. Do you have lawyers?
But I have written code which is hosted on GitHub which other people have uploaded and I have signed no agreement to let them do so. Does this mean all open source projects older than GitHub need to stop using it?
> If the code had a license that forbid reupload you could ask GitHub to remove it.
Licenses don't need to forbid re-upload generally to be incompatible with GitHub. Given that uploading to GitHub, according to someone quoted on page 1 of this discussion, grants GitHub the right to use that code to "improve their service" (whatever that means -- maybe Copilot?), not explicitly granting that right to them, or the right for someone else to grant it to them, is enough to make them not allow uploading to GitHub.
So that should be more like: "If the code had a license that didn't specifically allow Microsoft to use it to 'improve the GitHub service' you could order GitHub to remove it."
Sure they're going to make more money from me (as a generic user) than value I will derive from them (on average over the population of users). Otherwise they go out of business and we use other available alternatives.
But you know what? Even after copilot my code is still there, for people to make money from (as both GH and random people already do), to learn from, to cut up and rehash, to reference, to write (much) better versions of, to generally advance humanity. I know this is the root philosophical conflict between free software and open source but I wanted to state my view since this is a call to action and we'll be seen as betraying some ideal for failing to comply.
It was the same with SourceForge. It worked great (for its time). You get stuff for free. It's convenient.
And then slowly they started changing policy, closing the source, adding ads to the platform, .. then one day you realized you've locked yourself into their ecosystem. It's a bunch of work to move away. It will be worse with Github since they bolted so many things on top of Git.
> Where does this end? You write a license that your GPL code can only be re-hosted on non-GitHub hosts?
Copilot’s contention is that it’s exempt from copyright restrictions under fair use doctrine, which means that your license that says they can’t use it is irrelevant and legally void.
> Where does this end? You write a license that your GPL code can only be re-hosted on non-GitHub hosts?
AFAICS you don't even have to: GitHub reserves the right to use uploaded code to "improve their services" (in some unspecified way). The GPL doesn't grant that right -- i.e. the right to grant Microsoft that right -- to any license holder, so anyone else but the copyright holder will infringe on their copyright by uploading it.
> But I don't care, sorry, I'm not interested.
If you've ever uploaded, or are planning to ever upload, anything containing code written by anyone other than yourself under a FOSS license to GitHub, you probably ought to be interested.
> * They’re a closed-source for-profit company? Great! That’s why their product is high quality.
That's a huge stretch for a few reasons. First, the reliability of GitHub.com is very poor and it has had tens of incidents in the past few months. Second, their hosted solution, GitHub Enterprise, is notoriously poor, difficult to maintain, always late with features. Third, their main competitor, GitLab, which is open core, was kicking their ass for years with features and quality until GitHub got the unlimited funds of Microsoft to be able to even come close. More to the point, there are tons of good quality open source software, and tons of really poor quality closed source. Openness of code matters little for quality (besides the fact that with open source at least you have the option to see why and fix).
Regarding Copilot, they're ignoring licenses, training models on everybody's code, and selling the result as a service. Sounds very sketchy.
OK let me take your projects and make money from them and not give you anything in return, not even credit.
> They sell software to ICE? Good. Why wouldn’t they? I’m not interested in anti-immigration-enforcement politics.
ICE policies aside, you should be interested in immigration politics because they're important to people and businesses. Apathy is a bad thing here.
> They’re a closed-source for-profit company? Great! That’s why their product is high quality.
hahahaha, there are plenty of open-source not-for-profit companies with high quality products and many many many more closed-source for-profit companies with terrible products. See also Microsoft Windows and Microsoft Office.
> OK let me take your projects and make money from them and not give you anything in return, not even credit.
This is one hundred percent okay with me and many other open source contributors. It's a no-strings-attached donation to mankind, and if someone else finds value in it, you don't complain, you cheer. Who needs attribution when people are actually _using_ something you made; you saved someone a good deal of time trying to write a solution themselves, perhaps.
And if that's your goal, then you release your code under one of the more permissive licenses (MIT, CC0, etc). As the developer and copyright holder, that's totally your right and privilege, and that's how you exercise it. When your license explicitly states that there's terms of use attached, then copyright law ensures that violation of those terms invalidate your rights to use the code. If we're gonna say that copyright law doesn't apply to open source licenses then we have to go ahead and in all fairness state that copyright law no longer applies to proprietary software licenses either, because it's the same laws that protect both. Making something open source doesn't mean you're just giving people the right to do anything they please with the code unless you explicitly state that you're doing so by choosing a properly permissive license for your code.
> This is one hundred percent okay with me and many other open source contributors. It's a no-strings-attached donation to mankind, and if someone else finds value in it, you don't complain, you cheer.
That's fine if your project says that. But there are a very many of projects which specifically say otherwise.
Right I'm just saying your argument only holds for licenses like GPL, not in general as you suggested.
edit:
Not implying you're wrong, just moreso that there's a large chunk of people who aren't going to be motivated to fight on your behalf. I understand there are some useful areas such as hardware drivers for GPL, but it's simply not an IP constraint that sounds at all fun to work on as a volunteer.
That's true. I use MIT myself (although I don't care if people cite me).
I guess this just all seems rather activist to me but people aren't seeing the big picture; our jobs as programmers are about to change in a very big way. It won't be long before more competition enters the space (e.g. Salesforce and Amazon) and ultimately it won't matter if this model saw some GPL/MIT code because the next one will work twice as well without having seen any of it.
People are seeing the big picture. That's all the more reason to make sure that, as new tools get developed, they treat it as a requirement to actually respect Open Source (licensing, credit, provenance, copyleft, patent non-aggression, and all the other reasons people use such licenses), rather than just abusing it as an input.
Perhaps I should have said idealistic. I mean I've seen several people make demands like they're somehow already in a courtroom with Microsoft. No mention made of the fact that this is likely covered under fair use. If it isn't, then basically every deep neural net trained on a web dataset is in violation. This means open works like from University at Heidelberg (Latent Diffusion, VQGAN) or EleutherAI (GPT-J) would be impossible.
Like, maybe you could all get together and learn how machine learning works and train your own clean model? That feels far more positive and in the spirit of progress to me. I guess from my point of view it's clear - no laws will save you from the next wave of large language models. The weights are trivially distributed meaning once it's trained it's more just a fact of life we all have to deal with. So making demands when you have basically zero leverage rather than admitting defeat and working within the new constraints of progress.
I just feel like it's an inherently philosophical position and people are acting like code theft hasn't been common practice for the history of all software.
You mean two of the most successful pieces of commercial software in the history of modern computing? Because as easy as it is to pretend they're terrible: to millions of users and orgs, they simply aren't.
Eh, first mover advantage counts for a lot. It doesn't necessarily mean the software is good.
Burning coal for generating energy was very "successful" and had great adoption. It's still a terrible method for the environment, and we can feel free to regard it as negative.
In this analogy, Microsoft products would be like generating energy by burning garbage, aka a dumpster fire.
You'll want to pick your analogies carefully, because burning garbage for energy when the alternative is a US style landfill is literally better in every way. But let's not strawman: first mover advantage is very real, but it's been literal decades of competition, and plenty of folks moved off of MS products. Hundreds of millions, however, have not. And still voluntarily install them. Because for them, they work the vast majority of the time, and don't cause any more gripes than Apple or Linux do for others.
That's the thing though, burning coal works fine as well and doesn't cause any more gripes than hydroelectric or solar. Without voluntary moves to other products, or regulatory changes, it would have taken likely much longer before we stopped doing it. It's not an exact analogy, but there are parallels – the linked article is suggesting voluntarily stepping away from Microsoft code hosting, and others in the thread have suggested regulatory controls.
> because burning garbage for energy when the alternative is a US style landfill is literally better in every way.
Now that you say it, I do remember reading that Sweden burns garbage for energy. I would have thought that the main problem would be arbitrary emissions from the plant, but from the plan of a typical one, those are trapped and/or filtered [1]. I still think that in the US this would be harder, since people are probably prone to throwing more things in the garbage than they should; and don't recycle as much as the Swedes do.
Indeed. A number of other places in Europe looked at the economics of landfills and went "yeah: no. Even with the cost of gas filters, literally burning garbage for energy is cheaper and more environmentally friendly than making a time bomb by burying it".
Being successful as a business has as much or more to do with market power than with code quality.
I am lucky that I can avoid their products. But millions of users and orgs can't because that's what others around them use. So they're stuck with it no matter how much worse it is than open-source alternatives.
In 2022, it really isn't "you're stuck". Even if your org uses MS prodcucts, there are interoperable open source suites these days. The problem is that they're shit. Yes, they work for small documents, but star/sun/open/libre/EtcOffice all fall over when you do something as simple as trying to sort a 30,000 row spreadsheet on several column criteria.
(and yes, that is trivially simple. If your software can't do that in a performant way, you don't understand what product you're trying to out-compete)
It's not that there's no alternative, it's that the alternatives are just not good enough when it comes to "I need this stupidly complicated thing, thanks to 20 years of spreadsheet formula history at my organization that I have zero power to even remotely change, done in seconds. Not minutes".
You can either try and sway people purely philosophically as the software freedom conservancy is trying to do here, but I think ultimately in today's world, you need more, most of the time, you need to show that:
- what you opposed negatively affects your target audience in real ways that matter, for their career, livelihood, or some other means
- Show that continuing to be apart of an old model will be damaging in the long term
- Also importantly, the thing you are referring people to do needs to be seamless. For instance, they mention SourceHut, and I sure hope SourceHut has all the core features and ease of use of GitHub, because if not, you are likely already going to lose in this conversation to most
Without factoring these things, its great to point out issues, and rightfully they should, but its not going to mean much in terms of action
GitHub has a formidable monopoly on source code hosting, thanks to network effects alone. I don't think there's a realistic case you can make today along the lines of your first bullet point that could convince a career-minded developer to switch away.
Direct competition against GitHub is barely worth contemplating; the practical path I see for replacing GitHub must be more indirect:
1. Some nonprofit foundation spins up a GitHub alternative focused on transparency and strong data privacy protections. This alternative has only a small fraction of GitHub's most crucial features.
2. A major FOSS project---which values the principles of the new alternative over the practical benefits of GitHub---switches away from GitHub.
3. Satellite projects reexamine their use of GitHub and slowly start switching over as well. The new hosting service incrementally adds features in response to demands from the growing userbase.
Steps 1 and especially 2 will require motivation by philosophical arguments, even if I agree that the linked article's execution wasn't perfect.
Not so much that I am saying philosophical arguments are wrong, they need to be had and presented, in this case I even find myself inclined to agree with them in some respects.
Simply, I'm highlighting what I believe to be crucial in giving the philosophical argument some teeth in purpose and next steps.
If the friction cost is low to do the right thing, then doing the right thing becomes extremely palatable
yeah these same types of purely philosophical arguments have led me to move away from services like DropBox and Google Drive at my own expense. Lost too many hours trying to find the “we do things the right way” alternative—when it comes down to it, if it works well and keeps the friction low, mf’s can have their bag
I had never put anything in GitHub, but I have done and still do read stuff on GitHub.
My reasons though are not because of Copilot. It is because I do not use git for my own projects (I self-host Fossil and mirror on Chisel). But if someone else makes mirrors of my code on GitHub (or CodeBerg or SourceHut) then I do not have an objection to that (making more mirrors on different services may be better, anyways; unfortunately if you are using git then a header will be prepended to the file before computing the hash, which makes the integrity more difficult (although it is still possible, since the header is predictable (as far as I know))).
Seeing a few examples of output from Copilot (although I have not used it myself and do not intend to do so), they do not seem to be a very good quality. So, I think that it is not worth it, even regardless of licensing issues.
If you do move your project to another service, you should please use one which does not require JavaScripts enabled to be able to view the code (even if other functions do not work). Ideally it should also work without CSS. (For these reasons, GitLab is not acceptable.)
The impetus for the post is apparently "co-pilot" being a commercial service. I have no dog in this fight, just giving an overview of their main argument.
It seems like a legal grey area because for the most part what CoPilot is doing seems like it should be a transformative use [1] and you don't need a license for that. But apparently sometimes CoPilot will spit out other people's code verbatim, so I wouldn't use it. Not worth the risk.
The problem is that you can't tell whether you're looking at verbatim code or not. While you can tell when you're looking at "obviously verbatim" code, everything else could still be verbatim and license encumbered, so there is literally nothing you can use copilot for if your intent is to write new software.
But it gets worse: because you can't tell whether copilot is giving you illegal-to-use code, even just looking at its output can make you liable for future transgressions because seeing copilot code may expose you to license-encumbered code that you might now use as inspiration for new, non-copilot code. Even if you in good faith believe your new code is your own, the law does not agree with that assessment: if your code is similar to something license encumbered that you read in the past, even if you forget about it, you are now potentially in legal trouble.
Simply by using copilot at all, you're taking on a risk that is great enough to go "actually, I am not even going to try this thing".
If i read the code published on github and later i do not infringe on any license.
If later i'm hired to implement some function chances are that i'll produce code which is similar to what i've learned from github.
Or if somebody asks me on how to implement some function and i can respond: Hey that and that project on github already doing it, just use it, i'm not infringing on any license.
Now, replace "I" in the above paragraph by "Copilot".
It seems that Copilot is not infringing on anything...
But as others highlighted, sometimes it is not just similar code, it is the exact same code. And if you as a human use code from another repo, you should respect that repo's license. This is the point: as it works now, Copilot isn't taking licenses into account.
Another critical reason is that GitHub is proprietary and has a pile of services with lock-in that (unlike git) aren't easy to move elsewhere. Some people depend on GitHub for their livelihood (through GitHub Sponsors). Some people's software integrates tightly with GitHub bots, or Actions, or issues, or project management. Everything other than the code itself is incredibly difficult to port over to another service.
When Github anounced its impending acquisition by Microsoft unlike many others who just flooded gitlab, I took it as a sign to just go ahead and host my personal git repo. Sure it takes some more effort and it "hurts my pocket" yet.... hosting a git repository on a raspberry pi costs no much more than having a lightbulb on permanently and having a cron job to back it up and save it somewhere once a month its not too much maintainance, my code is as free and/or private as I want it to be now and I actually think I gain so much more control of what I get to do with it instead of being limited to be a hostage of making my code opensource in order to get full feature capabilities.
And no, I have not switched (yet) to a non propietary git system, I am using Bitbucket (yes on a Raspberry PI) since it has plenty of already built in integrations, sadly they ended their offer of self hosted for small teams. It used to cost 10 bucks to get a licence to host your server for 10 users... lord most OSS is a lone effort or surelly under 10 guys, i am sure they would have continued that offer given they had more small customers. Anyhow there are plenty good alternatives as Gitea that someone might use, and that would be the true intent of git as a decentralized platform... nontheless while software is one of the best paid professions nowdays it is full of cheap (greedy) people
Look, I agree that copilot feels slimy. However, honest question: if you're maintaining an open source project, in most cases I'd say your source is publicly available somewhere. What realistically prevents github from doing exactly what it did with code that isn't even hosted on it? What, with regards to copilot, is changed by moving one's code off of github to some other public site?
Rate limiting Microsoft's ability to exploit FOSS, and moving to platforms that won't/can't implement abusive practices that cause these kinds of mass exoduses in the first place.
I think if Microsoft was figuratively and literally rate limited from accessing such a huge swath of open source code, they might not have been the ones to build something like Copilot and we may have been better off for it. A different, more ethical team might have made something better. Maybe it would even be an open source project.
> That's my point-- how? Presumably you're just going to host your FOSS somewhere else that Microsoft can still get it.
The problem isn't that Microsoft can access you code. It's the section D.4 on Github's terms of service [1]. If you are hosting your code on Github, you are essentially giving Microsoft the legal rights to do certain things with your code. That section is broad enough to allow Microsoft to train copilot with your code. It gives them the freedom to disregard the license. On the other hand, it would be a clear violation of the license if they read your code off another hosts and reproduced it verbatim. I don't know whether any legal challenges would stand in a court, but at least you are not giving away the right to do so.
IANAL, but as long as they block copilot producing verbatim code (which I believe they do now, after some blowback), I can't reason my way past "so you're not okay with a computer learning from you, but you are okay with people learning from you?". Because I am. One of the reasons I open source stuff is because I've learned so much from OTHER open source projects, and want to pay it forward. Limiting my "pay it forward" attitude to humans feels hard to enforce.
Also, nothing prevents Microsoft from completely ignoring licenses and terms of use elsewhere. This product isn't open source: they can keep the list of projects they crawl to themselves, and I don't see any way to stop it.
The new platform itself could literally rate limit requests to download source code. For example: 100 repos per day.
Or, the platform could make it part of their terms of use that licenses must be respected even for machine learning. It could be enough of a barrier to dissuade them.
Would something like that stop Microsoft? Maybe, corporations tend to look for low hanging fruit for stuff like this.
This GitHub Copilot thing is pretty polarizing. According to one of the expert papers they sollicited, it's unlikely MSFT/GitHub is breaking any copyright laws.
What they're doing does feel icky, and they could have mitigated a few concerns by a) making the inventory of the full training set public and b) at least attempt to attribute if there is a direct copy (which by their own admission happens about 0.1% of the time). These seem very simple steps they could take, and takes away the "shady behavior" argument.
More specifically it's an option, the VSCode Copilot extension asks you to configure it and takes you to a GitHub page where you can choose what happens if a suggestion is a direct copy.
Single 6mb executable with a version control system, web server, bug database, forums, import/export/sync with git, repo browser, much saner CLI than git, etc. Been using it for 2+ years for all my projects and love it.
1. Hashes are computed without adding any additional headers, so it will be the same as computing the hash normally.
2. The /raw capability is a good thing to have.
3. The deck format is not bad.
4. The command-line interface is less confusing than git.
5. It is written in C.
It isn't perfect, but it is more than good enough. (If I do have to change it, I would do "Generalized Fossil" (I already wrote the specification, although no implementation exists yet as far as I know), which will have the same five advantages listed above, and is compatible with the same /raw capability and deck format of existing Fossil repository, too (except old technote edits, but fortunately I do not have any).)
Fossil does not call it the "deck format", although libfossil and Generalized Fossil both do so. The /raw interface allows accessing a raw file (including decks) by its hash or branch name, over HTTP(S).
See [0] for the Fossil deck format. (Generalized Fossil uses the same format, but any combination of cards is allowed, as long as there is exactly one Z card, and not more than one W card; they must still be in the correct order with no duplicates. There are many other details too (e.g. "subrepositories", which can allow you to optionally make decks unparseable in some subrepositories), most of which will not be mentioned in this comment, but I will say that it is mostly a superset of the ordinary Fossil format, except that ordinary Fossil allows cards to be in the wrong order in some circumstances (specifically, technote edits in some versions) that Generalized Fossil does not allow.)
See [1] for an example of the /raw interface (in this case, a mirror of one of my own projects; however, this will work on any publicly accessible Fossil repository). The name "trunk" at the end of the URL is the branch name; you can compute the SHA-1 hash of the returned file and substitute that in place of the word "trunk", and you will get a permanent link to that version. The lines starting with F are the files in that version; each line has the file name, and then the hash, and sometimes another field specifying file mode ("x" means executable). Substituting the hash of the file in the end of the URL will access the contents of that file. The line starting with P has the hash of the previous version; you can put that in the URL to access the previous version.
(In at least one case, I have used this /raw capability to download a single version of a Fossil repository. It is a simpler interface than using /xfer, if you do not have Fossil installed on your computer.)
They wrote the whole thing in C?! Wow, I really admire people who can turn C into useful, complex projects... I was trying to write C the other day and holy shit, it's incredibly hard to write anything that involves more than a couple of files... it just has almost no abstractions and one small slip and you've got a segfault that terminates your program without trace :D I gave up very quickly and been learning Zig now... as low level, but extremely saner to write stuff.
Anyway, why would I usee Fossil rather than Git exactly? Are there code hosts that support it? It does look interesting, but I don't see the motivation to do it when so many hosts support git (people who think GitHub is the "only" host need to look around: there's literally dozens of options), and I'm quite ok using it specially with intelliJ and magit (emacs) support.
I'm well aware... but still, when I see new software written in C when there are so many languages available (which was not the case when Linus started writing Linux) it's amazing to me, both because they can actually do it (I definitely don't have the discipline to pull it off, spoiled too much by high level languages) and because they didn't just choose to use something easier like Go or even Java.
C isn't that hard; you certainly have to approach things in a pretty different way, and it does have a slower curve to productivity, but once you get past the hurdle of having less abstraction (i.e. mostly by just not making pointless abstractions, and having general utilities for the cases you do need some) and learn "gdb -ex r ./program <enter> [wait for crash] bt <enter>" to get a stacktrace, it's, give or take, usable.
> For its part, Git was designed specifically to make software development distributed without a centralized site.
Just yesterday I had to explain the basic premise/history of Git to a young intern. I had asked him if he was using Git to manage his little pet project the company gave him to play with. “No”, he replied, he didn’t know what the company’s policy was to posting code in public on GitHub. As I explained to him that “git init” was all he needed, no GitHub or even no repository on our local GitLab was necessary, his eyes grew wide: “But how does that work??”
I’ve had to explain this same thing to multiple novice devs of various ages. It baffles me. I consider it one of the greatest ironies of software development today.
It’s like explaining to people that they could just talk to each other using a thousand means, instead of having to communicate by netcasting at each other through some shared social media platform.
I don't think it's fair to frame this as specific to GitHub, or even as a thing to wring hands over.
New devs - especially those coming from bootcamps (I say this without judgement) - mostly start with practical skills. Industry-standard ways to just get things done. That's how you get a job, that's how you get off the ground. This goes beyond source-control; languages/frameworks, tooling, etc. You enter the territory - with your finite bandwidth for learning - where it's most immediately useful. And then over the years you move out from there, incorporating more and more nuance and detail and auxiliary knowledge.
There's no need for moral panic. "Where it's most useful to start" has shifted, sure. But that's natural; I don't think it's a new phenomenon or in a fundamentally worse place than before. GitHub is a higher-level tool that makes you dramatically more productive than raw git on its own. The details will be filled in as they work their first job.
There is absolutely a need for moral panic if, like me, you believe that code and industry quality of product is more important than onboarding more people into the profession.
I can't think of a worse idea than subconsciously letting the idea "By default, the way we are supposed store and write code is by putting it in the hands of a deeply centralized third party that will exploit you and owes you nothing" just sort of be the default deal.
> you believe that code and industry quality of product is more important than onboarding more people into the profession
Perhaps a bit of a stretch?
Does the fact that new devs don't know Git fully necessarily mean the quality of code will decrease? I mean, compared with the new devs knowing Git fully...
I mean something broader than that here; I mean to say that "a generation of new devs who grow up in a world that sees git and Github as equivalent is perhaps likely to be something like a not-useful generation of mindless code drones doing little more than working for the proverbial 'man'" -- which, especially for CODING (in which hours doesn't mean accomplishment, unlike, say dentistry) simply isn't good for the world.
There's some conflating of correlation and causation here, of course.
There are lots of worse ideas. Things get abstracted from us over time. 99% of the code in active development today probably lives in a source control repo that's in the cloud.
At one point, decades ago, I looked with amusement on devs that couldn't do C/C++. But the reality was that it wasn't really needed anymore for most tasks.
This is different from what you're saying though. This isn't about "what basic knowledge is needed generally." This is about us frog-boiling ourselves into a world where we end up accepting the mentality that code is owned and controlled by the big company/entity in the center. For now it's "convenience," but if it continues it absolutely will turn into a center of power to be policed. "Dangerous code" will be censored, etc.
like we (atleast people my age) by default are expected to have Instagram Snapchat WhatsApp YouTube Discord Reddit Spotify Twitter accounts and when I left some of these sources my friends actually questioned me about why I'd do that and some people even told me it was weird and I'll end up giving up material possessions and becoming a "sanyasi". fascinating that becoming a sanyasi is considered bad it actually sounds rather nice to me
The ins and outs of version control solves a problem that has not yet existed for an intern. It’s not that big a deal to train them on the job, industry practices probably don’t map well to academic settings anyhow.
Not to mention that git is plagued with, as a general rule, utterly confusing defaults for a newbie (and for me, when setting up a machine without my config files!)
Depends on the uni. Mine focused on fundamentals, almost to a fault; we learned algorithmic complexity and relational normal-forms but never touched eg. JavaScript, Python, or - ironically - git. Practical skills were very much not the entrypoint; I had to learn most of that myself.
Yeah I had to learn Assembly in college. We also had to physically print out our C++ code and turn it in to the professor. Did it make me a better programmer? Maybe but to this day I still haven't had to use it. Most of my practical skills I picked up on the job.
P.S. I've seen the professor grading the printed out programs and he'd do it by flipping to the last page which was supposed to have the result output, and then fold the corner of the papers so it looked like he read the whole thing and then put a grade on the top. It was pretty funny.
If they also made you do all the calculations the program would, by hand, and record the output, then penalized you if you made any errors on that... it'd be math class.
Same here, only if you wrote in pencil, you were not allowed to appeal grading. So you had to write C from memory _in pen only_
while learning the language. There were students who pointed out the grading was wrong on their code, the TA and professor agreed, and the mark was not changed. Had to take it to the dean, don't think I learned if it was ever fixed. It wasn't even a CS degree...
Did we go to the same school? Halfway through the semester he gave up on even that pretense and started throwing them out in front of us and giving everyone a C-
Never saw him do that but given his personality I wouldn't put it past him lol. He also cemented in me that there would be virtually no collaboration at a company and that we'd be given problems to solve/features to build and if we had trouble we'd get the can. I think most of his work experience might have been in a pretty hostile work environment. Or at least a good while before pair programming became the hot ticket item.
Not sure I'd agree, every CS student at my school had to take a class where one of the projects was implementing a mini version of git from scratch. And even before that class there was at least one lecture dedicated to git in the intro class.
In my experience, bootcamps and uni produce polar opposites. You either end up with devs who can do several practical things, but have no clue about fundamentals, or devs who know all the academic terms but have no idea how to apply them practically.
the only significant difference I've seen in brand-new devs from bootcamps vs CS programs is the devs from bootcamps are somewhat aware that they have a lot of learning to do
2 years ago, while interviewing internally, I asked the team what VC they use. It was Team Foundation's proprietary VC (and not a DVCS). I mentioned to them that even MS recommends Git for Team Foundation's usage.
"Well of course, MS invented Git!"
Wrong on so many levels:
1. Conflating Github with Git.
2. They bought it, not invented/founded it.
3. Most importantly, MS's recommendation to use Git for TF existed long before they bought Github.
Needless to say, I didn't join that team. Unfortunately, misconceptions like these, and a refusal to use Git[1] due to its complexity are quite common at the company - and it is one of the larger SW companies[2] in the country.
[1] I'm OK with any DVCS. Even I prefer Mercurial to Git. But most of the company prefers SVN or TF's version control. In 2015, many teams had to be dragged kicking and screaming by IT from CVS to SVN. In 2015, when Git already dominated the world.
[2] By number of employees with "SW <something>" in their title. Not by revenue, etc.
I kind of like this friction. If a software developer finds git too complex, that’s an important signal that I should minimize my exposure to them. I’d never work in a team that dumb, and I wouldn’t want any dependencies on their code.
Of course there are other reasons not to use git, but complexity is not one.
My take is completely opposite from yours. I think it's great when someone "gets" that git is an extremely powerful and capable tool with a horribly unintuitive user interface. Whether you call that "complexity" or just "terrible UX", to me that's a sign that someone has good instincts regarding the risks associated with this kind of "complexity". Whether I agree that a decision to shun git for that reason is justified depends on the specific situation.
> ... that I should minimize my exposure to them. I’d never work in a team that dumb ...
That sort of arrogance is an important signal to me that I should limit my exposure to the people displaying it. I personally see this sort of attitude as a sign that there may be a dangerous lack of empathy on their side, and I've seen that go south too often.
^ this. The comment's arrogance and narrow-mindedness is preposterous, and would be a strong signal to me to stay away.
I find Git's CLI and workflow model complex because I had the "misfortune" of using other DVCS products (Mercurial, Bitkeeper, Bazaar NG, Darcs, etc) before it. All did a far superior job of presenting roughly the same conceptual model in their command-line tools. Git's is a step backwards.
Git didn't invent distributed version control, it didn't perfect it, it wasn't particularly superior to the others [it does have some benefits in terms of performance tho, yes], it was merely at the right place (the Linux kernel when Linus got sick of Bitkeeper's business model) at the right time (when people finally got sick of CVS and SVN garbage.)
And the rise of GitHub was certainly part of the rise of Git's prominence.
Slightly different circumstances and it could have been any of the other open source distributed revision control systems instead.
> That sort of arrogance is an important signal to me that I should limit my exposure to the people displaying it.
Different strokes for different folks and not everyone is capable of working together. Comically, people like you would find your comment also arrogant and stay away from both of us.
For what it’s worth, I work on empathy quite a bit and my filter is based on the idea that it saves both me and the other party pain. I don’t want to work in a team where developers are tolerated giving up on super basic technologies like git (what else isn’t tolerated “oh, tcp/ip is just too complex for developers, I give up”) and they probably don’t want me to work with them.
I think one of the best things about technology is that people have great capability to solve problems. Giving up is a bad characteristic. Asking for help is important. Having teams of various skill sets is important. But giving up on basic things instead of getting help and figuring it out is bad as a permanent state for a team.
I say this as a senior dev with 8+ years of experience working with git: git is nightmarishly complex. Highly stateful, insanely large CLI surface, huge amounts of terminology/concepts relative to the complexity of what you're actually trying to do with it. I've learned to navigate its waters over the years, fully appreciate the difference between git and GitHub, etc etc, and I don't blame anybody for being scared away by its complexity.
Git is super complex, yet it can be used successfully with just a few commands.
I blame developers and even most users for being too scared to use it. I think it’s perfectly normal to be scared of it while using it.
I think all users of technology should have some basic competence. Like every human should be able to be a basic Unix user, every human is capable of using git. And every developer should be capable.
As a whole, you are absolutely correct. But 99% of your daily interactions with Git come down to a handful of commands. Hell, print out a cheat sheet and tape it to your desk for the first few months of using it and you will be more than fine.
Until you make the slightest mistake and your work area gets into a state not covered by the cheat sheet. But of course, the standard recommendation seems to be just to remove the work area and check it out again :p
I’ve used git for 15 years or so. I use the same 5 commands. I haven’t run into a situation like you describe where I lose everything.
Worst case scenario is that I lose a check in, but even that is rare as I can just copy over, start with a fresh clone, and reapply. I don’t think there’s a significant danger of messing up a project. And again, this is just basic developer capabilities and developers should be familiar with a vcs enough.
Well, that proves the point that just having a cheat sheet isn't a complete solution. And unfortunately, I am the co-worker more experienced with Git :p. Well not quite, I have a co-worker who is roughly on my level and we keep helping each other and of course trying to google ourselves out of every corner we got us into.
The practical approach was to write a simple GUI which supports the operations "Pull" "Stage" "Commit" "Push", which makes live quite easy. And otherwise, try to keep out of anything more complex, until really neede. Works quite well, but I cannot say I am entirely happy. There are some benefits though, due to our company using a GitHub Enterprise installation for all version control. Ironically, our group recommended that some years ago, but because GitHub comes with a lot of neat features, less so because our love of git :)
And yes, over time, I am picking up more of it, but I try to do that on a slow pace, because I consider VC a tool which is important for work, but which shouldn't take time away from work. And Git definitely takes a lot of time to learn, there are far too many mechanics exposed to the end user.
Well, be fair, grandparent argued that a cheat sheet would help "for the first few months", not that it is a "complete solution", so I don't see how my suggestion "proves" anything.
Also, without knowing which issues you're having with git it's impossible to know if it's lack of baseline knowledge or if you're running into real complicated problems.
If you have a GUI with "Pull" "Stage" "Commit" "Push" buttons, I strongly suspect you're not in the complicated end.
Considering the popularity of git itself and the fact it is successfully taught at bootcamps and university at a satisfactory level, I would disagree that being a “deep tool” is part of the reason it’s not 100% ubiquitous. The easy parts are still very accessible to newcomers and the existence of deep part doesn’t negate that.
And I also take issue with calling its “non-100% ubiquity” a problem. There is plenty of space for other tools. If anything, there’s almost a “git monopoly”.
Sure it could improve in parts, but the general discussion here doesn’t seem to be focused on improvements at all…
The more relevant point is: do you refuse to work on some code or use a VCS on your code because of this complexity?
That was the GP's point. He acknowledged it's complex. But it's a very clear and very large amount of value left on the table, so people that gets scared away by it will probably practice other kinds of harmful behavior.
(Of course, the option of just using a simpler VCS doesn't tell anything bad about people. Why did we standardize on git again?)
It might be hugely complex, but at the same time, I'd posit that 90 percent of devs use only 1 percent of its functionality, 95 percent of the time they use it, and get work done.
To be fair, most people who find it complex do so because Git has a very crappy UI. You don't need to understand the internals to use Mercurial, and I would imagine that's true for most other DVCS's. In that sense, I sympathize with them.
I'm pretty sure half of the teams that find Git too complex wouldn't find Mercurial to be complex. But they haven't heard of Mercurial.
I'm the guy people at $work call to do git submodule reorganization, troubleshoot lfs or recover disasters. I'm certainly comfortable with git. It gets the job done. The same can be said of svn.
However my own projects always start with hg init. In addition to the UI being less insane it also feels consistent and nice to use.
At one point I was on a team using Darcs, which was another level of beautiful but clunky in many ways. It is why I'm rooting for the Pijul team. [1]
Same here. I use Git at work, but anything I control I use hg. The only factor making me more likely to use Git is Magit, and no good equivalent for hg in Emacs.
You could miss out on some good junior developers if you pick one thing they are ignorant about and reject them because of it. There are plenty of bright junior developers who don't know everything, have opinions on things that they think they know, but will work hard and pick things up once they get some time at it. Of course, if they outright refuse to use something then that's another story.
Not knowing git doesn’t have to be a permanent state. I am grateful for being taught git and tons of other stuff. I’ve thought many people many things and still have a lot of debt.
To clarify, I’m not criticizing not knowing git. Lots of people don’t know git. I’m criticizing the decision not to learn git and being unwilling to learn it due to its “complexity.”
I would say "friction" is the right word and "dumb" is not the right word here. Centuries ago (in Internet time) there was a culture among server devs that anyone who could not or would not, learn and correctly execute 100 things in the shell that today are obscure and obsolete, was "too dumb" and "not professional" . Evolution and intelligence have moved along since then, and others crusted over and embedded the minimal practices needed.
Does a typical enterprise need a distributed source control system? It makes a lot of sense for open source.
Doesn't the uptake in Github which centralizes this distributed system kind of invalidate its main tenant?
I haven't been a hundred percent sure the overhead was ever worth it at most other types of paid gigs over the years. Adding complexity without value is a mistake imo. Maybe it's my own fault I haven't seen the value...
The (side?) benefit of a DVCS is that you commit locally. So I can do all my fun experiments on my team's codebase on my laptop - it doesn't pollute the number of branches on the server, etc.
Indeed, even when forced to use SVN, I would simply check out the SVN trunk, create a Mercurial repository in that "working" directory, and then clone from their whenever I did any development (one clone per feature). I would then push back to the main Mercurial repository, and push that to the SVN server.
Other than that - true. No real DVCS benefit compared to SVN. I would imagine most of the nicer Git features that people use have, or could have, analogs in SVN.
> Indeed, even when forced to use SVN, I would simply check out the SVN trunk, create a Mercurial repository in that "working" directory, and then clone from their whenever I did any development (one clone per feature). I would then push back to the main Mercurial repository, and push that to the SVN server.
Once did something similar with Git, the one time I was at a place that used SVN. I didn't trust any of the git-to-SVN tools so I just did all my work in a local git repo, then copied my working directory to an SVN-controlled directory and committed maybe once or twice a day, when I had something worth preserving.
The ~week before I switched to this workflow was nerve-wracking. Not being able to make all the branches I want for any purpose at all without it showing up for anyone else, or to make shitty commit messages for my own local junk-commits before it was ready for consumption by anyone else, was awful. Having to worry that every little vcs operation might mess up someone else's stuff was the worst.
Luckily I was working on an isolated part of a larger system on my own, so this worked OK. No incoming code to worry about, so SVN was write-only from my perspective.
> Does a typical enterprise need a distributed source control system?
Yes, unless you are working on a tiny number of source files. The value is that you can work on entire copies of a code base at a time instead of single files. Git does a much better job than making lots of local copies of source code, passing around big diff files and is a lot less hassle than older centralized source control systems.
> Doesn't the uptake in Github which centralizes this distributed system kind of invalidate its main tenant?
No. Not at all. Github is actually a peer with an a few integrated extras for managing tickets and requesting your code be merged into github's repo branches. Most of the centralization is around access control and automation (i.e. continuous delivery, unit tests, etc...).
> Adding complexity without value is a mistake imo.
The main value of GitHub is reducing some operational complexity. For a small 1-4 person team, it may be of little value, but for larger teams, access control, issues, and automation can have a lot of value. For really small teams, fossil is actually quite nice. That said, there are a lot of great alternatives to GitHub that give you similar features.
> Maybe it's my own fault I haven't seen the value...
Probably not. I did the first 20 years of my career without source control and was able to build some pretty big applications. I do think dvcs was a big improvement, and I'm glad it exists now.
I find distributed version control super useful for the enterprise. I have way more private repos than public.
It’s useful because it makes it easy to have multiple copies of a repo and merges and branches are trivial. Just for my own project I might have clones on multiple machines. When I mess up and make a change, it’s not a big problem and merging is easy.
For team collaboration, I find it better than centralized because it makes it easier to work on multiple branches. It also encourages merge requests from people outside the team.
Git is a very malleable tool, you can use branches for many purposes, do all kinds of complex merges (e.g. subtree, etc). I think it can be made to fit most workflows.
The main issue is repository size, which is hard to get from bad workflows alone, with the exception of insistence on tracking large binary files in Git (rather than LFS, DVC, etc).
I used to teach a coding bootcamp. I had over 100 students over about 2 years, and ran into this constantly despite my best effort to explain that git != github from day one. We even did an exercise where we just used git locally first and then later (on a different day) showed how you can push to github. It didn't seem to matter. People just decided that git = github and couldn't let go of that.
Should explain with another git hoster than github first, let them get used to something else. Then, when they meet github, they will know, that they used something different before (hopefully, otherwise they are hopeless cases).
Most didn't. That's my point exactly. Even though I introduced them to git, and then later github, AND I was aware of the problem and tried to ensure they wouldn't conflate the two. It still happened.
That scenario plays itself out time and time again.
1. Base technology arrives and is adopted en masse
2. Some entity wraps it in an easier-to-use interface
3. After some time, that interface becomes the de facto standard
4. Developers who started developing after #2 and especially #3 don't understand how the underlying technology works or even that the wrapper is just a wrapper
I'm reminded of how many Juniors I trained that didn't know jQuery was itself just a Javascript library and not a language in and of itself. None of them knew much of any of the underlying Javascript it was wrapping. I'm seeing the same scenario play out with React and Vue right now as well.
I find this so weird, I always feel like I have to at least a bird's eye view of how bits and bytes are flowing through the system. Not that a lot of boxes I don't just label as "magic", but at least I need to know the box exists
I'm having a difficult time picturing how these hypothetical juniors will fare without that underlying curiosity driving them forward. Likely they'll wash out of the industry or see enough success to be promoted to a management position.
That entity is usually Microsoft. I've met web developers who relied on Microsoft drag-and-drop tools to build websites, they were unable to tell me what a cookie or HTTP request was. Microsoft could be injecting telemetry into their websites and they wouldn't know.
Since then I've been obsessed with knowing how things work under the hood and doing as much as possible by myself until I understand what's being abstracted away from me.
Is the takeaway here, then, to beat third parties to the punch when creating a new technology and include a first-party easy-to-use interface right out of the gate?
But no, you can't demand it as some kind of mandatory component for a project to "count as a real FOSS project".
And for git, especially... It's not as if an easy-to-use interface had ever been a priority there. ;-) What one could ask for from them, though, IMO, would be a website of their own where they at least explain what it is. Because if such a thing exists, I must have missed it. I mean, git-scm.whatever isn't that, is it? That feels more like... Idunno, it's actually Atlassian in a thin disguise, isn't it?
I actually (almost) got in trouble once for this because people thought I was posting code publicly on Github when I was merely creating a local repository in my home directory. I had to explain what Git was and how it was not Github.
We had an intern push code that shall never be released publicly to a public repository after we asked him to "push it to our gitlab server". He just assumed there was only one gitlab server and it was the public one, I guess.
That was a bit shocking to me. I started with local VCS only all the way up through school and work.
You should assume that developers, especially interns know nothing about any technology until they had a thorough introduction. And you should anticipate any confusion they may have in the process.
Once, I had to introduce most of the employees of a company to git and git hosting. They were trained from scratch to use git reasonably on the first day. Use of code hosts were taught only on the second day. They were taught hands-on on our internal host and on public server. None of them had any confusion over differences among git and different git hosts. While they did make some mistakes afterwards, none of those were based on total misconceptions.
How many of us have parents (or, OK, for all you youngsters: grandparents) that assume not only that the Web is the Internet, but that Facebook is the Web?
Going through security certification procedure, I was asked to list all the third party SAAS things we use.
I didn't include github, gitlab, or anything else, because we don't use it. The auditor was going off on a tirade about how lack of version control is not okay at all, so convinced they were that 'no github or gitlab' must therefore mean 'no version control'.
The mind boggles. He barely believed me when I showed how git just syncs with other git repos and that's really the start and end of it.
This has actually gotten me into thinking about a few things. What a web site 'backed' by your git repo seems to get you is:
* Some insights to those who don't have a full git dump. Mostly irrelevant.
* CI stuff and hook processing, but this does not need to be done by the system that hosts git, or even a dedicated system in the first place.
* An issue tracker that nicely links together and that auto-updates when you commit with messages like 'fixes #1234'.
* Code signoff/review coordination.
And all of that should be possible __with git__, no?
If you have a policy that all code must be signed off otherwise it isn't allowed to be in the commit tree of your `main`, `deploy` or whatever you prefer to call it branch, then why not just say that a reviewer makes a commit that has no changes (git allows this with the right switches), _JUST_ a commit message that includes 'I vouch for this', signed by the reviewer? And that _IS_ the review?
What if issue tickets are text files that show up in git, to close a ticket you make a commit that deletes it. Or even: Not text files at all, but branches where the commit messages forms the conversation about the issue, and the changes in the commits are what you're doing to address it (write a test case that reproduces the issue, then fix it, for example), and you close a ticket by removing the branch from that git repo that everybody uses as origin?
Then all you really need is some lightweight read only web frontend so that the non-technically-savvy folks can observe progress on tickets in a nice web thingie perhaps, if that. But it's just a stateless web frontend that reads git commit trees and turns them into pretty HTML, really.
Commit hooks to ensure policies such as 'at least 2 sign-off reviews needed before the CI server is supposed to deploy it to production'.
> If you have a policy that all code must be signed off otherwise it isn't allowed to be in the commit tree of your `main`, `deploy` or whatever you prefer to call it branch, then why not just say that a reviewer makes a commit that has no changes (git allows this with the right switches), _JUST_ a commit message that includes 'I vouch for this', signed by the reviewer? And that _IS_ the review?
Git has two levels of built-in/native commit signing support. There's "Signed-Off-By" which adds a note to the bottom of a commit. Some projects use that for CLA verification. There's also GPG signing which signs a commit hash.
If you want to use that as a level of "merge request"/"pull request" reviews there's a natural commit to sign that says you reviewed an entire branch: the merge commit itself. You can make a policy of --no-ff merges in your main and important branches. You can make a policy that they are signed (using one or both of the sign off types). You can make a policy that they are signed by someone who wasn't the author of most of the branch's commits.
> What if issue tickets are text files that show up in git, to close a ticket you make a commit that deletes it. Or even: Not text files at all, but branches where the commit messages forms the conversation about the issue, and the changes in the commits are what you're doing to address it (write a test case that reproduces the issue, then fix it, for example), and you close a ticket by removing the branch from that git repo that everybody uses as origin?
There's multiple cool approaches to this that people have tried. Search for "git distributed issue tracker" and you should find some of them. Some have okay web views. There's multiple options for storing the issues. Some use YAML files inside of the branch. The neat thing about files in the branch is that you can find things about where fixes happened using basic branch diffs. Some use git "Notes" which are indeed like git commits as a first class top-level object in git's object tree. Those do have the benefit that they form their own branches outside of your code branches.
It's neat to explore what people have already tried in that area.
What is so baffling about explaining git to a novice? I'm more baffled that you are baffled about this.
Git wasn't even invented yet when I was in school and I did not learn about SVN until my first programming job in 2001 where a senior explained it to me.
I used to be confused by this. It's because "git" is in the name and GitHub is the first thing you encounter, so I was like, "oh, is git some command line extension of GitHub?" rather than the reverse. :D
Similarly, as a youngling, I often confused GNU and Gnome, in particular because several popular (often GTK-based) applications from either started with `g`, eg GIMP, gedit, gparted.
git isn't easy to understand. I think the real irony is I spent the first decade of my career teaching senior engineers how to use distributive version control. So many older guys decided to just use the SVN shim instead of learn something new and useful.
The finer parts of git and other day-to-day tools seem to almost always be picked up on the job. I've seen this confusion of git/github before, but the one I always notice is when I'm interviewing someone and they say they have node on their resume but don't know that node isn't just a webserver, they really only used it as an express server and many didn't even know that it could touch the filesystem.
It's usually not a problem since they tend to already know javascript and can get things working by referencing the node api docs, but it's still really funny to me every time it happens. There's lots of stuff people don't know until they know.
IMO the reason to suspect that git is independent from the SaaS products, is that you can create and commit (and merge and most other things) to a local repo without ever pushing it to GitHub or any other site.
It's baffling because practically all learning resources on git emphasize its decentralized nature. Even the online free book explains that several times. They never left me with any confusion regarding the difference between git and github, even though my first version control was the centralized subversion. I don't understand how any developer can learn git without this idea being drilled in constantly.
I have had to have the same conversation with co-workers who have been developing for many years, but using tools like SVN, or similar Microsoft code repos that require a centralized repo.
Are you saying they use GitHub (via the web editor?) but have never heard of Git? or don't know how Git works? or just don't know the Git CLI? or something else?
I have seen the same phenomenon where I work. However, I disagree with the implication that this is limited to young people. I know of a few people that are 50 and 60 years old, that have been using source control for 30+ years, that didn't have any idea that git and GitHub were different things. Now, they were easy to set right; the concepts are all familiar to them.
My experience with younger folks that have never been outside the Windows world has been that it was a lot harder to make them understand. Young folks with a unix-y background were much easier.
I think many a junior just never initialized a repository, and 'git clone' is what they've used to "get a repository". So I would forgive them for thinking that a remote is needed for having a repository on their drive.
IMO the issue here is just that people think of git as it was something like SVN where you need to have some kind of external server to host project. Which isn't surprising as most of the time (when you cooperate with others) you do exactly that: push changes to one server. Idea of decentralization and independent copies of repositories may not be easy to grasp for newbe, because it's really technical concept.
How many people know that you can just run `git daemon`[1] to share your local repo (very insecurely, mind you) to anyone that can reach your machine? No central server needed.
IMO the "issue" that the concept of a free remote storage/backup makes sense, while doing random things on your own computer doesn't have much value for a lot of people.
".. you don't have to keep the previous versions, so it saves space? OK... No, I was talking about the real git."
well, they were novices, they gotta learn somewhere right? while I admit git being different from Github is pretty basic knowledge and it might be expected that they know that, but this seems rather harsh considering that using git isn't a skill cs degrees cover.
The copilot saga hasn’t even played out yet. If it turns out when the legal fog clears that anyone using copilot is always personally responsible for making sure the code was theirs to use (which I see as the likely outcome)- then what difference does copilot make? It basically lets people copy paste FOSS code automatically. We could always do that and they we were always responsible for the consequences.
The idea that copilot can somehow “AI-wash” the copyright of large/nontrivial pieces of code seems completely crazy.
how does a clean-room implementation of anything works anyway? people personally testify that they haven't seen the thing they want to cleanroom implement, right? and what if they did see it, how would anyone know/catch them? something is too similar is not proof. what if the implementer read a random comment on it on HN that laid out the general architecture of the thing, does make the result a derivative work?
I think the fear of copyright infringement from developers being "tainted" by seeing or hearig about some software/implementation is exaggerated.
Copyright only applies to expression (code) not ideas. It would be extremely hard for anyone to re-implement anything in a matter exact enough to be plagiarism.
For a clean room impl the fear would be patents, not copyright.
Any software lacking patents can (my armchair lawyer guess) safely be re-implemented. If people reimplementing have had past access to the source code of the original, that's probably not even a great danger, so long as nothing is copied verbatim. Ideas/designs/architecture/functionality is not protected by copyright.
GitLab is mentioned on the advocacy page https://giveupgithub.org/. GitLab is an open-core business, which means there is a permissively licensed FOSS version and a proprietary version. The gitlab.com website uses the proprietary version. In addition many features of GitLab don't work at all or work much less well if you turn off JavaScript. They have swayed a lot of ostensibly FOSS, copyleft-leaning and privacy/security-centric projects to use it, despite the business model and license.
> In addition many features of GitLab don't work at all or work much less well if you turn off JavaScript.
I feel like the venn diagram of people who complain about JS being required and of people who have never had to code up a web /app/ that users expect rich interactions without a page-reload is just a circle.
I dislike JavaScript being gratuitously required, because it just generally makes things worse—slower to load, and less reliable. I occasionally complain about it, mostly in the places where there’s just no conceivable reason why it should have been done that way, because I do understand pragmatism.
I have also made and worked on multiple web apps where rich reload-free interaction is expected. In some cases, it has not been practical to support JavaScript-free operation at all, but in almost all cases where JavaScript-free operation has been feasible, I have provided at the very least partially-degraded operation—certainly on all green-field development.
A lot of the places where GitLab requires JavaScript are quite unnecessary, and should probably not have been done client-side at all in the first place, though I’d settle for server-side rendering with rehydration.
JS being required for the interactive features would be fine. My personal problem is that I end up on some random gitlab instance to just take a look at the source or issues for some library, and get a blank white page. For the read-only public view there should be no need for any JS.
I understand that thinking but it's ignores reality. To do what you want you want means maintaining 2 codebases (even if just for sub-parts of a site). It's really easy to say "This specific page could be static" and you are right, it could, but it would mean having fallbacks for every JS interaction on the page (or removing them if the user has JS disabled). There simply aren't enough people who die on the no-JS hill to care about, especially since it means ongoing development maintenance, testing, design/UI work, and the list goes on.
GitLab is built on JS and renders a white screen without JS. Enabling JS at all taxes my Core 2 Duo machine, and opening GitLab to a few thousand line file (or worse yet, opening the pull request diff view) taxes my top-of-the-line Ryzen 5 5600X machine running Firefox. GitLab is just badly written.
Or you could server render the pages and hydrate them as needed which is something easy to do with NextJS, NuxtJS, Remix, Fresh, among other modern frameworks for developing with JavaScripts libs.
This is my opinion, too. The JavaScripts should not be required just to read the documents, files, list of files, etc; even if some of the other features do use it.
What about websites that won't even render static text when javascript is disabled. For example, any website that references sinclairstoryline for javascript will not render anything at all until you give permission to run scripts from both the website and that external host.
If I'm just reading static text on a web page, theres'a absolutely no reason why I should need javascript to just read it.
> If I'm just reading static text on a web page, theres'a absolutely no reason why I should need javascript to just read it.
100% true if the site is privately funded. In most other cases JS is required for ad integration and analytics.
I don't like it, but I understand that funding is required and ads are the simplest way to get there without getting into the whole micro-payment and paid subscription mess.
Forcing people to run JavaScript does not guarantee that Analytics or Ads will run as these might get blocked by DNS, Extensions or even the browser itself. I understand the need for them but I don't think it excuses the need to run JavaScript to see text. By the way Hacker News has ads on their main page and the website works perfectly without JavaScript, and even better with it enabled! IMO the job of JavaScript is to enhance the UX, not render the webpage. With modern tooling such as NextJS and Svelte Kit it's possible to code everything in JS (without duplicating logic between backend and frontend) and still have some stuff work without JavaScript, even better when using something like Remix.
> Forcing people to run JavaScript does not guarantee that Analytics or Ads will run as these might get blocked by DNS, Extensions or even the browser itself.
True for power users on PCs, but keep in mind that many users use smartphones [0] and tablets nowadays to access websites. The possibilities to block analytics and ads are severely limited on these devices.
JS is also sometimes used to "protect" content from scraping by bots (I cannot comment on how effective this is is, but I've seen it a lot). Again, I agree that JS shouldn't be used like this, but sadly it is.
You make it sound like it's impossible to build a Web app that supports rich interactions without a page-reload without requiring JavaScript, but this isn't true. You can use progressive enhancement/graceful degradation to build one such that users with JavaScript still get the experience they get now, while users without it will have an experience that's slightly clunkier but still usable.
And as I said in reply to a sibling comment: Then you are effectively maintaining 2 codebases. Also, I'm not aware of any SPA framework (at least the big 3) that even offer an escape hatch to do something like that. Maybe with SSR and some special logic you could but it would be painful and ultimately not worth it.
Remix provides something close to what you are thinking, the application works perfectly without JavaScript but gets enhanced if it's enabled which is pretty nice.
Remix, along with the other frameworks/libraries you've mentioned, are very interesting. I've considered trying out SSR with Quasar (the Vue framework I work most in), though selling it to "business" is hard and I understand why, I can't bring myself to eat the cost (both time and real dollar cost) on my own projects. I do hope SSR continues to advance, though I have some trouble imagining a "free"/"seamless" fallback for no-js users and so other than initial paint I'm not sure how functional some sites will be.
I specifically called out "web apps" in my first comment as I do understand the value SSR brings to things like blogs, news, or other simple sites where JS is not needed, or where it can have a clean fallback. On the other hand, I write "apps" (sometimes deployed on phones via Quasar/Capacitor as well as on the web) and those get much more complicated. I'm not quite sure how modals, WYSIWYG, rich date pickers, etc translate for a no-js user. Simple navigation is easy enough to grasp but my understanding is that things like NextJS/NuxtJS are really just for first render/paint and then React/Vue take it from there. I could be behind the times on what's possible without JS and using SSR through. I just know the PHP codebase I also work in uses plenty of JS to be functional (not above and beyond, literally "table stakes" stuff).
Yeah as of now I think the SSR capabilities of NextJS and NuxtJS will serve mostly for the first paint, it will also allow a user to navigate between pages without running JavaScript (which a SPA wouldn't). I do have to agree though that at a certain point thinking further than this about non-js users becomes too cumbersome and not really worth it if your application is truly a web _app_ meaning very interactive and to the point it could be bundled as a desktop application.
I'd like to note that Remix does handle everything being tied to a single logic as far as my testing went, I love it. The idea is that basically all interaction is done with html forms (like in the old days) and Remix loads a React bundle that makes that run client side after the page has loaded. It's a very simple model that should work for most use cases, although I don't think it's suitable if you're truly developing a web _app_.
As with everything, balance is key. JavaScript is useful and more appropriate is some situations, and it's not in others. I do hope to see more progress with seamless SSR for SPAs though, I think it would make the internet a much better place.
I was a big fan of GitLab until the recent changes they've made on regards to their free tier and the hoops required to qualify for their Open Source plans.
I'm part of a small FOSS project with about 20 contributors but the changes mean we can now only have 5 max in the Project without paying for licenses or moving the Project to its own namespace and going through an application process for GitLab Ultimate for Open Source (or whatever it's called) which needs to be resubmitted yearly.
I fully understand they are not required to provide services for free but this follows the CI runner allowance reductions - fairly - recently (which is understandable, compute costs money) doesn't give me much confidence for hosting smaller to medium FOSS projects without having to jump further hoops while shouting about how open source they are whittling down their offerings to the bone.
But they manage to change and sometimes break the UI in every update and it just gets more and more bloated every day. - At least that's what it feels like.
> “a broader conversation [about the ethics of AI-assisted software] seemed unlikely to alter your [SFC's] stance, which is why we [GitHub] have not responded to your [SFC's] detailed questions”
To be fair, there is a valid point here. If even one party has already made their conclusions and enters into a discussion with no willingness to even entertain ideas, instead just to fight their corner, then why would other parties willingly take part? We've all had those engineering discussions where no matter what is said, there are still engineers who refuse to entertain a concept. They're difficult and draining. I can see why the request would be refused if this was the case.
It seems far more likely that GitHub isn't answering the questions because they don't have answers that support what they're doing.
The ask here isn't "don't ever use AI code assistance tools", the ask here is "don't ship something as an AI code assistance product that fails to provide any means of tracking provenance and handling license compliance".
Quoting the post:
> Meanwhile, the work of our committee continues to carefully study the general question of AI-assisted software development tools. One recent preliminary finding was that AI-assisted software development tools can be constructed in a way that by-default respects FOSS licenses. We will continue to support the committee as they explore that idea further, and, with their help, we are actively monitoring this novel area of research. While Microsoft's GitHub was the first mover in this area, by way of comparison, early reports suggest that Amazon's new CodeWhisperer system (also launched last week) seeks to provide proper attribution and licensing information for code suggestions.
I think it's the other way around - if you only talk to those whose stance you think you can change then you are implicitly admitting that you are unwilling to change your stance. Otherwise there would still be value in hearing the other party out to see if they have anything to say that would make you reconsider your position.
> twenty-one years ago, the most popular code hosting site, a fully Free and Open Source (FOSS) site called SourceForge, proprietarized all their code — never to make it FOSS again.
I should not have to opt out. GitHub should have to respect my license. I already said they can use my code, as long as they keep an attribution intact (via a BSD license, for example)
GitHub is taking my code and ignoring the license. I don’t understand why anyone would think that is ok.
No, you're ignoring what you agreed to when you accepted the terms of service. GitHub can display your code, and YOU granted them that license by accepting their terms.
I find the only people making these OSS claims haven't used copilot and tend to lack any real contributions to OSS. What you're describing is just simply not the case for 99.9 percent of the code snippets being produced/generated based on data from GitHub.
I actually care more about putting code into peoples hands versus someone copying a license file, that's probably why I use the unlicense... "Because you have more important things to do than enriching lawyers or imposing petty restrictions on users"
I have used Copilot in its free phase, and have made substantial OSS contributions to various projects as well as shepherding my own projects (a few of which have attained a degree of success). Building an AI model off the community's code and selling the model (and code generated by rearranging statements and patterns found in the training data) back to the community is odious.
Not everyone who has code at GitHub uploaded it personally. Plenty of code was written before GitHub even existed and that code is still uploaded there.
That’s what fair use doctrine is about. Copyright doesn’t say “you can’t do anything without permission”, it says “you can’t do anything without permission, except for a few categories of things which cannot be forbidden”, and Copilot claims that what they’re doing fits in one of those categories.
Given Git is distributed (or most source control as a matter of fact today is) - even if you stopped pushing your code to Github, does it stop Copilot from pulling code from sources like gitlab.gnome.org, kernel.org, gitlab.kde.org etc?
I think underlying discussion should be about licensing, not about website to which you are pushing open source code to. Because that can be easily worked around.
There is a hidden problem with licensing here. Developers are giving Github the permission to use the code with a different license [1]. The clause sounds broad enough for them to justify training copilot with it. This allows them to disregard the license with which the project is published. The developers don't have the protection of a FOSS license anymore when you host there.
Yeah, because that was the substantive point about your company, its business model, and its relationship to the FOSS community that was made in the article.
I have no considered opinion on the GitHub thing in general, but Copilot depresses me. Any tool that makes you more productive at the cost of taking away the pleasure of figuring it out on your own bores and depresses me.
Do you feel the same way about high-level languages instead of writing your own assembly? How about using a bulldozer when a simple shovel could do? Copilot doesn't "do all the thinking for you", it's glorified autocomplete.
> Any tool that makes you more productive at the cost of taking away the pleasure of figuring it out on your own bores and depresses me
I mean, you don't have to use these tools yourself, particularly since you have to pay for it and therefore there is no "it is free" allure.
While I understand what you are saying about the pleasure of figuring things out yourself (and do enjoy that feeling myself), I don't feel that the mere existence of the tool affects me that much.
You need to pay more attention to GitHub Sponsors. Not sure how you could write this article and then not even mention it. You're asking people to walk away from a giant pile of money without even addressing it?
I'll continue to use GitHub pro for the next year, at least, simply because GitHub just sent me $720CAD through GitHub sponsors. The self-host movement is important, and I self host a lot of things (including gitea), but it still costs you time, especially when you have to deal with external users (like reputable email servers for gitea invites). If staying on GitHub is painless and _makes me money_, why would I switch?
> What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.
> If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?
I'm not sure I buy this argument. In the first point, the authors state that Copilot was trained on public data. In the very next point, they slightly tweak it by saying training was done on "any" code which loses the distinction between public and private code. Obviously Windows and Office are not public code.
I also interpreted "public data" to mean they trained on codebases that explicitly specified, say, MIT licenses or other permissible licenses. That seems like fair use to me. Those licenses don't explicitly restrict training AI models on their codebases do they? It's ironic if these licenses started banning AI training now though. That would effectively mean Copilot would be sole trained AI model.
I'm happy to be proven wrong though. In general I have a distrust of Copilot. I fear it would make individuals worse programmers in the end at the cost of productivity
So even if it is legal to create a commercial product which outputs GPL code as its main value-add, it still seems like it could put the user in an awkward position of auto-completing big chunks of GPL licensed code into their project.
Good point, but also with FOSS licenses you can have incompatibilities if you take code distributed under a certain license and you use it into a project distributed under a different license. So I still think there is a double standard here.
Nobody mentions User Experience. SourceForge was terrible, Google Code also, Gitlab has an acceptable UX. Github kills it in most fronts.
What prevents Github Copilot from expanding to FOSS that is not hosted by them in the future? Just how Google indexes and caches the whole internet, what prevents this thing from going to public gitea of FOSS projects and scraping to train their model!?
When you join GitHub you accept their ToS which among many things gives them the rights to use your code for stuff like Copilot. Theoretically (although I'm not sure of this is the case, looks like no one does right now) Microsoft would have to change the way they handle code licenses in their training set if they were to use code hosted elsewhere, giving attribution to code from MIT licenses and not using AGPL code at all.
> When you join GitHub you accept their ToS which among many things gives them the rights to use your code for stuff like Copilot.
I get so many "our terms of service changed" emails and they link to a 30+ page document with not even a diff of what changed. I vaguely remember GitHub sending one out maybe in December 2019 but it linked directly to https://docs.github.com/en/site-policy/github-terms/github-t... and didn't even hint what was different so the only way you'd be able to know what changed is by re-reading every single word.
This is one of those things where you technically agree by continuing to use their service but no one can realistically be expected to read a 30+ page document for the 20 services they do every time a provider updates their terms without a diff. You'd be reading one of these at least once a week.
The email I got from GitHub also didn't include "pilot" anywhere in the body of the email and neither do their current terms of service, so now you need to be able to decipher whatever wording they use to translate back to "co-pilot". After all that I also have lots of emails from noreply@github.com and searching my inbox for emails from that with "terms" in the subject doesn't show anything related to co-pilot.
I'm not a lawyer but I can't imagine if you agree to the terms today but 3 months from now new terms have been added -- you don't passively start accepting those terms without an explicit action to say you do after you've been notified of the changes.
This is why I prefer to use permissible license for my own software. I just don't need the drama of worrying about what people are doing with my code after I write it. Living in a world where you think a company is evil for training ML models on code that is published for anyone to read on the internet anyway because it violates the license you used just sounds exhausting.
My main concern was exactly Copilot. The concern is mostly principal.
I don't hate Microsoft like I did when I was younger. I think VSCode is a great editor. I still think GitHub is the best social network I know. But I've quit every other social network.
But when your values don't align with an organisation's, you will eventually run into conflicts of interest.
I still push changes to open-source projects that I host on GitHub, but I don't create new ones. The latest project I started runs in a local git directory that will get pushed to a self-hosted Gitea.
When necessary, I'm seriously considering contributing to Gitea to make this transition easier for others.
Gitea is an ~100MB self-hosted binary that mimics GitHub, runs its own SSH server, and it looks GREAT!
maybe I'm being stupid, but i'm not seeing the issue here. having your code public means that anyone can copy it and use it. forget about respecting licensing and what not. if you don't want anyone copying your code, you have to keep it closed and not publish it to a public repo. i don't think this is necessary a github issue. I can copy code from any other service (gitlab, bitbucket, self hosted repo) as long as the repo is public.
FWIW, you can help send OpenAI and Microsoft a strong message by filing an official complaint with the FTC, BBB, CA State Attorney General, San Francisco DA office & FBI IP Theft division.
There is currently an open investigation case at the Better Business Bureau (filed by my self), but OpenAi refuses to participate. I.E. OpenAI refuses to defend its business practices in front of a government entity.
if you're worried about copilot: isn't the problem is that copyright law (apparently) allows them to do it?
if you move everything to gitlab/self-hosted there's nothing stopping them spending 5 minutes querying the bing index and feeding it every repo they find
Well, they're asserting copyright law allows them to do it. It's not clear how well that would hold up in court against a well-funded lawsuit -- but that would require someone whose code got regurgitated substantially intact by Copilot, who also had the resources to manage the suit.
Copying off Github and copying off another host is not quite the same. I explained it twice on this thread already. So I'm leaving a link here: https://news.ycombinator.com/item?id=31937705
I think an easy and efficient way to prevent FOSS code to be used to train the model is to update the licenses or create a new license that forbids such a training.
It seems superficially reasonable that a model trained on your work would be a derivative work of that work? But I am super far from any kind of domain expert.
Legit question: why do people feel locked in to GitHub? It’s very easy to push a git repo to a new place, and GitLab has way more features out of the box.
You're only better off using GitHub today in the same sense that a turkey being fattened up for Thanksgiving is better off today than one living in the wild.
What's "nebulous" about having your chosen license for your project ignored – in practice, rescinded – when it's uploaded to GitHub by someone who may or may not be you?
The most common and basic thing is almost disappointed for new comer. There are no steps of installation of project, i.e. how to install this project / repo and how to run. A very few maintainer mentioned the steps but a handful runs without error. So yes, Goodbye Github!
While I appreciate the earnest defense of FOSS and the, in all fairness, totally warranted suspicion of Microsoft, given its history, I found the attitude of this article to be very sour and, actually, in bad faith. Let me address the three questions which they posed to MS:
> 1. What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO, that: “(1) training ML systems on public data is fair use, (2) the output belongs to the operator, just like with a compiler”? In the interest of transparency and respect to the FOSS community, please also provide the community with your full legal analysis on why you believe that these statements are true.
I mean, I'd be floored if any corporate lawyer let anyone at [large company] answer this kind of question outside of an actual lawsuit. They are essentially asking the opposing team's lawyers to do all this work for them, for free. This is followed by an "obvious[ly]" correct (I'm being ironic) interpretation of the refusal to answer: that MS is wrong but just won't admit it. But go back and re-read the question. The question was architected to produce this impression if it wasn't answered. That's a sign of a bad faith question, rather than a question with intent to learn the answer.
> 2. If it is, as you claim, permissible to train the model (and allow users to generate code based on that model) on any code whatsoever and not be bound by any licensing terms, why did you choose to only train Copilot's model on FOSS? For example, why are your Microsoft Windows and Office codebases not in your training set?
Other commenters have discussed this one already. There is a perfectly reasonable and legitimate explanation here: The do not want to do anything that remotely risks exposing trade secrets, and that's a separate concern from potentially accidentally violating a license. Suppose the model was trained on all these public repos + MS's private repos. Someone else can come along and train their own model on the public code; now they have two code generators whose outputs can be compared to reveal secret information about MS's training set. This time, the article guesses well at the answer: MS cares more about itself than others. Sure. Why would it be expected not to?
> 3. Can you provide a list of licenses, including names of copyright holders and/or names of Git repositories, that were in the training set used for Copilot? If not, why are you withholding this information from the community?
I think this question is bad faith too. It starts by asking "can you". Then, if the answer is "no, we can't", reinterprets the answer as "no, we won't" ("withholding" is an intentional act). It is disingenuous to imply that someone who cannot do something is, therefore, intentionally refusing to do so. In the analysis of the lack of response, the article (finally) admits that it is speculating wildly, backpaddles on the implied claim that MS is refusing to provide this information, and instead takes a different approach: MS scientists can't answer because they are not good scientists. But wait, here's the kicker:
> ... so they don't actually know the answer to whose copyrights they infringed and when and how.
Busted! The authors have essentially demonstrated the question is in bad faith by suggesting that the answer to the question, "Whose data did you use?", is the same as the answer to the question, "Whose copyright did you violate?", which is a logical connection made possible only by the underlying presupposition that MS is totally incorrect in its assertion about fair use in question 1. The framing of all these questions suggests to me that the authors were already firmly convinced of their guesses as to the answers/non-answers _at the time of posing the questions_.
If they actually waited for a whole year expecting a response, that's on them. I'm with MS on the decision not to engage here, even if I share all these qualms about Copilot.
1: What other kind of faith in Microsoft would be even remotely warranted?
2) "perfectly reasonable and legitimate explanation ... risks exposing trade secrets ... a separate concern from potentially accidentally violating a license"
2 a: Sure, they may be separate concerns, but Microsoft is acting -- and you, by arguing for them, at least im- but AFAICS explicitly endorsing their viewpoint -- as if their interest obviously overrides everyone else's. Why should it? They're the ones who want to do this, so why shouldn't their code be the one exposed to any risk? If you want to test if some newfangled house-building material is really as fire resistant as its manufacturers claim, you can set fire to your own house, not your neighbour's. Also, there's only one Microsoft whose interests would be put at risk if they use their own code; but they chose to expose how many others?
2 b: For someone complaining about "bad faith" on the part of others, "potentially accidentally" is some mighty fine weasel wording. What's "accidental" about intentionally building a product and intentionally training it on a bunch of code written by others? (They didn't just randomly press some keys and say "Oops, let's see if it gets trained on our code now, or everybody else's", did they?)
3) "starts by asking 'can you". Then, if the answer is 'no, we can't', reinterprets the answer as 'no, we won't' ('withholding' is an intentional act"
3 a: The word"can" has several valid usages in English. If I say "Can you pass me the salt, please?" and you don't, then you are (assuming you have no severe physical handicap that's stopping you) intentionally withholding the salt from me.
3 b: Even if Microsoft is actually unable to provide the asked for data, the question arises: How come they are? They've built this product. Not building that traceability into it was their choice. Why did they choose not to?
If anyone is "busted" here, it seems to me that's you: Busted as a Microsoft shill.
I had my suspicions but reading this comment thread I am now 100% sure that FAANGS and the likes have active PR firms working on HN threads. Naive to not have realized that before I know.
What is it with Americans and their dismissal of other people's concerns as always being "politics".
It totally does break licencing laws, just because Americans think they can change the terms of an agreement and relicense your code because you didn't remove it doesn't make that legal.
I haven't looked into this extensively or tried Copilot myself, so I might be completely wrong about this. But from what I understand, the code Copilot generates is generally different enough from the source data that it shouldn't be an issue. In a sense, Copilot reading lots of code on Github to train a code-writing model is analogous to a human reading lots of code on GitHub and learning from some of the design patterns they see—as long as the output is not too similar to the input, it should be fine.
> But from what I understand, the code Copilot generates is generally different enough from the source data that it shouldn't be an issue.
It's a completely closed system and they refuse to let you know what they used as source, so you will never know which is the problem raised by the fine article included Microsoft's refusal to engage.
The premise is completely unfounded. If I read the Windows source code and then went to recreate Windows functionality in Wine Microsoft would completely sue the crap out of everyone even if I didn't copy/paste Windows code. Why should we give Microsoft leeway?
Even Amazon lets you understand what licences went into the code you are copy/pasting.
the issue is if GitHub hosts your code with an open source license that does not allow for-profit reuse without concomitant sharing, like the GPL, they will incorporate that code into their product in violation of the license, and claim the license doesn’t apply because the code isn’t code but merely text.
they built their business on precise and detailed articulations of consent and its boundaries, but disregarded all of that, post-acquisition, because Microsoft has enough lawyers that they think they can get away with it.
so, for the subcommunities and creators who put their work on GitHub in the context of these very specific and fine-grained articulations of consent, this may be theft and is certainly betrayal.
But we never had humans who could read so much code and "learn" so much. :p Our old system's rules and logical conclusions, taken up a level, now make us uncomfortable.
I agree, except the last sentence about "leftist activists" can be omitted.
I'm grateful for experiments like Copilot and replit's[0] "generate and explain." Maybe there are issues with software licenses and "ai assistants," but the way to surface the issues is with working software. I don't think this is a case of "move fast and break things." It's always been possible to abuse open source licenses without "ai."
There are plenty of alternatives to GitHub if you don't like it. There's no need for the hypercritical tone in this piece.
What case law, if any, did you rely on in Microsoft & GitHub's public claim, stated by GitHub's (then) CEO?
Doesn't Google crawl the entire web? How is crawling code different? Copilot is essentially a more intelligent search engine. The only difference is that people want their websites to be searchable and rise to the top of Google they benefit from the increased traffic. Legally I don't see the difference. As Copilot gets more developed, it may become desirable to have your API at the top of Copilot just like search, and this can help drive traffic to FOSS projects. After all, why would a creator not want their FOSS project to be easier to find/use?
why did you choose to only train Copilot's model on FOSS and not on Windows/Office?
Code from Azure and tons of other Microsoft projects is in the training set. Windows and Office are not FOSS and not on GitHub. Obviously it would be a huge security risk to train on OS code.
Can you provide a list of licenses, including names of copyright holders and/or names of Git repositories, that were in the training set used for Copilot?
Again unfair; this is a gotcha type question. There's all kind of code on GitHub with so many different types of licenses. there's bound to be some gray areas and code that inadvertently made it's way into the model. No lawyer would ever expose themselves to that kind of liability.
The bigger issues I see here are copilot is not free but uses free software and that understandably make FOSS community uncomfortable. However, the Copilot models are incredibly expensive to run interact and Microsoft has to cover the bill. Would it be unethical if Google charged you a monthly fee? Arguable not, because it does already in the form of ads. What Microsoft needs to do is have an opt-out standard like the robots.txt file or noindex meta tag. The problem with GitHub is, unlike websites on the web, not everyone uses Github public repos with the express purpose of having them be easily accessible to the public. Another issues is attribution of snippets is a nightmare, but one could argue that devs do with stackoverflow all the time.
This is my favorite question about Copilot ever.