This is all just computational statistics. Why in the world would you invoke ill-defined anthropocentric terminology like "intelligence"? Of course a statistics program isn't "using intelligence".
But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.
> But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.
You mean in the same way that google.com isn't "just a database"?
If Copilot isn't intelligent, then what makes it more special than a search engine? How is Copilot not just Limewire but for code?
I could understand the argument that, if Copilot really is intelligent or sentient or something like that, then what it is producing is as original as what a human can produce (although, humans still have to respect copyright laws). However, I haven't seen anyone even attempt to make a serious argument like that.
It can produce code snippets that were never seen by generating fragments from various sources and combining them in a new way. This makes it different from a search engine, which only returns existing items.
Is it producing code (by which I mean creating/inventing new code by itself), or is it just combining existing code? Because to me it seems like the latter is a more appropriate description.
* AI searches for code in its neural-net-encoded database using your search terms (ex: "fast inverse square root")
* AI parses and generates AST from the snippet it found
* AI parses and generates AST from your existing codebase
* AI merges the ASTs in a way that compiles (it inserts snippet at your cursor, renames variables/function/class names to match existing ones in your program, etc)
* AI converts AST back into source code
Is AI intelligently producing new code in that example? Because I don't think it is.
What would be an interesting test of whether it can actually generate code is if it were tasked with implementing a new algorithm that isn't in the training set at all, and could not possibly be implemented by simply merging existing code snippets together. Maybe by describing a detailed imaginary protocol that does nothing useful, but requires some complicated logic, abstract concepts, and math.
A person can implement an algorithm they've never seen before by applying critical thinking and creativity (and maybe domain knowledge). If an AI can't do that, then you cannot credibly say that it's writing original code, because the only thing it has ever read, and the only thing it will ever write, is other people's code.
But you seem to have a basic fundamental misunderstanding of what is going on inside the NN. There is no "search for code" - it is generating new code each time, but sometimes that code will be the same as something it has seen because there is little or no variation in the training data for that snippet.
The NN generates code token by token, conditioned on the code leading up to it (and perhaps the code ahead, similar to BERT).
If you see tokens like this you probably generate the same next token too:
for i in range(1,10)
You have conditioned your input on the code you have seen and the most likely token you produce is ":".
That's what the NN does, but for much longer range conditioning.
I am not educated on this matter, but I have to ask for your clarification. Would that not be just a pre-emptive lookup? Akin to keeping a cache of "results" per input token that are essentially memorized and regurgitated?
Sounds like there is still a db lookup, just not at runtime and instead at build time of the NN. Can you clarify this please?
GPT-3's raw output is "logits", or indexes into an encoding space. The encoding space contains individual tokens; for generation, it would be words, or even word pieces. The pieces are as small as "for", or "if". Constructing code from an embedding space, even if it is more specialized, is like constructing sentences by using a dictionary -- it is a lookup table, but it's not a database. Generation works by looking at the existing document (or portion), and based on what is already present, generating a token. Then repeating until some condition is met (such as length, end of sentence, something else).
The issue here is that certain sentences (code segments) are memorized, and reproduced -- much like a language learner who completes every sentence which begins with "Mi nombre" with the phrase "Mi nombre es Mark". The regurgitation is based on high probability built into the priors, not an explicit lookup. A different logit sampling method (instead of taking the likeliest) reduces regurgitation, without changing anything else about the network. (It also makes nonsense happen more often, since nonsense items are inherently less likely!)
That isn't even necessary. I've been exploring GPT-3 for a while and it is completely incapable of any reasoning. If you enter short unique logical sentences like "Bob had 5 apples, gave 2 to Mary, then ate the same amount. How many apples Bob has left?" No matter how many previous examples you give it (to be sure it gets the question), it gets it wrong. It is simply incapable of reasoning about what is going on.
> How do you differentiate between these two things?
That's a contrived example because none of those lines could be protected by copyright, patents, etc. A better example might be if you started selling a 30 minute movie that was just the first 15 minutes of Toy Story spliced together with the last 15 minutes of Shrek. I'm not a lawyer, but I'm pretty sure that would qualify as a derivative work, meaning you're potentially infringing on someone's rights (unless they've given you permission/a license).
And to be clear, none of these problems are new. People have been fighting over copyright and it's philosophy in court for a very long time. The only thing that's different here is that it seems some people think it's ok to ignore copyright if you use Copilot as a proxy for the infringement.
> As an aside, your understanding of how the model works here is completely wrong. Like, just absolutely fundamentally completely wrong.
Of course I don't, it's a neural network. You don't know either. That example I posted could be exactly what it's doing, or not even close.
(although for the record, I wasn't trying to explain how copilot works in that comment. It was a hypothetical "AI" for the sake of discussion, not that it matters. My point about it being copyright infringement is the same even if that hypothetical implementation is wrong)
> Of course I don't, it's a neural network. You don't know either. That example I posted could be exactly what it's doing, or not even close.
What is this supposed to mean? We know how neural networks generate things like this very very well.
I personally have built a system that takes pictures of hand drawn mobile app layouts into the NN, then generates a JSON based description file that I compile into a Reactive Native and/or HTML5 layout file.
This was trivially easy in 2018 when I did it. It took me maybe 2 weeks engineering time, and I'm no genius. Our understanding of how transformer-based NNs work has come a long way since then, but even back then it was easy to show how conditioning on different parts of the image would generate different code.
> That's a contrived example because none of those lines could be protected by copyright, patents, etc.
Well no. The question I'm asking about, the philosophical distinction between "producing" or "combining" is a valid question no matter the copyrightability of anything. It's an interesting philosophical question even if we presume that copyright is bubkis.
> It was a hypothetical "AI" for the sake of discussion, not that it matters.
Ah, my mistake. I see that now.
> Of course I don't, it's a neural network. You don't know either.
I may not know how to make a lightbulb, but I do know hundreds of ways not to make one ;)
> A person can implement an algorithm they've never seen before by applying critical thinking and creativity (and maybe domain knowledge). If an AI can't do that, then you cannot credibly say that it's writing original code, because the only thing it has ever read, and the only thing it will ever write, is other people's code.
This doesn't hold at all. Not many people can come up with an original sorting algorithm for example, but people write code all the time.
The fact it reproduce code verbatim, including comments and even swear words means it is definitely copying some of the time.
Does it copy all the time? Doesn't matter. Plagiarism is plagiarism even regardless of it is done by a student in school, an author, a monkey or an "AI".
You wouldn't accept this from a student, you shouldn't accept it from a coworker (unless you are releasing under a compatible license) and of course you should accept it from Microsoft.
Co-pilot produces original code (as in code that has never been written before). It's not just combining snippets.
This should surprise no one who has seen the evolution of language models. Take a look at Kapathy's great write up from way back in 2015[1]. This generates Wikipedia syntax from a character-based RNN. It's operating on a per-character basis, and it doesn't have sufficient capacity to memorise large snippets. (The Paul Graham example spells this out: 1M characters in the dataset = 8M bits, and the network has 3.5M parameters).
Semantic arguments about "is this intelligence?" I'll let others fight.
That is can combine snippets doesn't mean the OP's understanding of how the system works is correct.
They seem to believe it is a database system. That's really not how this works, and the fact it behaves like one sometimes is disguising what it is doing.
If I say "write a "for" loop from 0 to 10 in Python" probably 50% of implementations by Python programmers will look exactly the same. Some will be retrieving that from memory, but many will be using a generative process that generates the same code, because they've seen and done similar things thousands of times before.
A neural network is doing a similar thing. "Write quicksort" makes it start generating tokens, and the loss function has optimised it to generate them in an order it has seen before.
It's probably seen a decent number of variations of quicksort, so you might get a mix of what it has seen before. For other pieces of code it has only seen one implementation, so it will generate something very similar. There could be local variations (eg, it sees lots of loops, so it might use a different variation) but in general it will be very similar.
But this isn't a database lookup function - it's generative against a loss function.
This is subtle distinction, but it is reasonable that people on HN understand this.
> Some will be retrieving that from memory, but many will be using a generative process that generates the same code, because they've seen and done similar things thousands of times before.
How are these not both the exact same process of memory recollection? Can you elaborate on the difference between memory recall vs a generative process based on conditioning? I understand how these two are different in application, but not understand why one would say they are fundamentally different processes.
Analogies start to break down once we are talking at this detailed level.
The best I can come up with is this:
Imagine you are implementing a system to give the correct answer to the addition of any two numbers between 1 and 100.
One way to implement it would be to build a large database, loaded with "x" and "y" and their sum. Then when you want to find out what 1 + 2 is you do a lookup.
The other method is to implement a "sum" function.
Both give the same results. The first process is a database lookup, the second is akin to a generative process because it's doing calculation to come up with correct result.
This analogy breaks down because a NN does have a token lookup as well. But the probabilistic computation is the major part of how a NN works, not the lookup part.
Perhaps it’s not so different from a search engine like Google. The article cites Google’s successful defence, under US copyright law, of its practice of displaying ‘snippets’ from copyrighted books in search results. There is a clear difference between this and the distribution of complete copies on LimeWire.
If you look at it this way, your brain is also "just" computational statics. (Or to be precise, it might be, since we don't yet know in all the details how it works).
Hint: it has been said hundreds of times since the advent of computer science that the brain is "just" [some simple thing that we already understand]. That notion has never once helped us in any way.
A common tech-bro fallacy. We understand exactly what is happening at the base level of a statistics package. We can point to the specific instructions it is undertaking. We haven't the slightest understanding of what "intelligence" is in the human sense, because it's wrapped up with totally mysterious and unsolved problems about the nature of thought and experience more generally.
The fallacy is the god-of-the-gaps "logic" of assuming there's some hand-wavey phenomenon that's qualitatively different from anything we currently understand, just because reality has so much complexity that we are far from reproducing it. You're assuming there's a soul and looking for it, even though you don't call it that.
Intelligence is mysterious in the same way chemical biology is mysterious (though perhaps to another degree of complexity)... It's not mysterious in the way people getting sick was mysterious before germ theory. There's no reason to think there's some crucial missing phenomenon without which we can't even reason about intelligence.
But it's also not exactly just a database. It contains contextual relationships as seen with things like GPT that are beyond what a typical database implementation would be capable of.