You're assuming that Copilot Chat and the regular completion are the same model. Do you have a source that says so? I'd assumed that they were two different models, since they're quite different tasks.
Footnote 1 on page 2 explicitly mentions the 3.5 model and the research in this paper is only about auto completion: https://arxiv.org/pdf/2306.15033.pdf
Lastly, OpenAI states on the original Codex page: “OpenAI Codex is a descendant of GPT-3; its training data contains both natural language and billions of lines of source code from publicly available sources, including code in public GitHub repositories.” - It included GitHub repos, but it never was only GitHub repos. https://openai.com/blog/openai-codex
As I said to the other commenter, I specifically avoided saying "only", I said "primarily", and I should have clarified that I meant "primarily fine-tuned". My point is simply that it is far more likely to spit out results that are patterned after GitHub than results that are patterned after any inter-programmer communications.
Also, I wasn't contesting that autocomplete uses GPT 3.5 as the base model, I was contesting the idea that it uses the same derivative model as chat.
you don't need to use the chat to test whether it was exclusively trained on github or not.
I started typing this in my IDE in a rust project I already had open:
// the following function prints out the best soccer players of all time, which teams they played for, and why they were great as a hashmap
fn soccer_greats() -> Hashmap<String,(Vec<String>, String)> {
let mut hm = HashMap::new();
it completed with the following:
hm.insert("Pele", (vec!["Santos", "New York Cosmos"], "He scored over 1000 goals in his career".to_string()));
hm.insert("Maradona", (vec!["Boca Juniors", "Barcelona", "Napoli"], "He won the world cup with Argentina".to_string()));
hm.insert("Messi", (vec!["Barcelona"], "He has won the Ballon d'Or 6 times".to_string()));
hm.insert("Ronaldo", (vec!["Barcelona", "Inter Milan", "Real Madrid", "AC Milan", "Corinthians"], "He won the world cup with Brazil".to_string()));
hm.insert("Zidane", (vec!["Juventus", "Real Madrid"], "He won the world cup with France".to_string()));
}
I don't believe that information is going to be on github anywhere, but i could be mistaken.
You're addressing a straw man, I never claimed it was "exclusively" trained on GitHub. I said "primarily", though I should have been specific and said "primarily fine-tuned".
In the context of the person I replied to, the point is that it isn't made up primarily of a bunch of communications between programmers.