Yeah, the dataset was 40GB of text from pages linked from Reddit, so I imagine it was quite hard to clean it to just English text. They also noted in their paper that it "accidentally" learned to translate English into French, even though they removed non-English web pages, because of examples like
"I’m not the cleverest man in the world, but like they say in
French: Je ne suis pas un imbecile [I’m not a fool]."
What if the server is GitHub? Or some random blog about PHP development? There are lots of situations where it's very intentional that PHP is contained in HTML.
It’s a text generation algorithm. It’s not meant to be real code, just look like it. This is the infamous “too dangerous to release” GPT-2 making this code.
I have been working with a group that is trying to clone this dataset and make it publicly available (https://github.com/jcpeterson/openwebtext), and I have noticed quite a bit of code in the scraped dataset. Future releases of our dataset will be pre-filtered with another LSTM language model that will filter sentences by their probability under more conversational / literary datasets.
Until GPT-2 can participate in a formatting holy-war all our jobs are secure. It's time to get worried when it starts posting opinionated comments on the internet about "how spaces make my code look the same on everyone's machine" that's when it'd be a good idea to invest in a bunker.
I am literally living in the streets, freezing my ass off and hungry, looking for any kind of programming work for the past month, and now I have to see some AI bot generating more inexpensive shit code that I am sure some manager will convince themselves might get them that final career promotion by lowering their labor costs to near zero.
WTG, geniuses, for developing AI that before you know will have all of us living in the streets and hungry...
People are downvoting your crassness, but I sympathize with your situation.
I don't know what Las Vegas is like, but there is a lot of LAMP/WordPress work here in Toronto. (Frankly, I want nothing to do with it, but there's plenty of it and they pay alright. Some also offer PT or FT remote)
I can sympathize with your situation but I have to ask: where are your friends/family in all this? You don’t have any support system at all? I realize it’s possible but I want to understand how you ended up in your present situation.
My next comment is you are in the wrong industry if something like this scares you. You have the wrong attitude. Instead of lamenting about a new tech replacing your current skills, you should be asking yourself, how can I learn this new technology and put it to work for me?
Some people may say someone in your position has more important things to worry about and I would agree. Get yourself the first job you can find (tech related or not) and get your basic needs in order. Then invest your time in learning a tech with some staying power.
Jumping from short lived and volatile coding jobs isn’t a long term solution.
Welcome to the life of all the non-engineers out there either up to their eyeballs in debt or otherwise unable to earn a living wage. How many secretaries or data entry workers or webmasters were made obsolete because you were paid to destroy their job?
However, you don't have to worry, free open source tools and off the shelf B2B software will make your job obsolete long before AI is actually a time saver when writing code.
This is very fishy. You can get code like this by substituting words in identifier names for other words, but how can an algorithm trained on English dataset "learn" that keywords like 'function' and 'class' are exempt from substitution? I know most people here have unwavering faith in the magic of deep neural networks, but you'd need _a lot_ of examples to deduce this with any certainty, regardless of how you do it.
So you're buying the idea that it looked at a bunch of code snippets embedded at various pages, managed to build a sub-model for PHP (separate from all other languages it should have encountered) and managed to generate a long, nearly syntactically correct program uninterrupted by English text?
And while it makes tons of obvious mistakes in English (which is a much more flexible and forgiving language), its PHP is somehow nearly syntactically perfect?
To me, this doesn't seem like an argument in favor of this model "understanding" English (or C, or PHP). It seems more like an indication that it memorizes way more information than the paper implies and then does clever word substitution.
Yes, I do think that it learned a model of PHP and JavaScript syntax. 40GB of text data is a lot, and PHP syntax is a lot simpler than English grammar, which it learns quite well.
See also the example in the paper of accidentally learning to translate into French even though they tried to remove French pages from the corpus.
I'm not sure what point you're trying to make. Do you think a neural net is not capable of generating the code in the gist? Because it's pretty easy to do that. The harder part that we're still trying to figure out is getting that code to do something meaningful.
Did you read the GPT-2 paper? Frankly the english examples therein are much more impressive than this, and this certainly seems within the realm of possibility for GPT-2 based on some of the other emergent behavior of the model (e.g. inadvertent french translation skills)
Can someone shed more light behind this? What is the true source? Was it generated via the unreleased full model by an OpenAI employee? Or did someone generate it with the released "smaller model"? Can we, the curious public, see the model and replicate the results?
This makes me think that something like Stack Overflow could be used to train a model that generates code to answer a question—and that software specifications that are decomposed into a series of requirements or "questions" could be fed into this model to produce code that's equivalent to a team of remote contractors.
Your model would be based on NLP/votes of the questions, NLP/votes of the answers, and separating the text from the code in both.
The fact that many markdown/code formatting tools have you select the language for syntax highlighting is useful for classifying code as well.