Hacker News new | past | comments | ask | show | jobs | submit login
Predicting Stack Overflow Tags with Google’s Cloud AI (stackoverflow.blog)
163 points by amrrs on May 12, 2019 | hide | past | favorite | 43 comments



Slightly off topic but I’ve been playing with this dataset for a while now to learn ML. It’s remarkably humbling.

No NLP approach I’ve tried has been able to predict question score based on content better than my baseline of “choose the mean”. (I’ve tried random forest on bag of words, AWD-LSTM, and Google AutoML so far).

The author of this post tried score prediction as well and pivoted to tag prediction after she couldn’t find anything that worked well: https://twitter.com/srobtweets/status/1125860523377979398?s=...

It’s so crazy to me that which posts get popular might just be random. And makes me wonder about the correlation between post content and popularity on other social sites like HN and reddit.


Quick question: Did you try/find any correlation with post time/seasonality? I recall that in previous examinations of HN/Reddit top/viral posts, there was some amount of signal along that dimension. (I notice someone mentions this in the twitter thread too but I didn't see a response)

Additionally, you found average (all up) to be a better predictor than average per user or per category?

Apologies for grilling you here, I should frankly dig in myself, but if you happen to feel like indulging me it's much appreciated :)


I didn’t try adding in any metadata to the models yet (including time, date, and author). I was just trying to work with the text content of the posts.


I have some unproven factors for predicting certain popular low-scoring questions: https://meta.stackoverflow.com/questions/373412/require-conf...

A lot of questions that should have a very low score get fixed by people other than the original author. This may make your modeling more difficult because not only are there multiple authors, but the expected score of a question is not time-invariant. In fact, the final score of a question is a path dependent composite score of potentially several intermediate formulations.


It's not that it's random, it's that the signal is too abstract for such simple methods to grasp it. Machine Learning which can predict Stack Overflow question scores is going to have to understand what's being asked, and how useful the responses are going to be for people interested in the question.


I’m not convinced. I tried manually predicting high score vs low score as well and couldn’t do so reliably.

The language model (viewable at https://stackroboflow.com ) scores much higher accuracy predicting the next word on Stack Overflow than other datasets like IMDb.

But on the IMDb dataset this approach led to state of the art sentiment analysis results by pulling out the encoder and adding a custom head whereas on Stack Overflow it didn’t grok anything about the score.


It may seem obvious with the benefit of hindsight, but I would have guessed that the better answers might be different in terms of word choice or style.


I'm not saying it was a silly thing to try, just that it's wrong to conclude from the result that the scores are random. I.e., I was mostly responding to "It’s so crazy to me that which posts get popular might just be random."


I’m not convinced they are randomly distributed but before I started playing with it I was fairly certain they weren’t.

Now all I know is I haven’t seen evidence that they aren’t.

I could see how there could be some element of luck in whether a post gets attention before getting buried. It depends who’s online when it’s posted and whether it piques their interest, and many other random factors that don’t have anything to do with the content of the post like how quickly other posts come in afterwards to bury yours, and how any automated algorithms decide to surface it.


Just curious - did you include only the title and text of the question, or also the tags and author history?


Only title, and text. My original goal was to make a Chrome extension that would give you tips to help you craft a better question.

(I added tags to the language model later.)


Date also plays a huge part. Most stackoverflow posts get very little votes in the month they were posted but slowly grow as people search a problem and find an answer. The oldest posts have the highest score so something like "How do I change branch in git" will get 4000 points but posting the same thing or a similar thing today will get you a negative score.

Generally very short questions score badly because they will be stuff like "How do I make a video sharing website" but a very similar question like "How do I copy the current line in vim" will score well as long as its not a duplicate.

Both of those questions look similar when you just look the words and sentence structure. To know the difference you have to understand what a video sharing website is and know what kind of task copying a line in vim is so you can know which one is a reasonable question and which one is likely to help more future readers.


I did try to rule out question age as a major source of error in the model by training on a sample of ~1 million questions all from ~2012. But it didn’t do any better on those.

And (theoretically) duplicates are deleted by the moderators.


I've used simple TF/IDF with a custom stemmer as a similarity scoring tool against the XDK documentation. It worked pretty well, I'd be curious to know how that'd fare vs. the neural model (i.e. I suspect it'd be better in identifying the rare, high-signal words that were excluded by the author's 400-most-common word limit).

For my side project: As we received emails from developers asking for clarification/help with the APIs, the system would provide relevant documentation URLs so that anyone could pick up an inquiry and brush up on the API (and have a handy link if the the docs could be leveraged in the response).


Between this and the Pi calculation "achievement" it appears Google's all-in on the grassroots marketing these days.


Disclosure: I work on Google Cloud.

Both Emma and Sara are in our Developer Relations (aka Dev Rel) organization, like Kelsey Hightower and Felipe Hoffa.

They’re explicitly not in Sales, and their job is focused on explaining and demonstrating stuff to Developers. They go to meetups, give invited talks, write blog posts, and so on.

Sorry if that doesn’t come across as clear. They work for Google, are paid by Google for that work, but aren’t measured by revenue or anything.


The most effective sales is disguised as “not sales”.


That’s called marketing. Also sometimes pre-sales with solutions engineering and onboarding.


No, we have those functions, too :). It really comes down to how people are measured for their work, and how it's directed. Nobody told Emma or Sara to do the things they did. Nor is anyone going to go see "how much pipeline got generated".

DevRel is measured on how many people they reach, but even that only vaguely. Marketing events, by contrast, are more often measured by pipeline. And you can be certain that onboarding (Professional Services) and Sales/Solution engineers are measured on revenue and time to revenue.


I'm not trying to be snarky (I swear every second sentence I say starts that way these days, sad spirit of the times), but to me, that sounds like sales...


By grassroots marketing you mean paying people to use their services?


Out of curiosity, does anyone ever use tags on Stackoverflow? Does anybody use search on Stackoverflow? Does anybody ever use in-site search on any website, instead of just using Google?


I'm a "heavy" Stack Overflow user https://stackoverflow.com/users/1348195/benjamin-gruenbaum and I use tags all the time.

Otherwise you're just playing "fastest gunslinger in the west" trying to answer generic stuff before anyone rather than sniping questions you genuinely find interesting and can teach you.


I think there are two kinds of SO users - those that ask questions when they have problems, and answer questions when they solve them; and those that find questions to answer (for internet points, kudos, product support etc). Tags mainly help the latter.


Same here (also a heavy user), monitoring a tag is how I find questions to attempt to answer.

I guess tags and searching (when dupe-closing) are more essential if you primarily use SO to post answers.


I had a daily practice of looking at the tags of my language of desire `R` and `Python` to see the kind of incoming questions and the variety of answers coming at different instance of time. This is like a break from my full time work let's say after I finish a task or something. Somehow this has helped me improve my coding.


Stackoverflow may be the only site where I ever use tags. Tags are useful there for finding questions you can answer. You can watch tags of topics that you probably can reply to and browser the front page with a collection of interesting questions. Or look up the new questions page of a specific tag.

The same could probably done with search terms saved to your profile, but the tags are a much more organized alternative.


Yes, tags are mandatory when posting a question and useful when looking for questions to answer.

And yes, Google is worse than useless, it doesn't even search for what I type in, and when I force it to do that, it returns SEO spam websites, bot-farmed-content, clickbait and advertising ahead of useful content. Or instead of useful content.


That's weird, Stackoverflow is almost always my first result. Perhaps by accident you clicked on those spam websites too often and now Google thinks you love them.

As for asking, I suppose if my question has never been asked before by someone else, I'd rather not know the answer.


You are expected to put a few tags on a question if you ask one and normally some editor person will mangle your English to remove the nuance that your question was really about to possibly also update the tags to maybe not quite fit your question. This is remarkably unappreciative, however, it is interesting how any website that gets established garners an army of helpers who do these things. Wikipedia being the classic for this expertise-ism.

So even if tags are not your thing then whatever question you see on StackOverflow will be fully tagged up. If the software doesn't suggest some when asking some keen person will add them in.

If you are interested in a particularly obscure software package that does not have its own StackOverflow site then you will fond the square bracket search option (for the tags) is a good way of finding out what is new in that niche and what cool features or tips you can borrow for your own project.

At a guess the unrelated questions - 'top network questions' - are more likely to get clicks than the SO search box. So, to answer your question, the answer is 'no'. Aside from the one DDG user on HN that maintains a gopher site and the really computer-phobic uncle that uses Bing! with Windows XP the whole English speaking world is using Google.

China is different.


I used to have a rss subscription on a obscure SO tag I was interested years ago. The tag had like a couple of posts a week, and I would read the question in the rss reader and click on what I am interested in.

So tags probably do not work well on a popular topic, but outside of that, they can be useful.


"Does anybody ever use in-site search on any website, instead of just using Google?"

It's certainly common for e-commerce. Especially when you can search by specific product attributes, shipping options, etc.


You don't use tags for searching for answers. You use tags to browse questions on topics you know a lot about and you have a high chance of being able to answer. Most of the questions on stack overflow I know nothing about so I browse the ruby tag where a lot of the questions are in the range of what I could answer.

Thats why they have the requirement for a tag "Could someone be an expert in this tag?"


I used tags on MathOverflow because questions outside my area of expertise often may as well have been written in Latin.


If you want to experiment with question tag prediction on your laptop, you can also play with fastText : https://fasttext.cc/docs/en/supervised-tutorial.html


Kaggle ran a facebook recruitment challenge trying to do something similar a few years ago for those interested (I am because I participated in it):

https://www.kaggle.com/c/facebook-recruiting-iii-keyword-ext...

I'm in a waiting room right now, but is anyone interested in summarising or commenting on the differences/ gains/ losses between the performance and aspects of comparing the two? :)


For those interested, I’ve done similar work in the past. I’ve also written a guide along with explanations of how it works (on sentence classification):

https://github.com/lettergram/sentence-classification

It uses Keras and goes through everything from encodings, to the way various networks function, to hyperparameter tuning.


Similarly, predicting issue tags on Github would be interesting.


Links to dataset do not work.


Not sure why you're being downvoted. I also cannot get to at least one dataset -https://storage.googleapis.com/cloudml-demo-lcm/SO_ml_tags_a... which is linked to from the fourth paragraph directly above the "What Is The Bag Of Words Model?" section heading.


I pinged Sara. Probably just forgot to make it publicly accessible.


Thanks!


Article author here.

To access the dataset you need to be logged in with a Google account. Details here: https://github.com/GoogleCloudPlatform/ai-platform-text-clas...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: