How Kaggle Is Changing How We Work

rm999 · on April 12, 2013

Kaggle is a great idea; I actually entered the industry partly through a data mining contest (pre-kaggle). But I think the article overstates Kaggle's influence on industry, at least at this point in time.

>the Kaggle ranking has become an essential metric in the world of data science. Employers like American Express and the New York Times have begun listing a Kaggle rank as an essential qualification in their help wanted ads for data scientists

No, it hasn't, and the nytimes job posting they link to doesn't list it as an "essential" qualification (they didn't link to the amex posting). I know many people with experience in the data science space and very few of us have taken part in a Kaggle competition. It's not that there is anything wrong with Kaggle, but the pay is low for the required effort to differentiate yourself. Many modeling competitions I've seen require an inefficient use of time in the "diminishing returns" part of the process, which means winning requires a lot of free time. I worked with a guy who won a couple prominent data modeling competitions, and frankly I thought he was a mediocre data scientist (but a very hard worker).

I sometimes wonder who takes part in the competitions; and then I remember myself from six years ago, applying for jobs and looking for a way to make my resume stand out.

achompas · on April 12, 2013

Agreed. In conversations with data scientists looking to hire, we've never discussed Kaggle aside from "wow they reeeeeally focus on that evaluation metric."

In fact, our conversations quickly turn to the Netflix Prize, where first place won the competition with an algorithm that could not be ported to production, and we discuss how poorly these competitions map to reality.

None of the data scientists I know hire based on Kaggle score. Several don't even think positively of Kaggle.

tocomment · on April 12, 2013

Why couldn't it be ported to production?

textminer · on April 12, 2013

If I recall, BellKor made many, many models based on Gradient Boosted Decision Trees, Restricted Boltzmann Machines, and kNN. They tried many different feature subsets, added temporal weighting, and tried many reduced-dimensionality representations (SVD, NMF). They then stacked them all together into one final ensemble whose RSME beat everyone else's on a hidden validation set.

In a production environment, this is probably an insane amount of transformation, feature extraction, and classification for marginally little gains in precision (as defined here). But I'm only a year or two in to building production-environment classifiers, and nothing at Netflix's scale (though not tiny either-- it is a problem if I can't do feature extraction and high-precision/-recall* classification within a few milliseconds).

* - mid 90s, for a hard NLP/social graph problem.

willis77 · on April 12, 2013

It's funny to me that people are so quick to poo-poo the complicated modeling done for the Netflix prize. When did production-worthiness become the only important thing? It's like saying Watson was useless because it can't play a concurrent game of Jeopardy with thousands of people on the web.

Just like in research, it turns out that relaxing real-world constraints on a problem is often a great way to make progress. I would not have to search long or far to provide much worse uses of million-dollar grants/projects/big-data-software.

textminer · on April 13, 2013

Don't mean to poo-poo. I love scrambling for models and methods. In the latest problem I worked on, I did the same thing. Implementing papers, trying to avail myself of semi-supervised or reduced-dimension representations, tweaking models and features every which way... it's illuminating work, and good things come out of it.

But, in the end, companies like Netflix or where I work are immediately looking for cheap ways to make X happen easier, better, and cheaper. But then hopefully the smart papers go on a shelf or are easily Googleable, and the rest of us get to learn from their efforts.

I can fit long-term and short-term goals in my brain, too, mister.

__mtb__ · on April 13, 2013

I am a 'Kaggler', I have done well (prize winner) in some past competitions and I could not agree more.

In the last week of a competition, you go into kitchen sink mode trying anything and everything to squeeze the last few points out of your models. The objective of the competition from the competitors standpoint is to win, not create a solution that could be deployed into the sponsor's production environment. On going maintenance of the model is not a consideration.

As far as the netflix competition goes though, the final solutions did help publicize the potential of ensembling and RBMs - I am grateful for that. My personal approach to the competitions is to use them as an opportunity to try out new modeling techniques (i.e. RMBs, deep beliefs etc...) on real world data. It has the added bonus of potentially paying off with prize money (as long as you don't get screwed[1]).

I would like to hear from past Kaggle sponsors and see what they have done with the winning models.

[1] http://www.gequest.com/c/flight/forums/t/4284/acknowledging-...

rm999 · on April 12, 2013

Largely what textminer said. In fact they didn't use anything from the grand-prize winner because it would require too much engineering effort and because their concentration on the problem shifted. They did however use the two best methods from the "progress prize", so I wouldn't call the contest a failure.

>If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then.

From http://techblog.netflix.com/2012/04/netflix-recommendations-...

reader5000 · on April 12, 2013

I think Kaggle is great, but I don't really see the model of "work full time for months competing with 80,000 other people in hopes that your data model will turn out 0.001% more accurate than everybody else's [which at some point becomes largely a function of the initial seed on your RNG when you fit the model] in order to get a single pay check" to be a workable concept for work.

rm999 · on April 12, 2013

Yes, you succinctly made the same argument I was trying to make. I think Kaggle is pushing the industry to be more efficient, but in an ironically inefficient way.

F_J_H · on April 12, 2013

Don't overlook those who do it for reasons other than earning money.

Look at how many people solve complex crossword and sudoku puzzles everyday for free...

NewAccnt · on April 12, 2013

Just another exploitation of creative labor if you ask me. The people competing for prises may be doing it for fun, but as you can see in the industries using it, these are serious big money applications. Of course, nobody is blind to this and a majority of applicants are going to try to go for the cheapest and easiest solution for a shot at the gold.

jph00 · on April 12, 2013

A lot of commenters here seem to have missed this: "The really disruptive thing about Kaggle, though, comes through the company's new service, Kaggle Connect. Here, Kaggle acts as a match-maker, where customers with a specific problem can hire a specific data scientist well-suited to their problem; candidates are drawn the top tier of Kaggle participants: the top 1/2 of 1 percent, or about 500 data scientists."

The competitions are a good way to learn, practice, and get feedback on your methods. Kaggle Connect is where you can make a good living while doing a range of interesting work.

(I work for, and compete on Kaggle).

achompas · on April 12, 2013

The article is sorely mistaken: a job posting that says "Kaggle participation will give you an advantage" does not mean Kaggle provides "essential metric[s]." That's just one example of the article overstating Kaggle's case.

Also love the irony of explaining the importance of a data science competition host by citing Tom Friedman, the King Of Generalizing Anecdotes.

paulgb · on April 12, 2013

It's interesting how differently the data science world has embraced this model vs. designers, of which a vocal set seem to scoff at any mention of 99designs. (I am currently competing in a Kaggle competition and the competitive factor is what makes it fun)

That said, I doubt more than a handful of people could make a living off competition winnings alone.

dbecker · on April 12, 2013

Kaggle is pivoting, and they don't intend "making a living on Kaggle" to involve contest winnings. Instead, they are offering a platform for prospective employers can use to contract to highly ranked Kaggle members.

And those gigs pay hourly.

jurassic · on April 13, 2013

Good designers scoff at 99Designs because the expected payout is absurdly low for the amount of skill and effort usually required to win a "competition". The price structure undermines the pricing of struggling freelancers trying to break out in the low- to mid-tier end of the market. Bean-counting execs who don't care about design see no reason to pay $X000 or $XX000 to a pro designer with legitimate expenses when they can get several useable designs (I won't say "good") on spec for only $Y00. Nevermind that many of those crowdsourced designs come from kids and people overseas using pirated versions of Photoshop.

Besides the economic damage wrought by 99Designs, a lot of designers have gotten a sour taste in their mouth from the blatant copying/IP theft that often wins. If I come up with great idea A, and you do a nearly identical rendering of my idea but in what happens to be the buyer's favorite color.... who do you think will be selected as winner? Unfair outcomes are pretty common, especially because the people doing the choosing often wouldn't know a good design if it bit them in the face.

jack_trades · on April 12, 2013

The cynic in me says, "How Kaggle wastes the time of many talented individuals while enabling corporations to give the finger to their staff."

I should start providing prizes for the best start-up. I'll give you a $20k prize and then turn around and sell it for $1M or more. You can put it on your resume. Win win.

EDIT: Sure, it's good for various things, but it is so detached from the reality that it's a bit out in the thicket. The cynic in me just doesn't get over the value handed over by competitors to the sponsors.

dj_axl · on April 12, 2013

> The cynic in me just doesn't get over the value handed over by competitors to the sponsors

I would think you don't have to hand over your algorithm, if you forego the prize money. Another way of looking at it... as alluded to by other posters, the "winning" data model may not be the best, so there might be less value handed over than you think.

stratosvoukel · on April 13, 2013

I cant see how Kaggle is good for the data scientist (but I can certainly see how it is good for the hiring companies). It seems like the design spec work. I find it horendous that people do work for a company and most of them never get paid for it. It is disgusting. The top #1 in Kaggle according to the article only got paid 6 times. Lets put it into perspective. Companies that wanted web developers stopped hiring them and started using a similar service, in which, everyone would create a version of the application, with only the frontend available for other contestants, in order to foster competition. Then the company chose the best one and only the winning team would get paid. This is capitalism at its worst, and long-term it is severely harmful. In a modern work-rights aware society this should be illegal, spec work that is... At least the communities should critic it in my opinion and not embrace it.

axusgrad · on April 12, 2013

There was a similar website, where customers posted a bounty for the cheapest travel itinerary meeting certain conditions, and anyone could submit a plan meeting those conditions, to try and win the bounty. Does anyone remember what it was called?

todsul · on April 12, 2013

Hi axusgrad, that's us, Flightfox. Like Kaggle and 99Designs, we're also Australian.

In reference to a few comments here about the inefficiency of crowdsourcing, we're moving more of the work to the back-end (i.e. after an "expert" is guaranteed the bounty). Instead of awarding an expert on Flightfox, you now "hire" an expert based on an initial "pitch".

So, instead of receiving work from let's say 99 experts, we're aiming to provide great results from only 3 experts. To do this, we're working on segmenting and profiling both customers and experts. Think Uber/Amazon rather than eBay.

billpaetzke · on April 12, 2013

http://flightfox.com

jmount · on April 13, 2013

Kaggle is pretty important. But concentrating only on accuracy tuning (ignoring data collection, data curation, and interviewing stakeholders to find real business needs) in machine learning is like celebrating only (premature) optimization in software engineering.

But don't get me wrong, lots of top notch results in the contests. It is just that it is testing only one facet of what is needed in a data scientist.

dbecker · on April 12, 2013

I wish there was a similar site where participants collaborate rather than compete to solve interesting data problems.

juskrey · on April 13, 2013

Big numbers of data crunchers in a big play. I wonder how many top scientists are themselves spurious, therefore, random leaders, since there are 85 thousands of them. Is Kaggle really a science place or a gamble? You know, we have seen things like that in any prediction market, e.g. financial.

danbmil99 · on April 12, 2013

I find irony in the cognitive dissonance betwixt these comments, and these: https://news.ycombinator.com/item?id=5540841

michaelochurch · on April 13, 2013

I'm not anti-Kaggle, but this is not the future.

Credible people will do one or two competitions because they care about the problem, and because they want to establish themselves well enough to get better jobs, etc. If it works for them, great. If it doesn't, they'll get bored and quit. In that case, the best people leave and you have a ghetto.

Right now, Kaggle makes sense because "data science" is still an ill-defined field but a lot of people want to get into it, and no one knows what it means or takes to get in, so people will try things out to see what happens.

If Kaggle wants to stay in play for the long term, they'll need to get really good at connecting talented people with very high-quality jobs.

There is something that I don't think all of the hiring-related startups get yet: as things are, there's such a shortage of quality jobs. That's a 5-year existential threat to the whole business model. What happens when people realize that these sexy startup jobs are just corporate jobs with better marketing? What happens when the dream dies? Right now, high-quality jobs are too rare for the hiring startups (unless the genuinely change the economy) to prevent people from getting just as disillusioned with these new services as they are with headhunters. Now, that's not because there's an intrinsic limit on interesting work (see: Lump of Labor fallacy) but we'd have to overthrow the management of a whole industry to change that.

Now, data science. It is attractive right now because it carries with it a promise of what software engineering was supposed to deliver but, for most people, doesn't: interesting work, implicit respect, autonomy. I feel like data scientist in many companies means "software engineer who gets dibs on the most interesting projects". I'm afraid that title inflation into the data science field might dilute that, however.

What we really need is to fire 90+ percent of software managers and trust engineers to pick their own projects and call some shots. I don't know how to turn that into a specific startup idea, but it will solve a lot of problems.