For recruiters who aren't sure where to find these mythical post-docs who do science (with "Data"):
1. Go to a scientific journal that involves serious computational work.
2. Look at the last author on an article. This is usually the lab boss.
3. Google "$LASTNAME lab webpage".
4. Look for graduate student and post-doc profiles.
5. Email them and offer more than the 38k they make for working 60+ hrs/wk.
6. Now you have a "Data Scientist"
Alternatively, just Google "Argonne National Labs".
There is no shortage of Data Scientists with years of experience (every scientist I've talked to has guffawed heartily and asked "what other kind of scientist is there?"). Yes, I get that the term is just a very unfortunate choice of words (on the order of General Linear Model vs Generalized Linear Model), but the point is that no one knows to brand themselves this way. Recruiters are probably just frustrated by the lack of linked-in-ness, and the fact that most people competent at "Data Science" don't know what that means.
Here are things virtually every PhD in a computational discipline will have done:
1. Written code. It might be just Matlab, Python, R, etc.
2. Written up and communicated the results with compelling visualizations, both orally and written
3. Published a paper at some point demonstrating they can do this.
4. Dealt with failure and hunches that didn't work out after weeks of work.
On the hiring part, after you've found these mythical creatures:
As someone who does science (with Data, even!), here is the best question I can think of to ascertain my competence: "Diagram an approach to answer a particular question, with emphasis on ruling out competing explanations and demonstrating whether a result is true. Sketch some hypothetical visualizations you'd use that show what your result is, how large the effect is, and how sure you are that you're right."
I should be able to do that. On the flip side, if an employer can reason with a post-doc about that process in an intelligent fashion, they would be excited about leaving academia. You might need to teach them to use a cluster, EC2, whatever, but you will not have to teach them to ask and answer questions.
I absolutely agree, I work in bioinformatics and everyone I work with fits the bill blog-post describes.
I would add that hunting people in the relevant communities should work, as well, for example biostars.org or seqanswers.com for bioinformatics, and then check out the posts of the top-contributing members.
I also see more bioinformaticians storing their work on GitHub! So far I can name about 5 but I think that's going to grow in the next few years. There's no central point storing these (not sure how to structure that) but at some point there will be.
Erm... isn't a 'Data Scientist' just a shitty name for a Statistician? Maybe I am biased, but it seems like somebody just came up with an ass-hat new name for my profession.
There are enough people who come from top notch statistical backgrounds who don't necessarily have a firm grasp in whatever fancy machine learning heuristic is in flavor today (which is not a deal killer), or don't know how to code (which probably is depending on the company)
I think I'm in the minority, but I really hate the term, "data scientist." It seems usually to mean, "senior statistician, but with training and credentials expected of an RA" (to clarify, that isn't meant as a comment on the original article). I would be especially skeptical about hiring someone who self-identifies as a "data scientist," people are trained as Statisticians, Biostatiticians, computer scientists, various subspecialties that end in "-metrician" (e.g. Econometrician, Psychometrician, Cliometrician), etc; no one is trained as a "data scientist." Unless you're hiring someone really junior, you want the "data scientist" to have a specialty -- anyone good will have one.
But the best way to find a good "data scientist" is probably the best way to find a good programmer -- be one yourself; tap your professional network; and hire people as consultants/freelancers on non-critical projects before making a real commitment. Identifying someone with a deep skill that one doesn't possess oneself is pretty much impossible. And on the flip side, I have trouble imagining that someone who really knows what he or she is doing would want to work for some unknown.
If you want someone to scrape and clean data with Perl and generate some scatter plots and histograms, look for undergrads with good grades who worked as Research Assistants, or recent grads working as RAs at consulting firms, research centers, governmental agencies, or think tanks. They'll do great (by and large), they've have had some informal training from a more senior researcher to help put everything in context, and faculty often steer their best students into those sorts of jobs, so there's a pretty strong quality screen. I'm sure there are other places to find people too.
I think most people do, but I've never heard a good term for the job. It's like "we want someone who can take large amounts of data and do something awesome with it". What do you call that?
>Unless you're hiring someone really junior, you want the "data scientist" to have a specialty -- anyone good will have one.
Not sure I agree with this, I want people who are well-rounded. I think it's great to find someone who specialized at something, but I'd want that person to be able to grow the rest of his abilities up to par. Example: let's say you specialized in machine learning. If you don't understand building scalable systems, you can't take a holistic view of a project; how will you know if your algorithms can scale to a production environment? Or, if you can't program well, you can't write code to actually get your algorithms into place. Or if you can't understand the business side of things, you won't be able to build trust with the rest of the company, and hence you won't be able to contribute.
That's close to what I used to be called. It's a very overused and vague term, I'd argue far worse than the already bad 'data scientist'. Search job postings for 'analyst' and see the wide variety of jobs that turn up.
>"we want someone who can take large amounts of data and do something awesome with it"
Yeah, that's kind of the problem; until "awesome" gets defined, it's awfully hard to be specific about what the company needs. But this is why it's going to be really hard for someone that couldn't do part of the job themselves to hire effectively.
As to the second point, I guess it's going to depend on the business and on whether the statistical component is crucial or just peripheral. It's nice if everyone is pretty broadly well trained. But if you're a hedge fund building algorithmic trading rules, you need different people than a marketing research firm or a litigation consulting agency.
You're not alone. I have grumbled about it for years. I know Hilary Rosen and others have advocated for the term and I am very sensitive to the hardships endured by working in such an interdisciplinary manner, but it really is a goofy name.
It has always seemed to me just an excuse to run away from the "Artificial Intelligence 2.0" moniker and all the negative connotations that would imply. I dislike the label "Data Science" because there is not really much "Science" with a capital-S being done by people who adopt the moniker and with whom I have had chance to meet.
I have always thought that "Knowledge Engineer" was a more descriptive and useful term for what they actually do. The more abstract you get it seems to fall into the field properly known as Computational Mathematics.
All good, but I think statistics as I understand it, is only a facet of the work involved. And the outcome or goal of their work seems generally to be the creation of a knowledge-based system for predictive analysis; to derive ontological meaning from numerical data (mostly about humans and human activity).
Not a bad article, if a little short, on wikipedia kinda captures it for me [0]:
* Assessment of the problem
* Development of a knowledge-based system shell/structure
* Acquisition and structuring of the related information,
knowledge and specific preferences (IPK model)
* Implementation of the structured knowledge into knowledge
bases
* Testing and validation of the inserted knowledge
Right, but the question is whether statistics is the most important facet of the work involved. Everything you've listed except "maintenance of the system" is part of statistics (which I'm going to define broadly as: 'what statisticians do'), and system maintenance is only missing in a "keep servers online" sense, not a "make sure that the concept we implemented remains appropriate" sense.
"1. Tell me about some peer reviewed papers that you published as first author?"
It could be that the author's first criterion is an important predictor. But it seems to me that unless somebody is actually in academia, publishing papers is more akin to a hobby than a professional qualification, especially given the inherent bias against unaffiliated authors.
Edit: On re-reading, the author (hardtke) writes the above when talking specifically about weeding through post-doc applicants. So my quote is out of context and my criticism unfair.
I'd amend this to "tell me about an academic publication that you've been a major contributing author". Depending on the field and circumstances first author doesn't mean the same thing.
I will also say that most people that will excel at data science may not ever be pushing the envelope of statistical methods enough to warrant writing a research paper. Being able to apply and understand the state of the art algorithms is inherently a great skill to have.
The world is a dirty place, and just like there are thousands of applications that just need a developer to implement a CRUD app to expose an API on the web, there are tens or hundreds of domain specific problems to where a 'data scientist' can implement the state of the art data cleaning and machine learning algorithms.
However, being able to understand and apply graduate level textbook statistical methods to a large dataset (bigger than an Excel file) might be boring to some research scientists, it is cool as hell to see what the data is saying.
Yeah, I fail that criterion hard (I've had five rejections, does that count?). Nonetheless, while the author is probably showing his (or her) biases, there's definitely a nugget of truth there. For a more coding focused candidate, an equivalent question would be questions about software you have designed, built and promoted (as that's essentially what the question is asking).
Basically, I want to see if people can finish stuff. For software engineers, tasks tend to be easily defined and of shorter duration. At least at Bright, we have Data Scientists working on multiple simultaneous projects that take a few days to several weeks to complete. Knowing when you are done with a complicated, possibly open-ended, project is very difficult.
For software engineers, tasks tend to be easily defined and of shorter duration.
For crappy ones doing commodity work, this is true. For good software engineers, not so much.
I'm a data scientist by pedigree (before it was called that) who's spent the past few years in "regular old" software engineering (and probably heading back in the DS direction). Trust me that software engineering done right is as subtle and talent-intensive as DS.
The problem is that SWE's are terrible at marketing themselves as a group and generally get too little respect and autonomy to have architectural successes.
Data Scientist simply means a developer who also knows statistics. Here's how to sniff them out:
1. What is you favorite programming language and why?
2. Which is your least favorite programming language and why?
3. Explain how Bayesian spam filtering works.
4. How do you determine if a given data sample is statistically significant?
5. Suppose we ran a brand advertising campaign on radio and on television. Neither ad campaign uses special tracking codes or custom landing pages, both ads simply mention the web site address. What tools and methodology would you use to measure the response rate from radio ads vs. tv ads, and predict the total response that will generated from each ad?
These questions would get you: recent grads, who have not forgotten all the math; hackernews crowd, who reviewed pg research from years back; developers, who have taken some popular AI/ML class.
And while this is all good, this is certainly not enough.
In reality applied data science is not different from any other area. You get proficiency with experience, as usual.
>4. How do you determine if a given data sample is statistically significant?
That's a weird question - what am I to prove? Are there differences in the dataset? How does the data look like? Is there a second dataset against which to compare? What kind of data is it, ordinal or nominal? Is the dataset normally distributed?
I can't just say "that data sample is statistically significant" without a background against which to compare against!
>3. Explain how Bayesian spam filtering works.
I would say the majority of non-web developer scientists don't know how this works, why should we?
If you've had any exposure to Bayesian statistics, it's very easy to explain the approach. You're basically tokenizing the email by word, and then predicting the probability that P(email is spam|token).
I would expect any data science candidate to have (at least) a basic understanding of Bayesian statistics.
It's just that an e-mail filter is the last thing most scientists working in biology, chemistry, physics, medicine are actually working with, so it's (in my opinion) a rather unfitting example, you might be selecting against these people who fit the bill of "data scientist" probably better than the average "I once did something in Excel"-web guy.
"If you want to find a Data Scientist, find yourself a disgruntled postdoc toiling away on brilliant scientific research, but failing to land a professorship because … all the professor jobs are taken!"
That quote is more than just humorous, it points out one compelling answer to the question in the title of the post -- Perhaps all scientists are data scientists, you simply have to lure them into a new domain of study.
Possibly, i have certainly made that point before (and it remains my most upvoted post, so certainly other phd students and postdocs agree with me).
I do think that people need to stop expecting to get a physicist, statistician, economist or applied mathematician (every graduate analysis job I've seen had that wonderful qualification) as most of those people already have really well paying jobs in finance (or satisfying careers in academia), and open their eyes to the fact that for many data science roles, social scientists are probably a better fit.
If you're dealing with numbers generated by people's interaction with a website, a handy background to have is in some form of quantitative social science. (I am of course horribly biased, by being a quantitatively trained social scientist).
In any case, I expect data science to go through a dot.com like boom and then a horrible crash, so it may be a good idea to get the skills (and possibly qualifications) now, while the sun still shines and people are still hypnotised by the promise of big data, rather than the tedium and slog of extracting value from it.
The thing to note is that data scientist interviewing culture at companies is a function of who their first data scientist is or who the founder or whatever things the first data scientist should look like. This is why you have a wide variance from companies that look at people only with applied math and a background in Clojure to companies that require data scientists to be well versed in Java to companies that require statisticians. It has reached the stage where I think it is easier to sell yourself as an awesome scaling/data engineer and quietly do whatever data analytics you want when you get in.
FWIW, I agree with you that quantitative social science is a great background to have. My data science team lead has a background in Neuroscience and he seems to be having fun applying that intuition to recommendations.
To the reply below (for some reason I can't reply directly to you, which is weird).
Yes, I agree that on average, social science people tend to like maths less. However, you're not looking for the average social scientist, you're looking for the ones who weren't satisfied with SPSS (like me). The other, more common kind, are not likely to end up in a data scientist position (though perhaps this may change).
Hilariously enough, after I learned R, SPSS made way more sense to me, while before that it just seemed way too easy to generate reports without a shred of insight.
My wife works in marketing in Research and Analytics. Hard science people do the math (including creating the particular profiles/techniques) much better, on average, than the social science people. While they may have used menus of SPSS a bit, they hit a wall when they hit the scripting language, or at least learn at a glacial pace from the smaller sample sets I've seen.
Yeah, another social science guy here. My employer has found my statistical and behavioral knowledge very valuable, but I'm not much of a coder yet so it doesn't seem like I'm yet able to market myself as a data scientist. I've been doing some serious self-directed training in Python and trying to get better at R (moving from SPSS and SAS)
most of those people already have really well paying jobs in finance (or satisfying careers in academia)
I've always taken the data scientist distinction to be the startup world's answer to the "quant" designation. Quant jobs are a lot better than "just a" programmer jobs, even at hot startups, so I see the "data scientist" title as an answer to that. It's a startup quant.
Yeah, I've actually heard people use the terms interchangeably. Although 'quant' is almost exclusively used in the finance world. I interviewed for a couple of those jobs and got caught up on stuff I've never heard of before, like pricing bonds and combining interest rates.
Your article seems to be written from a phd-centric, low-industry-experience perspective. I'd reconsider this bias if it's real, finding good data scientists is hard and you may be eliminating good potential candidates.
I've been what you could call a data scientist for over five years now, and worked with dozens of people you could also call data scientists with different degrees and varying experience. From my sample, I don't think PhDs add much, if any, value over masters degrees after a year or two of experience (I'm biased here, I don't have a PhD). I think industry experience can add a tremendous amount of value you can't get from a degree, but it comes at a cost premium. Not related to your article: I've also found the best people have physics or computer science + applied math backgrounds.
I think there are too many ways to miss the mark when it comes to hiring data scientists. Looking for PhDs only is just as dumb as looking for people with 10 years of experience with Hadoop. There are some important things that they need to know, sure, but where they come from is next to meaningless.
One problem with hiring academics is that they can be far too focused on their subdiscipline as this is necessary to break new ground. An astro physicist might be amazing at identifying peaks in tremendous amounts of data but have no idea how to do basic analysis of a graph.
When we interview someone I like to start talking/asking about matrix decomposition (eigenvectors, svd ect) and see how excited they get.
I consider knowing about things like MDS and pagerank a bare minimum and if someone can bring up a more recent or esoteric application (locally linear embedding, graph partitioning) they stand out.
Asking about the nuances of estimating probability densities from data (bin/histogram vs kde etc) is another good one and something that stops a lot of the cooler theoretical statistics and information theory from getting used (or used well) in the real world.
Both of these questions get more at "do you understand the basic building blocks that come up over and over again" more then "is your research groundbreaking and new." Asking about techniques to do the above at scale also ups the difficulty of the interview.
"Data Scientist" is such a weird title for somebody who is basically a statistician with good IT skills. I think in some market verticals they're simply called "Quantitative Analysts" and make very good money.
In a previous job I did lots of analysis on very large data (at the time) data volumes, millions of structured or unstructured records, homo and heterogeneous datasets. Lots of aggregation, sifting, sorting, simplifying, deduping, summarizing, etc. All in support of similar kinds of things that "Data Scientist" positions seem to be intended to support. But the output was not a statistical model, or a machine learning exercise or some other similar. It was the distillation of gigabytes of data into a handful of slides and a report. Usually with a virtuous cycle of feedback directly into software development to improve and expand the next go-around.
But almost no statistics. Very very little, and what I did was very basic stuff.
What is that kind of job called? In my day we called it a "Data Analyst" but I don't see that around much.
I'm going to make a prediction, "Data Scientist" as "Senior Statistician" is going to be short-lived. I don't think they're going to provide the value companies think they will in most cases. "Data Analyst" is much more general purpose and useful cross-domains, except most Data Analyst don't have proper statistical training.
A Data Analyst with statistical training would be a much more useful tool to an organization seeking to make sense out of large volumes of data than a Senior Statistician as they'll have a much wider variety of tools at their disposal than just looking at the world through the statistics lens.
Bonus, jobs advertising "Data Analyst" can demand things like machine learning AND entity extraction AND automatic summarization AND data sanitation AND automatic correlation analysis AND automatic colocation analysis etc.
Most of the jobs I've seen looking for Data Scientists are for companies that are probably going to try and end up using them as high-priced Data Analysts, except the job reqs are all wrong and the candidates that get hired are way over qualified.
But this role is still evolving I suppose, IBM [1] views it as an evolution from the business/data analyst. So they definitely seem to be on the side of not so much statistics and more analysis.
I'm going to throw it out there that a Statistician with poor IT skills nowadays is like a carpenter with poor measuring, cutting and hammering skills.
I'm actually on the other side: currently writing my bachelor's thesis in statistical physics, having a lot to do with probability stuff, statistics and data. I'd like to take a year off and work in a company to get some real life experience before I start my Master's degree, because I don't want to stay in academia after that.
But I have no idea where I can find companies which could need my abilities and where I could work for 0.5-1 year. Any ideas? I live in Germany.
Most shops are willing to let you work remotely. Anyone that's down with stat. mech. has a solid base for quantitatively attacking most problems. Luckily the 'domain' knowledge required is general human consumer behavior, which you'll know quite a bit about already. And a surprising amount of that can be reasonably modeled by a microcanonical ensemble.
Are people hiring freelance/part time for data science work? I'm a physics professor at a small liberal arts college, and would like to move my career in the direction of data science. Picking up a client or two for smaller/short term projects would really help, I think.
As an aside, it isn't just the postdocs who are disgruntled. I was awarded tenure last year, and while there are aspects of the job I love, there are others that push me toward making a change.
1. Go to a scientific journal that involves serious computational work.
2. Look at the last author on an article. This is usually the lab boss.
3. Google "$LASTNAME lab webpage".
4. Look for graduate student and post-doc profiles.
5. Email them and offer more than the 38k they make for working 60+ hrs/wk.
6. Now you have a "Data Scientist"
Alternatively, just Google "Argonne National Labs".
There is no shortage of Data Scientists with years of experience (every scientist I've talked to has guffawed heartily and asked "what other kind of scientist is there?"). Yes, I get that the term is just a very unfortunate choice of words (on the order of General Linear Model vs Generalized Linear Model), but the point is that no one knows to brand themselves this way. Recruiters are probably just frustrated by the lack of linked-in-ness, and the fact that most people competent at "Data Science" don't know what that means.
Here are things virtually every PhD in a computational discipline will have done:
1. Written code. It might be just Matlab, Python, R, etc.
2. Written up and communicated the results with compelling visualizations, both orally and written
3. Published a paper at some point demonstrating they can do this.
4. Dealt with failure and hunches that didn't work out after weeks of work.
On the hiring part, after you've found these mythical creatures:
As someone who does science (with Data, even!), here is the best question I can think of to ascertain my competence: "Diagram an approach to answer a particular question, with emphasis on ruling out competing explanations and demonstrating whether a result is true. Sketch some hypothetical visualizations you'd use that show what your result is, how large the effect is, and how sure you are that you're right."
I should be able to do that. On the flip side, if an employer can reason with a post-doc about that process in an intelligent fashion, they would be excited about leaving academia. You might need to teach them to use a cluster, EC2, whatever, but you will not have to teach them to ask and answer questions.