Hacker News new | past | comments | ask | show | jobs | submit login

I'm pretty suspicious of the idea that most companies have data that is too big for a non-distributed environment.

I do think that most companies log lots of data in different places without much thought or structure, and as a result need a "data scientist", who spends most of their time simply aggregating and cleaning data that can then easily be processed on a single machine. ie: they spend 10% of their time doing statistics.

Maybe I'm wrong, but the above is how all the data scientists that I know describe their work.




In my experience it's a bit of both.

Most of a data scientist's time is definitely spent getting data from various places, possibly enriching it, cleaning it, and converting it into a dataframe that can then be modeled (and surprisingly a lot of those tasks are often beyond pure actuaries / statisticians who often get nervous writing SQL queries with joins nevermind Spark jobs)...

But also once you've got that dataframe, it's often too big to be processed on a single machine in the office so you need to spin up a VM and move the data around which also requires a lot more computer science / hacking skills than you typically learn in a stats university degree so I think the term data scientist for someone who can do the computer sciency stuff in addition to the stats has its place...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: