99% of people looking for information about big data and 99% of people looking to do data science, don't have nowhere near big data, and don't need to be taught hadoop. Those people are instead often lacking fundamental knowledge and are looking for a trick technological solution instead of reviewing their basics.
A "data science" track should hence be 90% about algorithms, data structures, linear algebra and computer architecture. People need to know how to compute with matrices, how to use B-trees, what R-trees are, what SIMD is, what cache locality is, what package to use for practical linear algebra, what the gpu is good for and when and how to use it. Teach databases, but include also some of the theoretical database stuff, and teach how to correctly use relational databases in the first place, since this is most commonly useful and people don't really understand it. PostgreSQL is a real pearl, and few people know how much capability it has, how to use the geospatial indexing, the full-text search, how to do basic optimization and profiling.
Even for people who really do large scale computations, learning hadoop or mongodb (is anyone legitimately doing big data really using mongodb?) is just an afterthought, considering how much you have to learn first about mathematics, algorithms and computing to do anything sensible at that scale at all. If you bubble sort, MongoDB won't save you. If you know the fundamentals already, you likely don't need a separate course in mongo or hadoop.
For people looking to learn something more genuine, I would recommend, for example, this book:
100% agree except that I think courses like this are great for people who want to bluff their way through a job interview to get one of those $150K-$250K big data jobs that are the rage right now (watching Andrew Ng's machine learning lectures beforehand as well would be the pro move in my book). In my experience so far, most of these positions appear to be Java programming gigs where, sadly, issues like SIMD, cache locality, and the GPU are actively ignored or dismissed. Autoboxing alone to use Java generics pretty much destroys any hope of efficient cache use.
I've walked out of job interviews over this sort of thing wherein it's been described as a "performance programming" position for big data and/or machine learning except that everything has to be in Java. And sure, Java performance programming is a thing, but compared to what one can achieve with SIMD, attention to cache-locality and/or running on the GPU for the 20% of the code that eats 80-99% of the cycles, it's laughably uninteresting to me, big bucks aside.
You can do extreme performance coding for Java, Javascript, R, Python, Haskell or the language of your choice as long as the number-crunching is done by calling low-level heavily optimized code. For example, PyCUDA:
And this is where I get told by data scientists that they don't wish to support such code. And IMO that's fine for piddling around and experimentation. But for production, on thousands to hundreds of thousands of servers, running 24/7, at companies with billions and billions of dollars in the bank, that's leaving way too many transistors and electrons on the table for me to stomach. In contrast, here's what you can achieve when you do pay attention to these things:
Aren't there already plenty of courses and books out there about algorithms, data structures, RDBMS, etc?
I have a pretty good background in a lot of that (can always learn more of course), but I don't know anything about Hadoop and MapReduce (which is conveniently not mentioned in your critique, probably because it does fall under your list of acceptable topics), so I find this course interesting. I find the claim of "if you know the fundamentals, you don't need a course in that" to be dubious. Essentially you are saying that any learning material specifically targeting Hadoop is unnecessary?
Don't worry though, I'm not looking for some quick fix to my business needs, I'm not going to go out and spin up a Hadoop cluster on my 500GB of production data, I just want to learn. You're arguing more against your perceived motivations of the course-takers than the validity of the course itself.
What really annoys me is this particular blog post, not just the existence of the course, for example this thing:
“What is Big Data?” They will teach you fundamental principles of Hadoop, MapReduce, and how to make sense of big data. Developers will learn skills that provide fundamental building blocks towards deriving maximum value from the world's data. Technologists and business managers will gain the knowledge to build a big data strategy around Hadoop.
In my experience, to be successful in engineering in general, one has to learn whole design spaces instead of just individual technologies. This means taking a programming languages course instead of another C++ course, taking a distributed systems course instead of taking a Hadoop course, taking a databases course instead of a MySQL course and so forth. You of course have to fiddle around with the various tools as well, but you don't need a course or an instructor for that, otherwise you often end up just following written or spoken instruction which configuration file to edit, what command or query to type, etc., which is actually much _worse_ than self-directed learning. Once you have this theoretical background and bits of varied practical experience, you can do mature decisions about which tool to pick for a particular job.
So, I would really, really like to avoid anyone I might have a chance of working with learning about "how to build a big data strategy around XXX", whatever product XXX is. You don't build anything "around" up-front assumed technologies. This is just pumping the "big data" bubble, which is certainly good for Cloudera, which makes a living based on that, but doesn't exactly sound like teaching people to make informed technical judgements. I am also not to partial to the stance that learning Hadoop and MapReduce makes you a big data expert (they offer a certificate in big data after completing the course).
And no, there is nowhere near enough algorithms courses at all. The things taught in most undergradute algorithms courses are often not really the things you need for practical large scale data processing. I posted the Jeff Ullman book precisely as an example of how good courses in handling data might look like. This material is taught very rarely.
A lot of universities are definitely trying to push more and more of these sorts of courses though; check out, for example, cs229r at Harvard: http://people.seas.harvard.edu/~minilek/cs229r/index.html , along with some other course examples at the bottom of that page. Do you think things are changing for the better in this manner?
If you have any suggestions for online courses or other self-learning resources for distributed systems, I would be much obliged. The Ullman book is already in my queue.
Any recommendations for learning more about PostgreSQL's features? I use Postgres for basic web app data store kind of applications, but haven't gone much beyond that. I've skimmed the manual and it doesn't look that dissimilar to MySQL's, but I see a lot of comments heaping praise on Postgres and/or criticizing MySQL. Is there a good book on Postgres?
Man, I love that book. I loved it so much, as a matter of fact, that I went and bought it from Amazon. Even though I haven't directly used a large amount of it (though the locally sensitive hashing stuff and the communicational complexity stuff is easily worth the price), it has definitely made me think about large datasets in a much more productive fashion.
My only complaint is with the awful, awful cover. No link, but if you've ever seen it, you'll know exactly what I mean.
True. You're much more likely to apply data management fundamentals in a project than optimize impala queries on petabytes of user data. If you're a 10-50 person startup...maybe even a 200 person startup, what core/critical internal problems can you think of that would require such large scale computing? Would you allocate precious resources to a 'big data' team to monitor your logs or user activity? You most likely wouldn't need to. For the most part, only the big companies deal with that much data and only a handful of people would be in charge of managing it.
Edit: I also don't want to sound close minded or rule out an era where every company, large or small, will have TB's of data on their hands. I just haven't seen any indications that we're going in that direction.
your list of stuff to learn is rather heavy on implementation - wouldn't it make more sense to use a poor implementation of an algorithm in hadoop that uses 10 machines than to squeeze every ounce of performance using GPU's, cache locality etc. in a single machine, at great expense in programmer time?
I think that one nice thing about the idea of "big data" is being about to parallelize the problem and just throw more cores at it.
But on the other hand, I do think that when people think of "big data" they have in some magic solution that doesn't really exist. At the end of the day big data is just statistics.
This is a fair counterpoint, but if your lack of fundamentals caused you to write an O(n^k); k>1 algorithm, you're not going to be able to pay your way to the solution with more cores if you truly have "big data". Even constant multipliers of a poor O(n) algorithm will cost you serious bucks if your default optimization strategy is "buy more computers" rather than a few afternoons of quiet thought.
It seems that the majority of Big Data/Data Science applications are designed to give advertisers insight into things I don't really want them to have insight into. That really sucks, because the technology is cool, but I don't want to help build that kind of future. It's kind of analogous to how I feel about Computer Vision: there are a handful of legitimate purposes for it, but most applications of the technology fall somewhere between "I don't like that idea" to "that's totally unethical".
There are plenty of Big Data applications that aren't unethical. Back in the 90s (before the term Big Data was conceived) two of the biggest users of Teradata were P&G and Wal*mart. It was more about supply chain and retail store efficiency than anything nefarious. Big data helped make sure that store shelves had what people wanted.
Today there are mass spamvertising campaigns on Big Data, but there are also applications on financial services (making sure our pension funds take the right risk), engineering, telecom and elsewhere that help improve our lives.
Reducing retail waste is better for everyone. Though some of the stuff they do is questionable, like how carefully items in the store are placed to maximize the amount of unnecessary crap they sell to impulsive people.
Why not grab some public domain data sets and build something on that? There's some really nice sources that can be discovered at data.gov and other places. Much of it could be used to inform and educate the public. You probably could build a small business or consultancy if you choose correctly.
This is a cool idea, but I wish everything wasn't so 'big data' oriented. Most people will never work with big data. Instead of teaching me map/reduce, how about teaching me how to model with a mixture distribution? Teach me how to master small data and then scale those up to big data when and if need be.
Cloudera seems to be sponsoring this class and both instructors are from there. I think that explains why the class has a "Big Data in Hadoop" theme, since that's what Cloudera does.
(as a comic aside to the buzzwordiness of "Big data", I saw a tweet yesterday that lamented rising use of the term "Hyper data". Which was followed by a reply about approaching "Ludicrous data", with a link to Spaceballs' ludicrous speed scene)
I'm working through this course (and took the earlier sister course Computation for Data Analysis with R). It's quite good so far. They stay away from the "big" part and focus on the core of data analysis: how to find data, clean it up, explore it, find relationships and present your findings. We are using R, which is suitable for most data sizes. It's offered by Johns Hopkins, and has more of an academic bent than an industry one. Great general purpose knowledge that I think you would want before you start messing around with Hadoop.
I would also like something like this. I deal with data in my job all the time, everything from surveys to web analytics to customer data, but I only have a foggy notion how to work with it. As a result, most of the output is fairly simple, obvious metrics, but it would be nice to get a few online courses on how to do basic stats, construct surveys, do UI and A/B testing, etc.
So looking through this 'track', I see one course which seems like it might be more central to the discipline, "Intro to Data Science"[0]. Has anybody had a chance to compare this one against Bill Howe's "Introduction to Data Science"[1] on Coursera?
Introduction to Hadoop and Mapreduce course seems to have the right amount of content. It could be completed in one sitting and content is polished, well presented , and easy to grasp. Respect to Cloudera faculty. As an added bonus, uses python instead of java for examples.
(Course author here)
Thanks. We chose Python because it's a little more approachable for many people that Java, and is the language used in Udacity's Comp Sci 101 course. Also, using Hadoop Streaming saved us from having to explain a bunch of concepts such as WritableComparables, InputFormats etc that would just have got in the way of the basic MapReduce principles.
Most Udacity courses use Python, but seems this data science series will use R. Python also has lots of data analysis tool. Just wonder why they are choosing R.
This seems like a great bundle of courses. The big data topics caught my attention but I'm actually looking forward to exploratory data analysis. Especially with the Tukey mention: https://www.udacity.com/course/ud651
A "data science" track should hence be 90% about algorithms, data structures, linear algebra and computer architecture. People need to know how to compute with matrices, how to use B-trees, what R-trees are, what SIMD is, what cache locality is, what package to use for practical linear algebra, what the gpu is good for and when and how to use it. Teach databases, but include also some of the theoretical database stuff, and teach how to correctly use relational databases in the first place, since this is most commonly useful and people don't really understand it. PostgreSQL is a real pearl, and few people know how much capability it has, how to use the geospatial indexing, the full-text search, how to do basic optimization and profiling.
Even for people who really do large scale computations, learning hadoop or mongodb (is anyone legitimately doing big data really using mongodb?) is just an afterthought, considering how much you have to learn first about mathematics, algorithms and computing to do anything sensible at that scale at all. If you bubble sort, MongoDB won't save you. If you know the fundamentals already, you likely don't need a separate course in mongo or hadoop.
For people looking to learn something more genuine, I would recommend, for example, this book:
http://infolab.stanford.edu/~ullman/mmds.html