CS246: Mining Massive Data Sets

moultano · on Feb 6, 2020

I wrote a guide to the Minhashing family of algorithms that goes into a little more depth than is covered in the LSH section of this course.

https://moultano.wordpress.com/2018/11/08/minhashing-3kbzhsx...

You might find it useful if you've ever thought about using MinHash and wondered how to incorporate weights rather than treating everything as a set.

moultano · on Feb 7, 2020

I periodically submit this to HN, but it never seems to get traction. https://news.ycombinator.com/item?id=22268810

senderista · on Feb 6, 2020

I've done the class and can't recommend it highly enough (except for Ullman's lectures). I recommend the advanced track if it's still offered.

fizwhiz · on Feb 6, 2020

What's wrong with Ullman's lectures?

deepGem · on Feb 7, 2020

Man the first edition of this course on coursera had Ullman's lectures. I am not sure if it's the online delivery mechanism or his tonality - I couldn't sit through a single lecture.

One of the toughest courses out there.

rlewkov · on Feb 6, 2020

Dry as dirt

DrNuke · on Feb 6, 2020

New telescopes with crowdworked astronomy needing good algos to process gazillion images at yet-to-be-seen detail.

tasubotadas · on Feb 6, 2020

I've taken the MOOC version of this course and it was very poorly explained. I hope the content and lecturing has changed since 2013 because the content itself is really interesting.

I wasn't impressed with the quality of the book as well. I did learn quite a few methods there (minhash) that I got to use later so thanks for that, but compared to MLPR, Learning from Data, or TESL books the quality of the former pales.

greymalik · on Feb 6, 2020

Can you clarify what the MLPR and TESL books refer to?

tasubotadas · on Feb 6, 2020

Machine Learning and Pattern Recognition

The Elements of Statistical Learning

0x31a · on Feb 7, 2020

The title of Bishop's book is Pattern Recognition and Machine Learning.

tasubotadas · on Feb 7, 2020

modwilliam · on Feb 6, 2020

I've taken a more recent version of this course and it seems that concepts were explained pretty well - particularly, the graphics in the slides were good at explaining the various algorithms step by step.

egl2019 · on Feb 7, 2020

MLPR and TESL, as much as I like them, won't help you much when you have massive amounts of data, i.e. too much to fit into memory.

tasubotadas · on Feb 7, 2020

That's correct. But I was comparing the way the content was explained to you.

commandlinefan · on Feb 6, 2020

I haven’t done the course, but +1 for the textbook. It’s freely downloadable and very readable. I learned a lot from it and the exercises are just the right level of difficulty to help you really internalize the material. I wish there were self-check answers in the “back” of the book, though.

rahimnathwani · on Feb 6, 2020

This book?

http://www.mmds.org/

datlife · on Feb 6, 2020

abhgh · on Feb 6, 2020

Yes this is a good book, esp. for an overview. At one time, I used to organize weekly discussion sessions at work based on sections in the book :-)

commandlinefan · on Feb 7, 2020

> for an overview

I’ve read this before - that MMDS is more of a survey of machine learning’s “greatest hits” than a place to learn AI concepts. Out of curiosity, what would you recommend as something more beginner friendly?

abhgh · on Feb 8, 2020

I agree that MMDS is probably not a good introduction for ML- pretty idiosyncratic choice of topics for that. If you are fine with lectures, I always recommend Abu Mostafas lectures [1]; I think they have an extremely good balance of intuition and math, and I love his teaching style. Pretty underrated imho. Since these are a bit dated, they don't cover Deep Learning. For this, I like Andrew Ngs lectures [2]. Hugo Larochelle has a pretty extensive course on NNs too - I have viewed segments from it, and liked them[3]. Have heard good recommendations for fast.ai [4] if you need hands-on experience.

  [1] http://work.caltech.edu/lectures.html
  [2] sample playlist - https://www.youtube.com/watch?v=UNmqTiOnRfg&list=PLFKog8qYYq0Fs6lQf0jOuQD4XUQWYANPy
  [3] https://www.youtube.com/watch?v=SGZ6BttHMPw&list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH
  [4] https://www.fast.ai/

datlife · on Feb 6, 2020

Love this course. Pretty rigorous, but the professor explains very well.

diehunde · on Feb 6, 2020

Looks very interesting. Too bad they don't have video lectures.

mrlatinos · on Feb 6, 2020

There's a link to lecture videos towards the top of the page - https://www.youtube.com/channel/UC_Oao2FYkLAUlUVkBfze4jg/vid...

Here's a different YouTube channel I found with the full course: https://www.youtube.com/playlist?list=PLLssT5z_DsK9JDLcT8T62...

radiowave · on Feb 6, 2020

The videos from a previous Coursera offering of (I think essentially) the same course are here:

https://www.youtube.com/playlist?list=PLLssT5z_DsK9JDLcT8T62...

koulvi · on Feb 7, 2020

They do have class video lectures available from 2019 (not from coursera) http://snap.stanford.edu/class/cs246-videos-2019/

I really pray Stanford doesn't put them behind the paywall.

master_yoda_1 · on Feb 6, 2020

What no deep learning, and nobody is claiming to solve all of your ranking problem in 5 lines of code ;) do we really need this course if we have fast ai ?

jph00 · on Feb 6, 2020

You seem to have missed the entire point of the hundreds of hours of lessons provided by fast.ai, as a service to the community(+), which is this: to explain, and be able to re-implement from scratch, all the things that make those 5 lines of code possible.

Some of that includes coverage of the probabilistic data structures and algorithms that are at the heart of the MMDS course. Along with computational linear algebra, analysis of the details of floating point representation, discussion of C/C++ interop, matrix calculus, parallel processing, Python accelerators like cython and numba, functional and object oriented programming, notation, and a lot more.

Being able to solve challenging problems in 5 lines of code is much harder than solving the same problem in hundreds of lines of code.

(+) fast.ai makes no money from any course - there are no ads, and everything is free. Why are some people so keen to stamp on those who volunteer their time to help others? Open source software development suffers the same problem.

mcintyre1994 · on Feb 6, 2020

For whatever it’s worth fast.ai is my favourite MOOC and the most impressive set of videos I’ve seen shared freely on the internet. Thankyou for everything you do!

emmelaich · on Feb 7, 2020

I have to give a small wry moue of amusement at "fast.ai" -- since an `ai` is a type of sloth.

benrbray · on Feb 6, 2020

I think the ML hype is actually slowing down. I've been interviewing for ML positions as a new grad and a number of companies have told me they have an excess of data scientists who can train the models, but a dearth of engineers who can actually scale the models up to production. Friends at FAANGs have similar stories.

mrlatinos · on Feb 6, 2020

My personal experience is that most of the models that data scientists are creating will never be scalable.

streetcat1 · on Feb 6, 2020

The goal of a data scientist is not to train the model, but to find useful signals to feed the model (separate the signal from the noise).

And on the front end, tie the model prediction to the business outcome and back.

The rest can or soon to be automated away.

tudelo · on Feb 6, 2020

my n=1 exp is that there are tons of openings for ML engineers but they are very wary of hiring them

NerdyDrone · on Feb 6, 2020

Thanks for sharing! As a new grad also looking for work, maybe I'll apply to more SE jobs, fewer data science-focused jobs lol.

TBH applying for jobs is scary asf.

pc86 · on Feb 6, 2020

Maybe it's just because I've been in industry for 10+ years now, but how are you qualified as a new grad to apply for both?

tudelo · on Feb 6, 2020

Double major statistics + cs? masters statistics? Seems those would potentially qualify you if you are a solid candidate?

benrbray · on Feb 6, 2020

Not sure why you've been downvoted--it's really tough to know what skills will be marketable in 4+ years when choosing what to concentrate on in college.

In my schooling, I really optimized for the math+stats background, since I enjoyed it and thought it would help me stand out. I even took a short detour into a machine learning PhD before deciding academia isn't for me and leaving with an MS. Now I'm on the job market, and although I have modest coding/engineering skills and a willingness to learn, it's tough to find a company willing to take the risk. Guess I min/maxed a little too much.

Best of luck in your search!

spaniard89277 · on Feb 6, 2020

One day someone has to make a dataset of how much talent we waste in the EU. Bsc gets paid by EU taxes, guy ends up teaching in US university or working for US company.

It has become the cycle of life.

lowdose · on Feb 6, 2020

I bet this is going to shift because the Russian brain drain is almost complete. Europe did not have to compete yet with the United states because Russia was a single source of the migration of several million highly educated people.

https://publications.atlanticcouncil.org/putin-exodus/

wait_a_minute · on Feb 6, 2020

If highly productive people can't keep the fruit of their labor, they'll leave to a place where they can! We have less social programs in the US but the majority of productive people will probably prefer working here since they can take home a significantly higher salary even after taxes.

Anon84 · on Feb 6, 2020

And after you add the expenses needed to make up for the social programs you don't have, you might even realize that what you're left with is not that significantly higher.

wait_a_minute · on Feb 6, 2020

I don't think that's going to apply to the majority of high-income earners, because the difference is greater than the cost of not having free college or socialized healthcare. A college education is a one-time cost. You might pay $30,000 over time but when you're earning $130,000 or more, you can crush that really quickly and not be burdened by the higher taxation and reduced opportunity you'd be facing each year otherwise.

The same is true with healthcare in America. It's actually not that bad if you have a decent insurance plan and some cash. For high-income earners, which is the people we are talking about leaving from the UK, it's most likely going to be a net gain over time to have the higher salaries and lower taxes.

There's a reason the best engineers are going to want to leave the UK, Canada, India, China, etc, and come to the USA. It's worth it. I personally could work from anywhere including the UK, but why would I subject myself to such lower pay for little or no real gain?

zozbot234 · on Feb 6, 2020

Healthcare is "not that bad" if you have some cash, because then you can afford to go abroad, e.g. India for your care. Other than that, it is terrible.

ashtonbaker · on Feb 6, 2020

...do you live in the US? I have lots of problems with our health system, even as someone with health insurance, but I very rarely find myself traveling to India for healthcare.

esaym · on Feb 7, 2020

My brother-in-law worked at a walmart warehouse for $15 an hour. Had no car payment (owned a used junker). Paid $600 for rent (mobile home). His wife applied for medicaid and was accepted and had their first child at a county hospital for little/no charge. After two years, they also used some day care thing for no charge as well. His wife would take the kid there and then stay home and post garbage all day on facebook. By year three, they started putting their 3 year old on a bus for some "pre-k 3" program at the local public elementary. The qualifications for the "pre-k 3" program were $400 a month $0 if on medicaid. Wife used this even larger amount of free time to post garbage on facebook. Kid's always sick cause of public school, but with medicaid they paid $10 per doc visit (which is two+ times a month)

As for me, I gave up on insurance once it got to $800 a month (which was the same as my rent). Denied coverage from then on, claimed religious exemption, and built a house with the money I saved. Paid $4,500 at a local birth center for each of my kid's birth and neither have ever been to a hospital or doctor.

So complaining about "how bad things are" is actually "terrible" on your part.

SubuSS · on Feb 6, 2020

Depends on where you come from - it may not be a pure $$ angle alone. (for most places of origin, $$ also works out).

Status of living is higher for many folks who move to US in terms of ability to actually use the outdoors (less pollution, less population, a civic sense of cleanliness, better traffic and so on), less corruption, better police force, better government, better healthcare (not cheaper), better food (regulations) and so many more angles. Obviously not all of em apply to all countries, but there is a good mix. Not to mention US dollar goes much further in many countries.

There is a reason why US actually rejects 100s of 1000s of H1B applications every year.

pc86 · on Feb 6, 2020

In aggregate you're 100% correct, but on an individual level if you're a senior technical employee at a large tech company, you're absolutely better off in the US.

cmcd · on Feb 6, 2020

Like what? If you are a high productivity engineer you will have excellent health care through your employer.

Anon84 · on Feb 6, 2020

What if you're a high productivity engineer working on your own startup?

oarabbus_ · on Feb 7, 2020

Then you should be in the USA and not the EU if you're hoping for VC funding.

9q9 · on Feb 6, 2020

Not just Europe, have a look at how many top Silicon Valley types went to IITs (Indian Institutes of Technology) [1]. Why would you stay in your home country when you can hop on a plane and 10x your salary (not to mention work at the cutting edge of science and engineering, rather than a decade late)?

[1] https://en.wikipedia.org/wiki/Indian_Institutes_of_Technolog...

oarabbus_ · on Feb 6, 2020

Perhaps once the EU pays more than 50% of the salary that is paid in the US to software devs/analysts/etc, we'll see this behavior change.

tasubotadas · on Feb 6, 2020

Ironically, the EU has only few top tier software development companies because of (my personal uninformed opinion) all the taxes that reach 50% of the revenue to support social policies that include access to the universities.

keenmaster · on Feb 6, 2020

The state serves a collection of individuals. If a particular individual is underserved, then it is consistent with the purpose of the state to:

1. Celebrate brain drain, because your people are improving their circumstances

2. Quickly move to one-up the U.S. and attract talent back to the EU

Almost no one wants to leave their homeland unless the opportunities elsewhere are significantly better.

ur-whale · on Feb 6, 2020

> Almost no one wants to leave their homeland unless the opportunities elsewhere are significantly better.

That highly depends on where you homeland actually is. I can think of many places on earth where people would like nothing better than to leave but simply can't (language barrier, degrees not good enough to relocate, etc ...)

keenmaster · on Feb 6, 2020

Agreed. Interpret "opportunities" broadly. I don't just mean job opportunities, I mean all sorts of economic and social opportunities. If Bulgaria had top universities, great governance, a vibrant economy, and all the social changes that come along with that, I don't think young Bulgarians would be departing in droves.

geoalchimista · on Feb 7, 2020

> Bsc gets paid by EU taxes, guy ends up teaching in US university or working for US company.

Why would this be a concern? They should be free to go where they want to. Just because university education is publicly funded does not mean that they owe their career success to their national government.

tastroder · on Feb 7, 2020

While brain drain is a thing, why do you feel that this is a noteworthy example? The phenomenon clearly predates the EU and for this course in particular, you can get the same thing in universities all over Europe.

Eldt · on Feb 6, 2020

One problem I see here is considering it to be wasteful...

gota · on Feb 6, 2020

It is wasteful from the perspective of the taxpayers that funded that highly-specialized, highly-sought after education only to see it move away

(BTW the argument is the same even if that education was not strictly paid off by public money. The way societies and nations organize themselves, you're taking up a valuable resource just by occupying the "slot", even if you are paying for it the people around you are incurring in all sorts od externalities to support your existence and studies)

I'm honestly surprised that the position that Brain Drain is not a problem exists and would be curious to see your reasoning

Also - Brain Drain also negatively affects the destination country in the sense that it "eases" the societal pressure on providing top notch/decent education to the general population. Why bother with educating your people when you can let the best people from elsewhere immigrate? Revert that and see how fast FAANG backed education reforms hit

Note -I'm not all in for either side, and believe as in most things there's an ideal middle ground. Let some come. Send some away, too! There's a lot of value in the exchange.

But check the list od instructors in the post's page: a tremendously hot topic in one of the best educational institutions in the world - and how many instructors are stereotypically "immigrants"? I'm leaning towards 100%

zozbot234 · on Feb 6, 2020

Most expats do not move away altogether from their country of origin. Labor mobility ("brain drain"), even of skilled professionals, is not a negative; on the contrary, it has huge positive externalities for both the origin and the destination country, even if these don't always show up in obvious monetary terms.

People are the ultimate 'resource' for building wealth, and ensuring that people are free to move around and seek the highest and best opportunities for themselves is absolutely the best policy. (At least if a few safeguards are included to minimize, e.g. social disruption due to large-scale movement of people affecting the local culture and society in unexpected ways. 'Open borders' should never be taken literally, but it's an ideal to move towards gradually.)

fnord77 · on Feb 6, 2020

wish stanford would put videos online. But I guess then a lot fewer people would spend the $6000 for the class.

tmpz22 · on Feb 6, 2020

Didn't Berkley recently take down all their educational videos because they were sued for not providing them in a fully accessible format?

mrlatinos · on Feb 6, 2020

https://news.ycombinator.com/item?id=22258662

koulvi · on Feb 7, 2020

http://snap.stanford.edu/class/cs246-videos-2019/

streetcat1 · on Feb 6, 2020

The problem with such courses is that "massive" tend to change every year in exponential manner.

For example, if I can have a single machine with 32 cores and 1TB memory? what is massive in this context?

moultano · on Feb 6, 2020

The datasets grow to meet the computing available to them. The things gathering the data themselves become more powerful, and so more of that data makes it downstream.

I'd define "massive" data as anything where n^2 is too big, where "too big" is bigger than either my ram or my patience.

JimmyAustin · on Feb 6, 2020

I've heard everything from "it doesn't fit in Excel" to "it doesn't fit in ram for a standard dev's laptop (~10gb) to "it doesn't fit in ram in a decent sized EC2 instance (~250gb).

moultano · on Feb 6, 2020

I started worrying at one point that all the techniques I learned when I started my career for working with big data were becoming obsolete, but they aren't. What you needed to do before to make things possible is now needed to make it fast.

Qu3tzal · on Feb 7, 2020

Isn't it the same as before? If 4gb of data was too big because you had 2gb of RAM, then the methods used at that time are the same you would apply for a 500gb dataset that can't fit in a 250gb RAM machine, right?

New issues appear when you have to analyze 2Tb with a 32gb RAM machine, but when the order of difference is the same, the issues and thus the answers are the same as before?

streetcat1 · on Feb 7, 2020

No. Because the number of use cases where you have 1TB or 2TB of data is smaller in comparison.

Also, the rest of the use cases (which fits into a single machine memory now), can be handled much more efficiently with memory base algorithm, instead of I/O based algorithms.

The goal of Hadoop, as well as most of the theory on disk-based indices (E.g. BTREE), was to overcome the I/O bottlenecks. But as memory is getting bigger and cheaper there is a trend to drop Hadoop in favor of reading data directly from the cloud and into memory.