Hacker News new | past | comments | ask | show | jobs | submit login
Mistakes Programmers Make when Starting in Machine Learning (machinelearningmastery.com)
128 points by robdoherty2 on Jan 28, 2014 | hide | past | favorite | 63 comments



I have a love-hate relationship with the advice "Don’t reinvent solutions to common problems." In one sense, it's obviously a good idea because it's (generally) bad to repeat yourself. When you repeat yourself, the places where you can make a mistake increase, you violate single point of truth, yadda, yadda, yadda.

But in the same sense, for me, reinventing things has included some of the most enriching things I've ever done as a programmer. For example, the first big project I ever tackled (more than 10 years ago) was an open source message board heavily inspired by vBulletin. I really didn't solve any problem that hadn't been solved before, and the last thing the world needed was another message board. But holy hell, I really learned a lot! And it was maddeningly fun. I would never want to deprive someone of that experience just because I think that DRY is a Good Thing.

On the flip side, just a short while ago, I tried writing a package that handles a cluster of remote peers[1]. Want to know what I learned? That I knew a lot less about networking than I thought I did, and I'd have to read a lot more literature before I could get to where I wanted to go.

And yes, this can't always be the case, because you'd never get anything done otherwise.

(It occurs to me that maybe I'm talking about personal enrichment while the OP is talking about solving ML problems. Toe-may-toe, toe-mah-toe.)

[1] - https://github.com/BurntSushi/cluster


Don't reinvent solutions to common problems if solving the problem is the goal.

Do reinvent solutions to common problems if learning more about the problem space is the goal.


> I have a love-hate relationship with the advice "Don’t reinvent solutions to common problems."

Yes: it's difficult for me to reconcile this with the next piece of advice: "Don't Ignore the Math".

If you haven't understood the common problems well enough to code your own solutions, it seems to me that you wouldn't have a secure enough understanding of the math to make any valuable improvements.


When experts give advice, they often forget how little most people know. Contradictory advice probably means they know which cases it applies in, and it should be obvious if you know what you are doing.

Ultimately, most of the advice will boil down to "be an expert". I guess that's what you're in for when you read a five point blog post on something which is really quite advanced.


Alan Kay said, "To a first approximation, you should never write your own software. To a second approximation, you should always write your own software."


Cite? Not finding it.

I did find: "People who are really serious about software should make their own hardware."


Alan Kay is a gold mine of quotable material.


I've had good experience with reinventing the most important part of the wheel. If I try to solve a common problem myself first, I will gain insight into the pitfalls, so that I can make a better decision which pre-made solution to use.


I completely agree but I think there's a nice explanation for this.

It is _absolutely_ fine to violate DRY and repeat work of others for _learning_ purposes.

However, if you're already past that stage and want to create code/library/binary/whatever for other people to actually (re)use, then DRY and be orthogonal.

Edit: lostcolony put this more eloquently.


One nasty aspect of the advice "Don't reinvent solutions to common problems" is it doesn't really provide a way to tell when your problem is really, truly a common problem, which can be a lot more subtle than you might think.

Sometimes the simplest single requirement change - say, works in low memory environment, or handles high latency, or utilizes a GPU, or has a good public API, or has low battery usage, or can save its entire state to disk, or is skinnable, or supports other random feature X - can instantly take all common solutions to a problem and invalidate them. But just as often, it doesn't. Or, just as often, you have to just give up on those requirements for pragmatic reasons.


The author does specifically point out that his advice is for people starting with machine learning. I think your bulletin board example wouldn't have worked out as well if you hadn't used any online bulletin boards before building your own.


It's true I had used bulletin boards before, but I hadn't done much programming before that outside of a high school classroom.

In any case, I could pose a litany of other examples where I started in a subject precisely by "resolving" some common problems. For instance, when starting with machine learning, I implemented a simple naive Bayes text classifier myself. It was fun and illuminating.

Yes, this is definitely a learning style. Yes, people's tastes differ. But, I think that's my point...


On the flip side, just a short while ago, I tried writing a package that handles a cluster of remote peers[1]. Want to know what I learned?

I'm having flashbacks to the last level of the Stripe CTF :) Have you tried it? It's a little late now but it's still up I think. It has discussion of distributed consensus and distributed logs (related to your solution above) and links to a few interesting articles/implementations, including one in Go (go-raft). Quite educational if you're a beginner at that sort of thing (as I am) - you might find it interesting.


Sometimes there can be pragmatic reasons to re implement existing software especially if your needs are simpler than the problem the library had in mind.

It can be a choice between "Write 200 lines of code that I understand and know does exactly what I need and no more" or "import 5000 lines of unknown code that might be abandoned and left to bit-rot by the original author, use 5% of it's functionality and hope I can shoehorn it in without too many gotchas"


Only try this if you are sure you understand the problem domain and are very comfortable dealing with all its pitfalls. Many times (especially when dealing with numeric code) 4500 of those 5000 lines are for dealing with special corner cases and obscure gotchas that people have spent month agonizing over how to fix and that has been battle tested by experts over years. Stuff which won't affect you 99% of the time, but when it does will leave you having to redo all their hard debugging work.


Yes, that is very true and I have been bitten by exactly that in the past also.


Engineering is constant re-invention of already solved problems...


I'm surprised this is on HN, probably the lack of downvote buttons. It's just banal and generic advice, there's zero useful information for any mildly experienced programmer there.

1. Don't put machine learning on a pedestal. – Do programmers really do this mistake? And what exactly does that mean?

2. Don't write machine learning code. – A classic programming principle, not specific to machine learning at all.

3. Don't do things manually. – Oh really? Thanks, I didn't know that.

4. Don't reinvent solutions to common problems. – Obviously, the same principle as 2.

5. Don't ignore the math. – Ok, good point, but if you want to get serious with ML, it's difficult to avoid maths.


Agree. I found this post by John Langford more informative, on a similar vein:

http://hunch.net/?p=2562


Points 2 and 4 literally means, "Stand on the shoulder of giants"!


Interesting,

Having done a small amount of machine learning, I can see how the advice here is "true". And by "true", I mean appropriate for the way that machine learning exists and operates in present-day space. Algorithms are difficult, temperamental and requires expert "tuning".

The sequence seems to be:

- First you learn the formal theory, the math and statistics.

- Then you learn the "squinting", the ad-hoc rules for how to apply which algorithm.

- Then implement the thing

This works better than just starting your editor and piecing code. However, I would claim that this doesn't actually work well in the sense that this is kind of where AI/ML have bogged down. I mean, there are only 5 main approaches, 20 main algorithms and whatever subsidiaries and random stuff. They don't work great and the only progress is incremental (though there is progress and throwing more computer power around at the same time enhances - while masking the low amount of conceptual progress).

What's lacking is any modularity in combining algorithms. The power of ordinary programming is, essentially, using function calls to put together what you want. ML doesn't do that and for all the magic, that makes it weak and fragile - when one magic algorithm doesn't work well, rather than improving it, it really is better, at present, in the interest of getting stuff done, to start with a different magic algorithm. This is true, I'm a realistic in the sense of accepting the present but an idealist in the sense of saying "that kind of sucks, we should be able to fix problems, not surrender and regroup".

Yes, I'm happy to denigrate the good and proper in my question for the best. But I'm an idealist, I suppose it's a matter of taste.


I would seriously look in to deep learning. http://deeplearning.net/ I am doing everything with it now. This includes principal component analysis/compression, face detection,hand writing recognition, named entity recognition, clustering, topic modeling, semantic role labeling, among other things.

There is a very common structure to this. Despite neural nets having their own baggage, they're worth understanding.

The structure you're wanting is definitely in there. Edit: Yes a bit of self promotion here. Just making a point with patterns I've found as I've built this out.

See:

https://github.com/agibsonccc/java-deeplearning/blob/master/...

https://github.com/agibsonccc/java-deeplearning/blob/master/...

Half the battle is understanding the linear algebra going on here. Beyond that you can pretty much do everything with one set of algorithms and terminology.

For those who go WTF java are you insane? The core idea I'm linking to here is the fact that deep nets are composed of singular neural networks with slight variations having a very common structure for both the singular layer as well as the deep nets themselves.


I wouldn't claim the Neural nets don't work. We've seen many use cases where they do work (And I have looked at deep learning just a little, it may superior in ways but it doesn't seem in any way fundamentally different from the other stuff).

I would add that Support Vector Machines also work and they are similar and have a much clearer math to them [1]. But SVM and neural nets are ultimately just linear matchers on a nonlinear pattern space, they ultimately involve adhoc choices that experts learn over time.

As I said above, once you learn the maths (linear algebra, statistics, functional-analysis or whatnot), it become less basic understand and more "understanding how", a series of tweaks that experts "with a feel for this stuff" do. But this "feel" level understanding seems exactly what stands in the way of serious, rational progress on the topic.

[1]http://en.wikipedia.org/wiki/Support_vector_machine


Just today, I finished a deep-learning example in C# with Accord .NET. It solves a simple XOR function, as well as classifying ascii digits.

If you're looking for a basic example in .NET, I think the XOR one is as simple as it gets.

Deep-Learning XOR: https://github.com/primaryobjects/deep-learning/blob/XOR/Dee...

Deep-Learning Digits: https://github.com/primaryobjects/deep-learning/blob/master/...


Very cool stuff. A .NET implementation is a great project. If you want any advice, I'm working with a friend on a hadoop version as well.

https://github.com/jpatanooga/Metronome

We have a wiki and some resources put up.


Very interesting!

Is that Word2Vec implementation you have roughly equivalent of the Google version[1]?

Any examples of how to use deeplearning4j generally?

[1] https://code.google.com/p/word2vec/


Binary compatible yes. Star the repo and watch it in the next few days. Example apps are on the way. I plan on implementing a full "easy to use" machine learning lib around this.

Edit: Poke around in the tests, Here's an example of it learning a compressed version of MNIST: https://github.com/agibsonccc/java-deeplearning/blob/master/...

I have a lot more example usage in each of the tests. Test coverage was a higher priority above the documentation, but example usage is there. I'm more than happy to answer emails around the usage of the library as well. I also take feature requests.

I plan on implementing convolutional nets, recursive neural nets and some other ones based around that same structure; That includes the scale out versions with akka for easy multi threading or clustering ( I have built in service discovery with zookeeper among other things in there.


Advice question - how far into the linear algebra should I go? I'm currently working my way through a book - and it's not too bad, but I really don't want to invest more than I really need. Any suggestions?


The main thing to understand is the basics. For example element wise multiplication vs full blown matrix operations. Think something like this: http://en.wikipedia.org/wiki/Matrix_multiplication

Matlab/Octave is a great way to practice this due to the native data types. If python is your thing numpy's arrays are also pretty easy to digest.

Subtle little tricks like this: https://www.youtube.com/watch?v=evF-3ykjRU0

And understanding the dynamics of scalar operations vs matrix - vector operations.

The machine learning class has some good fundamentals if you need a refresher on how something works.

There will be more complex things like some optimization algorithms have different uses for eigen values: http://see.stanford.edu/materials/lsocoee364b/11-conj_grad_s...

See: http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors

One other thing might be understanding different ways you can manipulate data. In this case, numerical representation here is an example per row when I toss in one matrix for training. This is applicable to many machine learning problems.


Very much appreciated. That's really helpful and feels very doable. :)


You could look at the appendixes in texts by Bishop, MacKay, Barber and Rasmussen/Williams to see what they expect (and they expect a pretty thorough understanding. The last 3 are freely available content

http://www.inference.phy.cam.ac.uk/itila/book.html

http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/090310.pdf

http://www.gaussianprocess.org/gpml/chapters/RW.pdf

Also, an excellent list of ML resources: http://www.reddit.com/r/MachineLearning/comments/1jeawf/mach...


This is essentially the goal of probabilistic programming[1]: let the programmer specify and tweak the model, and have the analog of a compiler handle inference. Finding a good model is then analogous to debugging.

You are mostly stuck with Bayesian models, since that's what we have general-purpose inference algorithms for. And in practice an understanding of Bayesian statistics is a prerequisite for writing useful probabilistic programs.

[1] e.g. http://probabilistic-programming.org/


Well, you can kind of use decision forests, and similar approaches, to combine algorithms. The lack of human-readability makes it pretty tough unless you're in a domain where you can effectively apply weighted heuristics (which I found to work pretty well in NLP, albeit with a lot of experimental tweaking to arrive at the right model for my domain).


And here I was expecting things like 'overfitting' and 'not having a holdout set or at least crossvalidating'.


You mean there are people whose machine learning homework in college/grad-school didn't threaten a zero for failing to cross-validate?

Well damn. What did I spend all those late nights in front of a bleak MATLAB environment for, some kind of best practices?! </tongue in cheek>


Yes, I didn't find this helpful at all. Also, it needs editing.


Hackernews is moving towards Yahoo news.



I feel like the first mistake someone makes is to try to practice "machine learning". If you're going in with that as you're goal you're likely to fail.

Instead, tell a story. Motivate what you're doing with real questions and real data and you'll be driven to do all 5 of these lessons (and many more smart things).

Every time someone comes in with a pet algorithm I cringe a bit. There's certainly an air of everything being "just marbles" and thus having application of every algorithm to every problem, but the real question is rarely about the algorithm—it's about the set up, the cleaning, the story. Even when it's about the algorithm you're actually just trying to tell a better story.

So focus on that.

Figure out what you want to "do ML" before you get too excited about what ML is. It's often really painful and annoying with bug fixing turnaround clocking in the hours or days. It's also some of the prettiest math around and a collection of neat hacks for getting great answers to nigh unanswerable questions.

But it's always about answering a question. Start there.


BTW, edx.org is offering what looks like a relatively rigorous intro to probability with calculus. [https://www.edx.org/course/mitx/mitx-6-041x-introduction-pro...]. They say it closely follows this course on OCW [http://ocw.mit.edu/courses/electrical-engineering-and-comput...]

It starts soon, Feb 4th.


Using hostgator?

Ads on a 500 error page is perhaps not the most confidence building place.


I supposed there is DB connection max...

Anyhow, my first thing came to mind when I saw that was "oh so the greatest mistake for a new machine learning student is encountering a 500 error..."



This is missing the universally true #1 mistake probably (nearly) anyone commits when starting in ML: Missing an excellent understanding of the problem/domain (Unless you happen to be a domain expert for the problem you are working on, but that is a rarity).

If you do not know which features to choose and why, what the lables mean, which background data you should use, and even more important, what the actual problem is that needs to be solved, you will be wasting lots of time - and not just yours...


Please go back and spell check. The simple errors all over the place are a little embarrassing.


The site looks interesting, but keep getting a 500 error. I'd advise the owner to switch off HostGator soon; that web host has gone significantly downhill


Author here. Time for new hosting... a good problem to have I guess.


Here's some things with more substance (I don't think the blog author is doing any SEO machinations, but there's just not much to learn from his last few posts. Essentially he's written Data Science advice similar to many other authors, but he's substituted the words "Machine Learning"

I could dig up other ML gotchas/guidelines posts, need to dig thru bookmarks)

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

http://www.inf.ed.ac.uk/teaching/courses/dme/html/datasets04... (read Challenges in each dataset)

http://alpinenow.com/blog/machine-learning-is-not-black-box-...

http://research.microsoft.com/en-us/um/people/minka/papers/n...


Would anyone in SF be interested in a talk about Machine Learning for Hackers, or Machine Learning 101?


> 1. Put Machine Learning on a pedestal

Someone gave me this advice almost verbatim on HN 5 months ago: https://news.ycombinator.com/item?id=6335092


Appreciate being quoted >:)

That aside, you really have to take it in bits. Ignoring the math or fundamentals behind it is by far the worst mistake you can make.

Once you get decent at understanding it, the points I emphasized (feature vector building) become a lot less of a problem with deep learning (http://deeplearning.net/).

Auto learned feature vectors are going to be among the best ways to do things in the coming years. More than happy to answer questions.


I got into machine learning through an article off of HN stating that Random forests would get you 80% of the way (I think they were right!) For my purposes rotation forest increased my accuracy considerably. I have a few questions:

1. I have found that data manipulation and feature creation from a SQL database is harder than the actually using an algorithm, and knowing how to extract and aggregate data seemed to be more like "throw something at the wall and see what sticks" Do you have any suggestions or information on knowing how to extract the best data?

2. After getting a random forest going, I had a hard time figuring out which algorithm to try next, or how to figure out what would work best for my dataset. Any suggestions on how to take the next step?


1. Use what correlates best with the outcomes. Look in to feature selection and principal component analysis for this. This will cause less noise due to smaller feature vectors. It also allows more digestable outcomes. I would also highly reccomend visualization. Weka is great if you want plug and play; otherwise there's the more traditional R/matlab. It really depends on what you're comfortable with.

2 . Depends what kind of learning you're doing. I would look in to multinomial logistic regression for most applications (more than one class) for supervised classification. Then there's also k means if you're looking to understand trends in your data. Keep in mind this is my off the shelf/simple recommendation.

I would love input on a plug and play machine learning CLI. I planned on building out my current project in to a full blown command line app. Since it can handle most features including automatic visualization/debugging via matplolib I figure with some documentation it might be a neat tool for people who don't want to deal with feature selection but still want things simple. It's definitely a problem that there's really no clear way to build simple models. Domain knowledge is also an expensive problem.


Do you have it on a website or github? I would be interested in taking a look at it.


https://github.com/agibsonccc/java-deeplearning/

Keep in mind documentation is one of the things I need to work on the most now. I have it built and ready to go for the most part.


And just out of curiodity, how are you getting on with ML?


Mistakes programmers make when putting their blog on HN: not anticipating the traffic and sending 500s our way.


Website went down



Ditto, commenting to save page and come back at a later time.


Also: publishing a blog that doesn't let you zoom.


I don't think "reinventing solutions to common problems" is a bad thing. This is how we all learn how to do something new. And sometimes the new solution is better than any of the other solutions out there.


I agree that reinventing solutions to problems is definitely in the domain of the hacker.

When learning a new skill, however, you should understand what the common approach is before re-discovering the work of others. The article is about how to be more efficient in learning machine learning, not how to be a hacker.

It's akin to why (imho) better musicians learn to play other peoples styles before developing their own.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: