I have a love-hate relationship with the advice "Don’t reinvent solutions to common problems." In one sense, it's obviously a good idea because it's (generally) bad to repeat yourself. When you repeat yourself, the places where you can make a mistake increase, you violate single point of truth, yadda, yadda, yadda.
But in the same sense, for me, reinventing things has included some of the most enriching things I've ever done as a programmer. For example, the first big project I ever tackled (more than 10 years ago) was an open source message board heavily inspired by vBulletin. I really didn't solve any problem that hadn't been solved before, and the last thing the world needed was another message board. But holy hell, I really learned a lot! And it was maddeningly fun. I would never want to deprive someone of that experience just because I think that DRY is a Good Thing.
On the flip side, just a short while ago, I tried writing a package that handles a cluster of remote peers[1]. Want to know what I learned? That I knew a lot less about networking than I thought I did, and I'd have to read a lot more literature before I could get to where I wanted to go.
And yes, this can't always be the case, because you'd never get anything done otherwise.
(It occurs to me that maybe I'm talking about personal enrichment while the OP is talking about solving ML problems. Toe-may-toe, toe-mah-toe.)
> I have a love-hate relationship with the advice "Don’t reinvent solutions to common problems."
Yes: it's difficult for me to reconcile this with the next piece of advice: "Don't Ignore the Math".
If you haven't understood the common problems well enough to code your own solutions, it seems to me that you wouldn't have a secure enough understanding of the math to make any valuable improvements.
When experts give advice, they often forget how little most people know. Contradictory advice probably means they know which cases it applies in, and it should be obvious if you know what you are doing.
Ultimately, most of the advice will boil down to "be an expert". I guess that's what you're in for when you read a five point blog post on something which is really quite advanced.
Alan Kay said, "To a first approximation, you should never write your own software. To a second approximation, you should always write your own software."
I've had good experience with reinventing the most important part of the wheel. If I try to solve a common problem myself first, I will gain insight into the pitfalls, so that I can make a better decision which pre-made solution to use.
I completely agree but I think there's a nice explanation for this.
It is _absolutely_ fine to violate DRY and repeat work of others for _learning_ purposes.
However, if you're already past that stage and want to create code/library/binary/whatever for other people to actually (re)use, then DRY and be orthogonal.
One nasty aspect of the advice "Don't reinvent solutions to common problems" is it doesn't really provide a way to tell when your problem is really, truly a common problem, which can be a lot more subtle than you might think.
Sometimes the simplest single requirement change - say, works in low memory environment, or handles high latency, or utilizes a GPU, or has a good public API, or has low battery usage, or can save its entire state to disk, or is skinnable, or supports other random feature X - can instantly take all common solutions to a problem and invalidate them. But just as often, it doesn't. Or, just as often, you have to just give up on those requirements for pragmatic reasons.
The author does specifically point out that his advice is for people starting with machine learning. I think your bulletin board example wouldn't have worked out as well if you hadn't used any online bulletin boards before building your own.
It's true I had used bulletin boards before, but I hadn't done much programming before that outside of a high school classroom.
In any case, I could pose a litany of other examples where I started in a subject precisely by "resolving" some common problems. For instance, when starting with machine learning, I implemented a simple naive Bayes text classifier myself. It was fun and illuminating.
Yes, this is definitely a learning style. Yes, people's tastes differ. But, I think that's my point...
On the flip side, just a short while ago, I tried writing a package that handles a cluster of remote peers[1]. Want to know what I learned?
I'm having flashbacks to the last level of the Stripe CTF :) Have you tried it? It's a little late now but it's still up I think. It has discussion of distributed consensus and distributed logs (related to your solution above) and links to a few interesting articles/implementations, including one in Go (go-raft). Quite educational if you're a beginner at that sort of thing (as I am) - you might find it interesting.
Sometimes there can be pragmatic reasons to re implement existing software especially if your needs are simpler than the problem the library had in mind.
It can be a choice between "Write 200 lines of code that I understand and know does exactly what I need and no more" or "import 5000 lines of unknown code that might be abandoned and left to bit-rot by the original author, use 5% of it's functionality and hope I can shoehorn it in without too many gotchas"
Only try this if you are sure you understand the problem domain and are very comfortable dealing with all its pitfalls. Many times (especially when dealing with numeric code) 4500 of those 5000 lines are for dealing with special corner cases and obscure gotchas that people have spent month agonizing over how to fix and that has been battle tested by experts over years. Stuff which won't affect you 99% of the time, but when it does will leave you having to redo all their hard debugging work.
I'm surprised this is on HN, probably the lack of downvote buttons. It's just banal and generic advice, there's zero useful information for any mildly experienced programmer there.
1. Don't put machine learning on a pedestal. – Do programmers really do this mistake? And what exactly does that mean?
2. Don't write machine learning code. – A classic programming principle, not specific to machine learning at all.
3. Don't do things manually. – Oh really? Thanks, I didn't know that.
4. Don't reinvent solutions to common problems. – Obviously, the same principle as 2.
5. Don't ignore the math. – Ok, good point, but if you want to get serious with ML, it's difficult to avoid maths.
Having done a small amount of machine learning, I can see how the advice here is "true". And by "true", I mean appropriate for the way that machine learning exists and operates in present-day space. Algorithms are difficult, temperamental and requires expert "tuning".
The sequence seems to be:
- First you learn the formal theory, the math and statistics.
- Then you learn the "squinting", the ad-hoc rules for how to apply which algorithm.
- Then implement the thing
This works better than just starting your editor and piecing code. However, I would claim that this doesn't actually work well in the sense that this is kind of where AI/ML have bogged down. I mean, there are only 5 main approaches, 20 main algorithms and whatever subsidiaries and random stuff. They don't work great and the only progress is incremental (though there is progress and throwing more computer power around at the same time enhances - while masking the low amount of conceptual progress).
What's lacking is any modularity in combining algorithms. The power of ordinary programming is, essentially, using function calls to put together what you want. ML doesn't do that and for all the magic, that makes it weak and fragile - when one magic algorithm doesn't work well, rather than improving it, it really is better, at present, in the interest of getting stuff done, to start with a different magic algorithm. This is true, I'm a realistic in the sense of accepting the present but an idealist in the sense of saying "that kind of sucks, we should be able to fix problems, not surrender and regroup".
Yes, I'm happy to denigrate the good and proper in my question for the best. But I'm an idealist, I suppose it's a matter of taste.
I would seriously look in to deep learning. http://deeplearning.net/ I am doing everything with it now.
This includes principal component analysis/compression, face detection,hand writing recognition, named entity recognition, clustering, topic modeling, semantic role labeling, among other things.
There is a very common structure to this. Despite neural nets having their own baggage, they're worth understanding.
The structure you're wanting is definitely in there. Edit: Yes a bit of self promotion here. Just making a point with patterns I've found as I've built this out.
Half the battle is understanding the linear algebra going on here. Beyond that you can pretty much do everything with one set of algorithms and terminology.
For those who go WTF java are you insane? The core idea I'm linking to here is the fact that deep nets are composed of singular neural networks with slight variations having a very common structure for both the singular layer as well as the deep nets themselves.
I wouldn't claim the Neural nets don't work. We've seen many use cases where they do work (And I have looked at deep learning just a little, it may superior in ways but it doesn't seem in any way fundamentally different from the other stuff).
I would add that Support Vector Machines also work and they are similar and have a much clearer math to them [1]. But SVM and neural nets are ultimately just linear matchers on a nonlinear pattern space, they ultimately involve adhoc choices that experts learn over time.
As I said above, once you learn the maths (linear algebra, statistics, functional-analysis or whatnot), it become less basic understand and more "understanding how", a series of tweaks that experts "with a feel for this stuff" do. But this "feel" level understanding seems exactly what stands in the way of serious, rational progress on the topic.
Binary compatible yes. Star the repo and watch it in the next few days. Example apps are on the way. I plan on implementing a full "easy to use" machine learning lib around this.
I have a lot more example usage in each of the tests. Test coverage was a higher priority above the documentation, but example usage is there. I'm more than happy to answer emails around the usage of the library as well. I also take feature requests.
I plan on implementing convolutional nets, recursive neural nets and some other ones based around that same structure; That includes the scale out versions with akka for easy multi threading or clustering ( I have built in service discovery with zookeeper among other things in there.
Advice question - how far into the linear algebra should I go? I'm currently working my way through a book - and it's not too bad, but I really don't want to invest more than I really need. Any suggestions?
One other thing might be understanding different ways you can manipulate data. In this case, numerical representation here is an example per row when I toss in one matrix for training. This is applicable to many machine learning problems.
You could look at the appendixes in texts by Bishop, MacKay, Barber and Rasmussen/Williams to see what they expect (and they expect a pretty thorough understanding. The last 3 are freely available content
This is essentially the goal of probabilistic programming[1]: let the programmer specify and tweak the model, and have the analog of a compiler handle inference. Finding a good model is then analogous to debugging.
You are mostly stuck with Bayesian models, since that's what we have general-purpose inference algorithms for. And in practice an understanding of Bayesian statistics is a prerequisite for writing useful probabilistic programs.
Well, you can kind of use decision forests, and similar approaches, to combine algorithms. The lack of human-readability makes it pretty tough unless you're in a domain where you can effectively apply weighted heuristics (which I found to work pretty well in NLP, albeit with a lot of experimental tweaking to arrive at the right model for my domain).
I feel like the first mistake someone makes is to try to practice "machine learning". If you're going in with that as you're goal you're likely to fail.
Instead, tell a story. Motivate what you're doing with real questions and real data and you'll be driven to do all 5 of these lessons (and many more smart things).
Every time someone comes in with a pet algorithm I cringe a bit. There's certainly an air of everything being "just marbles" and thus having application of every algorithm to every problem, but the real question is rarely about the algorithm—it's about the set up, the cleaning, the story. Even when it's about the algorithm you're actually just trying to tell a better story.
So focus on that.
Figure out what you want to "do ML" before you get too excited about what ML is. It's often really painful and annoying with bug fixing turnaround clocking in the hours or days. It's also some of the prettiest math around and a collection of neat hacks for getting great answers to nigh unanswerable questions.
But it's always about answering a question. Start there.
This is missing the universally true #1 mistake probably (nearly) anyone commits when starting in ML: Missing an excellent understanding of the problem/domain (Unless you happen to be a domain expert for the problem you are working on, but that is a rarity).
If you do not know which features to choose and why, what the lables mean, which background data you should use, and even more important, what the actual problem is that needs to be solved, you will be wasting lots of time - and not just yours...
The site looks interesting, but keep getting a 500 error. I'd advise the owner to switch off HostGator soon; that web host has gone significantly downhill
Here's some things with more substance (I don't think the blog author is doing any SEO machinations, but there's just not much to learn from his last few posts. Essentially he's written Data Science advice similar to many other authors, but he's substituted the words "Machine Learning"
I could dig up other ML gotchas/guidelines posts, need to dig thru bookmarks)
That aside, you really have to take it in bits. Ignoring the math or fundamentals behind it is by far the worst mistake you can make.
Once you get decent at understanding it, the points I emphasized (feature vector building) become a lot less of a problem with deep learning (http://deeplearning.net/).
Auto learned feature vectors are going to be among the best ways to do things in the coming years. More than happy to answer questions.
I got into machine learning through an article off of HN stating that Random forests would get you 80% of the way (I think they were right!) For my purposes rotation forest increased my accuracy considerably.
I have a few questions:
1. I have found that data manipulation and feature creation from a SQL database is harder than the actually using an algorithm, and knowing how to extract and aggregate data seemed to be more like "throw something at the wall and see what sticks" Do you have any suggestions or information on knowing how to extract the best data?
2. After getting a random forest going, I had a hard time figuring out which algorithm to try next, or how to figure out what would work best for my dataset. Any suggestions on how to take the next step?
1. Use what correlates best with the outcomes. Look in to feature selection and principal component analysis for this. This will cause less noise due to smaller feature vectors. It also allows more digestable outcomes. I would also highly reccomend visualization. Weka is great if you want plug and play; otherwise there's the more traditional R/matlab. It really depends on what you're comfortable with.
2 . Depends what kind of learning you're doing. I would look in to multinomial logistic regression for most applications (more than one class) for supervised classification. Then there's also k means if you're looking to understand trends in your data. Keep in mind this is my off the shelf/simple recommendation.
I would love input on a plug and play machine learning CLI. I planned on building out my current project in to a full blown command line app. Since it can handle most features including automatic visualization/debugging via matplolib I figure with some documentation it might be a neat tool for people who don't want to deal with feature selection but still want things simple. It's definitely a problem that there's really no clear way to build simple models. Domain knowledge is also an expensive problem.
I don't think "reinventing solutions to common problems" is a bad thing. This is how we all learn how to do something new. And sometimes the new solution is better than any of the other solutions out there.
I agree that reinventing solutions to problems is definitely in the domain of the hacker.
When learning a new skill, however, you should understand what the common approach is before re-discovering the work of others. The article is about how to be more efficient in learning machine learning, not how to be a hacker.
It's akin to why (imho) better musicians learn to play other peoples styles before developing their own.
But in the same sense, for me, reinventing things has included some of the most enriching things I've ever done as a programmer. For example, the first big project I ever tackled (more than 10 years ago) was an open source message board heavily inspired by vBulletin. I really didn't solve any problem that hadn't been solved before, and the last thing the world needed was another message board. But holy hell, I really learned a lot! And it was maddeningly fun. I would never want to deprive someone of that experience just because I think that DRY is a Good Thing.
On the flip side, just a short while ago, I tried writing a package that handles a cluster of remote peers[1]. Want to know what I learned? That I knew a lot less about networking than I thought I did, and I'd have to read a lot more literature before I could get to where I wanted to go.
And yes, this can't always be the case, because you'd never get anything done otherwise.
(It occurs to me that maybe I'm talking about personal enrichment while the OP is talking about solving ML problems. Toe-may-toe, toe-mah-toe.)
[1] - https://github.com/BurntSushi/cluster