Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning for Developers (xyclade.github.io)
380 points by haifeng on Oct 8, 2015 | hide | past | favorite | 93 comments



Is anyone else at least a bit worried about a bunch of developers running around doing "machine learning" without much understanding of mathematics and probability? E.g. consider the creation of fragile models that overfit data being used in finance, infrastructure, medicine, etc.


I'm a little bit worried. At least at the same level as when I see a bunch of developers compiling programs without much understanding of what an LL(k) parser does, or how a pushdown automaton works, or what a Turing machine is. I usually feel the same every time I see an elevator without a liftman, don't compute a square root by hand, or hear about Google self-driving cars.


The difference is that the software or the elevator will work but the statistical model is wrong and doesn't work. It is like the elevator only lift people above 120 and below 90 and for the others it just don't work or take you to the wrong floor.


> The difference is that the software... will work

Lots of software doesn't work. Is there a substantial difference between putting an overfitting model in production, and putting a poorly tested program in production?


I think there might be. When ML fails the only individual capable of noticing is someone who understands the math. When code breaks often the "lay" user notices. The result is obvious to a novice. When ML fails it looks like a duck, quakes like a duck but after multiple years of study its immediately recognizable as an antelope. Though to disagree with my own point, security vulnerabilities have a similar profile. In essence, to all but the highly trained the difference is imperceptible.


>"When code breaks often the "lay" user notices. The result is obvious to a novice."

That depends "how" it breaks. As a novice coder myself, I've had things go wrong that I don't notice or can't identify, and it looks like my program is running fine.

I think that's the parent's point: it might be stupid to implement crappy macho learning models into production, but it isn't worrisome. It's expected.


I hear ya. I knew that assertion was going to draw some criticism as its a judgement call about where we draw the line. Who's a novice and what's obvious? However I can't get away from my nagging impression that statistical validity is not inherently clear to the absolute best practitioners. Causality is the goal, and its notoriously difficult, even for world class minds. In my experience the only similar effervescent specter for software development is in security. Such circumstance,seem to me, to require great humility and introspection about ones abilities, but I suppose a little of that would go a long way in general too!


That the program is likely to fail loudly and obviously, but the overfitted model will just sit there being subtly yet perniciously much wronger than you think is, forever.


I would like to see some ML applied to stop lights with a fallback to the PLC with timers if that fails. It would save the nation a lot of gas.


How many elevators were failing today? how many were fixed today? and how many have been monitored in real-time to get fixed as they fail? Likewise, statistical models can be monitored and fixed automatically, and, of course, even so they will fail from time to time like elevators do.


Well someone clearly has a horse in this race.

Your comment is a bit obtuse, developers are not creating parsers and compilers. Elevators and calculators are very robust technologies that already work.

What I worry about is a new wave of engineers and developers thinking they understand statistical models and then proceeding to work at the big banks and have their models blow things up. If PhDs can make such disastrous non-robust models, how on earth is a random developer who took a summer course not going to do the same?

Now if the banks actually failed on their own, then by natural selection the less skilled would be out of jobs and people would stop trying to "short-cut" gaining this type of knowledge. But that's not what happens. Academics keep writing papers and hyping up specific techniques for which they can give conferences on, and the taxpayer bails out the idiots at the top.


I'm a developer with very little statistics knowledge who at one time had extensive math knowledge, but haven't applied it in so long that I don't recall most of it these days.

I worked in the finance sector as a (lead) developer on a production use trading system for quite some years. None of the developers had formal math or statistics knowledge to the extent required to develop this system.

This didn't really bother anyone in the least despite the fact a mistake could cause a loss of millions of dollars in practically no time at all...

The reason for this wasn't because anyone was ignorant enough to think the programmers knew what was going on. It was because it was a finance company that also employed plenty of mathematicians, statisticians, physicists, and other's with the proper math/stats background. The programmers wrote the code, but the math/stats people wrote the business rules and the formulas and extensively tested that the system worked correctly against a large enough variety of models with expected outcomes that they were able to have sufficient confidence that the system reward/risk measurements were appropriate.

So my answer would be no; I don't find this all that troublesome. We can't be experts at everything and smart companies realize this so they should be creating teams with the correct skillset to be successful.


It's different. In ML the model, the analysis, and the insights, are the product. In general software engineering, your compiler is not your product.


Don't think so. The data and how it's represented is the code. ML is the compiler.


This is overly pedantic. The code is doing machine learning. In order to write and understand the code you have to understand the machine learning algorithms. Before you even choose a model and tune the parameters you have to know how the parameters interact and how different models work.


Is that any different from a bunch of developers plugging in magic numbers into a formula that they made up, which (to a first approximation) is roughly what happens now?

Realistically, the outcome will be the same as it is now: those firms whose models don't reflect reality will blow up, those whose do will get bigger, a few will get too big to fail off some very confidently-expressed models and make a lot of people mad at them, and eventually the market will straighten out who's lying and who's not. Won't be painless, but then, capitalism never is.


> those firms whose models don't reflect reality will blow up,

I'm picturing one of those dystopic films/novels where the main character is deleted/fired/jailed as a result of an algorithm error. Yes, in real life the trends will overcome the bad models. But just think of the potential consequences for harm on an individual basis!


That's sorta the way society has functioned for millenia. In the 20th century alone, hundreds of millions of people have died miserable deaths because some guy in power had an incorrect model of reality.

It sucks, it's not fair or just, and everything would run more smoothly if we were omniscient beings living in a completely egalitarian society. Unfortunately, that's not the reality we live in. In the meantime, we accept it as simply fate or vulnerability, and muddle through as best as we can.


Isn't that just the plot of Minority Report?

Think of the possibilities of machine learning for detecting precrime!


It's already here. See http://www.predpol.com/


I think the OP meant the use of ML to predict and individual committing a crime, and jailing/arresting them based on that.


Yes, but I also worry about a bunch of theorists writing substandard code that is unreadable and unmaintainable.


And creating "fragile" models because they don't have the tools to reproduce their own experiments. How many authors of academic papers in ML could reproduce the exact same results a year later? I would guess around 10%.


This pisses me off so much. I'm not a mathematician, but I like to think I'm a pretty good programmer. I feel like I could pick up a mathematical concept described in a computer science paper more easily if I could actually see the damn code and run it myself. But most of the papers I've read haven't mentioned where to find the referenced source code or, if they do, it's either horribly written and only runs on the author's machine or it requires specialized software that only a university could afford.


From my interactions with researchers in ML, most of them are actually pretty good programmers. There just isn't an incentive to make your code clean:

1. There isn't much correlation between quantity or even quality of papers you publish and the quality of your code. Meaning, writing cleaner code is not going to help you get that postdoc or faculty position.

2. Doing research is full of stops and starts and branches that fail and approaches that get thrown out. It's a waste of time to write clean code since you know it'll most likely be thrown out. When you do get an approach that works, you publish your paper and move on.


> most of them are actually pretty good programmers.

What is 'good'? In 'software development' 'good' is usually connected to writing clear, maintainable, test covered code. In most scientific research it means something completely different. I think on HN most adhere to the former definition of good and in that sense most researchers (especially in physics, but also CS / ML) are not 'good' according to that definition (because you need quite a lot of years of experience in a corporate setting usually) and actually even bad. But the code works and implements the concepts in their papers so they are 'good' in that respect. That is more rapid prototyping to make a POC to show it works, after which you properly rewrite it.


Good = they are capable of "writing clear, maintainable, test covered code" if they wanted to.


Maybe you can expect more citations if other researchers can examine your code.


A decent CS undergrad degree decade ago included abstract math concepts. I took Engineering math, Information Theory, Numerical analysis, Probability, Simulation in my sophomore and Junior years. NLP and AI were electives in Senior year.

As a Junior, we were building toy programs that do Operations research type of work - solving linear equations via various matrix operations, design optimal queue processes based on poisson process.

Assuming a software engineer is a CS undergrad, he/she most likely has good footing to learn more by themselves.


In all fairness, that's pretty atypical of a standard CS degree. In my anecdotal experience (knowing people that went to Stanford/Berkeley/MIT/CMU), most people take at most 1 probability class, 1 linear algebra class, and maybe 1 AI/ML class. Info theory, NLP, numerical analysis, optimization, etc. are not at all common.


Or just got a CS degree too long ago. I got lots of discrete math - formal methods, automata theory, and number theory. All that stuff that's in Knuth. But no number-crunching beyond matrix inversion and Fourier transforms.


Bad CS school student here, we don't take any Math aside from a very basic "discrete structures" class, which is simplified discrete structures :/ Wish we did more math...


That all seems like quasi-maths, however. What hopefully was being referred to earlier is algebra and analysis, at least up to the 2nd iteration, so one has real understanding of methods of proof.

E.g - no probability class that doesn't require analysis 1 and 2 is truly a probability class.


>> doing "machine learning" without much understanding of mathematics and probability

In my understanding "machine learning" is just a buzz word for the good old fashioned data-mining, which is still a part of applied mathematics/statistics. Only because it involves computers it doesn't belongs to CS.

So what you have written sounds for me like "doing applied statistics without much understanding of mathematics and probability". And yes, I am worried about it.


Nah. The only topic I would be worried about is cryptography, when used in a non-learning context. That has a high potential to cause harm. Otherwise with machine learning, I don't see how it is necessarily more dangerous than any other software -- databases, network protocols and so on...


Databases, networking protocols and so forth are hardened, relatively speaking (less the occasional heart bleed or PoW-blockchain fork). If you have autonomous systems built on top of hardened infrastructure but behaving according to ML models, the impact of their wrong doings is exponentially higher. It's about top-level autonomy through ML models really: from flash crashes to (future) autopilots. The same effect of severity vs. position in the control hierarchy goes for human organizations. A cashier can defraud for a couple of hundred $, the C-suite at Goldman Sachs / Enron for a multitude. The invention of the corporation, as much risk as it entails, was a milestone in human progress though. So yeah, let the predicting but in its entirety not quite predictable models run the world. It's worth it.


On top of this, I would add that general trends in information workflow / technological advancement, which ML models like this running the world would certainly fall under, are as close to unstoppable forces as we've ever seen due to the complexity and power of the smaller trends that cause them.

Basically, if this is going to become a thing, then there is no stopping it.


Not really. What will have to happen is a readjustment for realizing that many models, especially made by people with little experience and training, will be wrong.

Right now it's the glory days of ML when nobody much has the ability to judge success. Unlike software engineering broadly, where these glory days just keep going, ML is all about measuring success. People will detect failures.

The real risk is when people systematically underestimate the risk like the copula thing occurring with the subprime market. That was anything bug untrained people using models—they would not have been as dangerous as they were if they weren't so damn good to begin with. This is a robustness failure, not a poorly trained workforce failure.


ML is the next commodity on the development stack. It is good to worry over the next few years, but after that, there should be a bunch of pretty solid tools out there for developers to work with. I am among the people that I believe are working on these tools.


The main goal of this article is to make Developers understand some of the basics of machine learning, not to make everyone think they can be an expert data scientist after reading just this one article online. This is also stated clearly throughout the article, as there are no golden rules for finding good features, getting the data right, etcetera. Given this I don't think you should be worried about this. Additionally lots of testing and validation mechanisms and people are involved in complex systems that you state to be concerned about.


No because their businesses won't work if they don't understand that stuff. It took us a while to learn it in the anti spam world but not that long in the big scheme of things.


No. You won't get hired to do machine learning just because you did a Coursera course and read a few books. If you do, you know you're working at a dead end.


9 times out of 10 it will be a clueless manager that read in Gartner that ML is the next big thing so they'll put some programmers on it, they'll click their heels together at learning something new, nothing will come out of it, except they funded the platform for the much smaller group of people who'll actually use it for something useful. Win win.


There are plenty of people who don't know their algorithms and data structures writing prodduction code for large companies.


I always assume that if a person want to do much better in ML, he will try to learn stuffs he didn't know before, like statistics, math, etc. Everyone's knowledge is limited, but that doesn't limit what people can do, just need to learn more, I guess.


"Developers" should gain expertise in their problem domain, including understanding the mathematics and probability behind it.


Not really. What will have to happen is a readjustment for recognizing that poorly designed models just don't work very well.


What I worry about is the the hybris that is prevalent in the tech industry.


Please do not disable zooming. It makes life unnecessarily difficult for those on iOS devices who do not have perfect eyesight.


Is there a decent introduction to scala as a language with real world ML? I've come across various ML primers that go into detail on PCA, Linear regression, etc. But not any that show real world ML usage i.e. if person listen to music of type X they'll also like Y. Face detection, etc.


You also don't see things about feature engineering, how to know what to do next to improve things, etc. ML is an art, and nobody covers this much :(


exactly my thoughts. Its one thing to "know" about curse of dimensionality but one doesn't get to knot things unless they are applied to solve problems.


Check out our project, [KeystoneML](http://keystone-ml.org/) - it's geared to large scale machine learning in the realms of computer vision, NLP, and (soon) speech. The design is modular and engineering-friendly and is quite focused on real-world applications. (e.g. start with pictures, do the feature extraction, PCA, linear regression, etc.) and out comes a classifier.


I prefer if the author pointed out to course in edx on Data Science and Machine learning. Python is slowly becoming the gold standard for Machine and Deep Learning. Since Python has been very strong in scientific and artificial intelligence community there is a large corpus of knowledge. Given how easy it is to go from experiment to a live web service using python you don't need to fiddle with hundreds of xml configuration and infrastructure to just get it to work. Also with Anaconda and jupyter you can share your knowledge so easy. Julia is catching up which is good, but its still very far from Python.


I would love to see something like this in Elixir :)


That was my first thought as well, but as I understand it Elixir seems to faulter when it comes to computationally heavy stuff, but perhaps it could make up for it with it's amazing concurrency and scalability?


Erlang has good C/OS Process interop so the best route would probably be to capitalize on something written in a faster language to do the raw processing and have Elixir there to coordinate resources and report/store results


Wonder if there are similar articles about beginner image classification.


This is fantastic. For deep learning in Java or Scala to feed feature vectors into Xyclade, we built http://deeplearning4j.org


Java and Scala? Who uses that in ML? Python has long been the best language for ML, with some competition from Matlab.


Java and Scala? Who uses that in ML? Python has long been the best language for ML

You're kidding, right? Java has been extremely popular for ML for a long time. Not to cast any shade on Python, but I'd say Java and Python are roughly equivalent in this regard. Both have good libraries for various ML tasks and both are very popular in the domain.

For reference, a quick search on mloss.org finds 84 projects identified as "Java" and 105 identified as "Python". So while Python has a small edge in sheer numbers I think that supports my assertion that they roughly equivalent in terms of their popularity for ML.


I feel that much like programming in the corporate world, Python is often used to teach ML while Java is more often used to implement it.


Two out of three most popular deep learning libraries have python front end (theano and caffe). The third one (torch) uses Lua.


There is a large amount of research, including some hedge fund models, that are exclusively using Clojure, on the JVM, for machine learning. Just because public libraries might often be in Python doesn't mean it is the language of choice for the big guys (it often isn't). You'd be surprised how much in-house ML stuff is done on the JVM.


I guess I should have clarified that by ML I really meant DL, as this is the most important area of ML currently. Is there anyone writing large CNNs or RNNs using Java or Scala or Clojure? Are there any widely used DL libraries based on those languages?


There's definitely a popular DL library in Java:

http://deeplearning4j.org/

And at least one seemingly fairly current NN library:

https://github.com/ivan-vasilev/neuralnetworks

An an older "pre deep learning" NN library called Neuroph.

http://neuroph.sourceforge.net/

and another older one called JOONE:

http://sourceforge.net/projects/joone/files/joone-engine/

So in general, the answer is "yes" as to whether or not people are doing Neural Network / DL work in Java. I can't tell you how much such work is happening, or really compare Java/Scala to Python, etc., at that level of granularity though.

And just for a little bit more perspective: IBM Watson is (or was) apparently largely Java based:

http://www.drdobbs.com/jvm/ibms-watson-written-mostly-in-jav...


Ok, I see. Though I'm not sure why anyone who wants to write DL code today would go with anything other than Python on top of CUDA, or just using one of three main DL libraries (Caffe, Torch, Theano).


This is just a strange thing to say. There are so many languages out there with very interesting features, I'm not sure why you insist python is the only obvious choice. In my experience, all of the advanced research for proprietary companies in this area is not being done on Python, at least not those who are willing to speak at conferences. There's a lot of GPU computing also available to Java, if you think that's the reason Python is the only option. Python is certainly more widely used as a teaching language, so I guess you might see more libraries that are widely known because of universities in the academic environment, but I'm not sure why you think that means that it's the only language that major institutions are using? Because the reality is that almost all the cutting edge stuff that I've read about is not being done in python at all.


Can you give an example where a cutting edge research is done using something other than Python or MatLab? The only exception is Facebook using Torch. In fact, Python dominance in DL is not just my opinion. Even Java devs admit it while trying to justify using Java for DL [1]: "We’re often asked why we chose to implement an open-source deep-learning project in Java, when so much of the deep-learning community is focused on Python." [1] http://deeplearning4j.org/compare-dl4j-torch7-pylearn.html


Sure, how about Prismatic, a big San Fran ML shop catering to several industries. They are nearly 100% Clojure.


From looking at their website Prismatic appears to be a start up building APIs to access ML tools and providing services for enterprise customers. I looked at their job postings and they don't seem to be very research oriented. What makes you think they are doing cutting edge research in DL?


No one can say for sure what Prismatic is doing behind the scenes since it is a commercial entity but I've seen them speak at a couple conferences, which led me to believe they are doing some pretty novel ML studies. I could be wrong. But, your initial point was to suggest that a company shouldn't use anything other than Python when starting new ML projects, and Prismatic is an example of a company doing just that: using Clojure in this case. The article you mentioned earlier gives some good reasons why Python is often not the best choice for ML: speed, security, portability are often the reasons for leveraging the JVM (where Clojure lives). This is not a knock against Python, just pointing out that there is major stuff happening in ML that has nothing to do with Python. NYU's machine learning lab prefers C++ over other languages, and major projects like Siri and Watson rely a lot on C++ and Prolog. The point is that it makes no sense to claim that Python is the only language anyone should consider for these tasks.


That's a highly opinionated statement; lots of languages are used for machine learning, Python certainly has no monopoly on the field. There are significant entities using Clojure, Lua and other languages to do interesting things in this arena.


It seems like a no-brainer that one of the largest programming languages in the world should have strong machine learning tools. Maybe the reason ML isn't done in Java and Scala is simply because the tools haven't been there. Python's great, but if you want to do ML/DL with distributed systems working with Hadoop and Spark, it makes more sense to do that in Java and Scala.


Python isn't used for much ML in the field from my experience. It is heavily used for teaching and learning about ML - but for actual production ML, I've seen mostly compiled languages. The main reason is that ML is highly parallelizable and Python isn't terribly good at that. Plus you need to crunch large datasets and speed becomes important.

So, respectfully, lots of people use languages other than Python for ML, and I doubt if Python is even the largest deploy base of ML.


I've seen Python (and R) used all the time for exploratory ML. Do all of your feature extraction, feature selection, parameter tweaking, and backtesting in Python, and then once you have a model that works reasonably well, port the feature extraction for only the features that actually work well over to a compiled language like Java or C++, train your models on lots of data, and do your actual classification in the compiled language.

Most ML is an iterative process, and the final model that's used in production is just the tip of the iceberg of the development work that went on. Python works as well for exploratory programming there as it does for any other domain.


In research, Python is extremely popular since all the number crunching is done with numpy or Theano, which use BLAS and CUDA.


I've seen Python used heavily in ML where much of the work is done on the GPU anyways. There are plenty of libraries for that, and Python's slow nature isn't as relevant (though it still becomes relevant when CPU processing is the bottleneck!).


I work with a team that is doing large scale ML and CV on millions of pages and videos. It's a production system tied to millions of dollars in revenue. Everything's done in Python.


The cool thing is that you can use many languages to do ML! Once you know the concepts, they can be implemented in dynamically-typed languages like Python, R, or JavaScript, or statically-typed languages like Scala, Java or C++. Functional or imperative, there are many ways to skin the ML cat. We're lucky to have such a diverse set of tools!


People who do ML for a job?

There's plenty of heavy duty machine learning libraries and implementations for big data platforms. Just spark alone has a fairly high quality one:

http://spark.apache.org/docs/latest/mllib-guide.html


Spark, and its machine-learning module MLlib, are Java/Scala, although it has a Python API.


[deleted]


>Quick job search shows that there are many companies out there that do ML in Scala, e.g. Sharethrough, Teralytics, LeadIQ.

What do you use to do a search that specific?


Sounds like he used LinkedIn to me.


I favor Python as well, but from what I see in industry, Python is best for exploration/initial experiments and Java (and recently, Scala) is for production.


Scala is actually getting quite popular for machine learning, as it translates well to parallelization.


Our data scientists are learning Scala and Spark (MLLib) as a replacement for Python and R. So sure, maybe Python has long been the "best language for ML" but also one time in the not so far past "MySpace was the best social network"


What kind of tasks your data scientists are working on?


Mostly anomaly detection.


Interesting. What are the reason for this transition from Python and R to Scala and Spark?


Java and Scala are the main languages of Spark, one of the most popular large scale machine learning tools out there.


What do you recommend is a good primer (something similar to the article OP posted) for Python folks?


I feel like I've read literally this exact subthread on HN twice before. Am I going insane?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: