Machine Learning and AI seem to be in vogue but become tough to implement unless you have boatloads of data. We've personally had multiple frustrating experiences over the last ~7 years of trying to solve problems using ML. In almost all the cases we failed to ship due to lack of data. Transfer Learning is a major breakthrough in ML where companies with little data can also build state of the art models. Unfortunately not enough people know about it. We are trying to do our part to make it easier to use Transfer Learning as well as increase awareness about it.
Using Transfer Learning we can build a model to identify cats and dogs in images with a few (<100) images as compared to the few thousands it would take before.
To make Transfer Learning easy we are building https://nanonets.ai that has multiple pretrained models that can be augmented with your data to create state of the art models. We are currently in the process of building our first few models. Image Labeling and Object Detection (in Images) work with a few Text based models coming up in the next few weeks.
Is transfer learning really not widely known by people doing AI? In my field, computer vision, it is used by most of the papers in the past three years in CVPR, etc. All of the students that take either my deep learning or my computer vision courses have to do assignments on transfer learning with deep neural networks.
Totally agree, everybody in the industry knows about it. However if you look at https://www.google.com/trends/explore?date=2014-01-01%202017... nobody outside seems to know. I might be wrong but a lot of people outside the ML community seem to be hesitant to using ml because they don't have enough data, trying to remove the misconception if it exists
More seriously though: others have pointed out that finetuning is pretty popular in some subfields, but it's just one hammer in a of a whole toolbox of techniques which are necessary to make neural nets train (even when you have a tonne of data). Standardisation, choice of initialisation, and choice of learning rate schedule all come to mind as other factors which seem simple, but which can have a huge impact in practice.
Of course, each tool has its limitations. The most obvious limitation of finetuning is that you need a network that's already been trained on vaguely similar data. Pretraining on ImageNet is probably not going to help you solve problems where the size of objects matters, for example, because most ImageNet performance tends to benefit from scale invariance.
I wish you luck with nanonets.ai, but I think it's irresponsible to market this as the "1 weird trick" to bring data efficiency to neural nets.
That graph might be more because it's not as exciting. Google searches are probably dominated by hobbyists and casually interested people. If they're not trying to achieve a specific goal, then they might prefer to work out the basics and make it themselves instead of just taking part of someone else's work and reusing it. If you were going to do that for fun, why not go the whole hog and reuse an entire pretrained network?
Personally, I'm a hobbyist and I don't want to know about these shortcuts until I start to need them - which is a stage I might never reach. People who've progressed far enough to need them are probably far fewer than those who are just curious what these words mean.
Another possibility is the words "transfer learning" might be more generally meaningful outside the ML field than the other search terms on the graph, so most of the searches for it are really schoolteachers or something else.
Spot on. If you take that argument a step further that means an average developer who is not a data scientist or ml researcher might not know about it. Which implies a super easy to use dead simple technique which is used by most researchers is not available to the common developer even though it is easy enough for them to use.
It is probably transfer learning has become more of a technique used for training neural network models (such as Adam being one of the most commonly used optimizer) whereas most of the excitement stems from new and/or complex neural network architectures rather than the technique or tools that made training those neural networks possible.
Transfer learning in ML refers to the general idea of taking a model trained for one domain and applying it to another. However this article seems to only focus on feature mapping, ie. breaking down images into features using hidden layers of ImageNet models. In this case, the pretrained model is only acting as a feature extractor because it is not trained to maximize the embedded distance between the classes you are trying to differentiate.
Your right the article doesn't do justice to all of the transfer learning techniques. I cut out large portions of the original draft for brevity. Will write a longer post referencing all possible techniques, hopefully it won't turn into a review paper.
Your right about excitement been generated from new techniques in the research community. Also if you look at Prisma, Google photos or any number of applications that have gained mass popularity. They are just implementations of existing networks and techniques. The paper on style transfer made Prisma possible, commercialization usually has been a few steps behind research.
True that most excitement in the field has been in finding new architectures and building general purpose AI. However I would like to point out that people often underestimate potential of transfer learning. We have started solving real problems companies are facing right now because of lack of enough data to be able to train an accurate model themselves.
While I agree that you could build a demo in Deep Learning 101 that could work for some small set of examples, I disagree that this is 101 level material.
You could also call this Deep Learning 101. But it really isn't because building a usable platform that works at scale actually delivers performance and solves problems is a lot tougher than what can be taught in an intro to Deep Learning 101 course.
Just poked around a bit with your API, and the learning with just 25 samples is impressive! And the getting training samples from the web is a great touch. But that 25 sample number seems too low for classes that are "closer" together? How do you quantify if you have done a good job on training or if you need some varied samples?
Thanks for trying us out! Internally we have validation metrics which have a number that says how good a model is at class separation. One naive way to do this entropy (-plogp). We are planning on exposing this to users soon. So once you create a model you'll receive feedback as to how "good" we think it is. In case it's not working well we might ask you for more data (we hope we don't need to do this too frequently)
Transfer learning is fairly well studied. However we have seen lot of companies facing problems that can be solved using this technique but can't because of lack of knowledge, pretrained models availability, engineering challenges involved. We are just trying to make the process easier for them.
Another great example is the last post on the keras blog [1] "Using pre-trained word embeddings in a Keras model". You take advantage of a large pre-trained network for a text classification task.
The OpenFace face recognition library also offers this technique. You take advantage of their large pre-trained network for face embedding: transforming a face into features distinct enough for classification. You then train another few layers for recognizing your own samples.
One way it can fail is that your model might overfit your data if number of parameters you are training are way more than data you are using for training and if regularization techniques are not used. We don't arbitrarily cut-off layers from pretrained model. We check at what layer, output features are not specific to problem for which pretrained model was trained. At this layer, you can stitch a nanonet and build a model for your data
Transfer learning works pretty well for image classification related tasks.
Some other areas are much more challenging. For example, in natural language processing tasks you will sometimes see some benefit from using pretrained embeddings, but it is very task and model specific. There's some exciting work going on in this area though.
Yeah correct. Even with text, there is some exciting work going on. We are in process of building a text based model and should put it up in this week that should work for multiple problems.
Great feedback, I really appreciate you taking time to read the post. Though I would like to make a few notes regarding what we are trying to achieve and some improvements we've made:
1) Our target audience is someone who hasn't taken Deep Learning 101 but wants to solve a problem
2) We are focusing on users who don't want to setup their own deep learning machines and don't want to learn how to use tensorflow/kerras/caffe/theano and spend time maintaining their own boxes, they don't want to spend engineering effort in ensuring slas and uptime along with scalability
3) We have made improvements in both the model we use for our product and the way it is retrained. It's not the same as the tensorflow example
4)The model has a different dataset than ImageNet and provides additional value in being better suited to certain tasks.
Worst case we have something nobody wants and that's valuable insight in itself. In the best case we have made something that people learn in Deep Learning 101 that can now be used by anybody without spending time and get straight to solving problems.
Another less know but really promising approach is program synthesis (also called "program generation"). One can build fairly robust model just using 2-5 examples and that too in just seconds. Implementation of this approach is already shipped in to Excel where you just enter few example of formatting and "Flash Fill" will learn what to do: http://research.microsoft.com/en-us/um/people/sumitg/flashfi...
This is cool. From what I understand from paper, its DSL has set of algorithms as building blocks that learn the input/output function. Deep learning algos are trying to do the same but with more generic blocks where assumption is that a lot of these blocks will be able to learn algorithms too. Deep learning is trying to build with a more generic approach in which transfer learning is helping to reduce number of examples needed by reusing algorithms learned.
I think Deep Learning is very frustrating to work with at the moment. First, there is the problem of overfitting, which shows up typically after you've already been training for hours. So you have to tweak things (basically this is just guessing), and start from scratch. If your network has too many neurons, then overfitting may more easily occur, which is a weakness in the theory because how can more neurons cause more problems? Then there is the problem that if your data is somewhat different from your training data in what humans would call an insignificant way, your network may easily start to fail. For example, when doing image classification, and your images contain e.g. a watermark in the lower-left corner, suddenly your recognition may start failing. I've been able to use DL for some projects successfully, but for other projects it has been an outright failure with many invested hours of training and tweaking.
> which is a weakness in the theory because how can more neurons cause more problems?
In exactly the same way that adding more terms to a polynomial fit causes more problems. The is one of the most fundamental results in the theory of statistical learning in general; don't blame Deep Learning for it.
Yes I know, it was a rhetorical question. Imho, if having more parameters causes problems, then the system should simply not use those extra parameters. But the theory is not there yet.
I think his point is no one can tell you from theory which regularization methods to apply to a particular problem to get the best results. You need expert knowledge, experience, and hyperparameter tuning.
Transfer learning helps with overfitting too. It is proven to get more generalized model if you use transfer learning that if you train a model on your own with same data (even with large datasets). You need expertise in deep learning but the good thing is that you don't need a lot of expertise in domain of the problem.
RF is appallingly difficult to re-use for inference, though. At least with a DNN or CNN you can pop open the hood and see what the model is doing at various points.
Tradeoffs, tradeoffs everywhere. It's almost like traditional mathematical statistics has something to offer them fancy machine learners. (Breiman was a professor of statistics, after all... ahead of his time, but no less a statistician.)
There are a lot of solutions for the problems you mention. For overfitting you can do data augmentation, normalization, dropout and early stop with the test set. (and probably improve your dataset)
More neurons means more parameters to adjust to your data, so overfitting is more likely to happen. It is like interpolation a function, the more parameters you use the more overfitting you have.
> if your data is somewhat different from your training data in what humans would call an insignificant way, your network may easily start to fail
humans call it insignificant because we have a deep knowledge of a lot of domains, meanwhile a network has been trained for an specific domain. So if you train the network with a distribution and then test it with another distribution it is not going to work. That is like quite obvious I think
Deep learning works incredible well. It works so well that it outperform humans in some domains. So may want to rethink what are you doing, because I think (but I may be wrong) the reason you are failing applying deep learning is something related with your process and not with deep learning
This does seem to be holding back a lot of people from trying out DL in production. There are some advantages to using transfer learning in the cases you mentioned (eg models not generalizing for watermarks or differences in training and testing). Although there are still quite a few cases where the best pretraining and large data don't work. Two major areas of advancements (current research) are automatic model architecture selection and automatic parameter tuning to help in making DL more accessible.
Not to be a pedant, but I think the DeepMind paper is actually an example of one-shot generalization, but not learning. From the paper:
> Another important consideration is that, while our models can perform one-shot generalization, they do not perform one-shot learning. One-shot learning requires that a model is updated after the presentation of each new input, e.g., like the non-parametric models used by Lake et al. (2015) or Salakhutdinov et al. (2013). Parametric models such as ours require a gradient update of the parameters, which we do not do. Instead, our model performs a type of one-shot inference that during test time can perform inferential tasks on new data points, such as missing data completion, new exemplar generation, or analogical sampling, but does not learn from these points. This distinction between one-shot learning and inference is important and affects how such
models can be used.
Absolutely. One shot learning is the cutting edge research towards building more human like AI. However its still in early phases. We are trying to make transfer learning directly usable to people trying to solve problems which is proven today. Hopefully we will be able to do the same with one shot learning.
It should be noted that transfer learning is an umbrella term for many ideas that revolve around transferring what one model has learnt into another model. The method described here is a type of transfer learning called fine tuning.
Correct. There are multiple different ways to transfer knowledge in between tasks. We are talking here about transfer learning with deep neural networks where it is proven to work with several advantages over training a model end to end on your own. There are multiple decisions you need to make even with transfer learning like which layer to use for transfer, how much fine-tuning should be done, based on how much data you have which we are trying to automate.
Yes transfer learning is a fairly umbrella term encompassing a lot of different approaches. We tried to give an example of the one most commonly used in NNs almost exclusively with regards to feature extraction. Do you have some resource that lists a variety of transfer learning approaches? Happy to work with you in creating a aggregated list.
Well for starters, fine-tuning can be done in a variety of different ways. You can pretrain your model with a larger, different dataset, or you can train an autoencoder that learns some useful representation of that larger dataset and use the encoder as a base for fine tuning.
Another approach I've seen that was really cool is Model Distillation [0], which is basically the training of a smaller NN with the inputs and outputs of a larger NN (where the output is slightly modified to increase gradients and make training faster).
Thanks for this write up. I'm not in the ML field, but do follow it a bit and didn't know anything about it.
I also really like your business model. I had argued with potential entrepreneurs and friends that ai is becoming a commodity. However, your business model has a potential for building a network effect of data on top of ai. Presumably becoming more valuable with time.
I do think though, you probably best solve this for a specific vertical first as your go to market strategy.
Using Transfer Learning we can build a model to identify cats and dogs in images with a few (<100) images as compared to the few thousands it would take before.
To make Transfer Learning easy we are building https://nanonets.ai that has multiple pretrained models that can be augmented with your data to create state of the art models. We are currently in the process of building our first few models. Image Labeling and Object Detection (in Images) work with a few Text based models coming up in the next few weeks.