Ideally, I would like to be good enough to get a job at an AI/robotics startup. I already have a CS degree, a decent math background, and am working as an embedded software developer for a large company.
I think the question is a little too unspecific for there to be a good answer. The field is vast and depending on which thing in computer vision you want to tackle the best learning paths may vary greatly. Just to give a bit of an overview:
Before the Deep Learning Craze started in 2011 more classical Machine Learning techniques were used in CV: Support Vector Machines, Boosting, Decision Trees, etc..
These were (and still are!) used as a high level component in areas like recognition, retrieval, segmentation, object tracking.
But there's also a whole field of CV that doesn't require Machine Learning learning at all (although it can benefit from it in some cases). This is typically the area of geometrical CV, like SLAM, 3D reconstruction, Structure from Motion and (Multi-View) Stereo, anything generally where you can write a (differentiable) model of reality yourself using hand-coded formulas and heuristics and then use standard solvers to obtain the model parameters given the data.
Whenever it's too hard to do that (for example trying to recognize many different things in images) you need a data-driven / machine learning approach where the computer comes up with the model itself after seeing lots of training examples.
As for resources the other answers are already giving a great overview. Use Karpathy's course for an intro to Deep Learning for CV but don't expect it to be comprehensive in terms of giving you an overview of CV.
Learn OpenCV for more low level, non-ML and generally more "old-school" Computer Vision.
A personal recommendation of mine is http://www.computervisionblog.com/ by Tomasz Malisiewicz. It's an excellent resource if you want to get an overview of what's happening in the field.
I would argue Kinetic or Geometrical Computer Vision problems, things like Tracking, Mapping, Reconstruction, Depth Estimation are best suited for the classical approaches like VO, SFM/MVS, SIFT/SURF, HOG etc... and are a separate category of CV problems than object recognition/detection/segmentation - much more capable of being done with ML because dimensionality is reduced.
But there's also a whole field of CV that doesn't require Machine Learning learning at all (although it can benefit from it in some cases).
In fact, Machine Learning has made almost no progress on most of what you mention, specifically SLAM and Multi-View Stereo. It takes completely rethinking how those are done when they are approached from the Deep Learning perspective.
I did this course, but couldn't finish all the assignments. Loved it. Please note that this is a convolutional neural networks course, not computer vision as such. From what I know computer vision encompasses a variety of non machine learning based algorithms, which are not covered in this course.
As an engineer, it's difficult to know when to use deep learning and when to use more classical algorithms. Often, you have to try both and see which is better (twice the work, hooray!). The classical algorithms often are very understandable, and you can reason with what's going on and figure out what is breaking. Deep learning is so much harder, e.g., are my hyper parameters bad, or do I need another 30 GPUs running for a week?
Imho, deep learning has little to do with engineering, and more with guessing, hoping and praying. But it seems you can often get something to work if you do those three things hard enough.
This has been my experience as well :). Deep learning is a lot of random guesswork, trial and error. I am almost always in the 'brute force' mode. However, in this course, you learn more about the fundamentals of convolutions and backprop. You have to implement your own backprop - not sure of what use that is, given that it's a one line code in TensorFlow
I watched a great video on Tensorflow (link below). It mostly introduces very basic deep learning concepts, but there are a few key moments in the 2 hour+ video, where he explains what to do if something goes wrong. It's definitely not a "science", but with enough experience in deep learning, you can intuit what's going on inside the black box, and there are best practices on what to try next.
For example, he goes through a few examples where a neural net has too many weights, or too little data or improperly connected nodes. All three result in problems, but the problems exhibit themselves in slightly different ways and with expertise you can start identifying them.
Studying how to write your own back propagation algorithm could be useful if you're a deep learning researcher. But for most people it would be like studying semiconductor physics if you only want to write software.
This is pretty old school, but I recommend Multiple View Geometry by Hartley and Zisserman (http://www.robots.ox.ac.uk/~vgg/hzbook/) to get through the fundamentals...it's really good to understand the geometric foundations for the past 4 decades. Along the same lines, you have Introductory Techniques for 3-D Computer Vision by Trucco and Verri (https://www.amazon.com/Introductory-Techniques-3-D-Computer-...), which also goes over the geometry and the fundamental problems that computer vision algorithms try to solve. It often does come down to just applying simple geometry; getting good enough data to run that model is challenging.
If you just throw everything into a neural network, then you won't really understand the breadth of the problems you're solving, and you'll be therefore ignorant of the limitations of your hammer. While NNs are incredibly useful, I think a deep understanding of the core problems is essential to know how to use NNs effectively in a particular domain.
A lot of real world computer vision is implemented on embedded devices with limited computational resources (ARMs, DSPs, etc.) so understanding how a lot of commonly used algorithms can be efficiently implemented in embedded systems is important. It is possibly a way for you to jump the gap from "embedded software developer" to "computer vision engineer". Also keep in mind that in many companies a "computer vision engineer" is fundamentally a different beast from a "software developer". A CV engineer creates software but the emphasis tends to be more on systems and is not 100% about software. This will vary a lot by company but if you're working with prototype hardware you will need to get at least a working knowledge of optics.
Fun and trendy though it may be, I would not focus on deep learning / convolutional neural networks to start off. Deep learning is a small subset of computer vision. I would focus more on understanding the basics of image processing, camera projection geometry, how to calibrate cameras, stereo vision, and machine learning in general (not just deep learning). Working with OpenCV is a good place to start for all of these topics. Set yourself a project with tangible goals and get to work.
Surprised nobody has posted http://course.fast.ai/ yet. I've been following along with it so far for the first 4 lessons and it has been extremely helpful in understanding how deep learning works from the perspective of someone who did not have much of any related baseline knowledge except how to program. Jeremy is an excellent practical teacher.
I too got this url referred by somebody, and I got excited after their extended intro why, how etc their course different and better then any other.
Though after 5 videos i know nothing more then from any other ML/AI guide on the internet then i did before. 99% is only related to image classifying, and i'm simply seeing too many guides for that.
If anybody has some good links/videos on ML/AI on structured data, please comment and i'll be thankful and happy to click 'm :)
The author himself said that DL isn't the best option for structured data.
"Certainly I'd pick DL over more linear models for most problems. But I'd pick random forests over DL for most structured data problems."
"Deep learning is best for unstructured data, like natural language, images, audio, etc. it sounds like you may be dealing more with structured data, in which case the Coursera ML course would be a better option for you"
Adrian here, author of the PyImageSearch blog. Thank you for mentioning it, I appreciate it. If anyone has any questions about computer vision, deep learning, or OpenCV, please let me know.
Within the next month I'll be launching PyImageJobs which will connect PyImageSearch readers (especially the Gurus course graduates) with companies/startups that are looking to hire.
This is the most comprehensive book I know of on Computer Vision. The diagrams in the book (including captions) themselves do a great job of explaining things.
I started by getting a webcam or two and trying out various projects: marker tracking (made an optical IR pass filter and tracked an IR LED with two cameras), object segmentation (e.g measure geometry of certain-colored objects).
Measure the speed or count the number of cars passing by your street. Try to implement an OCR for utility meter. There are lot's of applications you can train yourself in, and I guarantee that you will learn a ton from each and every one of them.
Does anyone know if tech like OpenCV is used at companies developing their own "computer vision" product, maybe at Tesla? Or do they build their own technology from scratch which isn't available to public domain? Or do they say fork OpenCV and build upon it and heavily modify as OpenCV could be seen as 'outdated' technology.
Disclaimer: Never worked with any technology related to Computer Vision, just a bloodboy beginner Python programmer.
This is a great resource. I give it to people who need to learn about convolutional neural networks.
However let's keep in mind that the field of computer vision is much vaster than that. Deep learning approaches have been very successful at solving problems in computer vision, but not all of them and not without drawbacks. I believe any course on classic computer vision will give him more insight as to what challenges computer vision aims to solve, how, and what approach might solve what problem.
You don't specialize in surgery before learning biology. Similarly, you don't specialize in CV before learning basic ML and DL. The fundamental concepts are the same no matter if the modality is text, image or video (for example: regularization, loss, cross validation, bias, variance, activation functions, KL divergence, embeddings, sparsity - all are non-trivial concepts that can't be grasped in a few minutes, and are not specific to CV alone).
Adrian here, author of PyImageSearch. Thanks for mentioning the blog. If anyone has any questions regarding learning computer vision, please see my reply to "sphix0r" below.
Hey, amazing blogs. Currently working on degraded scanned documents. Are there algorithms distinguishable for documents and natural images?
I am using open cv to process the documents, curious if I am missing out chunk of cv algorithms specially for scanned administrative documents (financial,personal documents)?
I'm not sure what you mean by "algorithms distinguishable for documents and natural images" -- can you elaborate? OpenCV itself doesn't have builtin functionality to take documents and fit them to a pre-defined template, that tends to be part of a specific use-case/niche of computer vision for document processing. The general idea is to take a document a user has filled out and "fit" it to a blank template, where you know exactly where each field is. That way you can exact the information from the document.
"The general idea is to take a document a user has filled out and "fit" it to a blank template" - I agree point to point. However, I am struggling with templatization due to poor quality of the document images. To process those documents (denoise, super resolution, HE - etc. etc.), the OpenCV algorithms are not working good enough, requires a lot of tuning varying with each document.
So, I was wondering if those algorithms work better for natural images (buildings, people, things etc) than document images (text, graphics) and if so, there must exist algorithms to process such documents I am unaware of.
It's a really broad field, so don't expect to get up to speed very quickly. A lot of people have recommended a lot of books already, and I could add to that list. One thing you might think about is Safari Books Online. You'll notice a lot of the recommended books are there, and even though it's a bit pricey, I think you'll find you'd save money by the time you get enough of the books that seem useful to you. You'll also loose nothing by jumping from book to book because they're too advanced/not advanced enough until you find one that's at your level.
I would recommend starting with one of the many OpenCV tutorial books, and maybe work your way through a few of those. Then move into books that cover more of the algorithms behind the library like "Multiple View Geometry" by Hartley and "Machine Vision" by Davies, among many others.
I learned OpenCV using the O'Reilly book by Bradski and Kaehler (back when it was OpenCV 2). I found it well-structured and it worked for me. They have an updated version for OpenCV 3.
However, I can't tell you if OpenCV is still the framework of choice and/or widely used in the field you want to go into.
It does touch on Machine Learning, but it focuses much more on the fundamentals of computer vision, like feature detection, that allows things like SLAM to exist.
Is OpenCV (traditional CV technique) better to use or Deep Learning based approach? Has anyone done a comparison of the two approaches? The obvious flaw with deep learning is that it requires large labeled data sets - but assuming that is available, which one is more accurate at object detection (hotdog or not), detecting features on an image (faces, manufacturing defects)?
You'll notice that all of the top contenders use Neural Networks, but I would
bet that many of them use at least some traditional CV techniques to transform the images at various steps. That said, many of the more modern deep learning approaches are ditching CV altogether, just feeding in raw pixels without any normalization or transformation, leaving fewer parameters to tweak.
I use OpenCV for reading and writing real-time video streams (like webcams or video frames) for my hobby computer vision projects, but tend to use other ML or image-specific libraries for actual processing. The cascade classifiers in OpenCV are okay, but it isn't too difficult to set up something comparable in scikit-learn that is more modern and robust. Though a bit less performant if you do need real-time response though.
Best example that I have is a pulse rate detector that I put together, that uses OpenCV for video frame extraction & display but bare numpy/scipy for the rest.
Try reading the samples that come with the predominantly used programming library/framework. Works with pretty much everything. I know a thing or two about statistics because I study R examples at night. I know a decent chunk of winapi and windows ipc because of delphi, I know some CV because I studied samples from opencv and tried to solve problems with it, etc.
I learned mostly from MATLAB documentation. Good if you want theory and implementation, and most capabilities can be done with open source equivalents if you don't have MATLAB.
coursera has some; I typically have locally the images-2012 one, but you also have things like dsp-001 which is a bit more advanced. Generally speaking, coursera has some good material in many related domains.
Here is a guide I have developed over 6 years when I dove into copmuter vision around 2011. My path has been self taught until recently I took a graduate course.
I started from wanting to develop AR apps during my undergrad, Here are the best resources I have found to date:
Computer Vision is very theoretical and experimental, so the more hands on, the better! My approach has been to go top-down, overview the landscape and slowly progress deeper.
Begin with the best library for CV in my opinion: OpenCV. The tutorials are amazing!
I feel like then, you will have so much exposure that when you dive into formal classes and textbooks, you will really understand and be enlightened.
This was the general way I learned computer vision, and recently I completed a cv internship for nanit.com . I was not hired for my formal knowledge, but they were impressed by all the various projects ive done and knowledge I had on many vision topics.
All the assignments have starter code in python and opencv.
This was an amazing class as it dove deep into 3D computer vision, which is so relevant to augmented reality!
Before the Deep Learning Craze started in 2011 more classical Machine Learning techniques were used in CV: Support Vector Machines, Boosting, Decision Trees, etc..
These were (and still are!) used as a high level component in areas like recognition, retrieval, segmentation, object tracking.
But there's also a whole field of CV that doesn't require Machine Learning learning at all (although it can benefit from it in some cases). This is typically the area of geometrical CV, like SLAM, 3D reconstruction, Structure from Motion and (Multi-View) Stereo, anything generally where you can write a (differentiable) model of reality yourself using hand-coded formulas and heuristics and then use standard solvers to obtain the model parameters given the data.
Whenever it's too hard to do that (for example trying to recognize many different things in images) you need a data-driven / machine learning approach where the computer comes up with the model itself after seeing lots of training examples.
As for resources the other answers are already giving a great overview. Use Karpathy's course for an intro to Deep Learning for CV but don't expect it to be comprehensive in terms of giving you an overview of CV.
Learn OpenCV for more low level, non-ML and generally more "old-school" Computer Vision.
A personal recommendation of mine is http://www.computervisionblog.com/ by Tomasz Malisiewicz. It's an excellent resource if you want to get an overview of what's happening in the field.