Hacker News new | past | comments | ask | show | jobs | submit login
Have We Forgotten about Geometry in Computer Vision? (alexgkendall.com)
406 points by AndrewKemendo on April 28, 2017 | hide | past | favorite | 121 comments



My apprehension was that the computer vision community has been suffering some serious cognitive dissonance lately because here they spent all these years mapping problems to feature spaces of manageable dimensionality, backed by theory saying that proper assumptions must be made to reduce the search space; and then comes these deep nets, hardly tailored to the problems, and out-performs algorithms with decade old history of fine-tuning.

Despite this, I don't think anyone disputes the potential of a good set of assumptions. Instead I think what deep learning has thought us is that we should reconsider what these assumptions should be. While geometry might well be the first kind of language a toddler learns to think in, this should probably not be confused with the rigorous geometry of Euclid. Quite possibly we have some spatial relationships such as the affine transformations hard-coded in our brain at birth, but this does not mean, for instance, that one is therefore necessarily ever able to to draw a house in correct perspective.


Yours is a common sentiment but it is not very accurate. Deep learning is not separate from or a more evolved form of machine learning. It's a learning algorithm, with its own structural biases, such as from architecture and optimization method. There is a duality between optimization/search and sampling, they're two sides of the same coin. Instead of giving labels, I prefer thinking about shared/unshared properties and assumptions of the algorithm. Deep learning is unique in its ability to use gradients and credit assignment through a hierarchy when possible.

Deep learning, like all learners, must take assumptions and there are trade-offs. You trade time spent thinking up features (which reduce learning time and improve sample efficiency), for time thinking up architectures and the appropriate biases (graph conv, rnn, convnet, etc). Though still limited to a particular domain, you hope to gain an architecture that works with broader applicability than a feature based method. You also add more complex training, longer experimentation waits and gain a level of indirection between the problem and the learning algorithm. This means that the gains from new knowledge and understanding of a domain filter through slower for deep learning.

In deep learning, it's a very experimental science. Theory is usually after the fact and not very filling. People try this or that and then write a show and tell paper, but knowledge still helps. Where as in shallow methods, learning something about how images work can directly inform your algorithm, in deep learning you now have to think what sort of transformation will best capture the invariant I am looking for.

Deep learning (especially when you can get it end to end) has removed the expertise threshold for impressive results (the bigger your budget the better). But if it were really so magical, the way some people go on about it, then it would already be obsolete as a field because, minimal need for human expertise.


Excellent point -- our brains and deep networks seem to learn a kind of "fuzzy geometry" that is in some ways more robust than numerically correct geometry, and this also allows us to spend more cycles on higher-level abstractions.


My problem with our brains and NN is exactly this "fuzzy geometry" you speak of. Using precise geometry, if we have a pair of cameras equivalent in layout to eyes we can reconstruct a world with mm-precise accuracy. That is something that cannot be done with neural networks (artificial or biological), instead you get a sort of semantic tagging of "green sofa here, table here, bed there".

But if afterwards you want to use this model to know if the sofa will fit into the alcove (or the car into the garage), the NN systems will be wildly unreliable.


It seems to me that humans can predict 3D spaces pretty good.

I've done alot of construction work. After some practice you can measure quite precisely what does fit where and even how long something is. Our brains seem to be able to "compute" such things despite all the difficulties of constructing a coherent image of our surroundings in the first place.

Or think of moving your huge couch through the narrow stairway of your house - we can predict how you need to turn it so it fits. Or think of truck drivers that are able to maneuver their large vehicles within cm range. Even when they can't see some of it (dead angles) and need to rely on their mental model.

Did I misunderstand your comment? Could you elaborate what I am missing?


Anecdotal evidence: I used to work at a construction materials depot when I was younger. I remember how lost I was when I initially started, I couldn't recognize any of the material sizes by looking at them, while all the old timers always just knew what size they were just by glancing at them. Then only after a couple of months, suddenly I could too. Years later I'm still pretty good at guessing lengths, while people around me seem to always be way off. I can hold out my hands and make a foot, three feet, 1 cm, 0,5 cm 10 cm between them pretty reliably, while other people seem to be way off.


You can learn to estimate various geometrical quantities with surprising accuracy, but it always depends on context (scale of the geometry, distance to it and angle from which you view it).

One interesting thing that I recently noticed is that essentially everybody is going to grossly overestimate the angle of slope of road and at least while driving many people perceive curves of less than ~30° as essentially straight road.


I'm talking about the optical illusions such as seeing a small person next to a car makes the car look bigger. Anything based on ML instead of geometry, or even using ML as a strong prior, will be susceptible to this kind of problem.


You would probably hate David Hockney's famous photo collages then.

https://photomuserh.wordpress.com/2012/03/04/david-hockney-p...


Wow, this is really fascinating. It's remarkable that the human brain (well, at least my brain) has almost no issue understanding the "scene". I wonder how a NN would do...


The composition of the scenes seems to neatly match the particular way how human vision works, namely, that we look for/at details with only a narrow center part of our vision, and gather the total detailed picture by scanning over it with our eyes.

So, for this type of artwork, whenever we're looking for details, we rest our eyes at a single separate photograph that covers those details fully just as if we'd be looking at a normal real image; but this contrasts with the wider view captured by our peripheral vision, which is obviously artificial.


Hockney was fully aware of this mechanism when he made these works, which motivated him to make them in the first place:

> For his part, critics often got Hockney all wrong as well, misinterpreting the intensity of the ways he would presently be engaging photography—taking literally hundreds of thousands of photos, coming to feel that the Old Masters themselves had been in thrall to a similar optical aesthetic—as a celebration of the photographic over the painterly, and specifically the post-optical painterly, when in fact all along he’d been engaged in a rigorous critique of photography and the optical as “all right,” in his words, “if you don’t mind looking at the world from the point of view of a paralyzed cyclops, for a split second, but that’s not how the world really is.”

http://www.believermag.com/issues/200811/?read=article_wesch...


I don't see what this has to do with being able to make real-world accurate measurements and predictions based on CV algorithms.

And from first glance, I actually find his stuff pretty cool.


Ah, I misinterpreted this as you saying that you dislike the fuzzy geometry of the human mind in general - not specifically for measurement purposes:

> My problem with our brains (...) is exactly this "fuzzy geometry"


Like anything, the more tools in your toolbox, the more problems you can solve!

I love that humans seem to be innately wired to constantly create new and better tools.


But those conventional, mm accuracy detectors tend to work best when they know what to look for. Isn't this, preanalyzing the scene for tailored vision algorithms, where those fuzzy NN classifiers (will) have the biggest utility?

ML vision may get all the fame right now, but useful applications will need both.


I'm not talking about detection, I'm talking about 3D reconstruction and making judgements on whether X will be able to pass by Y without hitting it.

ML has all the same weaknesses that humans do in that they are easily fooled and can't accurately project outside of their trained parameter space. Without an accurate physics model and an integrated comprehension of the geometry involved we are basically planning to put autistic pigeons in charge of driving our cars.


But the nice thing about computers is that you can feed the results of the high level ML classifier into the mm accurate geometry detector. You can only determine whether X will be able to pass by Y if you already know that there is an X and a Y.

Humans and pigeons will identify an X and a Y and will then try to use the same method to classify the scene as either "will pass" or "won't pass". Computers can do the fuzzy thing for identification, but don't have to use the same method for "will pass"/"won't pass". They can switch to geometric measuring and simulation for that.


In order to accurattely ascertain whether thing X will fit into Y hole you have to know accurate geometry of both which is not something that can be reliably determined from results of the fuzzy classification pass. When humans want to know this they either trust their fuzzy judgement and try it or measure the X and Y in question with tape measure.


That seems like a bad idea. But don't computers have the upper because they can do both simultaneously and weight the inputs appropriately?


+1 for "serious cognitive dissonance"

Decades of research efforts being overshadowed by DL, it is hard to swallow for most researchers in CV community.


Except that DL can't provide the same things as a lot of CV research. They are two separate areas of research for different problems.

DL doesn't provide solutions for challenges in robotics or augmented reality that CV is very good at. For example it can't place the camera at a specific position in the world using the image. DL can tell us what's in the image which CV can't. But CV can place the locations of those objects relative to the viewer.


> For example it can't place the camera at a specific position in the world using the image.

I might be mistaken but isn't that exactly what this author's research was about? Camera pose estimation via Deep Learning ("PoseNet").


Yes, but if you read the paper you'll see that he's talking about tightening the gap between the Deep Learning approaches and the current features + geometry state of the art, which is an order of magnitude more precise. DL approaches are however quite faster.


Camera Pose estimation is nearly solved with DL. In fact we're rolling out an application through a major retailer this summer that does exactly that.


It's not a fundamental limitation. People just haven't gotten to that yet, since there's so much low-hanging fruit with DL.


Does DL provide mechanisms for feeding back into theory? As in, does a successful deep convolutional neural network provide a means to extract enough from it's structure and behavior to potentially not NEED the CNN for prediction in a future iteration? Gradient projects can be used to gather a total derivative quantity, and compare sensitivities across inputs. We can regularize to prevent overfitting with cross validation, L-curves, etc. But what about hypothesis generation?

For many of us that have dipped our toes in the ML tooling but don't have a great application for it in our work areas, this would is the kind of thing that we would like. A NN that predicts well, AND has a well understood methodology that gives us actual insight and not just a black box.

Maybe what I have in my head is a deterministic gradient-based analog of an evolutionary algorithm? I'm not sure.


Pretty much the same thing has happened in natural language processing. Previously trained linguists would spend lots of time on carefully crafted features. Now you just throw a bidirectional lstm model at the problem and enough training examples and you are close to state of the art.


Where "enough training examples" has proven to be the real difficult problem.


I have a feeling people are going to reply without reading this through and assume the author poses a Deep Learning vs Classic CV sort of argument, in a deep-learning-is-overrated sort of way. Whereas it seems to me he is merely saying Deep Learning should be informed by Classic CV.

"I think we’re running out of low-hanging fruit, or problems we can solve with a simple high-level deep learning API. Specifically, I think many of the next advances in computer vision with deep learning will come from insights to geometry."

And it's true. A lot of the low-hanging fruit have been gotten, and stuff like SLAM is not about to be done wholly by deep learning (probably). There are a lot of problems that require more insight and analysis than 'throw a deep network at a large dataset'. As he concludes

"learning complicated representations with deep learning is easier and more effective if the architecture can be structured to leverage the geometric properties of the problem."

And thinking about what the architecture should is basically the hard bit in deep learning.


"Classical" CV and deep-learning CV needn't be opposing one another.

There are several cases in which the classical approach is emulated by deep networks - implementing the same carefully thought-out pipelines but in a way that leverages representations learned from huge datasets (which are undeniably very powerful).

Some examples are:

* Bags of convolutional features for scalable instance search https://arxiv.org/pdf/1604.04653.pdf

This paper treats each 'pixel' of a CNN activation tensor as a local descriptor, clusters them, and describes an image as a bag-of-visual-words histogram.

* Learned Invariant Feature Transform https://arxiv.org/abs/1603.09114v2

This paper very explicitly emulates the entire SIFT pipeline for computing correspondences across pairs of images

* Inverse compositional spatial transformer networks https://arxiv.org/abs/1612.03897v1

This paper emulates Lucas-Kanade approach to computing the transform between 2 images with differentiable (trainable) components.

Also, don't forget that deformable part models are convolutional networks! https://arxiv.org/abs/1409.5403


Thank you for these great links. I'll add another interesting paper that carries on this emulation program:

Conditional Random Fields as Recurrent Neural Networks https://arxiv.org/abs/1409.5403

I hope more fruit comes out of the fusion deep learning and graphical models.


> I have a feeling people are going to reply without reading this through and assume the author poses a Deep Learning vs Classic CV sort of argument, in a deep-learning-is-overrated sort of way. Whereas it seems to me he is merely saying Deep Learning should be informed by Classic CV.

Thank you for saying this! You were indeed prescient - nearly all the comments so far at total non-sequiturs based on what the article actually shows.

Folks, please take a look at the actual GC-Net architecture, which is very nicely shown in the last figure in the post. As you'll see, it's still an end-to-end learnt deep learning model. The extremely neat trick is to come up with a way to represent the cost volume piece (that connects the 2d and 3d parts of the CNN) as a differentiable function, which allows it to be part of an end-to-end learnt neural network.

This approach, of finding differentiable forms (often approximations) of domain-specific loss functions, transformations, etc so they can be inserted into neural nets, is a very powerful and increasingly widely used technique. It's not about replacing domain knowledge with deep learning, and it's definitely not about replacing deep learning with domain knowledge - it's all about using both.

So in this case, the research is showing how to use geometry plus deep learning, not instead of deep learning.


The sad parts, for me are:

1. Thinking about a deep learning architecture is really a different kind of research. It requires much guesswork, a lot of trial and error, and lots of waiting for training cycles to complete. For me this type of work is much less interesting than engineering a solution, even if the results are better.

2. Deep learning is quickly becoming a commodity. One can download libraries to do almost anything with neural networks. I fear that deep learning will become the new "web development" of the 90s. Everybody can do it (this is not really true just yet, but I suspect we are rapidly reaching an asymptote where it is true).


2) would be awesome. If for example our hydrologists could easily take their extensive domain knowledge and experiment with layer of deep learning on top of that without having to go though the "high priests" that can only be a good thing in my mind.

I might be very much an oddity among programmers, but I genuinely believe that the more 'programming' that can be done by non-programmers, who actually understand the domain they are trying to model, the better. If nothing else it would free up more time for programmers to work on actual hard problems where they have more to contribute.


My main concern is whether it deserves my time.

Investing in deep learning now costs considerable time and effort, whereas in the near future when deep learning is a commodity, that investment gives no real advantage.


Back when I took signal processing, I remember our professor making a very insightful comment. I sadly don't remember his exact words, but it went something like this.

"20 years back, everyone wanted CD's for music, today it is flash drives, the upcoming thing is streaming music from the internet. Technologies come and go. What I can say is that the Fourier transform will still be around 200 years from now as a valuable way of understanding the world."

I am of course in no real position to assert that this is a good idea (as I am still in graduate school), but this is the reason why I gravitate towards math and physics classes. Conditional on the assumption that we have a meaningful civilization decades from now (and none of the dangers outlined for instance in the excellent book "Here Be Dragons: Science, Technology and the Future of Humanity" by Olle Haggstrom) cause its utter collapse, I think it is a safe bet to assume "fundamental" topics like Maxwell's equations, Fourier transforms, etc will still be around, alive, and fruitfully studied and applied across a variety of domains. Part of the reason is simply historical: the Fourier transform has had over 200 years to prove its worth time and again across a variety of disciplines.

Maybe deep learning is here to stay, maybe it isn't. I don't know, and I am not willing or interested in making bets either way. I am willing to bet on a time proven framework of understanding, such as the examples above.


Indeed.

ML seems a great part, "Magic Function Machines". People make features that can be tracked, and more features, and more features. Then they splatter them against the wall, and hope for the best.

The best, are things like "Tell what type of bird it is"... But in the end, we have no clue HOW it got those results. Just that the Mystery Magic Function + data = results.

I can certainly see a middle path of being able to train a learning function, and then interrogating said function for the underlying things that make it true. Once we understand the primitives, then we can make ideal functions that do X, and do it well. Because right now, learning functions do it rather well but with exceptional overhead. And they're hard to tune without recrunching the whole dataset.


Once it becomes a commodity there will be a bigger market for your services and a greater appreciation for the value you add (assuming you're actually good enough to add value).

There was a time when knowing how HTML tables worked was enough to get you a job, and knowing how to post a form to a cgi perl script made you an expert. It's getting easier and easier for more and more people to do more and more complicated things on the web every day, and still (and probably because of this) people whom are really good at it are commanding a higher and higher salary.


But my point is that we're reaching an asymptote, where knowing much more about deep learning gives only a marginal advantage.

So it's actually worse than web development.


knowing much more about deep learning gives only a marginal advantage.

Sure, but that's mostly because deep learning is so new and 'magical' that most people outside the field can't spot the difference between someone having PhD in the subject and someone having worked though a couple of Tensorflow tutorials. As the field matures and its practical applications to various fields and industries becomes clearer, so will the value of knowing more about deep learning than your peers.


Investing in deep learning now costs considerable time and effort

Actually I'd argue that now deep learning requires the least investment of time and effort compared to potential payoff. Hell if you spend just a couple of weekends grinding through tensorflow tutorials you'll be able to impress the crap out of a huge number of people, many of them with far more money than sense.


What is sad about 2?

Other than reminiscing about the halcyon days of the web (which is a forever moving target of 'how it was when I grew up'), it's the present day internet, technically speaking, a marvel.

What's exciting about deep learning becoming a commodity isn't seeing what everybody can do with it, but what the will be built standing on it's shoulders.


> the present day internet, technically speaking, a marvel

The infrastructural parts are marvelous indeed; as they have been for decades. If you mean the web, it's a marvel... that it works at all. It's a "marvelous" pile of bad engineering bolted on as fast as humanly possible.

I guess GP's worry is that if a field gets dominated by people with no respect for the quality of their work, the whole field will be built out of bad tools - which is in long-term harmful for everyone, specialists and laymen alike.


I think there's still a lot of room for exploiting geometry. In particular, it leads to very simple models that require little training, and most importantly are completely transparent in how they work. I did some work on a robotic wheelchair that exploited the prevalence of poll-like objects in the environment (trees, parking meters, street lamps) to localize. The model just looked for cylindrical objects and matched them against a map that we generated a priori.

Deep models are best suited when you don't know what features are important in your model. For us, it was straight forward with the radius and orientation of the poll object, so a deep model would have probably been the wrong approach.


This is a good observation. My personal opinion is that using many smaller networks as components of a larger, manually engineered system could be a fruitful approach for these complex problems.

End-to-end deep learning is the holy grail -- and may ultimately happen -- but I don't think computation in silicon semiconductors is going to cut it, and we don't yet know what the next computational substrate will be.

Did you use a smallish convolutional model for extracting pole position, orientation, and radius from image data, or another approach?


If you're interested, here is a link to the initial paper: http://vader.cse.lehigh.edu/publications/fsr12_mon_per.pdf


I love the idea of using poles as your fiducials.


However, these models are largely big black-boxes. There are a lot of things we don’t understand about them.

I'm getting kind of sick of this "deep learning is a black box" trope, because it's really not true anymore. Yes, it's a black box if you just use "some data and 20 lines of code using a basic deep learning API" as mentioned in the article. But if you spend some time understanding the architecture of networks and read the latest literature, it's pretty clear how they function, how they encode data, and how and why they learn what they do.

Because neural networks are so dramatically more effective than they used to be, in so many domains, it's true that we don't yet have a good understanding of optimal ways to build, train, and optimize networks. But that is exactly why there is so much excitement -- because there is a lot to discover, and a lot of progress that can be made quickly.

I agree with the author that fundamental physical and geometric approaches are still relevant and useful, and have been somewhat ignored recently, but the fact remains: If you and I as individuals want to maximize our personal impact, and capture as much value as we can while working on interesting problems, deep learning is an excellent field in which to do that.

It's kind of like we just discovered a nice vein of gold, and the silver miners are like "yeah, but we don't know much about that vein of gold and how long it will last." Which is true, but in the meantime, there's a lot of easy money to be made, and ultimately, both types of resources are important and synergistic.


>> I'm getting kind of sick of this "deep learning is a black box" trope, because it's really not true anymore.

I know, right? Like, take this model I trained this morning. Here's the parameters it learned:

[0.230948, 0.00000000014134, 0.1039402934, 0.000023001323, 0.00000000000005]

I mean, what's "black-box" about that, really? You can instantly:

(a) See exactly what the model is a representation of.

(b) Figure out what data was used to train it.

(c) Understand the connection between the training data and the learned model.

It's not like the model has reduced a bunch of unfathomably complex numbers to another, equally unfathomable. You can tell exactly what it's doing- and, with some visualisation, it gets even better.

Because then it's a curve. Everyone groks curves, right?

Right, you guys?

/s obviously.


Not to mention that the bug which causes it to detect a cat as panda is instantly visible. You really should change that 0.00000000014134 to a 0.000000000141339 .


> I'm getting kind of sick of this "deep learning is a black box" trope, because it's really not true anymore.

That's fair/probably true.

I think there's two things that drive that--one, lack of a widely shared deep understanding of the field[0] (and not really needing a deep understanding to get good results--as both you and the author pointed out), and two, the fact that it feels like cheating, compared to the old ways of doing things. :P

[0] When the advice on getting a basic understanding is "read a textbook, then read the last 5 years of papers so that you aren't hopelessly behind", there just isn't going to be widespread understanding.


Fair. How about an excellent 4-minute YouTube video to get a basic understanding? :)

https://www.youtube.com/watch?v=AgkfIQ4IGaM


I'll have to watch this later, but I'd argue the issue, at least for me, isn't really surface level understanding. (At least, the kind I think could plausibly be imparted in 4 minutes. :))

The basic idea of deep learning has always seemed straightforward to me[0]. However, at least my perception is that it feels like there's a lot of deep magic going on in the details at the level that Google/Microsoft/Amazon/researchers are doing deep learning. That's honestly true of most active research areas[1], but since those results are also the results that keep getting a lot of attention, the "it's a black box" feeling makes sense to me. :)

[0] Having done both some moderately high level math and having a CS background, I feel like most ideas in CS fit this description, though. Our devil is the details.

[1] For instance: fairly recent results in weird applications of type theory are also super cool, and require some serious wizardry, but those get much less attention. (And are, I think, more taken for granted, since who doesn't understand a type system? /s)


Having until very recently worked in deep learning at Google, I can assure you that if you read and watch enough recent public papers and talks, you will be very, very close to the latest thinking of researchers at these companies.

You're right that it can take some time to do this edification work and develop the understanding for yourself -- the research is broader and more specialized than it appears at first glance -- and it does help to be surrounded by smart people puzzling over the same types of problems, but there's very little secret magic here. It is, however, of benefit to these companies to develop a public image of exclusivity and wizardry in their research; I fell into this trap too, before I saw how the sausage is made.

If you want to make your own fundamental innovations in deep learning, it can be very resource-intensive, both computationally and otherwise. However, it is easy to apply the current state-of-the-art to a broad spectrum of applications in novel ways.

One of the reasons I left is that I think there is a big opportunity in applying these powerful basic principles and approaches to more domains. The research companies are, IMO, focused on businesses that are or have the potential to become very, very large, and that can take advantage of their ability to leverage massive amounts of capital. This leaves many openings for new medium-sized businesses. Of course, as you grow, you can take stabs at progressively larger problems.


I'm with you 100%. RF has been around for ages, but it still is "black magic" to most EEs (after most people finish the standard 100 level courses describing op-amps, people tend to go into the digital domain and leave analog work to that small demographic). One EE will be able to design a fantastic 7 layers of poly, multiprocessing chip in his garage using Cadence and the TSMC 65nm libs, while someone else will be able to design a flawless cavity filter at 16 ghz. People have specific domains of expertise, even when they hold the same "EE" or "CS" or "Math" degree from the same university, largely based on which courses they elected to take in their 3rd and 4th year.

Likewise, fields advance quickly. I can grok how an z80 or 6502 works from NAND to Tetris, but even a mediocre second year grad student would wipe the floor with me. I, too, went pretty far down the road of mathematics, but watching MSRI lectures from the last few years leaves me struggling to keep up, in the field (algebraic topology) where I once felt comfortable. If you don't keep up with your field you're going to be lost.

The reason I think the 'black magic' trope keeps on being bandied about is because most people reading the articles describing ImageNet et al just don't have the background necessary to grok it[1]. If you had asked them a year ago what the convolution operator was, they'd have scratched their head. When they try to go and read that ImageNet paper they'll be left even more confused because the last time they thought about linear algebra was in their freshman year of uni. It'd be analogous to trying to write some computational fluid dynamics modeling software after not having taken/not touching diff eqs for a decade.

[1] This isn't to disparage those who didn't- everyone has their domain of expertise. I'm just trying to emphasize why the conception of 'black magic' exists. It's quite simple - when one has a tenuous grasp on the foundational knowledge upon some theory is built, you will have difficulty learning abstractions built upon said foundations.


Ah, this is interesting, because I've recently dabbled a bit in RF. My path went like this:

1) Interested in doing something with RF, don't know much about it, know that people say it's black magic.

2) Do some research... Ah, this is a pretty deep topic, and it might take a while to develop the necessary intuition.

3) Become competent enough to solve my immediate problem, recognize that it is a extensive field in which there is a lot of specialized practical knowledge that could be acquired.

4) Accept that I have higher life priorities than to go down the RF rabbit hole, but feel that I could learn it if I wanted to invest the time. No longer feels like black magic.

I think there is a distinction between fields like deep learning and RF, where most of the information is public if you know where to look, and say, cryptanalysis or nuclear weapon design or even stage magic, where the details and practical knowledge of the state-of-the-art are more locked behind closed doors. And for a field that you're not familiar with, it can be initially unclear which category it falls into. I think the existence of public conferences on the topic is a good indicator, though.


I would love to hear more about these "weird applications of type theory". Any references?


So it turns out you can basically use type theory to encode a surprisingly large number of desirable traits about your program. (Caveat being that as you get more restrictive, you reject more "good" programs at compile time--no free lunch with Rice's theorem.)

For example: In this paper, they basically use types (with an inference algorithm) to catch kernel/user pointer confusion in the Linux kernel. (https://www.usenix.org/legacy/event/sec04/tech/johnson.html)

It turns out you can encode a lot of other interesting properties in a type system (esp. if you're building on top of the existing type system), though--you can ensure that a java program has no null-dereference checks (https://checkerframework.org/ has a system that does this), and Coq uses its type system to ensure that every program halts (as a consequence, though, it isn't actually Turing complete).

There's also cool things like Lackwit (http://ieeexplore.ieee.org/document/610284/) which basically (ab)used type inference algorithms to answer questions about a program ("does this pointer ever alias?", etc.).


Surely it's a black box in a much deeper sense than that. We're nowhere near being able to prove that a given net implements a particular transformation approximately correctly. So for example, if you were to train a net to reconstruct 3D geometry with apparent success, there would be no way to validate that except by testing it on lots of examples. Similarly, we would not know how to give a precise characterisation of which images the network could analyse correctly.


For the type of problems we commonly 'attack' with neural networks, the same criticism applies to all solutions - for the earlier solutions generally we can prove that they definitely don't/can't work in the arbitrary general case, and we can prove some boundaries of correctness that apply for a highly simplified artificial case with assumptions that doesn't really represent the real world problem we want to tackle. In many domains successfully using neural nets, the "we would not know how to give a precise characterisation" is a fundamental limitation of the task that applies to all possible methods.

I'm working on other ML areas and not computer vision, but using your own example, what methods of reconstructing complex 3D geometry from real world photographs have some way of proving that their transformation is correct within certain bounds?

For this particular example, can it even be theoretically provable since we know how to make misleading objects that produce visual illusions to appear to have a much different 3d shape that they do? (e.g. https://www.youtube.com/watch?v=qJGT-aZKCYk was an example found by a quick searching)

In many real world domains "lots of examples" is the only proper source of truth, any formal model would be a poor approximation of that, and you'd want to have a system that can and will diverge from that model when seeing further real world data, it should be able to correct for the simplifications of that formal model instead of being provably bound to it.


I may have chosen a bad example in the form of reconstructing 3D geometry. But take sentence parsing instead. If I parse a sentence using classical techniques, I know exactly how that works and exactly which sentences would or wouldn't be accepted. That level of understanding isn't (currently) possible with deep learning techniques, even if in some cases they perform better.

It's true that some problems are so ill-defined that all you can judge is whether or not a particular technique succeeds over a sample of interesting cases. But not all problems are like that.

>what methods of reconstructing complex 3D geometry from real world photographs have some way of proving that their transformation is correct within certain bounds?

The issue I have with this is that it's essentially giving up on understanding how reconstruction of 3D geometry works. One might at least hope that the techniques that make it possible to do this from real-world photographs are, with some idealization, the same techniques that make it possible to do this (nondeterministically) from a 2D perspective rendering of a 3D scene made of polygons. And we certainly can prove results about the effectiveness of those techniques. I think it's far too early to give up on that possibility and just say "it's all a mess, and whatever methods happen to work happen to work".

>we know how to make misleading objects that produce visual illusions to appear to have a much different 3d shape that they do?

But that supports my point, I think. We can prove that those objects have the propensity to give rise to illusions given certain assumptions regarding the algorithms used to reconstruct the scene. We can't (yet) prove what kinds of objects would fool a deep learning model.


Okay, let's take sentence parsing instead, I've got much more backround in that. If we're looking at classical techniques in the sense of 'classical' as in techniques popularized pre-neural networks some 10+ years ago, e.g. something like Charniak-Johnson or Nivre's Maltparser (and generally augmented with all kinds of tricks, model ensembles, transfer learning, custom preprocessing for structured data e.g. dates, and a whole NLP pipeline before the syntax part starts - all this was pretty much a must-have in any real usage) then all the same criticisms apply, the factors that the statistical model learns aren't human-understandable, and the concept of "accepted sentences" is meaningless (IMHO rightfully so), the parser accepts everything but the question is about the proper interpretation/ranking of potentially ambiguous locations. Even simple methods such as lexicalized PCFG would fall into this bucket; pretty much all "knowledge" is embedded in the learned probabilities and there isn't much meaningful interpretability lost by switching from a PCFG to neural networks.

On the other hand, if we think of 'classical techniques' as in something a textbook would describe as an example, e.g. a manually, carefully built non-lexicalized CFG, then these fall under the "highly simplified artificial case" I talked about earlier - they provide a clean, elegant, understandable solution to a small subset of the problem, a subset that no one really cares about. They simply aren't nowhere near competitive on any real world data, they either "don't accept" a large portion of even very clean data, or provide a multitude of interpretations while lacking the power to provide good rankings comparable to state-of-art pipelines run by statistical learning or, recently, neural networks.

Furthermore, syntactic parsing of sentences has the exact theoretical limits to provability - there is no good source of "truth", and there is no good source of truth possible; if you follow descriptive linguistic approach then English (or any other language) can only be defined by lots of examples; and if you follow a perscriptive approach (which could be made into some finite formal model) then you get enough "ungrammatical" sentences in literal literary English e.g. reviewed&corrected works by respected authors; even more so in any real world text you're likely to encounter. Careful human annotators get disagreements in some 3% primitive elements, i.e. every 2-3 sentences - how can you prove that any model will get the correct answer if for almost half the sentences we cannot agree what exactly the correct answer is?


Of course it is not possible to prove that a particular algorithm "does the right thing", because "doing the right thing" is an inherently vague notion. The issue with neural networks is that we often don't understand how they work even under idealized conditions. In the case of the PCFG, we can characterize precisely which sentences will or won't be parsed for a given parsing algorithm. We are never going to have an explanation of how real world parsing works because the real world is too complicated. But we might hope to figure out how the techniques that work in the real world can be understood as extensions of techniques that work in idealized conditions. The PCFG is a good example of that. There's nothing to understand about the probabilities. As you say, they're amalgamations of an indefinite number of real-world factors. But there is a core to the parsing algorithm that we do understand.


People have used genetic algorithms to evolve a set of images to fool DL machine vision models. Your point stands, we don't know of a deterministic method that I know of that can generate images optimized to fool DL models, and that'd indicate a black box model.


> I'm getting kind of sick of this "deep learning is a black box" trope

I'm not even a deep learning fan (see username), but this is a meme people need to get rid of. This meme is probably hurting people more than the actual impermeability of deep learning.


Check this out

"Learning 3D Models from a Single Still Image" https://www.youtube.com/watch?v=bWbEsDbfayc

"3-D Reconstruction from a Single Still Image" http://cs.stanford.edu/people/asaxena/reconstruction3d/

"Make3D: Convert your still image into 3D model" http://make3d.cs.cornell.edu/


Not that forgotten, right? Isn't SLAM almost pure geometry? I don't think stuff like LSD-SLAM or any structure from motion stuff has deep learning built in.


Correct about the current state of SLAM. However the next wave for computer vision focused tracking and mapping is heavily based on ML - though there isn't a single combined system that gives robust SLAM/PTAM level results yet.


why would anyone downvote the parent comment (by andrewkemendo) given that they seem to speak with the authority of an expert in the field. thanks for your comment.


There seem to be some nice developments into this direction recently? Something like the SfMNet (https://arxiv.org/abs/1704.07804) or the Correlation Filter Net (http://www.robots.ox.ac.uk/~luca/cfnet.html) for tracking.


With the SLAM algorithms I am familiar, they are mostly statistics and abstracted above the camera component. They are mainly concerned with where am I most likely to be given I am witnessing these previously seen landmarks and have this noise model. Moreover, the chosen statistical model has to be such that it can be updated rapidly as new landmarks and state estimates are added (which typically presents as an increase in dimension until you hit some limit you impose).


Most (all?) SLAM approaches are naive and produce point clouds. True geometry approach would result in purely 3d vector output. Kinect, Tango, Hololens all fail at recognizing flat surfaces and straight lines.


There exist research SLAM systems that use lines, planes[1] and other geometric objects as features, and thus provide non-point geometric reconstructions. Hell, there's even a demo using entire pieces of office furniture as the localization features [2].

[1] https://www.youtube.com/watch?v=KOG7yTz1iTA

[2] https://www.youtube.com/watch?v=tmrAh1CqCRo


This seems like a perfect thread to ask a question that is itching me for some time. Is anyone aware of deep learning approaches or some combo with geometry algorithms for mesh generation? Something like quad, hex or tetrahedral meshing aided with deep learning?


Interested in where you're going with this. What are the advantages you are seeking out? I've done some computational geometry with slice data from meshes, but have been operating off of STL files, not a feature-based format.


Just interested in simple mesh algorithms. I feel as if machine learning could help but almost no one is going in this direction. Higher order quad and hex meshing is not a solved problem, and could help in a bunch of numerical physics algorithms.


I'd be keen on doing or learning about a mapping from spatial tree to 2d raster and have the inverse (dual) mapped back to a spatial tree via a deep learning model, though I'm not sure how you'd represent the shapes and transform matrices. Maybe some sort of matrix stream like char2vec?

A nice property of such an experiment, you could generate your training data via permutations of the spatial trees.

You may have to accept multiple valid answers, since one could correctly say, the ball is to the left of the car, or the car is to the right of the ball.


What is the cost/benefit in this? Is it appropriate to divert resources to older techniques because we haven't yet figured how to do it in the new way? Yes we can score a few points but that is an advancement only to an academic setting.

We used to need PhDs to do simple Computer Vision applications and object recognition in a meaningful matter was a far away dream. Now a child can make an application on its Raspberry Pi, because the new way of doing things is generalizable. You don't need to spend huge amounts of time redoing things just to get to the basics and reach the state of the art as it was 20 years ago.

Should we reintroduce the old ways for the quick wins, or should we divert our research resources to try to solve unsolved problems? The GPU will get cheaper the cloud GPUs will be cheaper.

So this is the state of things today. If somebody wants to advertise his paper's submission to a conference good for him but it should not be presented as an important advancement that should be the new way of doing things. Because it isn't.

When we decide that its feasible to send robots to the planets or even build robots on them, building chessboards to do rectification of the cameras should not be one of their tasks.

Disclaimer: I had horse in this race too, I was on the losing side of the deep learning argument, I was wrong, I got over it.


> What is the cost/benefit in this? Is it appropriate to divert resources to older techniques because we haven't yet figured how to do it in the new way?

I too would like to know why cars have wheels when helicopters have shown that rotors work well. We should have abandoned wheels ages ago since they are clearly old and with that inferior.

> Now a child can make an application on its Raspberry Pi, because the new way of doing things is generalizable.

And that app will have issues deciding if it sees a couch or a leopard. Or do you expect a child to correctly train its neural net?

> The GPU will get cheaper the cloud GPUs will be cheaper.

Why again do we have multiple algorithms for sort when a few nested for loops would do? Maybe smart algorithms scale better than hardware ever could.

> building chessboards to do rectification of the cameras should not be one of their tasks.

And yet we have mars probes with a build in color table. Why do you hate geometry?


>I too would like to know why cars have wheels when helicopters have shown that rotors work well. We should have abandoned wheels ages ago since they are clearly old and with that inferior.

I said cost/benefit and your example is about an expensive way to do what a car can do. You get some points for speed but lose big points on affordable transportation. The world has settled on the car.

> And that app will have issues deciding if it sees a couch or a leopard. Or do you expect a child to correctly train its neural net?

It will train a neural net sooner than it will learn 3d computer vision.

> And yet we have mars probes with a build in color table. Why do you hate geometry?

I said that I was a skeptic against deep learning an in favor of the old ways, so why do you claim that I hate it? Its inadequacy regarding the current and possibly future state of computer vision technology is not an expression of my feelings. Its a mere observation.


> I said cost/benefit and your example is about an expensive way to do what a car can do.

Sometimes neural net running on a sever rack full of GPUs is also overkill, including the cost/benefit. You get bonus points for buzzwords thought.

> It will train a neural net sooner than it will learn 3d computer vision.

And you would trust it to run as required? I admire your courage. Unless the person selecting the training data knew what they were doing I wouldn't. I certainly wouldn't trust a child to get it right without being trained itself.

> I said that I was a skeptic against deep learning an in favor of the old ways, so why do you claim that I hate it?

Your mention against "building chessboards" for basic calibration as if that was in any way hard or even necessary. Calibration of sensors on mars is already a solved problem. I don't understand why you would think otherwise.


> I said cost/benefit and your example is about an expensive way to do what a car can do. You get some points for speed but lose big points on affordable transportation. The world has settled on the car.

No, the world uses the one that makes sense for the particular circumstance.

A few helicopters even have wheels!


This reminded me of a cool paper I came across recently, Spatial Transformer Networks [1], a good example of how knowledge of geometry helps frame the problem more effectively, allowing the network to learn how to e.g rotate objects into a canonical orientation before identifying them.

[1] https://arxiv.org/abs/1506.02025


> In computer vision, geometry describes the structure and shape of the world. Specifically, it concerns measures such as depth, volume, shape, pose, disparity, motion or optical flow.

Isn't optimization geometry? DL is just shifting the requisite geometric insights to a different level of abstraction. Take that knowledge of geometry, revolutionize non-convex optimization, and let all fields that build on the base abstraction benefit.


DL's optimization is pretty simple SGD. But I would suggest that the type of optimization in deep learning is actually a lot more like geometry than calculus. This was more of a hunch until recently, but some results such as the successful application of tropical geometry and less exotic Riemannian geometry to the analysis of DL loss surfaces are showing some evidence for that hypothesis.


I have a hunch an understanding of classical control will also lead to much better results than the deep RL stuff people are doing for ML based control


Classical control often restricts itself to Linear systems, and non-linear control is next to useless for non drift-free systems (if you can compute the Lie algebra in the first place). SOS relaxations won't scale, and all bets are off when dealing with contacts.


So what's your suggestion when dealing with any non-trivial system? All those issues you mention are true but also active research areas.


This is making the odd assumption that our previous abstractions should play well with our current abstractions of pixels.

My feeling is that the "perfect" abstraction of reality to geometry is actually a very high order function that we don't fully understand. An easy example is parallel lines aren't always parallel, even though that is a common geometry affordance.

So, that our current toolset does not play well with our previous one should not be that surprising. Would it be nice if they did? Yes. But it would also be nice if Newtonial physics played well with quantum physics.

Why? Tough to answer. Not impossible, but above my ability to understand.


I think you're kind of missing the author's point.

> My feeling is that the "perfect" abstraction of reality to geometry is actually a very high order function that we don't fully understand.

You don't need a perfect abstraction of reality, though. All you need to do is get close enough to catch up with humans or outperform them and you've solved most of the hard problems in computer vision. Fortunately--at human scales, which are the only ones you need to care about for computer vision, reality behaves closely enough to a pure Euclidean space that you're fine. :)

The author's primary argument seems slightly more nuanced, too, than just "It'd be nice if we could use old techniques".

Basically, they're claiming that if you can build geometric understanding into the ML model, you will get significantly better results that just naively plugging and chugging away with raw data. That's an empirical claim that can be validated by researchers--either it will give a significantly improved performance on well defined problems (stereo vision, etc.) or it won't. Vision is one of the research areas that have developed pretty good benchmarks over the years. :)


But that is exactly my point. Geometric models are the old tools we had. The new tools are ML Models. It would be nice if they both worked together. But there is nothing to say that they should. Nor is it obvious that it would be beneficial.

Your point that this is testable, though, is important. I fully agree with that and was not intending to dismiss the idea. Just because I am not as confident as the author, does not mean that I am right. :) (I'd accept that I am likely not right.)


>> For example, one of my favourite papers last year showed how to use geometry to learn depth with unsupervised training.

I've been saying that LIDAR is a hack for some time. People don't need it and neither should computers.


I'm sympathetic to the idea that robotics should try to do more with cameras, as opposed to LiDAR. As you say, humans can do without LiDAR (and they don't even need stereo!). But in practice, there's a good case for LiDAR in real-world applications, at least for the near future:

* The accuracy of depth from a good LiDAR sensor is hard to beat in accuracy, specially at longer ranges. Looking at a stereo point cloud and a LiDAR point clouds side by side, you can really see the difference. * LiDAR works in bad lighting, even complete darkness. (In fact it might work better in such conditions). For self-driving applications, you probably want to be able to see that deer crossing the road at night. * LiDAR will work on textureless surfaces, that traditionally are an issue for stereo/SfM.

That said, there are plenty of issues with LiDAR: relatively sparse, doesn't work on certain surfaces like water or shiny metal, heavyweight (bad for drones). So at least for now, it makes sense to integrate with other sensing modalities.

Disclaimer: I have published research on Deep Learning with LiDAR.


Thanks for a solid response. One thing I've wondered about LIDAR though is how well does it work with tens or hundreds of systems operating in the same space? Do they interfere with each other? Think about one of those 6-8 lane highways packed full of cars each with a LIDAR system. Will it work? I honestly don't know. I have the same question about radars currently used for cruise control.


LIDAR is not a hack. It measures depth even when objects are not moving.

With vision we can estimate depth, but sometimes we need movement for that.


>> LIDAR is not a hack.

Depends on what one considers a "hack". Epistemologically, yes, it is a hack because it already assumes a 3D world identifiable via such means, and therefore may make the AI incapable of "reconstructing" in its "brain" more dimensions. Whereas if the AI has to learn the "hard way", it may be able to project that analogously to other ideas, and could "speak" of them in the same terms we do when speaking about dimensions.

Of course, if we're only looking for productivity from the AI, then philosophical "correctness" is a needless endeavor. ... But it might be more fun.


So, LIDAR is best solution for an autonomous vehicle when it is not moving?


I did not say that.

It might be the best solution when nothing is moving.

Our brains also cannot measure distance when there are unknown reference objects and everything is standing still.


That was a joke. Some people would laugh. Others down vote.


Why do dolphins need sonar ? They hunt in the visible light zone, they do have eyes. Same deal here.


Cats have whiskers. We have the advantage of being in the space we are judging relative to ourselves. LIDAR is analogous.


People don't need wheels. Cars shouldn't need them either.


You could build legs for cars. It's harder, yes, but it can be done. // Your argument doesn't work against his point. The fact is that yes, a computer could, in theory, gain a calculable 3D perspective of the world without the need for distance estimation scanning. It could start with the assumption already... but it helps to use two eyes.


It isn't just a case of deep learning vs stereo geometry, you can have many cameras informing/improving depth analysis.

I've been out of the loop for while, but I'd be mind-blown if people were happy to rely on deep learning for depth reconstructions in domains like...self-driving vehicles, versus provably accurate, fault tolerant systems with multiple cameras + geometry.

And what are the hardware requirements typically like for DNN with high resolution images for real time reconstruction? Are we well past the stage where it's real time and accurate?


Who here visualizes, but does not have strong spatial skills?


I was just reading https://blog.openai.com/adversarial-example-research/ and it struck me that an understanding of geometry might help. For the examples and scenarios cited, the geometries of the two items would be different (sometimes markedly, sometimes less, like washer-vs-safe).


If you can afford data and computing power, dpl is the way to go... you don't need to think much to get the thing done quickly.. but in my eyes , having a suitable mathematical description to a particular problem always give you useful insights with respect to the problem, variables and its relationships and this is actually what most of us are looking for..


Apologies for asking a computer vision related but off-topic question: does anyone know where to start if I wanted to track moving objects (players) in a (sports) video? Is there a 'general purpose' open-source library out there anyone would recommend or is this not 'trivial' to implement (I presume non of this is trivial stuff)?


Have you looked at OpenCV?


I have only just started looking at the problem and (coincidentally) found this thread on the front page of HN so I couldn't resist asking the question!

I will look into OpenCV, is it the go-to libary for these kind of problems?


Pretty much, but it has some warts (the matrix library is notably bad imo).

How easy the task is depends on the nature of your game. If your players have a red outline and that's the only red in the scene, no big deal.


Yes.


There are several nice references within the article. I think the gvnn, Neural Network Library for Geometric Computer Vision, (https://arxiv.org/abs/1607.07405) should also be mentioned.


I am wondering if a properly structured deep net can capture these Geometry features with a few layers, if these features are relevant to the problem?

Is this reflecting our human arrogance that we simply know better than AI neural networks?


Sure it can. The problem is to validate that e.g. the net really knows geometry, and not that it is just working for problems similar to the ones in the training. Unless the training can be validated, i.e. formally proven that after some specific training the net "knows" covering all the cases, you'll have to complement neural nets with classic computer vision, no matter how good you think your model is.


"not many problems that aren't solved by deep learning"

The problems that deep learning doesn't solve aren't solved by anything else either. There's still lots of computer vision that we cannot do.


"However, these models are largely big black-boxes. There are a lot of things we don’t understand about them."

This describes geometry as well as it describes deep learning.


err, no. The model is a "black box" if the only thing we have is the input and the output, and only little intuition how the model produces output from the input. We have spent at least a couple of thousand years studying geometry; we know geometry quite well.

Let me demonstrate with a stupidly simple geometric model.

Suppose (for the sake of the argument) that we have simple image input, consisting only simple solid geometrical structures. Say, solid 2-d circles of one color, on the background of different color.

From high school geometry, we know that we know everything there's to know about a circle when we know its location on x-y plane and its radius. We could easily come with a parametric model for fitting circles in the pixel image data of circular objects. (For example, we could minimize difference in 2-norm between data image and image corresponding to a set of circles, [x_i,y_i,r_i], i=1..n ). This kind of descriptive parametric model would be particularly easy to understand: model structure consists nothing but representations of circles! (But of course, it wouldn't be particularly interesting model; it would apply to simple images consisting of circles only).

Alternatively we could work out the mathematics bit more, and come up with something like Hough transform to find circular shapes. Still nothing mysterious about it: https://en.wikipedia.org/wiki/Circle_Hough_Transform

However, my point is: We could also train a neural network to find circles in the images of our example. It might be good at it. However, understanding how the circle representations are encoded in the final trained network certainly would not be as easy than in our nice parametric model.

Some realistic applications of "simple" geometric models would be active contours / snakes ( https://en.wikipedia.org/wiki/Snake_(computer_vision) ) or (stretching the meaning of the word 'geometry') the various traditional edge detection algorithms that have been around long time.

Or read the post, in which the author describes how they utilized projection geometry model to account for camera positions and orientation, or for stereographic images. We know how the geometry of stereographic vision works: we don't need waste resources to train a network to learn inscrutable model for it.

Deep learning is useful when we need models for things complicated enough that we don't know how to model them. (For example, model that tells us "is there a dog in this image".)


In my opinion you are overconfident in the foundations of mathematics. Like deep learning models, math works. Why and how does it work? It's open to interpretation in both cases. In both cases, we don't have a complete understanding. It is that lack of complete understanding that makes it a black box.


The fact that convolutional neural nets are used in vision is significant. The convolutional structure encodes the geometry.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: