Finding Waldo Using Semantic Segmentation and Tiramisu

awb · on Oct 7, 2017

The premise of "novelty" ML projects like finding Waldo, playing Mario, etc. has been that the application could be of greater use elsewhere.

Has this been demonstrated? I've heard Watson got really good at Jeopardy, but in the enterprise world many were unsatisfied with the results.

Is it still considered more efficient to focus on small ML projects and then apply it to a bigger problem? It seems like we have enough algorithms that we can start making valuable ML products and focusing less on novelty applications.

bane · on Oct 8, 2017

Watson is probably a capability overreach thanks to some marketing gurus at IBM.

If you think about complex games like Chess, or Mario, or Find Waldo, and then think about the abstractions we use to reason about very complex topics, they aren't terribly different. Games like Cities:Skylines are being looked at seriously for modeling civic infrastructure.

If we can train a system to reason about things using the abstractions we also use, and they're as successful as a human is using those abstractions (or better), then we've "won" in a sense.

Find Waldo isn't too far removed from "find the right screw in the parts bin" at a factory.

ml_thoughts · on Oct 8, 2017

I think an aspect of deep learning that is often overlooked is that it is still not clear how much of current algorithm performance is defined by local "obsession to detail" vs global "understanding" of the subject matter.

The "tiramisu" layers here are an interesting example of this: they are built on dilated convolutions and one of their main selling points is that they can do calculations on a pixel-by-pixel basis, as opposed to standard methods which use are forced to compress information along the way through pooling/strided convolutions (basically taking multiple pixels and summarizing them into fewer features).

Even Wavenet, which has had a few posts on HN, is in some sense a compromise: a few years ago people were obsessed with the idea of forcing RNNs/LSTMs to summarize the inputs they've seen to date and learning long-range dependencies through a hidden layer that would hopefully be interpretable. Mostly, though, models seem to be very happy staring at the last few inputs...recent paper showed they basically seem to mostly work as n-gram models with relatively small n [0, 1].

The compromise is Wavenet, which can only act on a context of about ~300 ms at a time [2], which more or less precludes learning long-range structure, but doubles down on this inferential bias and runs this tiny audio context through many layers and tens of millions parameters to outdo state-of-the-art models that need to "lose information" as they process it.

To your point, I would argue that most real-world applications are more interested in "global" interactions and an ability to "understand" signals rather than expending tremendous resources on every tiny detail that is observed. I'm not sure that this is the typical solution neural networks are going to converge to.

Partly I think this is motivated by hardware: GPUs are unbelievably powerful computing machines and they make convolutions look unbelievably attractive. Some researchers have 8 or more of them so you don't have to worry about obsession over detail. The other part of the tiramisu models, DenseNets, basically glue together layer after deep layer after deep layer... I think it's an architecture that is an obvious idea but from my understanding of GPUs, layer concatentation is an expensive operation, and people wouldn't have bothered designing the architecture a few years ago because they wouldn't have been able to run it on anything other than a Titan X from the future.

Probabilistic models have been floated as a way to increase global coherence of the information models are learning. In my experience, however, when trying to combine models with probabilistic and convolutional components (like [3]) the neural network's first order optimization promotes obsession over details vs understanding data well enough to be able to handle any uncertainty. To some degree I think this is also what we see in the deep learning community: why do second-order optimization and move toward new paradigms when you have 8 GPUs and can get a 0.1% improvement on the latest image benchmark?

[0] Blog post on "Frustratingly Short Attention": https://martiansideofthemoon.github.io/2017/06/28/short-atte... [1] From https://arxiv.org/pdf/1703.08864.pdf : "It shows that the memory acquired by complex LSTM models on language tasks does correlate strongly with simple weighted bags-of-words. This demystifies the abilities of the LSTM model to a degree: while some authors have suggested that the LSTM understands the language and even the thoughts being expressed in sentences (Choudhury, 2015), it is arguable whether this could be said about a model that performs equally well and is based on representations that are essentially equivalent to a bag of words." [2] https://arxiv.org/pdf/1609.03499.pdf [3] "PixelVAE": https://arxiv.org/pdf/1611.05013.pdf

mycelium · on Oct 8, 2017

Thank you much for the substantive thoughts. As an every day working programmer, it's nice to hear higher level thoughts that explain the wider context of "NNs play mario". We're looking to include some ML based features in our product and I like to hear intuition from practitioners as to what actually works and wtf these things are actually paying attention to. I've followed the Alpha Go competition closely in hopes of gaining better insight, but it still ends up being a ton of woo.

p1esk · on Oct 9, 2017

they can do calculations on a pixel-by-pixel basis, as opposed to standard methods which use are forced to compress information along the way through pooling/strided convolutions

I'm not sure I understand your point. Why do you think dilated filters produce a better global view than pooling?