Hacker News new | past | comments | ask | show | jobs | submit login
Capsule Networks Explained (kndrck.co)
120 points by kendrick__ on Nov 11, 2017 | hide | past | favorite | 20 comments



Here are some other posts explaining the nature of capsule networks, their goals and how they work:

- https://medium.com/@pechyonkin/understanding-hintons-capsule...

- https://hackernoon.com/what-is-a-capsnet-or-capsule-network-...


Here's a fluffy short piece about Geoffrey Hinton + Capsule Networks.

[1]: https://www.wired.com/story/googles-ai-wizard-unveils-a-new-...


When it comes to translation/rotation invariance, this is similar idea to "Harmonic Networks: Deep Translation and Rotation Equivariance" paper:

- https://arxiv.org/pdf/1612.04642.pdf

- https://www.youtube.com/watch?v=qoWAFBYOtoU

Maybe they can be combined?


Interesting links. That paper indicates that it's primary benefit is with rotations rather than translations. Regular CNNs are perfectly capable of dealing with translations.


Looks like the part about translational invariance is wrong. Translational invariance is an invariance to translations, not rotations. If a model detects a rotated cat as a cat, then it is rotationally invariant.


I suspect transform invariance is what is meant, although we find some transforms much harder than others which may hint at a more descrete process than a transform matrix in human visual systems.


I'd say transformations are more important than rotations, as in a 3D world we'll almost never see an object from a perpendicular view point, but most of the time we'll see objects that are the right way up.


> in a 3D world we'll almost never see an object from a perpendicular view point

True, however transforms would be more useful as an umbrella term in this context for the subset of transforms that include perspective + orientation of a fixed geometry. Visual systems only need to care about this subset in almost all cases...

In which case it's conceivable that we infer geometry through a set of discrete transforms somewhat like rotations, translations and scaling, or perhaps there is a component that did happen to converge on something more unified resembling an arbitrary transform matrix. If only we could simply identify these pieces in biological systems.


If the point is to easily reconstruct geometry, then mimicking humans should mean using 3D imagery (same object seen from two eyes) to get a better idea of its shape. Wonder if that might some day become part of best practice in computer vision too.


In the same vein, I've always thought that operating on short (< 1s) video clips would help a lot with overfitting and object differentiation.


There one thing in the paper that has me stumped

> Each primary capsule output[sic] sees the outputs of all 256 × 81 Conv1 units whose receptive fields overlap with the location of the center of the capsule.

What does that mean? The capsules are bundles of convolutions, and the output of the "256 * 81 conv1" is a 1D manifold. What does it mean "overlap" and what is the center of the capsule?

Note on [sic] - seems like it should read "input"


I think it is a pretty unnecessary sentence. 81 comes from the 9x9 kernel size. It is obvious that those will overlap despite of the stride of 2. Maybe they mean the projective field.


Thanks. So maybe it is saying that the field overlap with capsules is implicit in network, not a step in the calculation? That's my conclusion.


I feel like capsule network are one step closer to a hybrid between standard deep learning tools and hofstadter conceptual slippage networks



I thought CNN was also translational invariant, why are they saying it’s not?


These guys doesn’t seem to understand capsule network all the much, they had a translational image showing a rotated cat, now modified to properly show it translated


Aw... I thought this was a post explaining active networking using capsules.


The startup? Is that what you are referring to?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: