Interesting links. That paper indicates that it's primary benefit is with rotations rather than translations. Regular CNNs are perfectly capable of dealing with translations.
Looks like the part about translational invariance is wrong.
Translational invariance is an invariance to translations, not rotations. If a model detects a rotated cat as a cat, then it is rotationally invariant.
I suspect transform invariance is what is meant, although we find some transforms much harder than others which may hint at a more descrete process than a transform matrix in human visual systems.
I'd say transformations are more important than rotations, as in a 3D world we'll almost never see an object from a perpendicular view point, but most of the time we'll see objects that are the right way up.
> in a 3D world we'll almost never see an object from a perpendicular view point
True, however transforms would be more useful as an umbrella term in this context for the subset of transforms that include perspective + orientation of a fixed geometry. Visual systems only need to care about this subset in almost all cases...
In which case it's conceivable that we infer geometry through a set of discrete transforms somewhat like rotations, translations and scaling, or perhaps there is a component that did happen to converge on something more unified resembling an arbitrary transform matrix. If only we could simply identify these pieces in biological systems.
If the point is to easily reconstruct geometry, then mimicking humans should mean using 3D imagery (same object seen from two eyes) to get a better idea of its shape. Wonder if that might some day become part of best practice in computer vision too.
> Each primary capsule output[sic] sees the outputs of all 256 × 81 Conv1 units whose receptive fields overlap with the location of the center of the capsule.
What does that mean? The capsules are bundles of convolutions, and the output of the "256 * 81 conv1" is a 1D manifold. What does it mean "overlap" and what is the center of the capsule?
I think it is a pretty unnecessary sentence. 81 comes from the 9x9 kernel size. It is obvious that those will overlap despite of the stride of 2. Maybe they mean the projective field.
These guys doesn’t seem to understand capsule network all the much, they had a translational image showing a rotated cat, now modified to properly show it translated
- https://medium.com/@pechyonkin/understanding-hintons-capsule...
- https://hackernoon.com/what-is-a-capsnet-or-capsule-network-...