However, that still leaves an open problem. To the best of my knowledge there is no good way to measure "tractability," i.e., how much structure a problem has in comparison to other problems.
For example, I don't think there is a good way to measure how much more tractable -- i.e., how much more structure there is in -- the problem of classifying dog/cat photos versus the problem of classifying dog/cat/wolf photos versus the problem of classifying fruit/vegetable photos.
The various measures of "structure" that we have today (Shannon entropy, algorithmic information content, etc.) fall short for measuring the kind of hierarchical structure (multiple levels of function composition) that is prevalent in high-dimensional problems of interest.
Any research that might lead to a good way of measuring "hierarchical structure" (for lack of a better term) seems important to me.
> Any research that might lead to a good way of measuring "hierarchical structure" (for lack of a better term) seems important to me.
Which is an interesting question. Why is hierarchical structure so important for ML, but not for compression?
One view would be, our ML approach is "wrong" and could be improved. In almost all cases running compressed data through another compression scheme yields little to no improvement. Yet in ML we're focused on multiple transforms requiring a concept of hierarchy. Maybe it's because ML is the equivalent of a multi-layered modular compression schemes but in practice all compression is fairly monolithic. If instead we compressed using first a single layer doing key based look up of long identical strings, and then did run-length encoding we'd have items that look more like our ML approach. But, this doesn't intuitively feel right to me.
Another view is that ML works in the physical world and the physical world has inherent hierarchical structure. And we don't do this in the digital world of compression. But a counterpoint is that when we program, we exploit hierarchical structure in our code and data via functions, layers and abstraction.
Which leads me to think it's the third answer. Compression is lossless, and ML is lossy. Being lossy is what allows hierarchy to work. If you are lossless you can't ever leave working with the lowest level of abstraction or else you're by definition losing/changing information. The command equivalent "Send an email with content X to my sister" is absolutely a lossy summary of what is happening or will happen below the surface. Me giving my computer that command it will translate to different low level actions each time I do it. Similarly if someone asks what command I just ran and I give that description, it isn't enough information for them to recreate a CPU & network packet level exact reproduction of what I did. It is lossy. Machine Learning of "is this a cat?" is exactly like this. By contrast, in compression you have to have every bit recovered.
So I think the answer, back to your point, is we may not have an equivalent methodology of discussing information theory / content in a lossy domain. As a result of this we don't have a way of discussing information content in a hierarchical structure as they may be one and the same.
Yes, that makes sense to me: we don't have a formal, precise way of discussing (measuring) information content when faced with the kind of "hierarchical structure that is robust to loss" that is prevalent in natural high dimensional data.
(FWIW, I don't like the term "hierarchical structure" that much, but cannot think of a more appropriate term. It's as if there's no good terminology for discussing these issues, at least not yet.)
> The various measures of "structure" that we have today (Shannon entropy, algorithmic information content, etc.) fall short for measuring the kind of hierarchical structure (multiple levels of function composition) that is prevalent in high-dimensional problems of interest.
You don't think Kolmogorov complexity captures hierarchical structure? (Ignoring the fact that it isn't computable).
As you correctly point out, it's not computable, so we can't use it for measuring anything in practice. But let's ignore that.
The issue I have with it is that it theoretically measures the lossless compressibility of some sequence of bits, instead of the kind of "hierarchical decomposition that is robust to data loss" that I'm talking about here.
Think about this way: we can randomly zero out half the pixels in a cat photo, and a human being can still see it's the same cat, because it's composed of cat parts that "approximately look like" cat parts (pointy ears, whiskers, etc.), and those parts look like cat parts because they are in turn composed of things that "approximately look like" cat part parts, all the way down to pixels.
AFAIK, there's no way of measuring whether cat photos have more or less of this kind of robust-to-loss hierarchical structure than, say, dog photos, or fruit photos.
However, that still leaves an open problem. To the best of my knowledge there is no good way to measure "tractability," i.e., how much structure a problem has in comparison to other problems.
For example, I don't think there is a good way to measure how much more tractable -- i.e., how much more structure there is in -- the problem of classifying dog/cat photos versus the problem of classifying dog/cat/wolf photos versus the problem of classifying fruit/vegetable photos.
The various measures of "structure" that we have today (Shannon entropy, algorithmic information content, etc.) fall short for measuring the kind of hierarchical structure (multiple levels of function composition) that is prevalent in high-dimensional problems of interest.
Any research that might lead to a good way of measuring "hierarchical structure" (for lack of a better term) seems important to me.