Hacker News new | past | comments | ask | show | jobs | submit login

If what you're asking about is the math, the steps are (essentially) as follows:

1. A Riemannian manifold is constructed from the dataset.

2. The manifold is approximately mapped to an n-dimensional topological structure.

3. The reduced embedding is an (n - k)-dimensional projection equivalent to the initial topological structure, where k is the number of dimensions you'd like to reduce by.

I don't know how well that answers your question because it's difficult to simplify the math beyond that. But you can also check out the paper on arXiv. [1]

The underlying idea is to transform the data into a topological representation, analyze its structure, then find a much smaller (dimensionally speaking) topological structure which is either the same thing ("equivalent") or very close to it. You get most of the way there by thinking about how two things which look very different can be topologically the same based on their properties. A pretty accessible demonstration of that idea is the classical donut <-> coffee mug example on the Wikipedia page for homeomorphisms. [2]

__________________

1. https://arxiv.org/pdf/1802.03426.pdf

2. https://en.wikipedia.org/wiki/Homeomorphism




This is a good summary. For a video that explains some of this (but which still hand waves), see the author's presentation at SciPy 2018: https://www.youtube.com/watch?v=nq6iPZVUxZU


Is this actually capturing any properties of the original set, or is this a set of operations that will make any input look similar? (i.e. is this just a pretty picture with no real connection to the math.)


You're asking about two somewhat different things.

In the strict sense two things which are equivalent share the same properties, yes. This is the topological generalization of algebraic homomorphisms and analytic bijections. See the example about coffee mugs and donuts both being topological tori.

That being said I can't really comment on the potential artifacting details of this specific algorithm. In theory the overarching idea makes sense because if you find structure preserving maps between sets of varying dimensions you should expect relations within the set to be preserved (i.e. the relational information in the smaller set is equivalent, there's just less of it). But practically speaking not all datasets can be usefully abstracted to a manifold in this way, which means that (efficiently) finding an equivalent lower dimensional projection for the embedding might involve a fair amount of approximation.

With enough approximation you'll introduce spurious artifacts. But that's precisely where all the innovation comes in - finding ways to efficiently find equivalent structures with the representative data in fewer dimensions. This isn't the first proposal for topological dimension reduction (see information geometry); the devil is really in the details here.


It captures more of the spatial (metric topological) arrangement in the set. Example they give in the paper is the MINST dataset where distinctly looking digits like 1 and 0 get separated farther apart and similar ones clump together, whereas t-SNE while correctly delimiting individual clusters clumps them all in a blob.


Cool. Thanks.


sidenote: i only recently realized the X in arXiv is a greek chi, so its sonically "archive"...


Yep, just like LaTeX :)


"latechi?" ive always heard it "lah-tek"


Latech is how I have heard it.


im assuming you mean hard "a" like lay-tech. ive heard that too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: