I use it for image deduplication across a ~28M image dataset. It works quite well, though it does take a lot of ram to hold the entire dataset in memory (~18 GB).
The 18 GB is for just the hashes. The dataset is several terabytes.
A bloom filter is useless, as it doesn't let you do searches within an edit distance. Basically, I use a BK-tree of perceptual hashes (https://en.wikipedia.org/wiki/Perceptual_hashing). Then, I can find similar images by searching within a certain edit-distance.
Basically, the end result is a system that functions very similarly to how google's reverse image search works.
I used a google paper's approach in a mosaic competition. I didn't end up submitting it, but I got it working.
It uses locality sensitive hashing to hash vectors into buckets. These buckets are a subset of the total set of items. It worked with similar images of a small size when I used it, but I didn't have many images.
In my case, the vectors were just the rgb values of the down-sampled image.
Maybe they were thinking of fingerprinting, e.g., by downsampling. It’s dual in some sense to hashing—a small change to some data should produce a large change in its hash, but a small change in its fingerprint.
As mentioned in one reply to the parent comment, hashing here usually is not cryptographic.
I'm familiar with location-sensitive hashing[1], which acts like a dimensionality reducer but, instead of receiving as input the image itself receives a feature vector extracted from it (or something analogous that allows for similarity assessment).
There are many metric-based data structures, such as M-tree, Vantage Point Tree, etc. They all suffer from the same major problem: they all rely on triangle inequality to speed up search by pruning impossible paths. But this doesn't ever result in an easily measurable complexity guarantee (like say, the guaranteed log(N) of a balanced binary search tree), so it's very common for certain datasets to result in severe performance degradation, particularly when the data is high-dimensional.
Metric-based data structures face the peculiar problem that the identity of each data point itself is not an actual "key", in the way that a data point in a hash-table or binary tree is a key. Rather, each data point in a metric tree is only "indexed" with relation to it's distance from all other keys. This makes it hard to get easily measured performance guarantees.
This issue also gets mentioned in the sklearn documentation [0]. The "curse of dimensionality" explains how high dimensional data tends to be close together, making efficient indexing on a distance metric difficult.
I feel like the poster-child for the frequency illusion. There was an article on 538 a few weeks ago that caught my interest which used cosine similarity (a non-metric distance function). I dove in, and started wondering if there were any ways to find nearest neighbors using cosine similarity efficiently. A week later, Facebook's Faiss implements something of a general purpose toolkit for indexing metric and not metric spaces to minimize storage. And now this. It's like the world is trying to tell me something!
EDIT: Deleted calculation about approximating arccos(z) <= arccos(x) + arccos(y) by cos(arccos(z)) <= cos(arccos(x) + arccos(y)) <= xy - (1-x^2)(1-y^2), completely disregarding that cos is a decreasing function in the relevant interval.
Bingo! We tested BK trees and found them to break down on lots of data sets. The search speedup never overcame the additional cost of building the trees (although these were one-off sets, a search heavy workload could definitely benefit from them).
An ordered binary tree still can fail to N if you allow duplicates.
(AFAICT: Most of these trees have a "typical" case search time of some root of N based on dimensionality. The easiest way to visualize this is to consider a walk from the perimeter to a cell in Voroni space, and how many cells that will search.)
Very interesting model that suffers from a couple of things that other data structures solve better IMO. One thing that I don't understand and I guess I have to go back and read the article again, is how the distance from any node X to the arbitrarily selected root M tells me anything about the distance between X and the query term Q. If I live in London and the distance from me to Paris is 50.000 km, and you live in Rome, 70.000 km from Paris, that does not give me enough to calculate the distance between Rome and London.
IMO there is no data structure better suited for a search tree than the doubly-chained character trie [0], as it has the following features:
- small in memory (every word is stored once at the most, but more often less than once)
- traversing and doing exact matching allow you to prune half of the tree at every step
- requires you to calculate the distance only at querying time, not when constructing the tree
- apart from exact and fuzzy matches also support prefix matches (traverse up to the path of the prefix, return all children that are EndOfWords)
- can be represented on disk in a way that makes reading insignificantly slower than reading from memory
- if nodes are lexically sorted when tree is constructed, you have also pretty much solved range queries
The trie is to be the perfect tool for when you find yourself combatting big data. What you find out eventually about all of the cloud offerings from all of the big players that claim they will solve both storing and map/reducing over your data is that you can do the same thing on your laptop, for cheap money, as long as the data is represented in a compressed form that still allow for querying. The trie is a zip file and a search engine in one.
>If I live in London and the distance from me to Paris is 50.000 km, and you live in Rome, 70.000 km from Paris, that does not give me enough to calculate the distance between Rome and London.
Yes, but you do know that the distance from London to Rome is less than 12,000 km, because of the triangle inequality axiom. So if that's too far, that search path is discarded.
Then we just start the process over again, by comparing each subtree back to the original location.
I'm struggling with finding a use case for that data structure through. Why would you construct a BK-tree that would only become powerful when it contains millions of words, which would then create a nuisance when representing that amount of data in memory, making it not so fast anymore, when you could represent the same data in a compressed form and with the same (as well as an extended set of) querying capabilities?
Perhaps BK-trees are for big machines with powerful CPUs? I'm sure there is a setup that would make that tree in fact better than any other tree.
I don't think the best use case for BK-trees is spell-checking and words. The area in which they are used most successfully is image deduplication. In that case the metric you're going to use is some form of perceptual hashing.
I like the reference to a metric, which I always felt could be a useful approach to many problems, including in AI.
Re spell-checking, I wonder whether there is an algorithm that uses topography (i.e. vicinity of keys) to assist with the correction...
There is ... a crude spell checker could be implemented by dumping a dictionary of words into a metric tree and using levenshtein distance (or damerau-levenshtein distance) as the distance function. The spell check would then consist of doing a K nearest neighbor lookup.
But it doesn't work as well as you expect. A metric tree requires a true metric to be used as the distance function, which means the distance function must satisfy both symmetry and triangle inequality. Levenshtein distance satisfies this - but unfortunately this usually isn't enough for a robust spell-checker that just "does the right thing". For that, you usually need something like a weighted levenshtein distance (e.g., an incorrect first character should weigh more than an incorrect Nth character, or an incorrect consonant should weigh more than an incorrect vowel). But it is much harder to guarantee that such a distance function is a true metric.
Well for a spell checker its not too unreasonable to require the triangle inequality, if there's some series of possible typos to get from word A to word B then the distance between A and B shouldn't be more than the total 'edit distance' of the typos, however you measure it.
Making things symmetrical might not make as much sense though, omitting a letter might be far more likely than adding an additional letter, for instance. Assuming it satisfies all other axioms you should be able to turn this into a true metric, by just measuring the distance both ways and taking the average.
Then again it looks like the BK-tree might not need the symmetry axiom, at least not in an obvious way.
The idea of representing a large corpus in a compressed form and learning both concepts as well as semantics about that corpus is facinating to me. A tree like the one described in the article measures semantic distance, much like a character trie does. What data structure support storing or calculating conceptual distances? I've read about word2vec and the mechanics behind the training process but never how that looks in memory or when serialized to disk.
Also, is there a link between the sematic and conceptual distance between two terms, does anyone know?
It would surprise me if there was such a correlation, because the concepts of "near" are different in the semantic and conceptual world. Then again, it would be facinating if there was.
Because today we're wondering a whole lot about how to represent big data in memory over many machines if you're problem is in such a way divisable, or on one machine if possible because what we want to do is to draw conclusions from that data, by querying it, and if we find a way to both store big data cheaply in a form where it is still queryable, then we have solved a problem most businesses today tamper with, that is what do I do with all this data?
Businesses are getting crushed under the weight of their data. They "solve" it by stashing it onto Hadoop. Unfortunately, no one then knows how to map/reduce over that data. So they're none the wiser even though they kept the data that could have made their analysts smarter than their competitor's. This of course totally awsome if you're a Hadoop consultant.
Well, that's at least why I dwell on data structures.
I would also like to know what this means. I'm the author of the post, by the way. When I submitted the link I found out that it had already been submitted.
I hate to say this, but on the server I manage at my day job, I've never, ever seen legitimate email traffic from an .xyz domain, or a .club, .rocks, .work, or any other "gold rush" TLD. Nothing but snowshoe spam/phishing/malware, to the point that my policy is "block on sight."
The core is a single, header-only BK-tree implementation here: https://github.com/fake-name/IntraArchiveDeduplicator/blob/m....
There are also relatively comprehensive unit-tests for the whole thing: https://github.com/fake-name/IntraArchiveDeduplicator/tree/m...
I use it for image deduplication across a ~28M image dataset. It works quite well, though it does take a lot of ram to hold the entire dataset in memory (~18 GB).