Hacker News new | past | comments | ask | show | jobs | submit login
Interesting data structures: the BK-tree (signal-to-noise.xyz)
345 points by rudi-rau on April 3, 2017 | hide | past | favorite | 57 comments



I have a C++ + Cython implementation of a BK-tree that I use for large-scale image deduplication on github: https://github.com/fake-name/IntraArchiveDeduplicator/tree/m...

The core is a single, header-only BK-tree implementation here: https://github.com/fake-name/IntraArchiveDeduplicator/blob/m....

There are also relatively comprehensive unit-tests for the whole thing: https://github.com/fake-name/IntraArchiveDeduplicator/tree/m...

I use it for image deduplication across a ~28M image dataset. It works quite well, though it does take a lot of ram to hold the entire dataset in memory (~18 GB).


Out of curiosity: why not store hashes of the images instead of the entire image?

Or even better: a bloom filter to find likely duplicates?


The 18 GB is for just the hashes. The dataset is several terabytes.

A bloom filter is useless, as it doesn't let you do searches within an edit distance. Basically, I use a BK-tree of perceptual hashes (https://en.wikipedia.org/wiki/Perceptual_hashing). Then, I can find similar images by searching within a certain edit-distance.

Basically, the end result is a system that functions very similarly to how google's reverse image search works.


I used a google paper's approach in a mosaic competition. I didn't end up submitting it, but I got it working.

It uses locality sensitive hashing to hash vectors into buckets. These buckets are a subset of the total set of items. It worked with similar images of a small size when I used it, but I didn't have many images.

In my case, the vectors were just the rgb values of the down-sampled image.


Presumably, the idea is to count sufficiently similar images as the same image when doing the deduplication, so traditional hashing doesn't help here.


Maybe they were thinking of fingerprinting, e.g., by downsampling. It’s dual in some sense to hashing—a small change to some data should produce a large change in its hash, but a small change in its fingerprint.


Hashing is the correct term here, but it's not a cryptographic hash, but rather a perceptual hash: https://en.wikipedia.org/wiki/Perceptual_hashing


may you please elucidate your scheme ? in my naïveté, it seems to me, that hashing etc would not help at all.


As mentioned in one reply to the parent comment, hashing here usually is not cryptographic.

I'm familiar with location-sensitive hashing[1], which acts like a dimensionality reducer but, instead of receiving as input the image itself receives a feature vector extracted from it (or something analogous that allows for similarity assessment).

[1] https://en.wikipedia.org/wiki/Locality-sensitive_hashing


There are many metric-based data structures, such as M-tree, Vantage Point Tree, etc. They all suffer from the same major problem: they all rely on triangle inequality to speed up search by pruning impossible paths. But this doesn't ever result in an easily measurable complexity guarantee (like say, the guaranteed log(N) of a balanced binary search tree), so it's very common for certain datasets to result in severe performance degradation, particularly when the data is high-dimensional.

Metric-based data structures face the peculiar problem that the identity of each data point itself is not an actual "key", in the way that a data point in a hash-table or binary tree is a key. Rather, each data point in a metric tree is only "indexed" with relation to it's distance from all other keys. This makes it hard to get easily measured performance guarantees.


This issue also gets mentioned in the sklearn documentation [0]. The "curse of dimensionality" explains how high dimensional data tends to be close together, making efficient indexing on a distance metric difficult.

I feel like the poster-child for the frequency illusion. There was an article on 538 a few weeks ago that caught my interest which used cosine similarity (a non-metric distance function). I dove in, and started wondering if there were any ways to find nearest neighbors using cosine similarity efficiently. A week later, Facebook's Faiss implements something of a general purpose toolkit for indexing metric and not metric spaces to minimize storage. And now this. It's like the world is trying to tell me something!

[0] http://scikit-learn.org/stable/modules/neighbors.html#k-d-tr...


If you want to use the triangle inequality with cosine similarity, just use angles instead: https://en.wikipedia.org/wiki/Cosine_similarity#Angular_dist...

EDIT: Deleted calculation about approximating arccos(z) <= arccos(x) + arccos(y) by cos(arccos(z)) <= cos(arccos(x) + arccos(y)) <= xy - (1-x^2)(1-y^2), completely disregarding that cos is a decreasing function in the relevant interval.


Sounds like you're only two-three LSD trips from a major break-through. Good luck man!


Bingo! We tested BK trees and found them to break down on lots of data sets. The search speedup never overcame the additional cost of building the trees (although these were one-off sets, a search heavy workload could definitely benefit from them).


Yes it is a quite a niche and strange query technique but looked cute

Interesting tid-bit, one Mr. Sergey Brin published this in 1995

"Near Neighbor Search in Large Metric Spaces"

http://www.vldb.org/conf/1995/P574.PDF

So I'm curious, because I had been planning on using a VP tree for something I'm mulling over. I could see stormy waters though for large dimensions.

Seriously interested to hear any feedback from people who have tried it. Any recommendations for what to do instead?


An ordered binary tree still can fail to N if you allow duplicates.

(AFAICT: Most of these trees have a "typical" case search time of some root of N based on dimensionality. The easiest way to visualize this is to consider a walk from the perimeter to a cell in Voroni space, and how many cells that will search.)


Nice! I released a Python BK-tree module on PyPI recently, along with a write-up about how I'm using it for duplicate image detection:

* https://pypi.python.org/pypi/pybktree/1.0 * http://tech.jetsetter.com/2017/03/21/duplicate-image-detecti...


Very interesting model that suffers from a couple of things that other data structures solve better IMO. One thing that I don't understand and I guess I have to go back and read the article again, is how the distance from any node X to the arbitrarily selected root M tells me anything about the distance between X and the query term Q. If I live in London and the distance from me to Paris is 50.000 km, and you live in Rome, 70.000 km from Paris, that does not give me enough to calculate the distance between Rome and London.

IMO there is no data structure better suited for a search tree than the doubly-chained character trie [0], as it has the following features:

- small in memory (every word is stored once at the most, but more often less than once)

- traversing and doing exact matching allow you to prune half of the tree at every step

- requires you to calculate the distance only at querying time, not when constructing the tree

- apart from exact and fuzzy matches also support prefix matches (traverse up to the path of the prefix, return all children that are EndOfWords)

- can be represented on disk in a way that makes reading insignificantly slower than reading from memory

- if nodes are lexically sorted when tree is constructed, you have also pretty much solved range queries

[0] https://github.com/kreeben/resin/blob/master/src/Resin/IO/Lc...

The trie is to be the perfect tool for when you find yourself combatting big data. What you find out eventually about all of the cloud offerings from all of the big players that claim they will solve both storing and map/reducing over your data is that you can do the same thing on your laptop, for cheap money, as long as the data is represented in a compressed form that still allow for querying. The trie is a zip file and a search engine in one.

Edit: formatting and clarification


>If I live in London and the distance from me to Paris is 50.000 km, and you live in Rome, 70.000 km from Paris, that does not give me enough to calculate the distance between Rome and London.

Yes, but you do know that the distance from London to Rome is less than 12,000 km, because of the triangle inequality axiom. So if that's too far, that search path is discarded.

Then we just start the process over again, by comparing each subtree back to the original location.


Thanks, The idea behind the BK-tree is ingenious.

I'm struggling with finding a use case for that data structure through. Why would you construct a BK-tree that would only become powerful when it contains millions of words, which would then create a nuisance when representing that amount of data in memory, making it not so fast anymore, when you could represent the same data in a compressed form and with the same (as well as an extended set of) querying capabilities?

Perhaps BK-trees are for big machines with powerful CPUs? I'm sure there is a setup that would make that tree in fact better than any other tree.


I don't think the best use case for BK-trees is spell-checking and words. The area in which they are used most successfully is image deduplication. In that case the metric you're going to use is some form of perceptual hashing.


I think you are missing important case such as image data. Another one is floating point vectors with scaled up and rounded distance.


The article explains why the search algorithm is valid. But it doesn't explain why the tree should be built up in the way described.

Is there some rationale? And can it be extended to non-integer metrics?


https://en.wikipedia.org/wiki/BK-tree

  A BK-tree is... specifically adapted to discrete metric spaces.
There are others that have a similar principal for real valued metrics.

• Ball Tree (also known as VP-Tree) (implemented in sklearn)

• GH-Tree (rough description in this large PDF https://github.com/searchivarius/nmslib/raw/master/docs/manu... )

• Multi-Vantage Point Tree

As for the rationale, I think grouping them together is an efficient way to exploit the triangle inequality for each subtree.



I like the reference to a metric, which I always felt could be a useful approach to many problems, including in AI. Re spell-checking, I wonder whether there is an algorithm that uses topography (i.e. vicinity of keys) to assist with the correction...


There is ... a crude spell checker could be implemented by dumping a dictionary of words into a metric tree and using levenshtein distance (or damerau-levenshtein distance) as the distance function. The spell check would then consist of doing a K nearest neighbor lookup.

But it doesn't work as well as you expect. A metric tree requires a true metric to be used as the distance function, which means the distance function must satisfy both symmetry and triangle inequality. Levenshtein distance satisfies this - but unfortunately this usually isn't enough for a robust spell-checker that just "does the right thing". For that, you usually need something like a weighted levenshtein distance (e.g., an incorrect first character should weigh more than an incorrect Nth character, or an incorrect consonant should weigh more than an incorrect vowel). But it is much harder to guarantee that such a distance function is a true metric.


Well for a spell checker its not too unreasonable to require the triangle inequality, if there's some series of possible typos to get from word A to word B then the distance between A and B shouldn't be more than the total 'edit distance' of the typos, however you measure it.

Making things symmetrical might not make as much sense though, omitting a letter might be far more likely than adding an additional letter, for instance. Assuming it satisfies all other axioms you should be able to turn this into a true metric, by just measuring the distance both ways and taking the average.

Then again it looks like the BK-tree might not need the symmetry axiom, at least not in an obvious way.


Guess I'll ask here, any non-brute force techniques to speed searching with such a metric? (specifically one that violates triangle inequality)


> an incorrect first character should weigh more than an incorrect Nth character

It's possible that this is true for how most people type, but not for me.


The idea of representing a large corpus in a compressed form and learning both concepts as well as semantics about that corpus is facinating to me. A tree like the one described in the article measures semantic distance, much like a character trie does. What data structure support storing or calculating conceptual distances? I've read about word2vec and the mechanics behind the training process but never how that looks in memory or when serialized to disk.

Also, is there a link between the sematic and conceptual distance between two terms, does anyone know?

It would surprise me if there was such a correlation, because the concepts of "near" are different in the semantic and conceptual world. Then again, it would be facinating if there was.


Minor quibble: It says std::deque is a doubly linked list, but that's what std::list is.

FWIW, Deque is actually more like a vector.


Mhm maybe in C++ a deque is a vector of vectors (in stdlibc it's just a double pointer to T IIRC). In Python it's a doubly-linked list of blocks: https://github.com/python/cpython/blob/master/Modules/_colle...


Very nicely written post. Always great to know there are Data Structures for solving most problems efficiently.


Another interesting but less talked about tree is the Bkd tree.

https://medium.com/@nickgerleman/the-bkd-tree-da19cf9493fb


Yes interesting stuff. Seems similar in nature to "Vantage Point" trees.

http://stevehanov.ca/blog/index.php?id=130


All the images under the "How searching a VP-Tree works" are broken :/


oh bummer, it was working six months ago...

Wayback machine to the rescue!

https://web.archive.org/web/20160629190654/http://stevehanov...


Data structures and algorithms were in general use, a textbook detail a few years ago. Why the current day renaissance and emphasis?


Because today we're wondering a whole lot about how to represent big data in memory over many machines if you're problem is in such a way divisable, or on one machine if possible because what we want to do is to draw conclusions from that data, by querying it, and if we find a way to both store big data cheaply in a form where it is still queryable, then we have solved a problem most businesses today tamper with, that is what do I do with all this data?

Businesses are getting crushed under the weight of their data. They "solve" it by stashing it onto Hadoop. Unfortunately, no one then knows how to map/reduce over that data. So they're none the wiser even though they kept the data that could have made their analysts smarter than their competitor's. This of course totally awsome if you're a Hadoop consultant.

Well, that's at least why I dwell on data structures.

Edit: spelin


I suspected so, but couldn't have stated it nearly so well.


I had a similar idea last week in the car, cool to see it's actually a thing


Why did this get [flagged]? Is that solely user reports? It seems like a good link.


I would also like to know what this means. I'm the author of the post, by the way. When I submitted the link I found out that it had already been submitted.


It is a good link! There's a problem, but it has nothing to do with the site or the article or its author.


It looks like there's some issue with this user's submissions https://news.ycombinator.com/submitted?id=rudi-rau


what is "some issue"? looks fine to me


If you've got 'showdead' on you can see lots of previous submissions marked 'dead'.

None of them look obviously inappropriate but maybe the algorithm sees something I can't. There is a high-ish submission frequency (2-3 per day).


The posting account probably triggered some kind of spam classifier. No comments and a bunch of submissions in relatively quick succession.


Perhaps they're purchasing upvotes?


At work I got: Cyber Security - Web URL Blocked

Which might be related?


I'm the author of the post. Could you help me figure out what is triggering the warning?


I don't get a lot of info, but the URL (*.xyz) is kind of odd and might be the issue?


Weird, I don't see why a whole TLD would be blocked. Even Alphabet uses it!

I chose it because .io was way too expensive for my student budget and xyz looked cool.


I hate to say this, but on the server I manage at my day job, I've never, ever seen legitimate email traffic from an .xyz domain, or a .club, .rocks, .work, or any other "gold rush" TLD. Nothing but snowshoe spam/phishing/malware, to the point that my policy is "block on sight."


My work doesn't like most URLs with xyz TLDs.


I got the same from work VPN.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: