Using a Hilbert Curve for indexing is a pretty standard approach for geospatial queries. Nothing new here...
However, using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.
For example, construct a heatmap of your 'load score' and shard based on that, in two dimensions. Then use an S2 curve to index inside that shard.
However, using a Hilbert Curve for sharding doesn't seem like the best approach.
Yes, that's also what I thought. Searching for "same size k-means" yields a simple postprocessing step to even out the clusters produced by the usual k-means algorithm.
In my experience the linked algorithm behaves quite poorly when the dataset and number of desired clusters becomes large. I think the issue is that it only allows for pairwise switching between clusters, and this ends up with a lot of points still being assigned to clearly the wrong cluster. Some day I would like to try more complex neighbourhood searches with it and see if it helps.
> using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.
If you want to shard by proximity (items close in space are likely to be in the same shard, then the transform to 1d is the way to go, why wouldn't it be? What is your definition of "optimal"?
Sharding by proximity or by something else depends on the relative frequency of queries by location or something else. If you shard by location, then a query to one location goes (ususally) to one shard. That should scale better. Otherwise, each location-based query goes to each shard.
Thanks for the suggestion and there are definitely room for optimization.
In the original design we did consider approach similar to the "heatmap approach" you mentioned, it would reduce the shard movement for people who commutes in the city.
Later we figured that simply sharding along on the Hilbert Curve already give what we need, and the shard move is unavoidable, we would explain more details around the shard move in future blog.
I present to you yak shedding: pointless bickering about an irrelevant detail that everyone has an opinion of, and the result of this decision process has such fundamental implications that the whole project is re-engineered from scratch to include a yak farm.
How big of a problem is the fact that shards constructed this way don't like crossing the "big" boundaries in the S2 curve? Especially if a high-usage location happens to straddle one of those boundaries (like, IIRC, Toronto)?
Yeah it seems like the real bottleneck is that in these super dense areas, there might be a ton of shards in a mid-size radius. This is inevitable since in the middle of Manhattan, there are just a lot of people in a small area.
One solution would be to scale their recommendation radius with density; no need to look farther than 5 miles for true love in Manhattan, but in Wyoming you may have to drive a bit.
The flip side though is that you don't need to show everyone to everyone, you just need to make sure people have enough other people to swipe. This opens up all sorts of other optimizations, and makes the use of an actual search engine seem a bit weird to me.
Is this an advertisement for S2? there is no description of alternatives, of how the choice was made, no comparison to any other system or provider, ...
Redis has geo search for example, why not using it?
geo-indexing is different from geo-sharding, search indexes normally have geo-indexing, but it is still one big index handles every single search. The alternative(geohash) is mentioned in article, which has high distortion around earth poles.
First of all, we normally do our sharding by the hash of the primary key. Typically it is the publisherId and hash of the name of a “stream”, which is our general dynamic data structure. What this does is essentially distribute the load evenly between hashes without having to worry about the “water” stuff mentioned above.
Ok now about location based search...
1. We use GeoHash for addressing the surface of the Earth. For lookups, we calculate actual corners of the lat-lng rectangle in Haversine distance and then translate it to geohash.
2. Technically we could have just used a database index at this point and be done with it. But we want to handle both PULL requests and PUSH notifications when something happens nearby.
3. So given a radius the subscriber is interested in, we set up 4 streams for SUBSCRIBERS that together (as rectangles) cover the circle surrounding the location and radius.
4. Then we also set up streams for PUBLISHERS which correspond to all the various drop-down values we have for the radius. This is compicated by the fact that we have one set for metric system and another for imperial. But we just make one stream for each of them. The more radii available the more streams you have to post to (like 10-20) but posting doesn’t happen that often and it’s better to precompute this stuff.
5. Relations between streams are one of the many features that streams do out of the box. So basically the user can just subscribe to a certain location and radius, interest, etc. and get either realtime updates or offline notifications. The item being posted is itself a stream and what’s being inserted is relations between it and the category stream the person is interested in.
Thus you can also see right away who is interested in what. In the future however we plan to encrypt the data end to end so the servers won’t know this info. Then the location search will be harder. (qbix.com/blog)
Elasticsearch has geo-indexing as well(based on geohash internally), and by default it does id hashing similar to what you said(murmurhash3), we actually leverages that for location based searches.
The challenge addressed in the blog is not in how to search/address(as said Elasticsearch handles it already), it is about how to distribute the load so calculation only happens on limited nodes, and reduce the index size so it can be more performant.
Ah, the goal makes sense. I would suggest that it’s not so bad to have a controller node fan-out and fan-in queries, as long as the database can handle many concurrent queries. Essentially you’re distributing the work evenly across nodes but you don’t have affinity for a particular node. Yes, there is more latency (it is as slow as the slowest connection) but it is endlessly scalable. But, I am sure I missed some benefits from localizing calculations to only a node or two.
In the scheme above, by the way, it DOES localize searches on one shard. Essentially all relations to a stream are on the same shard as the stream. And each center+radius has one associated stream and therefore the search takes place on one shard.
When implementing similar algorithms on a load-balancing reverse proxy cache-friendliness and lockless reads and updates in the "load scores" can be way more important than optimizing for CPU cycles.
Another option would be to map coords into very small 2D tiles and than look up "tile -> shard number" with a simple, large, array. A variable number of tiles would land into the same shard. Fast and lockless reads and writes. I wonder if it could be faster.
Very interesting. Used to do something similar for a project I work on. Used geohash based sharding, instead of complex curves like that. Worked great, but with recent advances in ES its no longer needed. They don't seem to mention which version of ES they are on, and would be curious how it has changed over time in their experience.
Same strategy can be applied regardless of ES version. ES6 does has index sorting and other optimizations, but not as good as pre-sharding for location based services, cases like Tinder.
They mentioned the distortion near the poles when using geohashing (which uses z-order curves), but I doubt they have many users at the North or South Pole.
a bit off-topic, it's funny you use all this fancy technology behind the scenes and your iOS app is buggy, slow and always problems with login and notifications which do not update.
I came here to write the same thing, so have an upvote. Tinder’s iOS app is such a gigantic travesty of engineering that I’m honestly surprised to hear they have an actual engineering team.
It doesn't seem off topic to me: it is reasonable to mention the end to end experience and value of particular work. The OP seems to be saying "the company might have gotten more value from investing in a different area than what is in the linked article"
I’m sorry, but Tinder and similar swipe-on-photos cargo-cult hookup apps have de facto cannibalized the online dating-scene with its entire focus on appearance (and atrophying genuine communication), promoting a major captological phenomenon of manipulating women into rating men more negatively based on nothing other than appearances. It contributes to unrealistic expectations, disposable interactions, paradox of choice and a lack of authenticity. If you’re not perfect and super photogenic with 4 professional “candid” pics, you have zero chance on these anti-relationship apps. tldr: it’s far phonier than it needs to be, encourages casual wasting of time and leads to a society where people stop talking to/slam the door on the real world. If you move to a new area, the solid, normal people aren’t going to be on hookup apps, but psycho nymphos with herpes are most def on there.
I use an image of a strong interest I have as my pic, no personal info, not even my real name. It acts as a strong filter, so far my experience has been net positive.
I find it far more rewarding to try to hack the awful world around me instead of complaining how awful the world is/is becoming.
People already judge by looks. There’s nothing noble about wasting your time approaching someone that never found you attractive,
that’s not their choice.
If you have the time and preference to meet people face to face, you can still do that. People still do that and Tinder is a nice supplement.
But many of us look at dating as a numbers game. And Tinder is a hell of an upgrade to sending messages to women on other dating platforms where you don’t even know if they like your skin color. That’s a waste of time.
You don’t need to be perfect to get Tinder dates. But being ugly in this world with or without Tinder already stacks cards against you. Tinder is not so different from cold-approaching women at the bar. Except in Tinder you know she is more likely to give you a shot beforehand. But you have to trade away the ability to charm her in person. Why does it bother you so much that many people like tht trade-off?
Your post reeks of someone who’s mad that all those attractive people seem to be fucking everyone but you. Dating is hard. I’m willing to try many paradigms/sources at once to date at my desired level. If I was meeting new women every week through my social circle (the ideal imo) then I wouldn’t use Tinder or go out just to find single women, but until then...
If you’re not happy with your dating life, then maybe consider increasing your hustle yourself and/or find the approaches that work for you. Maybe Tinder just isn’t for you, but seems a bit outward to turn it into technosocial criticism.
My observation is that most upvotes are because of the title, and then a minority of people will read and critique the article in the top-voted comments.
Doesn't sound like a perfect system, but actually does a wonderful job at debunking very appealing-sounding articles.
The upvote isn't for the article per see it's for the subject matter. I'm hoping to hear war stories and crtiques of Hilbert Curves vs S2 vs H3 or KD-Trees for spatial indexing.
The fun of Hacker News is hearing pros talk shop about stuff too distant from what you're working on.
The bigger problem is when people comment and even discuss the title without reading the article. It's extremely common on Reddit but luckily less prevalent on Hacker News .
...if paywalled then at least you know some people at least tried to read the article because they comment about the paywall.
Everyone else a) paid their way (as if), b) got a free pass as they don't generally read articles, c) used that new trick for bypassing the article that only people who read articles know about or d) just blurted their opinion without reading the article.
I know there have been changes on the internet lately and there is only so much that one can 'stalk' in someone's browser, but is it possible to have a comment system where you can only comment if the article is in the browser history, with that page having been visited at least three minutes before the comment is submitted?
If I go shopping online then I jump straight to the reviews and decide to buy on that basis. the 'shoddy' product description sometimes doesn't get a glance over. On HN I find some articles are the same, the real insights are in the comments so paywalled does not deter, the real insights will be in the comments.
I don't mind when people only read the comments, honestly. As you say, sometimes the comments are more insightful than the article. But I do think that anyone making a comment related to the article should have read the article. Otherwise it's the blind leading the blind.
However, using a Hilbert Curve for sharding doesn't seem like the best approach. You can partition by anything you like, it doesn't have to be arbitrary points along your index. Using 1-dimension to shard 2D data isn't optimal.
For example, construct a heatmap of your 'load score' and shard based on that, in two dimensions. Then use an S2 curve to index inside that shard.