Design details of Audiogalaxy.com’s high performance MySQL search engine

elq · on March 1, 2008

> "we never had to deal with sharding the index across multiple servers or a hash table of words that wouldn’t fit in memory. Speaking of which, I’d love to read some papers on how internet-scale search engines actually do that. Does anyone have any recommendations?"

I can speak from experience on a very large search engine (not on a google scale in # of docs, but within an order of magnitude - and google scale in terms of qps [estimated - google doesn't publish such numbers])

Re: "sharding the index across multiple servers" - every document has an id, mod the id against some number (preferably much larger than the number of partitions/shards you have), split you index servers into N clusters, assign mods to a cluster (do so in a way that you avoid hotspots), have a "query aggregator" that sends an incoming query to one server in every partition. the aggregator then merges the result sets and resorts based on a sort key passed by the search node.

Re: "hash table of words that wouldn’t fit in memory" - the vocabulary I had to work with included at least 7 (human) languages with _many_ artificial words. The # of hash entries tended to hover around 2.7M tokens. How, do not include numbers in the index (there's an infinite number of them :)), ignore case, and tokenization. Tokenization is relatively easy except for CJK languages for that either have fluent/native speakers define the tokenizing semantics or find/buy a library.

slackerIII · on March 1, 2008

The sharding thing is such an interesting problem. I'm thinking about the case where a user searches for "A B". Assume A has a few million matching IDs, B also has few million, and (of course) the indexes for A & B are stored on different servers. Do you have to pull the full result sets for A & B back to a single aggregator to intersect them? I'd love to learn more about the optimizations for that problem.

elq · on March 1, 2008

Ah. Well, keep in mind that we're splitting the documents up in an arbitrary way WRT the data, so every cluster/partition has a complete copy of the index for the documents that happen to be bucketed there. And no cluster owns a single "document class/cluster".

So, if you want to search for, say "microsoft windows", the aggregator just sends something like "PHRASE('microsoft', 'windows')", each query node finds the document vector/set for microsoft and the document vector for windows and does an intersection of the doc ids, then the node has to do scan that set, grab the document position array from each document and filters out any documents where windows doesn't occur at microsoft-Positions + 1.

All of the conjunctions, disjunctions, wildcard expansions, near operations, and phrase operations, etc are executed on the query node. All of the complex sort evaluation also happen on the query node. The aggregator only merges result sets and and performs any necessary global sorting.

slackerIII · on March 1, 2008

Ah, of course. I misunderstood your previous comment. The system is partitioned by document, not by token - that makes a lot more sense. Oh well, I guess that isn't quite as hard of a problem as I thought :) Thanks for the follow up.

stillmotion · on Feb 29, 2008

Man, Audiogalaxy was awesome.

slackerIII · on March 1, 2008

It sure was. Working there was amazing, mainly because of all the cool tech I got to build, but also because it meant that part of my job was using Audiogalaxy. Good times...

aswanson · on March 1, 2008

You worked there? Thanks, I remember there was one song I was searching for forever and Audiogalaxy was the only engine that found it.

slackerIII · on March 1, 2008

Cool, glad you liked it -- it is always fun to talk to people who used the site.

I was there from 1999 to 2002. It was exceptionally good at finding rare music, at least partially because we never partitioned our network. When you searched, it searched all 1 million+ users that were currently running the Satellite.

ajkirwin · on March 1, 2008

I have to echo what they said.

Audiogalaxy was simply wonderful. Even now, I would still say it's better than current P2P offerings.

You could find so much.. and the client was so very lightweight!

mdemare · on March 1, 2008

It was fantastic for finding rare music. Lots of music that I can't find nowadays anywhere - paid or otherwise.

And I loved being able to control my home client from work through a web interface.

kf · on March 1, 2008

waffles.fm?

email me for an invite