Hacker News new | past | comments | ask | show | jobs | submit login

Ah. Well, keep in mind that we're splitting the documents up in an arbitrary way WRT the data, so every cluster/partition has a complete copy of the index for the documents that happen to be bucketed there. And no cluster owns a single "document class/cluster".

So, if you want to search for, say "microsoft windows", the aggregator just sends something like "PHRASE('microsoft', 'windows')", each query node finds the document vector/set for microsoft and the document vector for windows and does an intersection of the doc ids, then the node has to do scan that set, grab the document position array from each document and filters out any documents where windows doesn't occur at microsoft-Positions + 1.

All of the conjunctions, disjunctions, wildcard expansions, near operations, and phrase operations, etc are executed on the query node. All of the complex sort evaluation also happen on the query node. The aggregator only merges result sets and and performs any necessary global sorting.




Ah, of course. I misunderstood your previous comment. The system is partitioned by document, not by token - that makes a lot more sense. Oh well, I guess that isn't quite as hard of a problem as I thought :) Thanks for the follow up.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: