I don't get it. They were using Map/Reduce as a way to build the index, which they were able to query in mere milliseconds. This article claims that in order to facilitate Google Instant, they had to ditch the Map/Reduce-oriented updating of the index.
How are these mutually exclusive? If you look at quotes like these:
"Goal is to update the search index continuously, within seconds of content changing, without rebuilding the entire index from scratch using MapReduce."
To me, this seems as if this change has nothing to do with Google Instant. This has more to do with being able to respond instantly to new content, instead of being able to query the index quickly.
It sounds like they added support for distributed stored procedures on top of BigTable, which reminds me a bit of the way MongoDB implemented Map/Reduce. But I bet that they in no way at all have dumped Map/Reduce.
> To me, this seems as if this change has nothing to do with
Google Instant. This has more to do with being able to
respond instantly to new content, instead of being able to
query the index quickly.
Right, this was part of the caffeine update, which happened months ago.
> But I bet that they in no way at all have dumped Map/Reduce.
The old way of doing calculations on the web graph was giant iterations on the adjacency matrix via map reduce. In the new system, they are probably doing local walks in the instantiated graph. These local walks are simple iterations, not map-reduce.
I just blogged about Caffeine. To address your questions: they have moved to a continuous update indexing scheme that uses something like Big Table so their index is now in a data. From an end-user's perspective, calculating search results using this new data store and more caching is faster because search results are displayed while I type. I can't wait for technical papers on Caffeine to be released.
But how is this so different from the index servers? No map/reduce was required to lookup a query in the index servers, it was only used to /build/ to index servers. I can see that this new method allows for more rapid modifications of the index servers, but I don't see how it is related to Google Instant.
"Google distinguished engineer Ben Gomes told The Reg that Caffeine was not built with Instant in mind, but that it helped deal with the added load placed on the system by the "streaming" search service.
"Lipkovitz makes a point of saying that MapReduce is by no means dead. There are still cases where Caffeine uses batch processing, and MapReduce is still the basis for myriad other Google services. "
I can't wait for the technical papers on Caffeine either, but to me it does not seem related to Google Instant in many ways, it's just part of Google's desire to be able to respond more rapidly to changing content on the internet. Which is totally OK with me. But then people shouldn't claim they dumped Map/Reduce in order to implement Google Instant.
Good points, thanks. Without more tech detail, it does look like the Big table type datastore is being updated while being used for search. In Lucene/Nutch/etc. apps, I think it s common to build one index offline, then swap for the live index (don't need to do so, but it makes sense to sometimes do this). Google is likely getting rid of the swapping index part of the process by having a large persistent datastore that is continually indexed.
Have you seen Lucandra? It's Cassandra-backed Lucene that gives you soft real-time updates. I'm sure Google's infrastructure is far more advanced, but the idea of real-time indexing for big data has been around for a while, and is available in OSS projects today.
To me, it looks like the blogger makes a few wrong assumptions: He confuses Google Instant with Google Realtime. He assumes that "something like database triggers" is actually very much like database triggers and could be e.g. used to check the integrity of the data. He goes wildly off on a tangent with Internet DOM.
This article seems to confuse real-time search (producing results from web pages put online moments ago, such as news articles and twitter feeds) and Google Instant (search as you type results using Google Suggest).
09/11/2020: Google has found that its BigTable database is very limited in terms of the types of queries it can perform and that the kind of data modeling it enforces leads to a number of nasty inconsistencies. Under lead engineer Matt Codd, Google researchers have been working on a secret project codenamed 'Join' in order to remedy this situation. At this year's I/O conference the search behemoth will present the successor to its dated BigTable system. The new database system is rumored to be named DB2.
As an engineer I'm blown away that they can pull this off at all. As a user one thing that's pissing me off about Instant is that it doesn't honor the 'Number of Results' search setting, it always returns 10 results.
I doubt they abandoned Map/Reduce, just changed where/how it's being used in their architecture. It's way too powerful of a technique, especially at their scale, to abandon it entirely.
Also, I didn't find this article to have a lot of substance. If anyone knows of any more detailed description of the under-the-hood changes, please point it out for us.
How are these mutually exclusive? If you look at quotes like these:
"Goal is to update the search index continuously, within seconds of content changing, without rebuilding the entire index from scratch using MapReduce."
To me, this seems as if this change has nothing to do with Google Instant. This has more to do with being able to respond instantly to new content, instead of being able to query the index quickly.
It sounds like they added support for distributed stored procedures on top of BigTable, which reminds me a bit of the way MongoDB implemented Map/Reduce. But I bet that they in no way at all have dumped Map/Reduce.