On Quora someone asked what the longest search query time was. I was able to craft a query that took multiple seconds to complete. It used wildcards and undocumented iteration allowing one to stuff thausands of queries into a single query. Turns out it is someone's job to measure result response times, and he/she came into the thread to kindly ask us to stop messing up their statistics.
This is his answer: "I work on search at Google, and I have to say, very clever answers! Now, please stop. :-p"
I don't think that he did it because it's his job to stop random people on the Internet from running slow queries. I think she was just surprised how creative people are and found it funny.
It's hard to believe, but years ago, back when Google had what was called "stop words" (like 'the', that it ordinarily ignored) I was able to make Google perform a search that took over 30 seconds.
The reason stop words take such a long time is that millions of sites have words like "the" on them, so doing a join on all those simply takes a long time.
My method to find a long string consisting entirely of stop words, was to just download a project gutenberg of the complete works of shakespeare, and find the longest string consisting of just stop words in there, then search for it as a literal quote.
The longest one I found was: "From what it is to a".
Also long long ago... I was watching a video of a guy channeling aliens. Someone in the audience asked what would be the next big thing for humanity. I immediately typed the query into Google. It showed the page head.... 20-30 seconds later it came up with a single published paper about "nuclear magnetic resonance identification"
Today it resolves instantly and finds at least 10 publications from before 2000.
There's nothing to index. How could it have found my Shakespeare quote via an index? It consisted entirely of words 'from what it is to a' but produced only the Shakespeare quote. I don't see how it could have indexed anything.... it must have done a join. (Which makes sense given the 30+ seconds I had to sit and wait before it returned its answer, while also reporting the time it took to produce it. What else could it have been doing?)
By the way I believe I wanted to know whether it would return the Shakespeare quote at all.
If you mean that it might have cached the results of the query, I doubt anyone else queried that exact phrase, other than me.
Google does index pages, in the database sense.
An index in the database sense is nothing more than reorganizing data (or subsets of data) into structures optimized for searching and seeking, rather than full scans.
I'm guessing you're most familiar with btree indexes as present and default in many SQL solutions, which are good for quickly answering exact, greater/less matches. There are dozens of data structures useful for indexing, some of which are built to index full text documents. For an example, check out the gin and gist indexes in Postgres [1].
It's my understanding that database indexing and index compression was a primary differentiator Google excelled at from the beginning. They could beat others at fractions of the typical cost because they didn't need data centers to store and query huge quantities of documents.
Seriously, there's no way even Google could intersect the sets of all crawled web documents containing those individual words in 30 seconds, much less two seconds.
>Seriously, there's no way even Google could intersect the sets of all crawled web documents containing those individual words in 30 seconds, much less two seconds.
I believe you're mistaken. What I've heard is that for every word, Google has a list of every web site that contains that word - they've flipped the database. So, I believe, if you search for (without quotes) neanderthal violet narwhal obsequious tandem then -- and I just did this query, which took 0.56 seconds, but decided to remove some of the words, so it can get it me results. When I did plus signs, making my query +neanderthal +violet +narwhal +obsequious +tandem it said it worked 0.7 seconds to determine that in all of the entirety of the Internet, there is not a single document that has those 5 words on it.
How do you think it determines in 700 ms that all of the sites it has indexed on all of the Internet does not contain those 5 words anywhere on it?
The answer is that it has a rather short list of sites that contain the word narwhal, which it then intersects with the somewhat larger list of sites that contain obsequious and so on. 700 seconds is plenty fast when you take that approach.
so, this explains why joining stop words (which consist of billions of pages, each) takes so very long.
using stop words it is easy to make queries that take one or two seconds each.
I'm confused. By "clever indexing" I thought they meant, in the database sense of the word.
The reason my search took 30 seconds is because it started by getting a list of every site with "from" on it, every site with "what" on it, and so on, intereseecting them all. That's how it ended up finding my quote. how else do you think it did it?
-----
edit:
to find the string "from what it is to a" which occurs only hidden in the middle of shaespeare's texts -- what do you think they do?
In my opinion they combine the list of sites that have every word - starting with the least common ones. It's easier if you search for something that has a few uncommon words. Then you start with a small list, and have to combine it with other small lists.
When every word in the phrase has billions of sites (there are billions of pages that have the word "to" on them, same for "from", "what", "it", "is", "a"), you have to combine them all. Then you have to do a string search within the resulting set, since I put it in quotation marks. There is no easy strategy. Hence the long search time.
I'm curious how else you think large-scale data is stored other than in an index in the database sense of the word as well. You think Google has some kind of massive heap-like, unstructured data-store that they run search queries against? That doesn't make sense to me, but I've also never worked in global scale web search, soooo idk.
You said "There's nothing to index," as if Google is making web requests to every domain in existence, parsing the document responses, and seeing which sites have these words on them, all at runtime when you type a search query. Google obviously indexes the web in the sense that they store their own cached versions of web pages "locally," on top of which they then build an insanely complicated, web-facing, search architecture.
I was not referring to database indexes. That is not pertinent here. I was thinking about the index that Google creates, its locally cached version, that it queries. If you have a locally cached version, you are not going to rifle through them one by one until you find matches, nor are you going to rifle through them and find partial matches and then intersect them all to see if any overlap in your final product. Among other weird assumptions, that final method assumes there is a solution for every query.
Google, no doubt, has a very sophisticated way of querying against their cache of the WWW and it has probably evolved over time. However, it is inappropriate to say Google does a join over the entire internet for one query. It is much more reasonable to say that Google checked your query string against their gigantic index of terms, and it took a while to dig that deep into the pile. The performance hit such a complex query takes is more like unzipping a large archive to get a specific megabyte's worth of info, rather than saying it smashed all the files together and then searched for the exact term like notepad.
Anyway, think about it for a while, it's clearly a cool issue in search, and programs and algorithms do not have to visually search things as humans must.
I recommend reading the Stanford paper[0] (page 12), which spells out in a lot of detail exactly what they were doing.
In short, your pathological query would have searched for every document which contained one of your words, discarded those which didn't match all, and then sorted by word proximity. I expect for a literal phase search, there would be a final pass to look for the exact phrase in order.
Yes, it must, let's be exhaustive because we can and because the problem is solvable and because someone must do it.
Have you accidentally searched Google for a long URL and seen it come up? It is actively caching stuff all the time, and that cache just grows and grows, and it must be a pretty beautiful megastructure that you can run queries against.
Google never had stop words. The original lexicon only included the most popular 14 million words (for fast bucketing), and rare words were processed specially.
It did - I had to use plus signs to force them to use them.
Normally it ignored those words. I am fairly certain of this detail. I must have found a list of those words - how else would I have found the string "from what it is to a"? I had a list of its stop words.
As you can see, it is Google saying it is ignoring a word because it is too common. It has a list of every site that has that, but that list is huge and it doesn't usually use it.
The query was already published on Quora. I assume that Google did patch the issue. They were simply requesting that it not be posted in a public forum, resulting in the potential for denial of service attacks. It wouldn't have been a statistician making a fuss about their pretty graphs being ruined. The problem was the very real performance impact the queries were having on the service, as thousands of visitors copy/pasted from Quora to see for themselves.
This is why companies like Google have bounties for such things. "Please submit bugs and performance issues privately so we can patch them before you disclose the details publicly and hurt our services - we'll even pay you for your discretion!"
This particular issue was posted on Quora, where anyone could pick it up and participate in what is essentially a denial of service attack (whether or not performed intentionally). It wasn't submitted as a private bug report to Google so they could fix the issue. It was spread in a public forum. I think it's fair for Google to politely ask "a few of your own tests to validate an issue you will submit as a bug report is fine, but please don't disclose to the public until we patch it."
When you operate at the scale of Google, everything is expected to be airtight; outliers should not be possible. It wouldn't surprise me if their monitoring systems are built without the ability to "massage" (ie: manipulate) statistics, as it is a terrible practice. I don't think a statistician who relies on ignoring outliers would last long working for Google. They're not doing their job if the only thing they care about is silencing warnings to make pretty graphs that falsely show everything is running smoothly. Their job is to work with the truth - not manufacture little white lies to appease management.
When that 0.1% - or even 0.001% - are 5-60 second requests, you have a bomb waiting to go off. There really is a massive difference when you are operating at the scale of Google. If the median is 100ms, the maximum acceptable time - 100th percentile - is likely below 200ms. A three nines percentile that is 10x the median isn't a good thing at large scale. Perfect consistency is more important than statistics. A small scale service deployed on my-little-unused-tool.com that receives a few requests/minute is an entirely different ballgame.