Hacker News new | past | comments | ask | show | jobs | submit login

Is it really technically correct to say that Google was performing web-wide joins on data? Isn't it all about clever indexing?



There's nothing to index. How could it have found my Shakespeare quote via an index? It consisted entirely of words 'from what it is to a' but produced only the Shakespeare quote. I don't see how it could have indexed anything.... it must have done a join. (Which makes sense given the 30+ seconds I had to sit and wait before it returned its answer, while also reporting the time it took to produce it. What else could it have been doing?)

By the way I believe I wanted to know whether it would return the Shakespeare quote at all.

If you mean that it might have cached the results of the query, I doubt anyone else queried that exact phrase, other than me.


Google does index pages, in the database sense. An index in the database sense is nothing more than reorganizing data (or subsets of data) into structures optimized for searching and seeking, rather than full scans.

I'm guessing you're most familiar with btree indexes as present and default in many SQL solutions, which are good for quickly answering exact, greater/less matches. There are dozens of data structures useful for indexing, some of which are built to index full text documents. For an example, check out the gin and gist indexes in Postgres [1].

It's my understanding that database indexing and index compression was a primary differentiator Google excelled at from the beginning. They could beat others at fractions of the typical cost because they didn't need data centers to store and query huge quantities of documents.

Seriously, there's no way even Google could intersect the sets of all crawled web documents containing those individual words in 30 seconds, much less two seconds.

[1] https://www.postgresql.org/docs/current/static/textsearch-in...


>Seriously, there's no way even Google could intersect the sets of all crawled web documents containing those individual words in 30 seconds, much less two seconds.

I believe you're mistaken. What I've heard is that for every word, Google has a list of every web site that contains that word - they've flipped the database. So, I believe, if you search for (without quotes) neanderthal violet narwhal obsequious tandem then -- and I just did this query, which took 0.56 seconds, but decided to remove some of the words, so it can get it me results. When I did plus signs, making my query +neanderthal +violet +narwhal +obsequious +tandem it said it worked 0.7 seconds to determine that in all of the entirety of the Internet, there is not a single document that has those 5 words on it.

How do you think it determines in 700 ms that all of the sites it has indexed on all of the Internet does not contain those 5 words anywhere on it?

The answer is that it has a rather short list of sites that contain the word narwhal, which it then intersects with the somewhat larger list of sites that contain obsequious and so on. 700 seconds is plenty fast when you take that approach.

so, this explains why joining stop words (which consist of billions of pages, each) takes so very long.

using stop words it is easy to make queries that take one or two seconds each.


>There's nothing to index

Huh? What do you mean? Google indexes HTML web page content from the entire public internet using web crawlers...


I'm confused. By "clever indexing" I thought they meant, in the database sense of the word.

The reason my search took 30 seconds is because it started by getting a list of every site with "from" on it, every site with "what" on it, and so on, intereseecting them all. That's how it ended up finding my quote. how else do you think it did it?

----- edit:

to find the string "from what it is to a" which occurs only hidden in the middle of shaespeare's texts -- what do you think they do?

In my opinion they combine the list of sites that have every word - starting with the least common ones. It's easier if you search for something that has a few uncommon words. Then you start with a small list, and have to combine it with other small lists.

When every word in the phrase has billions of sites (there are billions of pages that have the word "to" on them, same for "from", "what", "it", "is", "a"), you have to combine them all. Then you have to do a string search within the resulting set, since I put it in quotation marks. There is no easy strategy. Hence the long search time.

what else could they be doing?


I'm curious how else you think large-scale data is stored other than in an index in the database sense of the word as well. You think Google has some kind of massive heap-like, unstructured data-store that they run search queries against? That doesn't make sense to me, but I've also never worked in global scale web search, soooo idk.


You said "There's nothing to index," as if Google is making web requests to every domain in existence, parsing the document responses, and seeing which sites have these words on them, all at runtime when you type a search query. Google obviously indexes the web in the sense that they store their own cached versions of web pages "locally," on top of which they then build an insanely complicated, web-facing, search architecture.


we're talking past each other. sova referred to this meaning - https://en.wikipedia.org/wiki/Database_index when they said "clever indexing."

the sense you mean is a different sense of the word index - meaning, to crawl. Yes, of course it does that too.


I was not referring to database indexes. That is not pertinent here. I was thinking about the index that Google creates, its locally cached version, that it queries. If you have a locally cached version, you are not going to rifle through them one by one until you find matches, nor are you going to rifle through them and find partial matches and then intersect them all to see if any overlap in your final product. Among other weird assumptions, that final method assumes there is a solution for every query.

Google, no doubt, has a very sophisticated way of querying against their cache of the WWW and it has probably evolved over time. However, it is inappropriate to say Google does a join over the entire internet for one query. It is much more reasonable to say that Google checked your query string against their gigantic index of terms, and it took a while to dig that deep into the pile. The performance hit such a complex query takes is more like unzipping a large archive to get a specific megabyte's worth of info, rather than saying it smashed all the files together and then searched for the exact term like notepad.

Anyway, think about it for a while, it's clearly a cool issue in search, and programs and algorithms do not have to visually search things as humans must.


> what else could they be doing?

I recommend reading the Stanford paper[0] (page 12), which spells out in a lot of detail exactly what they were doing.

In short, your pathological query would have searched for every document which contained one of your words, discarded those which didn't match all, and then sorted by word proximity. I expect for a literal phase search, there would be a final pass to look for the exact phrase in order.

[0]: http://ilpubs.stanford.edu:8090/361/1/1998-8.pdf


I think it indexes the entire string, no? or would that be too many combinatorics, idk.


Yes, it must, let's be exhaustive because we can and because the problem is solvable and because someone must do it.

Have you accidentally searched Google for a long URL and seen it come up? It is actively caching stuff all the time, and that cache just grows and grows, and it must be a pretty beautiful megastructure that you can run queries against.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: