Greplin (YC W10) Search Engine Tackles Facebook, Twitter

vladgur · on Feb 16, 2011

This whole indexing business worries me. They actually store my data internally to provide these "instant search" capabilities. Im pretty sure this goes against terms of API use of many data providers. Linked in for instance says the following at http://developer.linkedin.com/docs/DOC-1013

"3.4 Data Storage and Conversion. You may not store or cache any Content returned or received through the APIs, including data about users, longer than the current usage session of the user for which it was obtained, except for the alphanumeric user IDs we provide you for identifying users, unless and to the extent that such storage or caching is expressly allowed in the Platform Guidelines. You may store the alphanumeric user IDs we provide you indefinitely unless we terminate your use of the APIs for breach of these Terms. The restrictions of this Section do not apply to “Independent Data,” which means data that users provide directly to you, provided that you cannot convert data received from the APIs to Independent Data (e.g., by obtaining it from the APIs and asking the user for permission); Independent Data must have been separately entered, uploaded, or presented to you by the user of your Application."

Basically Linked disallows storing data directly or converting/hashing/indexing it and then storing it. It only allows storage of user ids.

Yet greplin is getting away. I suppose they pay for special data licensing.

mayank · on Feb 16, 2011

I agree, this is quite worrying, and I'm not aware of "data licensing" for services like Facebook.

From the FAQ:

--Greplin updates your data approximately every 20 minutes, however, in high-load situations, updates can take up to a day.

So it definitely seems like all your data ends up in a giant DB somewhere. Unless they've worked out amazing data licensing contracts with Facebook, Google, LinkedIn and Twitter, I'd be very surprised if this was legal.

EDIT: Facebook appears to have changed their data retention policies from what they were. The 24-hour caching restriction seems to have been removed: http://developers.facebook.com/policy/#policies

ntoshev · on Feb 17, 2011

I don't know what Greplin is doing, but indexing alone doesn't imply they cache your data somewhere. Although the index contains pieces of your text, you can't reconstruct it from the inverted index alone.

mychacho · on Feb 17, 2011

From usability standpoint, they have this google-instant-search-esque thing going where they produce instant search result with text as soon as you start typing. So unless they are hitting the APIs of every provider you authorized to them and searching through them in near real time, they are storing the data.

zackola · on Feb 17, 2011

This is what Google should have been building instead of Buzz or Wave.

iterationx · on Feb 16, 2011

>>For Google, developing such a service could be a challenge, in part because it likely wouldn’t get the same access to users’ Facebook accounts that a non-competitor startup has

If Facebook would block Google from this, then they should stop Greplin, unless we are so naive as to think Google won't buy Greplin.

"The best thing that would happen is for Facebook to open up its data," Mr. Schmidt said. "Failing that, there are other ways to get that information."

justin · on Feb 16, 2011

Facebook could block Greplin upon an acquisition by Google.

Allowing Greplin to get Facebook data by using the Facebook API (which it does) is much different than allowing Google to get EVERYONE'S data by spidering. One involves consent by the user, and provides search only for that user. The other involves no consent and provides access to user data over a publicly available interface.