This is great, but I'm confused by the part about avoiding porn: > *Common C...

randomstring · on Dec 17, 2012

I work at Blekko and am the primary engineer working on our porn tagger. We include LGBT, reproductive/sexual health, breast cancer, bands like "Pussycat Riot," etc in our training set to make sure these sites do not get hidden from our search results.

We do not have anything against porn. However, when people are not searching for porn, showing them porn results makes for a bad search experience. So identifying porn, and only showing porn on relevant porn results is vitally important to search quality.

graue · on Dec 17, 2012

Awesome, sounds like you're on it. And yeah, there are sites where it's far too easy to find porn when you don't mean to, e.g. Tumblr.

jacquesm · on Dec 17, 2012

So tag it but include it.

Your answer is a bit at odds with http://news.ycombinator.com/item?id=4933437

ChuckMcM · on Dec 17, 2012

Jacques, the porn is there, its just identified as such. Whether or not it is included in results is a function on the query.

One of the funny things about language is that there is always a 'pun' or an innuendo which can trigger a hit on a porn site, however if most of what you're looking for isn't porn then the web site has to assume you are not looking for porn and avoid some NSFW link from surfacing into your search results. You could always explicitly ask for it with /porn but then that is a clear signal of what you are looking for.

Part of the crawl data includes an indication as to whether or not the ranker thought the document was 'porn' or 'not porn', so if you're selecting things to return you can ignore that bit, mixing porn with non-porn when someone searches for 'beavers' you get a wider variety of results than you would if you were assuming you meant the furry critters which chew on trees or sports teams and limiting results to those documents.

jacquesm · on Dec 17, 2012

That's actually really useful.

Having it there but tagged is halfway towards being able to use it to filter them out. Not having it means that when you merge it with another set that you're not going to be able to remove the porn.

And it also allows you to use it as a training set for classifiers.

ChuckMcM · on Dec 17, 2012

"And it also allows you to use it as a training set for classifiers."

One could imagine a project on Common Crawl which auto-generated a list of slang terms for porny things by creating a list of n-grams from the words used in documents tagged as porn.

LisaG · on Dec 17, 2012

Graue I am from Common Crawl. We don't filter for porn. A corpus of web data needs to include porn or it wouldn't be a representative sample of the web ;) We do want to enrich our sample of the web with high-value sites and that is where the blekko data will be so incredibly valuable.

I really appreciate your mention of LGBT and sexual health sites being collateral damage - we need to draw more attention to that problem. I would love to see someone work with Common Crawl to improve methods of distinguishing. Lisa

graue · on Dec 17, 2012

Glad to have brought up the issue. Sounds like you should talk to 'randomstring, above.

jmillikin · on Dec 17, 2012

  > The next sentence goes on to suggest that porn is not
  > "useful to humans", which is obviously false.

It's worth noting that porn is, in fact, of use only to humans.

graue · on Dec 17, 2012

So are webspam and SEO abuse, but just to a small number of humans and not the ones performing the search. :)