Language detection with Google's Compact Language Detector

dangoldin · on Oct 22, 2011

Nice. I've been using a call to the Google online translator to achieve the same result - http://ajax.googleapis.com/ajax/services/language/translate?...

toisanji · on Oct 22, 2011

are there published api limits on this service?

dangoldin · on Oct 23, 2011

I'm not aware of them. I'm using this in a high volume/non critical capacity and it seems to be okay. I should keep better stats though.

tha-dude · on Oct 22, 2011

Nice! I wrote a .NET wrapper myself, never got around to a Python extension though. One question - did you experience any memory leak issues with the CLD? Said, .NET wrapper DLL seems to leak and I never really checked if it was the C++/CLI I added on top or the actual CLD native C++ code. I doubt the latter since (according to my basic understanding) nothing is created in the original code which needs to be cleaned up manually. Before I start debugging mixed-mode .NET applications I just wanted to be sure.

lstrojny · on Oct 22, 2011

Great library! Here are bindings for PHP: https://github.com/lstrojny/php-ccld

sick-boy · on Oct 22, 2011

"You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand."

In most cases you have to know the language in order to guess the encoding and convert to UTF-8 if necessary. Mutual recursion...

adambyrtek · on Oct 22, 2011

Encoding, but obviously not language, should be provided explicitly as metadata (e.g. Content-Type HTTP header). Also, most of content available on the web is already UTF-8 (65.9% according to a recent survey[1]).

[1] http://w3techs.com/technologies/details/en-utf8/all/all

ninjin · on Oct 22, 2011

Mark Pilgrim reversed (or ripped out, can't remember) the encoding detection that Firefox uses. It has done a fairly good job for my web crawling:

http://pypi.python.org/pypi/chardet

e98cuenc · on Oct 22, 2011

In my experience chardet misclassifies very often iso-8859-1 as iso-8859-2. I saw the misclassification even in small spanish pages, which were using only the typical spanish characters.

Maakuth · on Oct 22, 2011

I'd say in most cases UTF-8 is already used. Of course this depends on the source of the text.

johnx123-up · on Oct 22, 2011

I thought that the detection is easier with Unicode ordinal value map table

abhaga · on Oct 22, 2011

I am assuming it will recognize languages even when they are using the same character sets. No?