Hacker News new | past | comments | ask | show | jobs | submit login
Language detection with Google's Compact Language Detector (mikemccandless.com)
75 points by davidw on Oct 22, 2011 | hide | past | favorite | 12 comments



Nice. I've been using a call to the Google online translator to achieve the same result - http://ajax.googleapis.com/ajax/services/language/translate?...


are there published api limits on this service?


I'm not aware of them. I'm using this in a high volume/non critical capacity and it seems to be okay. I should keep better stats though.


Nice! I wrote a .NET wrapper myself, never got around to a Python extension though. One question - did you experience any memory leak issues with the CLD? Said, .NET wrapper DLL seems to leak and I never really checked if it was the C++/CLI I added on top or the actual CLD native C++ code. I doubt the latter since (according to my basic understanding) nothing is created in the original code which needs to be cleaned up manually. Before I start debugging mixed-mode .NET applications I just wanted to be sure.


Great library! Here are bindings for PHP: https://github.com/lstrojny/php-ccld


"You must provide it clean (interchange-valid) UTF-8, so any encoding issues must be sorted out before-hand."

In most cases you have to know the language in order to guess the encoding and convert to UTF-8 if necessary. Mutual recursion...


Encoding, but obviously not language, should be provided explicitly as metadata (e.g. Content-Type HTTP header). Also, most of content available on the web is already UTF-8 (65.9% according to a recent survey[1]).

[1] http://w3techs.com/technologies/details/en-utf8/all/all


Mark Pilgrim reversed (or ripped out, can't remember) the encoding detection that Firefox uses. It has done a fairly good job for my web crawling:

http://pypi.python.org/pypi/chardet


In my experience chardet misclassifies very often iso-8859-1 as iso-8859-2. I saw the misclassification even in small spanish pages, which were using only the typical spanish characters.


I'd say in most cases UTF-8 is already used. Of course this depends on the source of the text.


I thought that the detection is easier with Unicode ordinal value map table


I am assuming it will recognize languages even when they are using the same character sets. No?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: