Hacker News new | past | comments | ask | show | jobs | submit login

I took a quick look at English / Swedish, and noticed that the Swedish version of Three Men in a Boat is badly encoded: the (vital) Swedish letter å has been consistently replaced with a. On the other hand, ä and ö seem to be present as expected.

So (based on a sample size of one), I'm afraid it doesn't look like the translations are necessarily reliable.




I've previously struggled with the encoding of books on Project Gutenberg. In theory each text file has a header that specifies the encoding and other data about the book like author and title. In practice the header format varies unpredictably (likely it was written manually and not meant for machine consumption) and even if you can parse the encoding value, you might have one that says ASCII but is actually some Windows code page, or UTF-8, or something completely different. And that encoding is also used in other parts of the header, so it can happen that you fail to parse the header because you don't know which encoding to use. One book actually used different encodings in different parts.

After collecting a messy tangle of special cases over several days, I threw in the towel and just used BeautifulSoup's UnicodeDammit to brute-force a working encoding.

Maybe the Swedish book didn't indicate a vs å in the first place, though.


Not sure this one came from Gutenberg; at least, I can't seem to find a Swedish version of it there. But anyhow, either the source is worthless, or some conversion process somewhere was bad. A vs Å is not a minor or optional issue in Swedish; å, ä and ö are separate letters of the alphabet in their own right, and "omitting accents" isn't a thing. It's just wrong.

(E.g. see the Swedish alphabet at http://omniglot.com/writing/swedish.htm)

> I threw in the towel and just used BeautifulSoup's UnicodeDammit to brute-force a working encoding.

For a limited definition of "working", at times! "A working encoding" that shows the wrong letters isn't "working" in a very useful sense.


I had the same thought. Some text is hard to read as well. For example:

-- Harris sade, att han drabbades av sa utomordentligt starka yrselanfall emellanat, att han knappt visste vad han gjorde; och da sade George att han led av starka yrselanfall, och knappt visste vad han gjorde.

This made very little sense to me, until I googled it and found another source where the second sentence italicized some words in the second half, making it at least make some sense. (https://sv.wikisource.org/wiki/Tre_m%C3%A4n_i_en_b%C3%A5t._K...)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: