Hacker News new | past | comments | ask | show | jobs | submit login
Identify Any Language Written in the Roman Alphabet at a Glance (theweek.com)
148 points by tokenadult on May 12, 2016 | hide | past | favorite | 92 comments



…both use accents on vowels, but only Scots uses grave (left-pointing) accents, like on à in Gàidhlig.

Just a quick note of caution here: "Scots" and "Scots Gaelic" are two completely different languages, the former being a Germanic language closely related to English and spoken in the Lowlands, and the latter being a Celtic tongue largely confined to the Highlands, the Western Isles, and Nova Scotia. If you can read English you can probably make some vague sense of written Scots, but unless you have training there's no way you'd understand a word of Scottish Gaelic. This article is referring to Gaelic, not Scots.


Some of my Chinese friends use の instead of Chinese equivalent 的 just for fun. Personally I distinguish them by just going and learning the languages. It's easy to distinguish them by noticing Japanese has curvy characters mixed with blocky complicated Chinese ones, where is Chinese is 100% complicated blocky characters.

Also I just want to put it out there randomly that if you want to learn a language but believe you can't, you are almost certainly wrong. If you are able to read this text, you have demonstrated possession of a wet sack of neurons capable of learning a second language. I've witnessed or read about all sorts of people learning a new language; old people, shy, autistic, even while dealing with brain cancer. It is a myth to think children can learn faster than adults. The only time this happens is when the adult is hindered by there own reluctances. Go get some beginner materials with audio, not just text, and dive in. Don't waste time torrenting 12312^23 TiB of learning materials. Glossika, Teach Yourself X, Xpod, Xclass101, whatever. Learn how to ask some trivial questions relevant to your life, write them down because you will forget, then go chat with a native speaker somehow. Read out your questions because you're nervous and forgot, and then fail to understand anything they say, but just pay attention and listen to the sounds of the language. Then go home and learn a bit more, but don't worry too much about memorising anything, just listen, and comprehend a little bit. Then meet up with a native speaker again with some more questions prepared. This time you might understand 0, 1, or a few words, still be nervous, but you'll be a little bit better than last time. Basically you just keep this up without giving up and you will pick up pace. For inspiration, look at blogs like Benny Lewis' and others. It may take 10000 hours to master, but it only takes hundreds of hours to find yourself understand and contributing to group conversations comfortably. If you can enjoy the process, you'll be able to study for N hours for any N as the clock keeps ticking regardless. Just set N=1000


For some languages a couple hundred hours might be enough to participate meaningfully in a normal conversation. But if you're for example an English native speaker and want to learn Chinese, you'll have a really hard time understanding anything but very simple sentences.

On the other hand if you know English and German well, it is very easy to learn for example Dutch, and a couple hundred hours will get you really far.


You're discussing about the language overlap being relevant for the amount of time and effort necessary to get results. That is true, of course, and it if often thanks to the (shared and) already possessed mental models necessary to master the new language. The most important bit of this mental model of a given language is the way speakers phrase their thoughts. This is exactly what you get right by following the brendyn's advice of taking it slow. In time you'll sense and reproduce the natural expressions. (This advice is especially valuable for learning English, BTW! Trying to make sense out of the compound verbs is a poor strategy, therefore just let them sink in slowly in your mind, each within an appropriate context.)


> It's easy to distinguish them by noticing Japanese has curvy characters mixed with blocky complicated Chinese ones, where is Chinese is 100% complicated blocky characters.

And Korean has lots of circles/ovals.


Additional bonus: Korean uses lots of basic angles, squares, and lines and many Hangul have "three parts" eg. 한국어의 It's a beautifully simple writing system.

You can go quite a long comment chain in Japanese without seeing の. I always tell people "look for lots of simple characters that can be written with only 2 or 3 lines mixed in between a bunch of really complex characters".

    来週は学校に行きません
For those who don't know Japanese, only following my rule, can you identify the Japanese characters in the sentence above?

Except with names, Japanese writing will have a bunch of Chinese characters with many "simple" Japanese characters sprinkled in between. To someone who doesn't read Japanese, both are "unintelligible" but I find people can identify the more complex Kanji from the more simple Kana quite easily and can point them out with pretty good accuracy, even if they can't read any of the kana.

The above example without Japanese characters if you'd like to see if you guessed correctly:

    来週学校行


Another quirk about the の thing is that some Chinese-speaking people, in Taiwan at least (not sure about the mainland) use の as a colloquial replacement for 的, presumably due to Japanese cultural influence. So it's not completely impossible to see の in a Chinese-language text.

(They still pronounce it "de", only the writing is different.)




Being in Japan is always a bizarre experience as a student of Chinese. I can perfectly understand a number of signs and get the gist of a surprising amount of other written language despite not understanding a word of it spoken aloud.


Twist: When you see a pair of complex-looking Chinese character strings, but if one of them looks somewhat "simpler", then chances are that it's Chinese and the other is Japanese. Because Mainland Chinese people use simplified characters.


来 and 学, from the original example, are both simplified characters (the traditionals being 來 and 學). I was a little bemused to see them in an example of Japanese text, but it turns out they are the common Japanese characters as well. Japanese has its own set of simplifications ( https://en.wikipedia.org/wiki/Shinjitai ), often overlapping with the Maoist simplifications.

In a different pattern, mainland China simplified 龍 to 龙 while Japan simplified it to 竜. Despite being a "simplified" form, 竜 is actually the oldest of those three characters.

>> For those who don't know Japanese, only following my rule, can you identify the Japanese characters in the sentence above?

I would technically meet that requirement, but knowledge of Chinese makes the question pretty easy regardless of knowing Japanese. ;)


>I would technically meet that requirement, but knowledge of Chinese makes the question pretty easy regardless of knowing Japanese. ;)

Found the loophole in my requirements! Haha :) Had a good laugh when you pointed that out, thank you. Showed a quirk in my reasoning.

You also did a wonderful job explaining the simplified/traditional characters and the overlap between them and I learned a bit of trivia!


Taiwan hasn't followed the simplification. Could you be looking at dual printed classical/simplified Chinese?


I would say the give away on Korean is that it has circles or ovals as part of the character.


It doesn't necessarily have any though.


    来週は学校に行きません
This sentence makes no sense.


What problem do you have with it?


It's nonsensical and an actual error for one. ;)

If you search for "来週は学校に行きません" on Google this thread is the only result.


The reason for this is contextual (internet posts), not grammatical. You'll get more results if you remove the polite ending and/or the topic particle (try "来週学校に行かない").


Actually, since not wanting to go to school is a common sentiment, it can end up simplified into something with no particles at all: https://twitter.com/k_y02240/status/613690896916201472

I hope you don't find it nonsensical, though, I understood it just fine.

And checking ghits for plain forms shows:

"学校に行く" - 3,680,000

"学校へ行く" - 444,000

so I think the other way is actually the variant!

BTW, I feel like "学校に行く" - where に means "for/into" - has a sense of "going to school to go to class", but "学校へ行く" - where へ means "towards" - has a sense of "going to the school building as a physical place".

Here's a forum thread about it:

http://oshiete.goo.ne.jp/qa/5458812.html


It is neither nonsensical nor an error.


    来週は学校に行きません

 roughly translates as "Next week, I won't go for school."
The correct usage is

    来週は学校*へ*行きません
"Next week, I won't go to school."


Hi, I'm a gainfully employed translator of Japanese and linguist.

Short answer: the sentence is fine as it stands, although your alternative is equally grammatical (if less idiomatic - "学校へ行かない" gets about 1/4 of the ghits of "学校に行かない" by my count, and personally I'd never use へ).

Long answer: Japanese draws a distinction between verbs of A→B movement (e.g. 行く to go; 引っ越す to move house; 移動する to move/change location) and verbs that describe the manner of motion (e.g. 歩く to walk; 走る to run). The A→B verbs can happily take an indirect object complement (a に phrase) as the destination, because there's really no other possible meaning. You can't "go for school", as you put it.

Verbs of motion, on the other hand, are less flexible. These can't take an indirect object complement, and need a more expressly destinatory case (think へ、へと、まで). The reason for this is that a change of location is not implied in the verb, however counterintuitive that might seem to an English speaker.

Interestingly, motion verbs can take a direct object (を), such as in 街を歩く 'to walk around town'. Another indication that they're not strictly destinatory. Also they can take adverbs that further qualify motion, such as ぽつぽつ歩く 'to dawdle/mosey/toddle'(? tough to translate), whereas 行く can't. A→B verbs can however take adverbs that qualify the speed, like ゆっくり行く 'to go unhurriedly' - but this is different from ゆっくり歩く 'to walk in a slow manner'.

This is my first ever comment on HN so I've tried to be as informative as possible... If you have any counterexamples I'd be happy to try to explain them.


Japanese N1 and JETRO biz japanese here.

Yeah sure, in vernacular construct, which explains more Google results.


Just stop. You're embarrassing yourself.


Entirely false. に is perfectly valid here.


There is nothing wrong with that sentence.


Wikipedia has a more elaborate and detailed guide to this

https://en.wikipedia.org/wiki/Wikipedia:Language_recognition...

but this article's approach is actually pretty handy and its tips are very practical.


Only tangentially related, but why can't spelling check software automatically figure out which language I'm typing in? I write mostly in English but sometimes I write in French or in a mix of both languages and I usually struggle with the spelling corrector which keeps bothering me.


For Firefox there is the Dictionary Switcher extension: https://addons.mozilla.org/en-US/firefox/addon/dictionary-sw...

As hirsin said, SwiftKey does this really well on Android. It's probably my main irritation on iOS: having to switch keyboard all the time when I could just type English on the Swedish keyboard.


SwiftKey (Android keyboard) does an amazing job of this. I routinely switch between French and English, and usually by the end of the first word in the other language my autocorrect and suggestions are both in the right language.


AFAIK it uses Markov chains, so I think it just lumps all dictionaries together, and as soon as you write a couple of words, the probabilities for the following words will be in the proper language. It doesn't even need specific rules per language, it's all automatic.


That is my guess as well but, whatever it is, it works marvelously. Many times, I write a significant portion of my message just through suggestions.


That's a funny game to play with your friends (if they also have personalized suggestions). Just see what you can write using only your suggestions.


There's now SwiftKey Neural, that probably uses Neuronal Nets instead of markov chains.


I suspect it's less a case of they can't and more a case of they don't. Automatic language detection can be done with reasonable accuracy.



Ð/đ may also be Croatian, where it sounds like a "dj". Technically it could also be Serbian (which is pretty much the same spoken language, called Serbo-Croatian), but Serbian is usually written using the Cyryllic alphabet while Croats chose Roman letters.


This is all very helpful, but how do you spot English?


If the text contains only plain, boring letters and you see the word "the" being used a few times... you're looking at a text in English.


Though æ is used in English, too.


Quite rarely, however. Especially in American english (British English still uses it for some spellings, like "encyclopædia").


Yes, I have noted the American difference. Pedophile means something quite different!


And ë or ö, archaically ;)


English with a lot of ë's and ö's means you're reading The New Yorker.


And ï.


"our", "re", and "ise" (unless it's English from Oxford)


No diacritics, and generous use of apostrophes.


> No diacritics

An understandably naïve view of things whose errancy could easily be corrected through your coöperative perusal of The New Yorker.


To be fair, the original question was how to recognise English text, not how to recognise the archaic microcosm of English text that resides between the covers of The New Yorker.


Wouldn't it be cöoperative?


It wouldn't be. The goal is to separate the second o from the first, not to separate the o from the c:

The diaeresis indicates that a vowel should be pronounced apart from the letter that precedes it [1]

[1] https://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29#Diae...


Oh, true. Naive should have tipped me off.


That guideline would likely classify Xhosa as English. See https://xh.wikipedia.org/wiki/Iphepha_Elingundoqo , which has no diacritics and three apostrophes.


Throw in "Mid-word capitals are (almost completely) absent in English".


To all sibling comments, I'm sure he meant how does someone who does not speak English, identify it by the letters?


And now that I think about it, how would you differentiate Latin and English?


In my case it'll be the only one that I can understand, innit?


The use of "w" puts it in the germanic group, then the additional usage of "qu" puts the romance sign on it. Usually, that is enough for me.


Pardon?


>Dutch, German, and Afrikaans: Of these three close relations to English, only German uses Ä/ä, Ö/ö, and Ü/ü.

Incorrect: Dutch uses the trema on the i, e, o, u, a as well. Examples: * reünie * knieën * ruïne * Aäron * zoöloog


And anal English speakers (such as the copy editors of the New Yorker) also use the diaresis (the double dot over a vowel). In English, it is used over the second of two vowels in a row which are voiced separately, such as naive (should have the double dot over the i) or cooperate (double dot over the second o). The distinction is between a word like coop (meaning a house for chickens) in which the two vowels make one sound and a word like cooperation, in which the two o's are separate sounds.


This reminds me of this fine map, "List of writing systems" https://upload.wikimedia.org/wikipedia/commons/a/aa/World_al...


Nitpick: In Turkish, "ğ" is silent by itself but it makes the pronunciation of the vowel before itself longer and sometimes makes the pronunciation end at the back of the mouth, especially after "e". "Erdoğan" is indeed just "Erdooan" though.


Correction to the article: There is no Ů in Czech (except for CAPS LOCKED words) - The longer "u" is written as "Ú/ú" as the first letter in a word, and "ů" in other positions (strange for sure, but because of historical reasons)


It's not just historical. "Ů" is phonologically a longer "u", but it is actually an alternation [1] of "o": for example see how nominative "dům" becomes "domu" in the genitive (likewise "stůl", "bůh", etc.). When a "u" becomes longer, instead, it becomes a "ú" as the first letter of the word, but otherwise it becomes "ou"; for example see how the feminine nominative of (some) nouns and adjectives is "a" and "á" respectively, while the accusative is "u" and "ou", or how the perfective companion of "kupovat" is "koupit".

(Also, see how I sneaked in a "Ů" in the second sentence :)).

[1] https://en.wikipedia.org/wiki/Alternation_%28linguistics%29


The article is correct, though: if you see Ů, it's Czech.


> Welsh is actually quite different from the other two. It uses lots of ll and ff and it uses w as a vowel (e.g., cwm).

Welsh also uses a circumflex accent to extend any of the vowels, and since both 'w' and 'y' are vowels in Welsh (leading to many jokes by English speakers about words with no vowels) they can have the circumflex accent too. I've had problems in the past finding the alt-codes to generate w or y with a circumflex accent - so those may be unique to Welsh.

From http://symbolcodes.tlt.psu.edu/bylanguage/welsh.html : > Because of the writing system, Welsh places accents on the letters w (phonetic /u/) and y (phonetic /ɨ/ or /i/), which is very unique in languages of the world. These symbols require Unicode support apart from that of other Western European languages.


Persian will have three dots in a triangle above a single upward stroke or below the line. Arabic only has the three dot combo above the script on a multiple upward stroke grouping (sometimes a flat line between upstrokes).


The same is true for Urdu as well, so if you want to distinguish Urdu from Persian: look for a backward moving (i.e. towards the right) horizontal stroke at the end of a word. This stroke will always run under the preceding letters of the word, except that some dots of the preceding letters may be moved beneath the stroke in order to avoid collision.


I try doing this a lot when listening to people on the street and think I'm pretty good at it... Of course, I never truly know unless I ask!


Nothing on Filipino/Indonesian languages? Those always confuse me, since the users also heavily mix them with English, so you might see a comment mostly in English but also have a bunch of native words or phrases mixed in.


> You can sometimes tell Danish from Norwegian because Danish sometimes uses aa (as in Kierkegaard) instead of å.

That goes both ways (e.g. Haakonsvern in Norway), so no you really can't tell it apart that way.


I like Hacker News for that. The topic of this article is interesting. Thanks for bringing that up. However, when you read the comments here, you realise the article is quite wrong :)


Ħħ - Maltese


ß for German!


Yes, if you spot ß, that's a dead giveaway for German. You can't really rely on it alone for identification however because it's not that frequent (or rather, it's very inconsistent – German can run for paragraphs without a single ß only to make up for it with five of them in a single sentence). It's also not used at all in Swiss German.

Another near-certain giveaway is that all nouns in German are capitalised. The only other language that does that and uses the Latin alphabet is Luxembourgish, and you're probably not looking at that.


There is a character only used in Taiwanese, not Chinese: 互

By that, I mean the Taiwanese language, which is not the same as Mandarin Chinese. Both languages are used in Taiwan, although Mandarin is the official language of the (outgoing) KMT government. Taiwan number 1 ;-)


It's not true that only Taiwanese uses it. It means "mutual" or "each other" and is used quite a bit.

互相 mutual 互聯網 internet


What? Chinese and Japanese both use 互.


French:

Often used: à è é

Used: â ä ê ë î ï ô û ù œ ç

Very very rare: æ ü ÿ


Crap, I upvoted this before I noticed which publication it's from. How do I downvote it?


On a meta level I find that just a little troubling. It sounds to me like "Crap, I agreed with this until I noticed it was an opinion from a tribe I don't identify with - so I can't agree with it". Maybe theweek.com is some uniquely evil thing I haven't heard about?


"Upvote" means something different than "agree"; one of the things it means is "I endorse people visiting this".

I can imagine quite a few sources that I wouldn't want to direct traffic to even if they published something where I agree with the sentiment.


This is the first time I heard of the week, a quick glance around failed to raise any problems; what's wrong with them?


The idea of voting for a link for any reason other than its quality is completely anti ethical to a karmic voting system like the one used here.


Yes, but "quality" is both vague and subjective, not only will different people evaluate aspects of quality differently, different people will legitimately have different views on what components "quality" of a link has. I don't think it's unreasonable to consider the source as a one factor in overall quality (if nothing else as a proxy for things the rater is unable to evaluate about the article in isolation.)


But why should things other than the article directly linked to matter? Why should it be acceptable to downvote an otherwise interesting and correct article just because of the source?

That smacks of voting for ideological correctness over truth or interestingness, a problem that otherwise intelligent people should be able to look past. What makes this site meaningfully different from the front page of Reddit if people will crap on an article because it comes from a source that doesn't align with their politics?


> Why should it be acceptable to downvote an otherwise interesting and correct article just because of the source?

"Correct" is often a probabilistic assessment, not something a potential up/downvoter can determine absolutely.

The source is often an important input to that probabilistic assessment.

> That smacks of voting for ideological correctness over truth or interestingness

Different outlets of the same ideological bent (whether relatively neutral or not) can have wildly different editorial standards which produce wildly different reliability.


I wasn't implying that upvote means agree - only that upvote and agree are positive rather than negative sentiments (because I was proposing a broad pattern match not an exact semantic match). But your explanation does make sense to me, that is a plausible stance, thanks.


But what's wrong with theweek?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: