Easy to test. "Koirissanikohan" should be a sentence (or a word) conveying exactly this meaning: "I wonder if also probably in my dogs there is ..".
Will not happen in pure AI-based system without hard-coded parsing rules and without changing the obviously default dictionary-based english-like language model.
I'll never consider machine translation a solved problem until there is good Finnish<->English. Living in Finland as a native English speaker with terrible Finnish skills this is a daily source of frustration and sometimes amusement.
I've been in Finland for a month as a native English speaker and came to exactly the same conclusion, especially after witnessing all the mistakes that Google Translate makes
OT, I hope you don't mind me asking, how does this work out for you? I have some reasons to consider moving to Finland but the language barrier seems very rough. Is it even possible for the average hacker to get by with just English?
It's been good so far but luckily my partner is Finnish so that helps quite a bit. I'm also taking classes and trying to use the language when I can. I've learned French and Dutch to conversational level in the past, but I've found Finnish to be much much harder.
In terms of life, generally in Helsinki or maybe the other big cities you could get by just fine with english only. I know a few folks at work who've been here a long time and have survived ok! In terms of written information everything is in Finnish or Swedish but occasionally (and from what I gather, increasingly) in English. I've found pretty much everyone in a customer/client facing position speaks very good English too.
The one situation that comes to mind where it's genuinely quite difficult is grocery shopping and following cooking instructions. Since I like to cook I enjoy trying new things but Google translate is totally hopeless. Thankfully I've found translating the Swedish to English seems to work OK.
> I've learned French and Dutch to conversational level in the past
They are both neighbouring contries of the UK, with similar idioom. Finnish is another league. Congrats BTW, many English native speakers these days try but, in my experience, use English in business settings only.
There is no rigid structure as the spelling is adjusted to make it easier to pronounce. I say pure text-reading AI would never learn all nuances and rules on its own, because so much is lost in translation.
>> Maybe some texts that describe a given language should also be used.
But in what language would those texts be? Language models used for translation are trained on sentence pairs. How would e.g. a book on the grammar of Finnish, written in Finnish and without a translation in English, help learn how to translate Finnish into English?
I'm genuinely asking. This sounds like an interesting idea- but how exactly would it work?
I don't know! My thinking is that if humans can read a Finnish book and learn facts about the Finnish grammar, there must be a way to exploit this in machine learning.
Problem is that we Finns can use a word (a known base word and adding the inflection to get the wanted meaning) in a sentence that has never been said or written before and the native speakers will still understand it.
Basically a full training set with all the words does not exist. Even less a set with actual translations for them.
This also means that autocomplete etc is pretty much useless for us. I just disable it on iPhone as it actively makes typing out a sentence harder.
(also to make things harder as others have pointed out the base word and inflections get changed to make the word easier to pronounce so there is no static form for them when you stack them)
There are many languages that work similarly, and often it's more the orthography than the grammar that's problematic. E.g. English and German form compund nouns in the same way, but the constituent parts are usually separated by spaces in English orthography, while they're run together as one long string in German.
That doesn't mean it's impossible to work with other languages, just that "words separated by spaces" is the wrong abstraction for processing them. It just happens to be a heuristic that works well enough for English, so a lot of functionality (like autocomplete) assumes that it works the same for other languages. It would be perfectly feasible to offer partial completions of long words in languages like Finnish or German, if only the space key were treated as less special. (Just compare to Chinese and Japanese, where autocomplete works despite no spaces at all.)
Not having a static form might create some redundancy in the lexicon, but that's not more of a problem than the vowel mutation in English "sing", "sang", "sung", "song". Treating different surface realizations of the same underlying base form as independent might actually be beneficial for getting accurate results that take into account how the base form is modified.
In Dutch, like in German, we write compound words without a space, so you can also invent completely "new" words. If I enter such words in Google Translate most of them are correctly translated to English and French. Granted, splitting a word into its compound parts (= several existing words with maybe an "s", "e", or "en" between them for easier pronunciation) is easier than having a base word + inflection + changes for pronunciation, but it shouldn't be impossible for an AI to do this. I do think you'll probably need custom rules or extra information per language, corresponding roughly to some grammatical rules or patterns.
I would bet that pure text-reading AI can absolutely deduce the rules. Finnish morphology (word structure) isn't that complicated. It's routinely used as an example in linguistics classes because it is simple and regular. If you want to see truly crazy word structure, try Georgian or Navajo.
Achually in 1970s Finnish morphology was considered hard AI-problem. I know this for a fact, because I saw Fred Karlsson in Xerox Interlisp Workstation demo in 1980. Few years later they obviously realized that only some cheap undergraduate labor is needed.
Problem is of course how to learn the meanings, when they are omitted in translations. Even native speaker is not always sure, like is "koiran-ko-han-ko" a tautology, or does the second question-mark refer to to the "han". "You wanted a dog, yes?" vs. "You wanted a dog, yes? You sure now?".
Pure context-free meanings-based translator would just pick a random word from the list, like "kauppammekinkohan" and translate it rightaway: "i wonder if our shops also might be ..".
Tangentially related, this pisses me off more than it should regarding the "AI" in phone keyboards. Under the hood they're barely anything better than Markov chains.
I'm almost sure autocompletion in Finnish (or Estonian or Hungarian) is borderline useless, as the chances to write a word that wasn't ever written before are quite high.
But even with more "sane" grammars these models barely work. When I write in Spanish, often there are verb forms that are missing that I have to type fully.
For example, in Spanish, all forms are on conjugation lists, but it can be tough to find a set of corpuses that covers all. I know that a 2016 Spanish Wikipedia dump I played with covered about 20% of all verb forms present in rae.es.
Then on all those forms (I'd say about 16-18 tense/mood/aspect forms are in common use) you have to take into account enclitic/affix pronouns, and the whole thing goes awry quick.
"Comámosnoslas" - If you search on Google, there are only 9 results[0], this post likely to become the 10th. But it's a completely normal word a Spanish speaker may use and will understand.
comamos ("let us eat") nos (emphasis "for ourselves") las ("them", feminine). "Let's eat them!" but with emphasis lost in translation.
E.g. usage "Hay 3 pizzas, ¡comámosnoslas!" . This sentence, funnily, is properly translated by Google Translate into English, but it tries to correct it to "comamosnos las" which is absolutely broken Spanish.
> I'm almost sure autocompletion in Finnish (or Estonian or Hungarian) is borderline useless, as the chances to write a word that wasn't ever written before are quite high.
I agree with you in that regard, speaking Turkish.
Will not happen in pure AI-based system without hard-coded parsing rules and without changing the obviously default dictionary-based english-like language model.