There is no rigid structure as the spelling is adjusted to make it easier to pronounce. I say pure text-reading AI would never learn all nuances and rules on its own, because so much is lost in translation.
>> Maybe some texts that describe a given language should also be used.
But in what language would those texts be? Language models used for translation are trained on sentence pairs. How would e.g. a book on the grammar of Finnish, written in Finnish and without a translation in English, help learn how to translate Finnish into English?
I'm genuinely asking. This sounds like an interesting idea- but how exactly would it work?
I don't know! My thinking is that if humans can read a Finnish book and learn facts about the Finnish grammar, there must be a way to exploit this in machine learning.
Problem is that we Finns can use a word (a known base word and adding the inflection to get the wanted meaning) in a sentence that has never been said or written before and the native speakers will still understand it.
Basically a full training set with all the words does not exist. Even less a set with actual translations for them.
This also means that autocomplete etc is pretty much useless for us. I just disable it on iPhone as it actively makes typing out a sentence harder.
(also to make things harder as others have pointed out the base word and inflections get changed to make the word easier to pronounce so there is no static form for them when you stack them)
There are many languages that work similarly, and often it's more the orthography than the grammar that's problematic. E.g. English and German form compund nouns in the same way, but the constituent parts are usually separated by spaces in English orthography, while they're run together as one long string in German.
That doesn't mean it's impossible to work with other languages, just that "words separated by spaces" is the wrong abstraction for processing them. It just happens to be a heuristic that works well enough for English, so a lot of functionality (like autocomplete) assumes that it works the same for other languages. It would be perfectly feasible to offer partial completions of long words in languages like Finnish or German, if only the space key were treated as less special. (Just compare to Chinese and Japanese, where autocomplete works despite no spaces at all.)
Not having a static form might create some redundancy in the lexicon, but that's not more of a problem than the vowel mutation in English "sing", "sang", "sung", "song". Treating different surface realizations of the same underlying base form as independent might actually be beneficial for getting accurate results that take into account how the base form is modified.
In Dutch, like in German, we write compound words without a space, so you can also invent completely "new" words. If I enter such words in Google Translate most of them are correctly translated to English and French. Granted, splitting a word into its compound parts (= several existing words with maybe an "s", "e", or "en" between them for easier pronunciation) is easier than having a base word + inflection + changes for pronunciation, but it shouldn't be impossible for an AI to do this. I do think you'll probably need custom rules or extra information per language, corresponding roughly to some grammatical rules or patterns.
I would bet that pure text-reading AI can absolutely deduce the rules. Finnish morphology (word structure) isn't that complicated. It's routinely used as an example in linguistics classes because it is simple and regular. If you want to see truly crazy word structure, try Georgian or Navajo.
Achually in 1970s Finnish morphology was considered hard AI-problem. I know this for a fact, because I saw Fred Karlsson in Xerox Interlisp Workstation demo in 1980. Few years later they obviously realized that only some cheap undergraduate labor is needed.
Problem is of course how to learn the meanings, when they are omitted in translations. Even native speaker is not always sure, like is "koiran-ko-han-ko" a tautology, or does the second question-mark refer to to the "han". "You wanted a dog, yes?" vs. "You wanted a dog, yes? You sure now?".