We haven't actually tried that yet. I imagine if you customized your model with words from another language and then pronounced them with an english accent the API might be able to recognize them ok. Would be a fun experiment to try at least!
If I understand correctly, "customizing the model" essentially adds new words to the vocabulary and adjusts the language model to change the probability of some phrases, but does not require any information about pronunciation, let alone audio samples.
But isn't having just the English text really error prone, especially when you are dealing with terms of art and proper names, that might even have roots in foreign languages? E.g. some people pronounce SQL as "sequel", and the English pronunciation of French words varies between "French pronunciation with English accent" and "French orthography interpreted as English orthography". (I'm guessing your model would tend towards the latter?)
So what I'm interested in is whether you have encountered examples of this during your testing, and whether you have some way to work around it (I would try phonemic transcriptions in addition to English); or whether this is not relevant for the use-cases you are trying to cover and the convenience of just using English text trumps the accuracy loss due to just using English text.
Hey! Great question. Our system is actually able to handle transcribing "sequel" as "SQL" automatically if you were to
"customize the model" for phrases like "what was my latest SQL query". It can also get words like "colonel" pronounced "kernel". In both cases, without needing the explicit pronunciation of the word. We have some customers who've uploaded thousands of proper names, for example, and we're able to transcribe all of them without needing the explicit pronunciation. This is possible because our ASR implementation is pretty different than traditional setups like Kaldi. You're right that there are some edge cases, especially with foreign words, but we're working hard on smoothing those out.
Sounds amazing! Now I'm really interested how your setup can do that. Will you publish anything about it, or is this the kind of secret sauce you'd rather keep secret?
You can create phonemic transcriptions as a back-off for unknown words (at least in WFST based setups), but with things like "sequel" this won't help much.
AssemblyAI is apparently using their own TensorFlow implementation, not weighted finite-state transducers like e.g. Kaldi.
Speaking about WFSTs, why wouldn't it work for "sequel"? I have only done the "Kaldi for Dummies" tutorial (i.e. digit recognition), but from what I understand, you could add an utterance "s iy k w eh l"/"SQL" and add phrases like "SQL query" to the corpus and this would make it more likely than "sequel query".