Nice idea, naive implementation which leads to the output being unconvincing as hypothetical English words. I had a brief look and it seems to be proportionally selecting and sticking together sequences of letters sampled from English words (lib/word-probability.ts). This doesn't take into account syllable boundaries, the way the English spelling system maps between phones/phonemes and the phonotactic properties of English which is why the output looks unconvincing.
A better approach would be to use a markov chain built from sampling English text letter-by letter... an even better approach would be to build your stats from some source of English words in IPA transcription with syllable boundaries etc marked, then map from IPA to spelling via some kind of lookup table. We use a similar process in reverse in my research group for building datasets for doing Bayesian phylogenies of language families
Clearly you are far more of a linguist than I am, but from such a perspective, I had a similar impression; I reloaded the page several times and none of the words struck me as being remotely plausibly English. These are worse than most Hollywood scifi words/names.
A significant improvement on letter-by-letter, but not that much harder, is to use n-grams: "two letters to predict the third" etc. Still not "industry grade", but the results start making more sense.
A letter-by-letter markov chain would lead to similar unconvincing results. As you said, vocal groups matter much more than single letters. If you know anything about korean, they actually group letters into characters that way. If one could build such a markov chain for English it would be very convincing I think.
You should check out the VOLT paper, I think it would work well. It's a new technique for splitting up a vocabulary into subwords while minimizing entropy. These subwords could then be mixed and matched, maybe by a neural model, for better results.
I'm sure some people don't hear it, like "the dress", but for some of us it sounds like an Uncanny Valley of English: close but not quite, just enough for our brains to trip over / struggle to comprehend b/c it is so close.
As well as the associations with [1], this also made me think of one of my favourite essays, "Horsehistory study and the automated discovery of new areas of thought"[2]
Sorry, after a few refreshes not a single word was anything that looked remotely like English. It all looked like complete gibberish or words in another language. Most of them weren’t even pronounceable.
I think ailml is the offending sequence here. It's pretty difficult to say and doesn't sound like something that you'd find in a native English word.
There's calmly which is similar, to be fair, but there's something about the tongue positions for ailml that I find noticeably more difficult, it's too far forward.
As with flailmen, you've put a syllable break (and a morpheme break!) between the L and the M. This will make continuing the sequence into -ailml- impossible, since an English syllable can't start with ml-.
Interestingly, there's nothing wrong in general with starting a syllable with ml-; it's fundamentally the same mouth motion as starting with bl- or pl-, both of which are common in English. But ml- isn't allowed.
This plays into a pet observation of mine, which is that an underappreciated constraint on the space of words that actually exist in a language -- as opposed to the space of words that could conceivably exist -- is that by and large they must descend from older words in an older form of the language, so that even if a word like "plailm" obeys the rules for modern English syllables, it can't exist because its precursor word would have violated the rules for older English sounds. (I don't know if this is actually true as applied to "plailm", but the phenomenon (of possible sounds failing to exist due to their precursors having been impossible) is real.)
Milord and milady do not involve syllables starting with ml-. They involve a reduced vowel coming between the /m/ and the /l/, making milord two syllables and milady three. They also aren't spelled "mlord" or "mlady"; your options are "milord", "milady", "m'lord", or "m'lady".
Yes. That is why I used the word “;)” at the end there. And, yes, I know ;) is not a word.
I’ve been splained that one of the reasons “humor” sometimes doesn’t play well on HN is that people here have such a wide diversity of English grok. I didn’t anticipate someone could have too much knowledge, but, huzzah, there 'tis, one small step f’r ’man, and so forth.
In that case, you might wish to know that putting a smile at the end of a comment like that is also a common way of calling the person you're talking to stupid.
Welp, it was a wink, not a smile. The intention was good-natured. Just havin' fun on the internet with my new pal, Thaumasiotes, who is plainly the only other person among the swarming billions who found this tiny quirk of language worth blethering about with me. I hear you, mostly people think this sort of mishegas is nutballs. Their loss.
Flailmen: Awkward males, made uncomfortable and rendered incoherent by the close proximity of a romantic interest. Also, medieval warriors wielding flails.
Can anyone tell me more about how this works? Most of these don't resemble English words at all to me lol, wondering what the generative procedure/parameters are in the first place
a slender, membranous musclelike structure, believed to represent a cross between a cranium and the external spaces of fish and invertebrates, supporting the glans in most vertebrates
"a dynoderma is thought to have existed in all living organisms"
The following is the text of a recent HN comment (not my own) on the subject of non-drug highs. As it suggests starting with a nonsense word, the OP fake word generator ought to suit:
> It is really quite easy: have someone you know provide a nonsense word. In needs to have no logical sense or connections to to anything - pure nonsense. Then, with that phrase held in your most present and loudest inner voice you repeat that phrase in your head. Repeat it over an over, forcefully to drive any other thoughts or thought fragments out of your mental conversation(s) (at all mental conversation levels, if you have more than one going at once). After a few minutes of forceful repeating, it echoes on it's own, and a few realization moments later 20-30 minutes have passed and it feels like waking from a refreshing dream. When in the "state", it really can't be described because it is whatever your imagination and recent experiences feedback froth back and forth. It's relaxing and refreshing, and a great way to clear one's head when working on difficult complex mental goals.
Aimlessly flying though Dasher can create some pretty plausible new words. It’s worth playing around with if you haven’t seen it. It’s in most Linux package managers.
I recently started playing the NYT Spelling Bee game. There you find yourself wishfully inventing a lot of plausibly English-sounding words, only to learn that indeed, (e.g.) "vilicent" is not a part of the language. IMO the quality of these words is low compared to what a human being comes up with.
So many of the generated names sound like pharmaceutical brands...
Also, if anyone is playing NYTime's spelling bee game, you've probably become pretty familiar with common english three/four letter combos and then iterate/manipulate them to find words. It all about the patterns!
These read just as plausibly as "Transient companies selling low quality imported products on Amazon." If perhaps a bit too easily pronounced in English.
This or pronounceable password generators are great for making usernames for random sites. Sometimes you can even get the .com for them! (if you’re into that)
A better approach would be to use a markov chain built from sampling English text letter-by letter... an even better approach would be to build your stats from some source of English words in IPA transcription with syllable boundaries etc marked, then map from IPA to spelling via some kind of lookup table. We use a similar process in reverse in my research group for building datasets for doing Bayesian phylogenies of language families