Hacker News new | past | comments | ask | show | jobs | submit login

Thank you for your generous offer of help! I look forward to taking it up (may take a while as I'm about to move countries and quarantine).

In particular I love that one of the examples in your comment history is in Latin as that language is not currently supported by Postgres FTS. Are Latin and Ancient Greek supported by Manticore? (dare I hope for Anglo Saxon...)




In terms of advanced NLP (stemming, lemmatization, stopwords, wordforms) - no. In terms of just general tokenization - I've never dealt with Latin and Ancient Greek characters (if there're specific characters for those languages), but if even they are not supported by default it's not a problem to add them in config (https://mnt.cr/charset_table)


For the character mappings, it might be useful to have a look at the config for https://tatoeba.org (or rather, the PHP script that generates the config): https://github.com/Tatoeba/tatoeba2/blob/dev/src/Shell/Sphin...

There's one big list of mappings for almost every script under the sun, including Greek. (With mappings like 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 Ἀ [CAPITAL ALPHA WITH PSILI] into U+1F00 ἀ [SMALL ALPHA WITH PSILI], and the same for seven other accented alphas. I've considered turning them all into unaccented alpha instead, but I don't know enough about Greek orthography to decide that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

For Latin, there are some special exceptions so that "GAIVS IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

It's not beautiful, but it's used in production. People who don't need to support quite as many languages as Tatoeba will probably want a simpler config, but it might still be useful as a reference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: