> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.
Have you tried something similar to what the browsers do where highlighting is only enabled when there are multiple scripts mixed within the same token? Source code seems like it would be harder since you have many tokens rather than just a single one as in a hostname, and I'd be curious how much legitimate usage mixes scripts for technical reasons because you have something like a language or framework convention that certain names start with a particular English-derived term.
So far we're just detecting individual bidi characters, but looking at characters in their greater context could be quite interesting. This would seem like quite a good use-case for machine-learning too, if you wanted to get super into it.
Have you tried something similar to what the browsers do where highlighting is only enabled when there are multiple scripts mixed within the same token? Source code seems like it would be harder since you have many tokens rather than just a single one as in a hostname, and I'd be curious how much legitimate usage mixes scripts for technical reasons because you have something like a language or framework convention that certain names start with a particular English-derived term.