> It'd also likely be pretty frustrating to end users if we were to highlight ev... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

acdha on Nov 1, 2021 | parent | context | favorite | on: ‘Trojan Source’ Bug Threatens the Security of All ...

> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

Have you tried something similar to what the browsers do where highlighting is only enabled when there are multiple scripts mixed within the same token? Source code seems like it would be harder since you have many tokens rather than just a single one as in a hostname, and I'd be curious how much legitimate usage mixes scripts for technical reasons because you have something like a language or framework convention that certain names start with a particular English-derived term.

robotmay on Nov 1, 2021 [–]

So far we're just detecting individual bidi characters, but looking at characters in their greater context could be quite interesting. This would seem like quite a good use-case for machine-learning too, if you wanted to get super into it.

Consider applying for YC's W25 batch! Applications are open till Nov 12.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact