This was a pretty interesting thing to mitigate - we added some support around it to GitLab after it was reported to us, which shipped in the latest security release: https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57... (you can actually see it in effect on that commit's examples, which is quite meta). These characters have valid use-cases in right-to-left languages like Arabic, Japanese etc, so it had to be configurable for project-owners if they have legitimate use-cases for it. Our focus was on making sure that repository maintainers could see these characters in code reviews.

The homoglyph attack is interesting but it really should be noticed as part of a code review process, as it requires adding the imitation function calls at some point too. It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

It's certainly a good lesson in not copy/pasting random snippets from the internet and pasting them into a root shell, however :D (we do always highlight the bidi characters on GitLab snippets, though)

Aside: this was a royal pain in the arse to figure out if I had live examples in the specs, because vim also just rendered them "correctly". I ended up checking the files in Windows Notepad on another machine to sanity check them.

Thanks to the authors for responsible disclosure.

> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

That actually strikes me as very desirable. (Especially in light of the old maxim that "programs must be written for people to read, and only incidentally for machines to execute".)

Those Unicode characters aren't just there for show. They're part of real scripts that real people use; it would be annoying for people using those scripts.

I'm fairly sure this could be arranged for. As in, if there's too many of them belonging to the character set of a particular language, then it's very likely that it's simply a text in that language. But random characters in the middle of ASCII identifiers are probably not something that you want.

Yeah I'm not opposed to adding highlighting to them, and we are investigating how to do it, but it was less clear-cut than the bidi characters (which are totally invisible when rendered). I think we'll want to make it a bit more configurable and probably a separate option to the one which highlights the bidi characters.

Exactly. When we were adding support for non-ASCII identifiers to Rust, and thinking about homoglyphs and confusable characters, we needed to evaluate the tradeoffs between catching such characters and inconveniencing the speakers of various languages who want to write Rust in their language.

This type of attack isn't new. I can't recall the names but there are afair multiple C/C++ coding standards that limit everything to ASCII to avoid precisely this attack, but also others with visually similar but nonequivalent names.

Yes, and they should be in well annotated/marked string/data sections, not in logic code.

Latin C and Cyrillic С aren't the same letter. The latter is actually an "s". It would be a pain in the ass to work with strings if those Cyrillic letters that look like their Latin counterparts reused their codepoints. Imagine having to convert "M" to lowercase. Would that return "m" or "м"? Same for "H", "h" or "н"?

And, actually, there was some really really cursed Soviet encoding that did this to save bits. The Russian railway company still uses it[1] to this day.

[1] https://habr.com/ru/post/547820/

> there was some really really cursed Soviet encoding

I know at least 10 stories that start like this

> Latin C and Cyrillic С aren't the same letter.

Well, as a moderately old Czech, I'm somewhat familiar with Cyrillic. They kind of used to force it on us in schools.

  this was a royal pain in the arse to figure out if I had live examples in the specs, because vim also just rendered them "correctly"
That's because vim supports Farsi/Arabic natively from day one. Even if the OS does not support it, you can still write bidirectional and right-to-left text in vim. Never knew the reason, but thanks Bram Molenaar.

I was impatient to find the example you were talking about; as far as I can tell, this is the line with the example: https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...

And here's what it looks like in various conditions/viewers:

With the fix, this is how it looks in the browser in the Gitlab interface:

    if (accessLevel != "user�") {� // Check if admin ��
Without the fix, viewed raw (and thus viewed in a vulnerable way), it looks like this:

    if (accessLevel != "user") { // Check if admin
And in a hex viewer, it looks like this:

    000005b0: 2020 2020 2020 2069 6620 2861 6363 6573         if (acces
    000005c0: 734c 6576 656c 2021 3d20 2275 7365 72e2  sLevel != "user.
    000005d0: 80ae 20e2 81a6 2f2f 2043 6865 636b 2069  .. ...// Check i
    000005e0: 6620 6164 6d69 6ee2 81a9 20e2 81a6 2229  f admin... ...")
    000005f0: 207b 0a20 2020 2020 2020 2020 2020 2020   {.
    00000600: 2063 6f6e 736f 6c65 2e6c 6f67 2822 596f   console.log("Yo
    00000610: 7520 6172 6520 616e 2061 646d 696e 2e22  u are an admin."

That's a great example ^ that demonstrates exactly how this vulnerability can be easily abused

I was intrigued by your meta example and I took a look. It took me 3-4 minutes to find the warning, and I was looking for it!

I was expecting a big fat warning on the merge request itself, or maybe on the lines containing the dangerous chars.

In the end, it is a small ? character inserted were the unicode control chars are, and a mouseover tooltip warning about a potential issue.

The warning is good, but why so subtle? Sorry for the criticism. The feature is still a huge positive.

Thanks for the feedback! Our primary use-case when deciding on it was to flag these up in a code-review situation, to prevent malicious content being submitted in merge requests to unsuspecting projects. We found this made it stand out enough to the reviewer when performing code reviews. I also try to not be too quick to add new alerts or sections to the GUI as we sometimes get criticised for having too much clutter D:

GitHub by comparison went down the alert banner route, from what I can see. I'm not opposed to adding something to that effect as well though - especially for inexperienced reviewers, it would be nice to include some more information about the potential exploit. That could be something we revisit when we add the homoglyph highlighting.

Thus, one sloppy review by that known tired-in-the-mornings dev, "sure thing, looks like Java..", and your little marking is missed?

I personally wish that in repos with the warning enabled, that the �s were displayed in lieu of the malicious characters instead of in addition to them. For example, I'd rather see this:

          var accessLevel = "user";
          if (accessLevel != "user� �// Check if admin� �") {
              console.log("You are an admin.");
than this:

          var accessLevel = "user";
          if (accessLevel != "user�") {� // Check if admin�� 
              console.log("You are an admin.");

Is that possible to do using CSS with our existing markup? Currently we prepend the � using ::before. I imagine we could probably hide the existing character and shuffle the � over where it should be, but it might need some testing across different text sizes I imagine. I'll make a note of it for our next revision :)

I don't think what I want is possible with a pure-CSS solution, but I'm not 100% sure.

> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

Have you tried something similar to what the browsers do where highlighting is only enabled when there are multiple scripts mixed within the same token? Source code seems like it would be harder since you have many tokens rather than just a single one as in a hostname, and I'd be curious how much legitimate usage mixes scripts for technical reasons because you have something like a language or framework convention that certain names start with a particular English-derived term.

So far we're just detecting individual bidi characters, but looking at characters in their greater context could be quite interesting. This would seem like quite a good use-case for machine-learning too, if you wanted to get super into it.

> It's certainly a good lesson in not copy/pasting random snippets from the internet and pasting them into a root shell, however

I gotta say that I always make sure that I understand each piece of code that I copy paste but I do copy paste and never thought of this type of attack. Maybe that's something I should pay attention to in the future.

from the article, its likely you'd not even notice - unless you pasted in an ascii only editor that doesn't allow anything other than plain old text.

> It's certainly a good lesson in not copy/pasting random snippets from the internet...

For someone with more gumption than me:

Future copy & paste will default have intermediate screenshot and OCR steps. Voila: charset scrubbing for free.

Why not? Already today misc UIs and renderings disallow text selection. Drives me nuts.

The future is now. Android has been doing this for years and it's awesome. There's no text you can't copy.

To clarify, by default copy and paste works the normal way, but you can open the app switcher to use the OCR copy/paste which works on non-selectable text too, even in images.

There's a way to prevent this - to my great annoyance, health apps (such as the ubiquitous MyHealth variants) and banking apps can prevent you from taking screenshots or copying text. This is presumably to prevent screen-scraping apps from stealing your private data, but it's really annoying when you're trying to screenshot a QR code for some kind of check-in process.

That's why you need a second phone to photograph the screen of the first phone.

If you root your phone, you can use an Xposed module like DisableFlagSecure to get around apps that do that.

This is too complicated for a personal supercomputer to be burdened with. Better to ship everything on the clipboard to a sanitizer service.

>These characters have valid use-cases in right-to-left languages like Arabic, Japanese etc,

I've never seen it used for Japanese. I don't think there is a valid use case for Japanese.

Ah yes you're right - looks like that can be handled with CSS: https://www.w3.org/International/articles/vertical-text/. Although from what I've seen most Japanese websites tend to be left-to-right instead anyway.

Hebrew would be a more valid second example I think. I'd be curious to know how many languages maintain their RTL preference online.

Japanese¹ isn't a right to left language, exactly. It can be written horizontally, in which case it's L-R, top to bottom, or, vertically, in which case it's top to bottom, with columns running R-L, but functionally, this is still like L-R typesetting, just with the characters rotated 90° CCW and the pages are then read in the same order as pages in a R-L book. This is typical of manga which is why there might have been confusion by the OP about the directionality of Japanese.


1. All of this also applies to Chinese and Korean. Interestingly, traditional Mongolian script is also written vertically, but in columns left to right rather than right to left.

This doesn't feel particularly new either? Isn't it pretty much a new variant of https://github.com/reinderien/mimic ?

Which, if one is suspicious of code, can be defeated in vim with: set encoding=latin1

Which breaks other things, such as every other string that's not written in English. But it's a great tip for a quick check, thanks! (Much more convenient than piping text through xxd)

Yeah, it's definitely just for quick checks if the text is in fact using unicode. But, hopefully just for stuff you're suspicious of where you could mandate no-unicode.

