Someone needs to make a script which runs through a git repo's commit history and looks for commits which add invisible Unicode characters. Maybe some existing exploits could be found in the wild.
The following code uses this technique. It's valid Lua. If you highlight this and copy it to your system clipboard, running `pbpaste | luajit` will count to 5 three times:
=print
=function()
for =1,5 do
()
end
end
((()))
true, but in this case, the invisible characters are simply names and the rest is verifiably benign code. It's not like we're copy pasting a binary or anything.
I've been thinking of ways to make it seem like verifiably benign code, while doing something "interesting."
For example:
print("")
This is benign, but that's not an empty string. The string contains a bunch of U+200e and U+200f characters, even though it appears empty. It proves that you can have strings with invisible characters in them.
Since we have two types of invisible characters, U+200e and U+200f, we can use those as binary digits -- 1 and 0. Thus, we can write a function that takes an invisible string as input, and returns a normal string as output.
So, what kind of string could we feed it? One possibility would be to convert something like "echo 'command-line injection'" into an invisible string. We'd pass that into our decoder function, and pass the result into os.execute. Since the conversion function mentioned above can be identified with an invisible variable name, it would look similar to this:
os.execute((""))
That looks very suspicious, but we can do better. In Lua, you can index into tables with strings. And we have a function which can take invisible strings and produce normal strings.
Making this work is left as an exercise for the reader. :)
Another interesting approach would be to iterate through the "os" table a fixed number of times, until reaching the "execute" key. The iteration order isn't guaranteed, but given a certain version of LuaJIT, I think it's stable. That means you'd be able to do the equivalent of "os.execute" while making it look like you're "counting to 5."
Ignorable (invisible) unicode characters have caused security vulnerabilities in the past, especially on HFS+ filesystems running on OS X (due to normalization):
The language is perfect for literate programming. All human-readable characters are comments by default, so you can write completely human-friendly prose, and make the machine-readable content invisible to humans.
It might be useful to have an option to turn all the "non-obvious" Unicode characters into the form of <U+XXXX> in programming editors. One concern is that some legitimate text will break, but it would be worth it considering most code is written in English anyways.
It's trivial to convert between this and regular Brainfuck syntax: just replace characters. The article even gives an example using a Perl one-liner :)
Revision: I believe I misinterpreted the intention of your post, instead wanting to expose tricks like these. I'd be fine with this.
It uses "U+FEFF ZERO WIDTH NO-BREAK SPACE", also known as "BYTE ORDER MARK" -- which means that an Anguish program that starts with that character might not survive translation to or from UTF-16.
At first glance, I'd consider it a (security) bug in Perl 6 that it permits tokens containing invisible characters, let alone consisting solely of invisible characters. Are there any other languages with this behavior?
As a random example:
titan:~ geofft$ python3 -c "$(printf "\u2063") = 1"
File "<string>", line 1
= 1
^
SyntaxError: invalid character in identifier
If you change it to e.g. 00e9 ("é"), Python 3 permits the character, so it's not just a lower-ASCII thing.
Those aren't valid tokens. "<newline> = 10" won't assign to a variable named <newline>. If the language wants to parse non-ASCII invisible characters as whitespace, or permit them inside comments or strings, that's fine.
For instance, Zero-width spaces and other word-break characters can help reasonable text layout, but should be invisible. RTL and LTR marks help rendering text of different directionality, but obviously need to be invisible.
Yeah, I was like 'well, that's kinda amusing' and then I saw the screenshot of the diff and went 'AHHHHH!!!' The worst part is, it's totally plausible. 'What's with the whitespace changes in your patch?' 'You had some trailing whitespace in there so my text editor automatically cleaned it up.'
Diff views probably should replace invisible characters with visible place-holder glyphs. Is that something that can be done on a font level, or does it require extra code? As in, can I assign a glyph to an RTL mark and have it automatically show up?
Since the article's title doesn't have that weakness and should have been used anyhow, we'll use it. (Submitted title was "Anguish: A language written in zero-width characters".)
Anyone who uses 'lede' correctly gets express treatment here...
I chose that title because "invisible" has a lot of meanings and "zero-width" is more precise here. Brainfuck is an invisible language -- and so is COBOL because so few people think about its widespread use today.
I think Unicode Emoji were a great step forward but we must redouble our efforts.