Anguish: Invisible Programming Language and Invisible Data Theft

lolc · on May 20, 2016

Unicode has stubbornly refused to become Turing complete. Will it ever be more than a bunch of character tables?

I think Unicode Emoji were a great step forward but we must redouble our efforts.

TazeTSchnitzel · on May 21, 2016

Unicode could be a stack machine. Implementing bi-directional text already requires a stack which can be pushed onto and popped from (https://en.wikipedia.org/wiki/Bi-directional_text#Unicode_bi...), we just need to add more operations!

lolc · on May 21, 2016

Fascinating. The emoji seem primitive in comparison.

This is actually related to something I'm working on at the moment and it cleared up a few misconceptions. So thanks for the link :-)

ghayes · on May 21, 2016

A turing machine can be simulated with two stacks.

jhardy54 · on May 20, 2016

I genuinely can't decide whether or not this is sarcasm.

unimpressive · on May 20, 2016

It's sarcasm.

danra · on May 20, 2016

Someone needs to make a script which runs through a git repo's commit history and looks for commits which add invisible Unicode characters. Maybe some existing exploits could be found in the wild.

danra · on May 21, 2016

Here's a script that detects the Anguish characters (except for the Byte Order Mark \ufeff, which isn't that rare):

  exp="\u2060|\u200b|\u2061|\u2062|\u2063|\u200c|\u200d"
  git log -p --pickaxe-regex -S"$exp"

lisivka · on May 21, 2016

Some popular invisible characters are space, tab, and newline.

anoother · on May 21, 2016

None of which are invisible in the same sense as those covered in the article...

lisivka · on May 21, 2016

I am responding to parent comment, not to the article.

Spaces, tabs, and newlines can be used to alter code in invisible way.

pairoffeet · on May 20, 2016

Lua, or at least LuaJIT, allows these characters to be used as identifiers, which has led to some pretty interesting looking obfuscated code: https://facepunch.com/showthread.php?t=1463260&p=47712658&vi...

sillysaurus3 · on May 21, 2016

The following code uses this technique. It's valid Lua. If you highlight this and copy it to your system clipboard, running `pbpaste | luajit` will count to 5 three times:

    ‏=print

    ‎=function()
      for ‎‏=1,5 do
        ‏(‎‏)
      end
    end

    ‎(‎(‎()))

gleenn · on May 21, 2016

Hmmm, running pasted code that is explicitly obfuscated from the internet probably isn't a great idea.

allthetime · on May 21, 2016

true, but in this case, the invisible characters are simply names and the rest is verifiably benign code. It's not like we're copy pasting a binary or anything.

sillysaurus3 · on May 21, 2016

I've been thinking of ways to make it seem like verifiably benign code, while doing something "interesting."

For example:

  print("‏‏‏‏‏‏‏‏‏‏‏‏‏‏‎‎‎‎‎‎‎‎‎‎‎‎‎‎‏‎")

This is benign, but that's not an empty string. The string contains a bunch of U+200e and U+200f characters, even though it appears empty. It proves that you can have strings with invisible characters in them.

Since we have two types of invisible characters, U+200e and U+200f, we can use those as binary digits -- 1 and 0. Thus, we can write a function that takes an invisible string as input, and returns a normal string as output.

So, what kind of string could we feed it? One possibility would be to convert something like "echo 'command-line injection'" into an invisible string. We'd pass that into our decoder function, and pass the result into os.execute. Since the conversion function mentioned above can be identified with an invisible variable name, it would look similar to this:

  os.execute(‏("‏‏‏‏‏‏‏‏‏‏‏‏‏‏‎‎‎‎‎‎‎‎‎‎‎‎‎‎‏‎"))

That looks very suspicious, but we can do better. In Lua, you can index into tables with strings. And we have a function which can take invisible strings and produce normal strings.

The final PoC could look similar to this:

  _G[‏("‏‏‏‏‏‏‏‏‏‏‏‏‏‏‎‎‎‎‎‎‎‎‎‎‎‎‎‎‏‎"))][‏("‏‏‏‏‏‏‏‏‏‏‏‏‏‏‎‎‎‎‎‎‎‎‎‎‎‎‎‎‏‎"))](‏("‏‏‏‏‏‏‏‏‏‏‏‏‏‏‎‎‎‎‎‎‎‎‎‎‎‎‎‎‏‎")))

That's non-working code, as I haven't put this together. But the idea is to convert the following (working) code into invisible strings:

  _G["os"]["execute"]("echo 'command-line injection'")

Making this work is left as an exercise for the reader. :)

Another interesting approach would be to iterate through the "os" table a fixed number of times, until reaching the "execute" key. The iteration order isn't guaranteed, but given a certain version of LuaJIT, I think it's stable. That means you'd be able to do the equivalent of "os.execute" while making it look like you're "counting to 5."

sillysaurus3 · on May 21, 2016

Indeed not, but it's pretty amazing that it's copy-pastable via HN.

tshtf · on May 20, 2016

Ignorable (invisible) unicode characters have caused security vulnerabilities in the past, especially on HFS+ filesystems running on OS X (due to normalization):

* https://git-blame.blogspot.com.es/2014/12/git-1856-195-205-2...

* https://www.cvedetails.com/cve/CVE-2013-0966/

gohrt · on May 20, 2016

The language is perfect for literate programming. All human-readable characters are comments by default, so you can write completely human-friendly prose, and make the machine-readable content invisible to humans.

jegoodwin3 · on May 21, 2016

Or you could put the comments in the invisible fork and pass a Turing test.

'I passed the Turing test. No one believed me. Honest.'

passes Turing test by sounding like a petulant child

yokohummer7 · on May 20, 2016

It might be useful to have an option to turn all the "non-obvious" Unicode characters into the form of <U+XXXX> in programming editors. One concern is that some legitimate text will break, but it would be worth it considering most code is written in English anyways.

chungy · on May 21, 2016

It's trivial to convert between this and regular Brainfuck syntax: just replace characters. The article even gives an example using a Perl one-liner :)

Revision: I believe I misinterpreted the intention of your post, instead wanting to expose tricks like these. I'd be fine with this.

LionessLover · on May 21, 2016

Jetbrains: "Zero Width Characters locator" plugin (https://plugins.jetbrains.com/plugin/7448)

zimbatm · on May 20, 2016

This is the next generation of shell code right there.

Why try to obfuscate programs in base64-encoded strings when you have it invisibly lying around in plain light.

Retr0spectrum · on May 20, 2016

Here's one I made earlier:

    alert(String.fromCharCode.apply(null,String.fromCharCode.apply(null,"‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌".split("").map(function(c){return(c.charCodeAt(0)>>2)^2098})).match(/.{8}/g).map(function(c){return parseInt(c,2)})))

yigitozkavci · on May 20, 2016

Hmm. Readability is high on this one.

Bromskloss · on May 20, 2016

Yes, very transparent!

jegoodwin3 · on May 21, 2016

It also has the interesting property that all the programs are quines except those that print themselves.

In other words, the programs are quines if and only if they aren't.

Androids dream of quined Anguish.

vvanders · on May 20, 2016

Wow, that's truly evil. Would other languages like Ruby that support overloads like that be susceptible(I'm no Ruby expert)?

Freaky · on May 20, 2016

You can name methods invisible unicode characters so calling them is basically invisible. Quick example:

https://gist.github.com/Freaky/51086f3c97784bdd6dfbd31913cd1...

You don't need to use define_method, it just makes it more obvious what's going on.

_kst_ · on May 20, 2016

It uses "U+FEFF ZERO WIDTH NO-BREAK SPACE", also known as "BYTE ORDER MARK" -- which means that an Anguish program that starts with that character might not survive translation to or from UTF-16.

geofft · on May 20, 2016

At first glance, I'd consider it a (security) bug in Perl 6 that it permits tokens containing invisible characters, let alone consisting solely of invisible characters. Are there any other languages with this behavior?

As a random example:

    titan:~ geofft$ python3 -c "$(printf "\u2063") = 1"
      File "<string>", line 1
         = 1
        ^
    SyntaxError: invalid character in identifier

If you change it to e.g. 00e9 ("é"), Python 3 permits the character, so it's not just a lower-ASCII thing.

labster · on May 20, 2016

https://rt.perl.org/Public/Bug/Display.html?id=128159 was filed by the author of this piece. The take-home to me is that supporting Unicode to a great depth in a language is really hard.

xyience · on May 20, 2016

My company's large Java codebase has hundreds of zero-width spaces in the middle of method names. It's sad.

yokohummer7 · on May 20, 2016

What's the reason? Were the method names needed to be aligned somehow?

ugexe · on May 20, 2016

"invisible" is entirely dependent on your text editor

kbenson · on May 20, 2016

What about carriage return. newline,form feed, etc? Those are invisible character, and those are just plain ASCII.

chias · on May 20, 2016

Those aren't invisible, merely transparent: they take up space, and alter the location of the cursor when inserted. Invisible characters don't.

logfromblammo · on May 20, 2016

Also, some text-displaying programs will insert glyphs for them, to make them visible, which would make them somewhat more detectable.

And the BEL character, while non-spacing and invisible, is sometimes audible.

geofft · on May 21, 2016

Those aren't valid tokens. "<newline> = 10" won't assign to a variable named <newline>. If the language wants to parse non-ASCII invisible characters as whitespace, or permit them inside comments or strings, that's fine.

solox3 · on May 20, 2016

git makes Anguish code look more readable than the rest of the file: http://imgur.com/AHavQor

vsviridov · on May 20, 2016

This look like vi, which is probably the default $EDITOR, git defers to.

Twirrim · on May 20, 2016

This is delightfully evil

undershirt · on May 20, 2016

"Hush" immediately came to mind as a useful name for an invisible language.

krylon · on May 20, 2016

This is probably a stupid question, but could somebody explain to me why one would add invisible characters to Unicode?

nine_k · on May 20, 2016

For instance, Zero-width spaces and other word-break characters can help reasonable text layout, but should be invisible. RTL and LTR marks help rendering text of different directionality, but obviously need to be invisible.

krylon · on May 20, 2016

Thanks!

astrobe_ · on May 20, 2016

Because a character encoding standard is considered worthless if you can't make ANSI bombs with it...

jlg23 · on May 20, 2016

Zero-width unicode chars have been used in exploit kits for a while now; just use hd (or something similar) when debugging.

hollander · on May 21, 2016

Where is this more dangerous, on the web or in Github and open source programming?

Null-Set · on May 20, 2016

Buries the lede. The interesting part is the abuse of invisible characters to sneak malicious code into pull requests.

The language is just a cute transliteration of brainfuck to use invisible zero width characters.

gwern · on May 20, 2016

Yeah, I was like 'well, that's kinda amusing' and then I saw the screenshot of the diff and went 'AHHHHH!!!' The worst part is, it's totally plausible. 'What's with the whitespace changes in your patch?' 'You had some trailing whitespace in there so my text editor automatically cleaned it up.'

detaro · on May 21, 2016

Diff views probably should replace invisible characters with visible place-holder glyphs. Is that something that can be done on a font level, or does it require extra code? As in, can I assign a glyph to an RTL mark and have it automatically show up?

dang · on May 20, 2016

Since the article's title doesn't have that weakness and should have been used anyhow, we'll use it. (Submitted title was "Anguish: A language written in zero-width characters".)

Anyone who uses 'lede' correctly gets express treatment here...

labster · on May 20, 2016

I chose that title because "invisible" has a lot of meanings and "zero-width" is more precise here. Brainfuck is an invisible language -- and so is COBOL because so few people think about its widespread use today.

function_seven · on May 20, 2016

True, but your title leaves out the important part, the data theft POC.

danharaj · on May 21, 2016

> Anyone who uses 'lede' correctly gets express treatment here...

is that a hint of weariness? :3

tlhunter · on May 20, 2016

This is nothing more than a string replace on top of Brainfuck.

ChrisClark · on May 20, 2016

In the first half of the article. The real reason for the post is in the second half, and the actual scary part that could affect developers.