Hacker News new | past | comments | ask | show | jobs | submit login
A primer on some C obfuscation tricks (github.com/colinianking)
223 points by simonpure on April 23, 2020 | hide | past | favorite | 88 comments



I publicly reverse-engineered a cute bit of C code many years ago that employs some of these tricks:

https://stackoverflow.com/questions/15393441/obfuscated-c-co...

It's a digital clock (which has to be compiled and run once per second to be accurate).

Of course, if you're interested in obfuscated C, you can't miss the International Obfuscated C Code Contest, which is where most of these evil tricks show up. Submissions for this year's IOCCC are still open: https://www.ioccc.org/

The IOCCC has been running since 1984, and there are some absolutely marvelous gems: https://www.ioccc.org/years.html. A great rabbit-hole to dive down if you're stuck at home ;)


  ({_:&&_;});

  ({});

  ({;});
All of these rely on a gcc extension ('statement expressions').

________________________________

  a = '-'-'-'
You can also do:

  a = '/'/'/'
To generate a 1 instead of a 0.

________________________________

  printf("%d %d\n", 0 == sizeof(count = 2, count++), count);
This works because:

1. There are two forms of sizeof; sizeof(T) where T is a type, and sizeof x where x is an expression.

2. There are 'comma expressions'; if you have (x, y) where x and y are expressions, then x is executed and the expression evaluates to y.

3. Parameters to sizeof are not evaluated (which is important because otherwise the value of the 3rd argument would be undefined, since there's no sequence point between evaluation of function parameters).


> Parameters to sizeof are not evaluated

Oh, but they are ;) Try running sizeof on a VLA sometime.


The argument to sizeof is evaluated if and only if it's of a variable length array type.

(If it's a parenthesized type name, it's unclear what it means to "evaluate" it.)


Number 24 on the list shows exactly this.


  $ cat sizeof.c

  #include <stdio.h>
   
  int main(int argc, char *argv[])
  {
      printf("%zu\n", sizeof(int[argc - 2]));
  }

  $ make -B sizeof
  cc     sizeof.c   -o sizeof

  $ ./sizeof foo
  0

  $ ./sizeof
  17179869180


This can yield surprising results even without undefined behavior. For example: https://godbolt.org/z/fwhLH4


This is what I see visit your link. I think this is easily explained. What might I be missing?

  ASM generation compiler returned: 0
  Execution build compiler returned: 0
  Program returned: 44
  ./output.s
PS: I did not examine the assembly code


It calls printf. Perhaps this is a better example: https://godbolt.org/z/fwhLH4. (I updated the link above, too.)


Is it bad that I've written some of these in my normal code? In particular, I use the concepts in these when they're relevant:

  *(c ? &x : &y) = v;

  return (char *[]){"No", "Yes"}[!!x];
(Note: for the second one, I use the "selecting from a temporary array" and !! separately; if I only had two options then I'd use x ? "Yes" : "No" of course.)


I don't find those obfuscatory at all either, and instantly understood them upon first reading. I think they are a form of "data-oriented" style of programming, where array indexing and selection via ?: are preferred over "code-oriented" control flow statements. In my experience, the former actually tends to create more concise and maintainable code, since it often means a lot of the logic becomes table-driven and easily modified. The latter tends to result in very long and "branchy" code with lots of if/else statements that ultimately contain tiny bodies that don't do much.

Combining multiple booleans into an integer and using that as a switch or index is another related technique.

To modify an old saying slightly, perhaps "one person's obfuscation is another person's simplicity and elegance."


> I don't find those obfuscatory at all either, and instantly understood them upon first reading

I understood the whole list of these obfuscation tricks upon first reading but its still obfuscation.

Creating an array instead of an if-else-switch like the second line is fine but for just two elements and in combination with !!x (and !!x in general) its just nonsense and doesn't help at all.

I'm all for code-density and avoiding repetition so I too use ?: whenever I can but not on the left side of an assignment. That's just an obfuscated if-else-statement. If normal code results in very long and "branchy" code you can rearrange your code in a better way.


FWIW, !! is a fairly idiomatic way to cast a value to bool.


Ints are automatically converted to bool, there is no need to cast. What this does is it restricts the int to 1 or 0. The clean way to write this is x ? 1 : 0. This is understandable even if you don't know the Boolean inversion of 0 is 1 (Which isn't self-evident. For example in Basic its -1).


They are converted to bool in bool context. Outside of that you have to force it.

   int countNonZero = 0;
   for (int i = 0; i < values.size(); ++i) {
       countNonZero += !!values[i];
   }
Also c++ has explicit operator bool();

edit: not a great example as values[1] >0 would be better.


You don't have to enforce a conversion from int to bool if you just want a bool. In this example you don't want to increment by 'bool', you want to increment by 1 or 0 (which are ints). So you use !!x to convert to bool and back to int.

The reason I'm pointing this out is that often enough I had to work with code where this distinction was unknown to the author and you come across code like if (!!x) which is, again, just nonsense.


edit: not a great example as values[1] >0 would be better.

Only if that's an array of unsigned ints, as otherwise the code would count only positive nonzeros.


You can also typecast to bool since C99. (bool)x if you include <stdbool.h> or (_Bool) if not.


I try to write C/C++ so that the state and flow of control is fairly obvious in a debugger. Complicated and "electric" statements like the ones above are succinct and maybe even obvious, but they're difficult to step through, and they don't really offer any advantage in code quality.

Screen space is cheap. Say what you mean. Clever today is usually a headache tomorrow.


I can fit over 500 lines of code on my screen but I still prefer dense code. There's just something nice about not needing to jump around so much (whether with the eyes or the cursor or file), and being able to see more of the forest at once.


> Is it bad that I've written some of these in my normal code?

yes


Well, that depends on context. Always insure your code is good in the context it appears.

    yourCode = bad;

    temp = bad;
    bad = good;
    good = temp;

    conclusion = (yourCode == bad)
As we can see from evaluating this code, in this context, it isn’t bad.

In general, if something is bad you can change that thing or change the context. Your job, your living arrangements, etc. People often are slow to change the context, even when a small change can make all the difference.

You might want to create an include file if you have a lot of code to check.

mycodeisalwaysgood.h seems self-explanatory.


I wonder if even replying is encouraging this behavior.


I can understand the code and if all you function does is this 2 statements I guess it is ok. Myself however I would not write any code of meaningful size in this style.

I switch between many languages often enough so anything even remotely esoteric gets forgotten instantly and causes my brain to pause at suspicious line and making me think too much about the trees instead of the forest.

If however I was only using C and nothing else maybe I'd be catching this bug myself, one never knows ;)


>*(c ? &x : &y) = v;

Do you know why need to use pointer magic here?

Tried (c ? x : y) = v; and wondering why that doesn't work.


because ternary operation is a Rvalue only in C. In C++ and as gcc extension it can be used as Lvalue, but not in regular plain C.


>return (char *[]){"No", "Yes"}[!!x];

Is !!x equivalent to x here?


No, it "collapses" an integer to 0 or 1. (Essentially, it's x != 0.)


Personally I prefer (x ? 1 : 0) over (x != 0) or !!x for this sort of thing.

It fits semantically with "I'm using x as a boolean, but it may not be 1".


i'm on the fence about this... i don't love type coercions, but (!!x) is nice because you don't have to stop and check - was it (x ? 1 : 0) or (x ? 0 : 1), (x !=0) or (x==0)? i think arthur whitney said something in the vein of "less characters - less room for typos" :)


!!x is idiomatic. (bool) cast is modern.

I use !=0 in only one context, if(strcmp(a,b) != 0) to check if two strings differ. Normally if an if() expression I don't check explicitly for 0 or not 0, except for strcmp class of functions, because of the ordered comparison of it (<0, 0, >0) , its "boolean" value is inverted, so somewhat counter intuitive if used with implicit boolean values.


I don't think !! can be too common an idiom when the one time I recall using it in a corporate setting it was pointed out as absurd. And I agree it's not explicit and clear. But this seems a matter of style.

Cast to bool has me extremely nervous because ... It works in C99's _Bool type, but I have lots of bad memories of security bugs when using this approach with pre-standard custom bool types, which are still in common use including but not limited to Win32 and COM, or Objective-C.

The bug I am thinking of looks like this:

    long long ll = 1LL << 32;
    bool b = ll;

    puts(b ? "yes" : "no");
I just tested to confirm. This prints "yes" when using C99 stdbool, presumably because they standardized it to work. If you change bool to a common choice for pre-standard bool typedefs (int or char, say), it prints "no" because the large nonzero value doesn't fit in that type. "Just use the standard type" you might say. But if we were working somewhere that required legacy nonstandard bools that is not our choice, and further, a later refactor on may start using those other types some day even if we make the correct choice today. So there is an argument to be consistent about avoiding this type of bug.


If x is any nonzero value [example: 2], !!x is 1.

Any nonzero expression is true, and 0 is false. But the result of !, or ==, ||, or any of those other boolean operators is 1 for true.

Recall relatedly that C didn't have a bool type until the 1999 standard, which introduced <stdbool.h>. C++ had it earlier, but for a long time the C way to store a boolean expression into a variable might have been int, or some nonstandard typedef. This is why so many libraries and frameworks in the C world define their own boolean type.


Hey, I've got one! Use the exponent operator.

  if (2^3 == 8)
      puts("two cubed is eight");
  if (5^2 == 25)
      puts("five squared is twenty-five");


That's Bitwise Xor not exponentiation.

2^3 => 1

5^2 => 7


It works though! And if you do

  if (2^3 != 1)
      puts("two cubed is not one");
  if (5^2 != 7)
      puts("five squared is not seven");
it prints those messages too.

The trick is operator precedence.


Cool - thought had operator confused


That index[x] was used in one of my favorite IOCCC one-liners.

http://faehnri.ch/have-fun/


From bulletin #4: "-2147483648 is positive. This is because 2147483648 cannot fit in the type int, so (following the ISO C rules) its data type is unsigned long int. Negating this value yields 2147483648 again."

This is not true. If a decimal integer constant value cannot be represented in type "int", the next candidate type is "long int". If the value cannot fit in "long int" either, the next type to try is "long long int" in C99 and "unsigned long int" only in C89.


Today I learned... _magic_

    #include <cstdio>
    
    void sw(int s)
    {
        switch (s) while (0) {
            case 0:
                printf("zero\n"); continue;
            case 1:
                printf("one\n"); continue;
            case 2:
                printf("two\n"); continue;
        }
    }

[0] https://gcc.godbolt.org/z/Q26LWG


Hey wait, that's just Duff's... Oh.


Huh, what does it do?


Looking at the link, it generates the exact same assembly (and behavior of course) as

    void sw(int s) noexcept
    {
        switch (s) {
            case 0:
                printf("zero\n"); break;
            case 1:
                printf("one\n"); break;
            case 2:
                printf("two\n"); break;
        }
    }


Is noexcept here the consequence of while? Or noexcept simply does nothing at all?

Not a C expert, so just curious.


AFAIK noexcept is a C++ specifier that isn't valid C. Even so, it has nothing to do with the loop, but in C++ it would cause the program to terminate if an exception occurred in the sw function.


thatsthejoke.jpg


I mean, parent seemed an awful lot like someone asking what it did. Damn me for missing the invisible /s. I'm sure no other reader would ever be curious!


It's not the comment that is the joke, but the example in the context of the "C obfuscation tricks" article where it's just a convoluted way to write the same thing as a simple C construct.


Yup. Of all of the examples on that page, this is the only one new to me. I thought it pretty clever and after toying with it, it does seem pretty cool with functions as the conditionals.


    > return (char *[]){"No", "Yes"}[!!x];
 
I prefer

    return (!x<<2)+"Yes\0No";


clang will warn as it doesn't like int+string constructs.

&"Yes\0No"[!x>>2]; // same number of characters

to shut it up.


Clever! Though feel having "No" relate to x = 0 is slightly clearer


But surely the goal here is the opposite of clarity? :)


> x[index] is *(x+index) > index[x] is legal C and equivalent too

Can't be unseen, I can't believe I never thought of that.


C could have permitted pointer+integer and banned integer+pointer, which would have made the indexing operator non-commutative. Nothing important (IMHO) would have been lost. But the equivalence goes back to the days when the distinction between integer and pointers was not always clear.


I find that after learning assembly language, things like this become very obvious when seen in different languages, specially C


The main reason that works is historic, I think.

It could just as easily throw a compilation error to index a constant with an array rather than the other way around. I don't think this works in Rust even if the resulting machine code for array indexing is the same.


The main reason is that in C, "indexing" an array is purely syntactic sugar for pointer arithmetic, which itself is commutative; that is. ((A)[B]) is equivalent to ((A)+(B)), which itself is equivalent to ((B)+(A)) (assuming one of them has an integral type and the other a pointer to complete object type).

Now, of course an array type isn't a pointer type, but as "indexing" isn't one of the very few cases where an expression that has an array type isn't converted to an expression with a pointer type, you aren't really indexing an array, but a pointer to its first element.


Another way to look at it is that C syntax was designed to be extremely simple to parse, and C semantics to simplify code generation. Early C compilers immediately generated code as they parsed each expression, keeping minimal state. (No AST!) Also consider that in B the only data type was the machine word, so the type of the operands were irrelevant to the code you generated. In early C the biggest difference was structures, which required some minimal bookkeeping (very minimal when all members were in the same namespace), but a struct dereference is just syntactic sugar for an (address + offset) expression, so underneath the covers the compiler was still just chewing through identifiers, left to right, and emitting simple assembly for addition and multiplication, because each identifier was just a symbol for an integer.

So index[array] isn't an historical accident. It might not have been deliberate, but it follows naturally from the nature of the language.

Go very much follows the same discipline. Speed and simplicity of compilation constrain the syntax, most notably the lack of generics. Goroutines, channels, etc, only require minimal syntactic and compiler support. Contrast that with Rust--Rust front loads everything into the parsing phase--lifetimes, async, etc. Deep AST analysis and transformation is everything for Rust. Of course these days people abhor even the possibility of allowing something like index[array], so even a compiler like Go goes out of its way to disallow it.


I see no reason why this should be obvious; it's just a historical quirk of array indexing in C being a literal translation to pointer arithmetic. In C++, for example, there is no way to support this with the subscript operator.


> or since int is default type

Not since the 1999 ISO standard.

> int main(){ return linux > unix; }

A conforming compiler must diagnose "linux" and "unix" as undeclared identifiers. Many C compilers are not conforming by default.


I suspect the post is about GNU C rather than ISO C.


I wonder when there'll be the Obfuscated Rust Programming Contest.


No need for that, any Rust program is obfuscated.


We held an “Underhanded Rust” contest, which is similar but different, a few years back.


They might need an obsfuscated compilation contest.


I like if (val && ~val) as an alternative to if(val) based on the fact that for non-zero ints ~val will also be non-zero.


That probably shows (once more) that the C language really needs an overhaul, with a stricter grammar disallowing that sort of tricks.

Or maybe not.


This is a valid Rust program:

  fn main() {
      return return return return return return return
  }
You can do silly things in any language.


Indeed, anyone can also write code over-abusing lambdas in a functional language, but lambdas are also quite useful in the majority of cases. On the other hand, can you point out a single situation where swapping int and typedef in a type definition brings anything good?

Likewise, in your example, I doubt that this could bring anything. Is there a practical obfuscation method based on that quirk in Rust, or a reason for keeping it in the syntax? I cannot tell, but maybe you can?

Overall, the problem that I have with this is not that it is silly, it is that it makes it harder to understand and maintain. In some cases, it requires active engineering to fix the issue, but the language should be designed so that most of these problems are taken care of by design.

Also, many people seem to enjoy the fact the C can be bent that way. I don't mean to remove that from them, I just think that for system programming, it should be less permissive. Perhaps a 'strict mode' could be devised, not at the syntax level like in javascript (which I suppose couldn't be avoided), but as compiler flag (like the c++ people did it).


One guy's useless tricks can be another's essential flexibility. Especially when you consider code generation.


> Especially when you consider code generation.

That's a very good point you are making here.

However, as I said elsewhere, I don't see the use of being able to either write

typedef int ...

or

int typedef ...

and I don't think that proper code generation would require it.


[flagged]


I don't know Forth, Perl, or even Rust enough to discuss them on that basis, but having other languages allowing it doesn't mean that C should be lenient about it either, or does it?

Now, don't get me wrong, I do not dislike C, I've practiced it a lot, and I come back to it from time to time when I need to. But I would rather use a language where many of these forms are forbidden for readability and maintainability.


I keep my code interesting, to make work more enjoyable for future maintainers, by keeping all my code ...

inline with_all_ColinIanKing_standards()


What does the provided example do?

5) Surprising math:

int x = 0xfffe+0x0001;


It doesn't compile.

It's trying to parse the e as scientific notation I think.

  surprising_math.c: In function ‘main’:
  surprising_math.c:4:10: error: invalid suffix "+0x0001" on integer constant
    int x = 0xfffe+0x0001;
            ^~~~~~~~~~~~~


Indeed, I think you're exactly right. Changing it to `0xffff+0x0001` lets it compile.


Doesn't that defeat the purpose then? It's no longer using scientific notation.


The initial intent was to add 2 hex values, but using 0xfffe tripped the parser.


It looks a bit like a binary exponent, but according to clang and gcc it's erroneous: 'invalid suffix '+0x0001' on integer constant'.


That is really stupid. How is that not a bug? I can't belive GCC wont compile it. What does other compilers do.

Edit: Clang also gives an error. Mscv seems to compile. I wonder who follows spec. I assume not mscv ...


It's not a bug. 0xfffe+0x0001 satisfies the syntax of a "preprocessing number". In a later translation phase, preprocessing numbers are converted to numeric constants. Not all valid preprocessing numbers are valid numeric constants. The syntax of preprocessing numbers is deliberately permissive to avoid complicating the implementation of the preprocessor.

I'm sure it could have been made stricter, allowing 0xfffe+0x0001 to be treated as 0xfffe + 0x0001 -- but the solution is simply to write 0xfffe + 0x0001 in the first place. The language grammar has to be consistently defined even if it leads to surprises now and then.


It produces an error message.

error: invalid suffix '+0x0001' on integer constant

[0] https://gcc.godbolt.org/z/pR5L6v


Thanks for sharing this!


how about this one?

  int main() {
    fork();
    printf("choo");
  }


What's the trick (besides calling functions after fork that you shouldn't be calling?)


I think parent wanted to write the following:

int main() { printf("choo"); fork(); }

Here, "choo" can be printed twice, even though we fork after printing. This is a result from line buffering when the flush happens after the process is forked. Essentially, the output buffer is copied when forking and therefore duplicated.


oh yes, that's what I meant! your example and mine print the same output.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: