The C preprocessor is a horrendous way of doing metaprogramming that was implemented because it was relatively easy to do as a separate pass. There's a reason why very few other languages have done it this way.
A good knowledge of the preprocessor is essential for writing obfuscated and underhanded C. For example, the lucky7coin backdoor: https://github.com/alerj78/lucky7coin/issues/1 where the code
IMO, whether or not the C preprocessor is good depends on what you're trying to do and how you do it. I doubt there are any preprocessors or macro systems that can't be used to obfuscate code - That's basically the definition of what they do, modify your code before you compile it. Obviously, and strange/unexplained preprocessor usage should be examined and preferably removed.
The example you gave is not really fair though, because it seems pretty obvious to me that nobody ever looked at that code - it hardly matters they hid the backdoor in the the C pre-processor. If you take a look at the repo, it only has three commits - With the first one (https://github.com/alerj78/lucky7coin/commit/07d7e5fc53e5673...) being a supposed import of the code from the repo it used to exist in, and it's in this commit where the backdoor was inserted. The real issue is that people were running code from someone who appears to be a complete unknown, has no history for his code, and just assumed it was the same as the old code without checking.
Preprocessors are uniquely problematic in this regard, though, since they're just simple text-substitution engines. Things like templating (as in C++), in-language macros (as in Lisp variants), or language-level metaprogramming facilities (as in Ruby, Python, ...) all have access to actual entities in the language, which constrains their effects in a way that's safer and easier to reason about.
I'm not looking to deny that straight text-substitution has it's drawbacks, you're completely right. But that being said, I still don't see it as big of a deal as it seems. Generally speaking, bad/malicious pre-processor abuse like this sticks out like a sore thumb when you're reviewing code. If you don't have anybody reviewing the code then it doesn't really matter how you disguised it. At least with the C preprocessor, if there's something you're unsure about, you can run the preprocessor separately and look at the output, clearing up all doubt over what it does.
Also worth noting, one of the nicer things about the design of the C preprocessor is that it can be applied to a lot of different file-types. In more complicated low-level C projects, you can run the preprocessor over your C code, assembly code, linker scripts, etc.. which is a huge gain since you can have access to all your constants and simple macros, simplifying work and duplication. You can't get that with something tied to the language - Which is unfortunate, because like you said it's better to avoid the preprocessor, since writing things on the language-level makes them much easier to reason about.)
It seems to me that you could create something at least this devious using C++ templates and operator overloading. I wouldn't bet against it in the other languages you mention, either.
This is so absurdly simple and yet devastating. Reading some of the comments on the Github issue you posted, this stood out (I don't know anything about lucky7coin):
> So disappointing such code was not reviewed by Vern and team before running it on the server where damage could result.
So this code was actually put into production somewhere at some point -- wow. And cursory code review and compiling from source will do absolutely nothing here.
That often expands into hundreds of thousands of lines though. It's more routine to go backdoor-hunting in binaries; you've given me the interesting idea of running `strings` on the binary and looking for anything that's not in the source.
Exactly. If the preprocessor respected file scope or better yet namespace scope it would be much much better. The way it works now there is no encapsulation. preprocessor definition in one library header will inadvertently affect code in other libraries, depending on the order in which they happen to be included and compiled. Its such a mess its embarrassing we still put up with it in the year 2015.
Oh, I have the sources. I haven't lost anything. Why I don't host that project anywhere is that I'm not all that proud of it.
I worked for one startup some almost years ago whose guys looked at that thing before they hired me and liked it.
MPP has namespaces, and it also tries to preserve whitespace (expansions occurring at some indentation level are indented). It could be used for Python, in theory.
Thinking along those lines, I posted to a Python newsgroup around then: reactions were mixed:
Interesting. I don't think I ever tried to use a macro in the conditional expression of a #if, except inside a defined() or undef(). From my time on the C committee, I recall that the preprocessor was a royal pain to get right. It has it's own set of token rules that aren't the same as C itself, for example.
I am also reminded of the button I used to have that said, "Defining define is undefined."
This doesn't seem quite right. Did you maybe mean "#undefine FOO" or "! defined(FOO)"? Whether BAR gets expanded or not, in your example it looks like it would always evaluate true. Or am I misunderstanding the ambiguity?
It might be telling that I also don't understand the Clang bug report as written. I think there are typos in the examples. Is the switch from "HAVE_FOO_BAR" to "HAVE_FOO" in the first example intentional? Is the construct "#defined" (with a final 'd') intentional in the second?
I'm not sure -- I'm going off what the Clang bug says. They have a spec ref in there. (Note that you can't trust your intuition on how compilers work, you can only trust the spec and experiments.)
If it helps any here is the real-world code where this problem came up:
#define FOO
#define BAR defined(FOO)
#if BAR
#error "true"
#else
#error "false"
#endif
Clang, GCC, and ICC evaluate the "true" branch, while MSVC evaluates the "false" branch.
For the same test code with first line changed to "undef":
#undef FOO
#define BAR defined(FOO)
#if BAR
#error "true"
#else
#error "false"
#endif
MSVC, Clang, GCC, and ICC all agree on "false".
Importantly, though, when used with "/Wall", MSVC gives this error message in both cases:
main.cpp(3): warning C4668: 'definedFOO' is not defined as
a preprocessor macro, replacing with '0' for '#if/#elif'
None of the other three compilers give any warnings even with '-Wall -Wextra -pedantic". So there definitely is a difference in behavior, but I don't think it's actually the one that's presumed in that bug.
For further experimentation, Clang, GCC, and ICC can be tested online here: http://gcc.godbolt.org/
#2 is incorrect. Being sensitive to line breaks does not make a grammar context-sensitive. It just means you have to treat line breaks as tokens rather than ignorable whitespace (which is exactly what the context-free grammar given in the C11 standard does).
Same with the bit about concatenating tokens. Every single one of those examples has a static parse tree, which, for the C preprocessor, is a sequence of tokens and directives. The author seems to be confusing the preprocessor's parse tree with the effect it has on the underlying text.
(Yes, the output of the preprocessor is dependent on what you define, but that has nothing to do with the grammar. What the author claims is like saying a Lisp is context-sensitive because the factorial function produces a different values for different inputs!)
Now, if you could do this:
#define foobar define
#foobar x 123
x
and get "123", that would be a context-sensitive grammar. But that is NOT a thing you can do!
I hate to say it, but I was rather unimpressed by this list, and nothing in it surprised me. While I certainly agree that the C preprocessor is a relic, and has not weathered the test of time well, I would suggest that a number of the supposed infelicities mentioned in this article stem from the misleading idea that the preprocessor is an integral part of the C language proper, when it is better thought of as its own language (and one that was traditionally done by a completely separate program). The preprocessor does things differently than the rest of C, because it's not C. It is a text-processing language of convenience, provided specifically for doing things that C itself cannot (or should not) do.
I've written a C preprocessor and I agree that the language standard documents are ambiguous and incomplete. The best I could do was hack on it until it matched GCC's preprocessor well enough to compile Linux.
I don't recall all the horrid details, but one case that I do remember driving me nuts was the use of #if/#endif in the argument to a function-like macro.
Has there been any notion of a replacement Meta/Macro language for C? Something open source. Of course pre-preprocessing one's files and the complexity that might add to the build system are unattractive but I'd still be interested if someone has attacked this problem.
Much of what makes 'C' annoying can be made less painful by referring to static/const struct tables/arrays. Those are a prime candidate for generation.
You don't have to keep the preprocessing of files as part of the mainline build, but there's something to be said for it - sort of "make GENERATE_ALL_THE_THINGS" might run the preprocessing { Python/Tcl/Perl/bash/even 'C' } scripts for you.
If the generators just emit .h files, that can be pretty good. You're still left with something #ifdef-ey to select them, based on #defines or -D options.
You might even go so far as to dynamically load these tables if that can make sense. The ld linker can directly link in blobs.
The module and template features (along with static if, if the committees figure out what to do in that area) in the newest versions of C++ together get pretty close to replacing the C preprocessor.
My school of thought would be to limit it to just #include, #if, #else, #end, and non-recursive single word only #define / #undef. Force everything to be 1 per single line, and call it a day.
Macros should always be the absolute last resort to doing anything. Stepping through code in gdb with some "creative" macro-based API is almost as bad as C++.
I've seen 'C' macros used to do template-ey things that probably reinforced the readability of the code ( once you get used to the fact that the macros were there at all ) .
More modern compilers allow using non-static "const" constructs to do much of the same, which is a great improvement.
You'll forgive me if my knowledge is a bit dated. I largely work on tools integration these days and don't do much in the way of framework development but aren't macros pretty much the defacto way of doing reflection/rtti. Actually, almost every mature framework I've worked with in the past used macros to markup class properties, register types, etc.
I'm 95% sure the last example in #3 is undefined behaviour. #(a b c) is not valid, so evaluating it with multiple levels of indirection probably is a compiler bug for not erroring out.
And the last 3 or 4 are odd, but are required for some of the hacks required in the early days of C (and some are almost certainly used in the Linux kernel source today).
A good knowledge of the preprocessor is essential for writing obfuscated and underhanded C. For example, the lucky7coin backdoor: https://github.com/alerj78/lucky7coin/issues/1 where the code
expands to