Hacker News new | past | comments | ask | show | jobs | submit login
Main is usually a function, so when is it not? (2015) (jroweboy.github.io)
224 points by ColinWright on June 14, 2021 | hide | past | favorite | 63 comments



Can be both, a string and a function:

         char main
         [/*x86*/]
       __attribute__
    ((section(".text"))
   )="WTYH)9Zj8_j7H)9]R"
  "H)9^\350\0\0\0\0H)1^R"
   "H))Z8<2u\366j<)9Xj9"
   ")9j9VY)<$[S_H\xbd^["
    "H$@\xcd\200-XP\xf"
    "\5XP_j<W\xeb\xe2]"
     "Hello World!\n";
      /*Linux_Only!*/


Now make it a quine too. ;-)


It's not hard to get the address of data in 32-bit addressing. You just interleave the data inside your assembly, something like the following (pseudo code I haven't done this in a while):

    ...
      call continue
      .db "Hello World!\n\0"
    continue:
      pop eax
    ...
Since 'call' just turns into a 'push eip; jmp target' (simplified, sorry), the address of the string is now pushed onto the stack. Popping off the top, now eax contains the address of the string "Hello World!\n\0". Since in 32-bit ABI most parameters are passed on the stack, many times you don't even need to 'pop' the address off the stack, it'll just be part of your arguments to the function.

Old school malware used this a lot to 1) run regardless of the memory base address it was loaded at and 2) confuse some disassemblers (you can use silly conditionals that are always true or false to control whether you execute the 'call' instruction or not, forcing the disassembler to try and 'disassemble' the string into valid x86 opcodes)


More or less how Fortran works on PDP-11's.


Curious to hear more!


I'm guessing this is referring to the fact that early versions of Fortran stored return addresses in specific memory locations (at the end of the function definition IIRC) instead of on a call stack. This is why those versions of Fortran couldn't do recursion, because the new return address would overwrite the old one.


It seems weird now, but natural at time. Point a register at the data following the call, rely on the caller advancing over the argument and jumping back to the (presumed) code after the data. The stack isn't involved at all...

Edit: found this on the 'net:-

https://retrocomputing.stackexchange.com/questions/9328/does...


One of the winners of the 1st International Obfuscated C Code Contest (1984) used this technique.

https://www.ioccc.org/1984/mullender/mullender.c

https://www.ioccc.org/1984/mullender/hint.text

    short main[] = {
            277, 04735, -4129, 25, 0, 477, 1019, 0xbef, 0, 12800,
            -113, 21119, 0x52d7, -1006, -7151, 0, 0x4bc, 020004,
            14880, 10541, 2056, 04010, 4548, 3044, -6716, 0x9,
            4407, 6, 5568, 1, -30460, 0, 0x9, 5570, 512, -30419,
            0x7e82, 0760, 6, 0, 4, 02400, 15, 0, 4, 1280, 4, 0,
            4, 0, 0, 0, 0x8, 0, 4, 0, ',', 0, 12, 0, 4, 0, '#',
            0, 020, 0, 4, 0, 30, 0, 026, 0, 0x6176, 120, 25712,
            'p', 072163, 'r', 29303, 29801, 'e'
    };


That’s already mentioned in the third paragraph but it’s nice of you to also include the bytecode in the comment.

> Apparently in 1984, a strange program won the IOCCC where main was declared as a short main[] = {...} and somehow this did stuff and printed to the screen!


Barely mentioned, and the author makes us all die inside when they say "Too bad it was written for a whole different architecture and compiler so there is really no easy way for me to find out what it did."


As the hint explains, it's a combination of PDP-11 and VAX machine code, set up so that either system will run its own code and ignore the foreign code.

You can extract a few ASCII strings from the data. As the hint says: "Can you guess what is printed? We knew you couldn't! :-)"

The ASCII strings I found were "vax", "pdp", "str", "write", and " :-)".


So not only did they write something baffling, they also wrote it so that it was semi-portable.


I wonder what strange and interesting things in the computing world we’ve simply lost by forgetting and from moving on.


Some past threads:

Main is usually a function, so when is it not? (2015) - https://news.ycombinator.com/item?id=15206198 - Sept 2017 (65 comments)

Main is usually a function. So then when is it not? - https://news.ycombinator.com/item?id=12799637 - Oct 2016 (1 comment)

Main is usually a function – when is it not? - https://news.ycombinator.com/item?id=8951283 - Jan 2015 (60 comments)


Floats being more mysterious and intimidating than ints I prefer

const float main[] = {-8.10373123e+22, 6.16571324e-43, 1.58918456e-40, -7.11823707e-31, 5.81398733e-42, 1.26058568e-39, 6.72382769e-36, 2.17817833e-41, 2.16139414e-29, 1.10873646e+27, 1.76400414e+14, 1.74467096e+22, -221.039566};


Better yet, try to find the corresponding ints (or maybe more realistically shorts or chars) from usual #include headers, and use the #define or const mnemonics for all numbers.

Bonus points for finding them all in the same header file, or with like names, so as to give appearance of them actually meaning something in the context of the prank.


It doesn't translate the octal and hexadecimal constants into decimal, but you could get a first cut at that from

  cd /usr/include; egrep -r \#define.*[0-9]+$ . | sed 's/#define[\t ]//' | awk  '{print $NF,  $1}'  | sort -n


Genuine question, can you be sure the conversion wouldn't introduce a wrong bit here or there? Maybe in a different architecture or something?

I'm not that good with CPUs past 16 bits, this is really out of my comfort zone heh


Unless I'm missing something, this code is already architecture dependent .. adding more architecture dependencies won't really hurt.


I think the format for single precision and double is defined by the standard. Beyond that may be implementation dependent.


The format for floating-point is specified by the IEEE floating-point standard (or whatever it's officially called these days). C permits but does not require IEEE format. Most implementations these days use it.


You only depend on the compiler to interpret these floats correctly and generate their binary representation that decodes into valid instructions. As far as the CPU executing this code is concerned, it's machine code either way.

> Maybe in a different architecture or something?

Of course this isn't portable across CPU architectures, neither is it portable across operating systems due to at least ABI differences.


Oh, I think you misunderstood me. Sorry.

I meant like, how can you be sure a compiler would interpret them correctly and give you the exact binary value you wanted.

And by different architectures, I meant more like, if you compiled it on Intel and AMD, could the results be different? Though I guess this part of the question makes no sense now that I think more on it.


I interestingly did this just the other day. I had a testing reason for main to consist of a single specific illegal instruction that I know the hex of anyway. It was less work for the system's Makefile to compile a .c rather than a .s file, and I knew everything I needed to make this trick work, but didn't know how to for sure disable function prologues for this arch.

It's the first time I had a legtimate excuse to whip out this technique since seeing it in an ancient obfuscated C contest entry for the PDP-11 probably a decade ago. mullender.c I think?


The C standard requires main to be defined as a function, but failure to do so is not a constraint violation, so no diagnostic is required. If you define it as something else, the behavior is undefined.

A conforming C compiler could reject a program that defines main as an array, but is not required to do so.

gcc doesn't complain by default, but with warning enabled it says "warning: 'main' is usually a function".


Well, I guess technically they could expel him because he had someone do part of the assignment for him.


Is there any particular utility to this trick, or is it just a neat side-effect of the linker and compiler being very permissive and treating something that most languages would call a compilation error as merely a warning?


In C and C++, "main" is special. Too special. For historical reasons, its argument and return types are not checked.

I once argued on the C standard forum that a C compiler should not know about "main". "#include <unix.h>" should contain the usual Unix declaration for "main", and "#include <windows.h>" should contain the Windows declaration, which at the time was, roughly:

    int WINAPI wWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, PWSTR pCmdLine, int nCmdShow);
It's then up to the user to define their startup function to match, with normal type checking.

This gets the compiler out of handling "main" as a special case.

This was generally considered to be the right answer, but would break too much existing code.


On visual studio, the main() is not the entry point of the program.

The entry point is automatically generated by the compiler, it calls a few functions depending on what the program does then calls the main, I think it had to do with initializing the standard library. You can see the stub using a debugger or a disassembler.

It's possible to set the entry point point to any function name. See advanced project settings.

Now about the arguments and return type. With main the caller is responsible for pushing arguments onto the stack before the call, then popping the stack after the call. the return code is in the EAX register if I remember well.

Because of that, it doesn't matter what's the signature of the main, the invocation will work irrelevant of the arguments.

People may ask what's the point of knowing any of this? One major use case is to write executable compressors like UPX. Another use case is to make a custom entry point written in assembler.


Not just in Visual Studio. main is usually (always?) not the entry point on unixoid systems either, that's much more likely to be _start, which calls main() down the line.

Nevertheless, main is treated specially by the compiler for the aforementioned historical reasons, for example to not warn/error out if it does not return a value despite the type clearly telling so.

Observe:

% echo 'int main(void) { }' > foo.c; clang -c foo.c

<no output>

% echo 'int foo(void) { }' > foo.c; clang -c foo.c

foo.c:1:17: warning: non-void function does not return a value [-Wreturn-type]

int foo(void) { } ^ 1 warning generated.

As you can see, clang is happy to ignore the missing return value for main(), but not for foo().


Also, just as execution doesn't start in main, it also doesn't end with main, either. In C, you can register `atexit` handlers. You can do that in C++, too. In C++, you can also have "user" code executed before & after main by virtue of static initialization & destruction.


There is no special treatment of main in the major C compilers, the only "magic" thing the compiler does is including the CRT startup object file in the link, which defines _start as a function ultimately calling main, and having the default linker script set the address of "_start" as the executable entry point.

You can pass -nostdlib to gcc to disable linking the CRT startup object (or use ld directly) and you can pass --default-script /dev/null to ld to disable the linker script.

There is no need to declare main or check arguments or return types since in C arguments are both pushed and popped by the caller and the language provides no typing guarantees and thus there is no problem in calling functions with mismatched argument or return type declarations.


Yes there is. I've demonstrated this in a sibling (or rather, cousin) comment, but in short, you can happily not return a value in main even if its type is "int main(void)". Try that with another function, and the compiler should at least warn. This might not be a special case of code generation, but it is a special case of error handling at least.


Not quite true: there’s the weird thing where gcc on i*86 will align the stack on entry to a function called main but not any other.

  $ gcc -m32 -O2 -fno-pie -fno-asynchronous-unwind-tables -fomit-frame-pointer -S -masm=intel -xc -o - -
  int foo(void); int main(void) { return foo(); }
  ^D
   .file "<stdin>"
   .intel_syntax noprefix
   .text
   .section .text.startup,"ax",@progbits
   .p2align 4
   .globl main
   .type main, @function
  main:
   push ebp
   mov ebp, esp
   and esp, -16
   call foo
   leave
   ret
   .size main, .-main
   .ident "GCC: (GNU) 11.1.0"
   .section .note.GNU-stack,"",@progbits
It doesn’t do that if you set the historical stack alignment, though (-mpreferred-stack-boundary=2), or if you name the function anything else but main (it even does a tail call). Presumably it’s trying to (somewhat) recover from the time when the GCC authors accidentally the SysV i386 ABI[1,2].

[1]: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838 [2]: https://stackoverflow.com/a/49397524


to nitpick: windows also needs to provide a main function(why shouldn't?)

but I agree that the fact that the standard allow for 2 different declarations of main in a language without poliformism doesn't help.

Not to start with the whole implementations are free to define extra entry points part.


Ken Thompson wrote a regex engine which compiled (at runtime) regexes into data structures containing executable machine code, and invoked them (from C source) by jumping into the data i.e. treating its location as a function pointer. That's what's happening here except it's the start code inserted by the linker which is jumping into main.

So there's the utility, if you're hardcore enough to build machine code at runtime.

If you wanted to abuse main() particularly, I guess you've got argc and argv in registers, and your hand-compiled main 'function' could maybe have some self-modifying code?


I don't know that that would work since if the code is generated at runtime it would live in .data and not .text. At least for the architecture being targeted, you aren't allowed to create executable code at runtime like that (note that the original poster had to declare his main array as const to be able to have it in the .text segment.


Coercing a data pointer into a function pointer is undefined behavior in standard C (they don't even need to be the same size), but at least on POSIX platforms the compiler must do the right thing because `dlsym` depends on it working. Generating and executing native code at runtime is not that special, mind; after all JIT compilers are ubiquitous these days!


Popularity of the no-execute NX bit significantly postdates Unix. As I recall, Microsoft only started flipping it on by default for the 64-bit Windows NT kernel, since so many preexisting 32-bit applications relied on self-modifying code.


Request: URL pointing to the code for this regex engine.


Neat side effect


Why does he add semicolons at the end of the assembly lines?


They're not necessary, so I suspect some combination of reflex, consistency for consistency's sake, or, possibly that they were added automatically by his editor/IDE at some point.


They allow you to put multiple instructions on the same line - in this case you must either have ';'s or '\n's in the string - having both doesn't break stuff, I guess it's more belt and braces


Not sure. I guess it shouldn't hurt anything though, since semicolon is the comment character.


depends on the assembler, in some assemblers ';' allows you to put multiple instructions on the same line


Treating main as a function, failed me at an exam back in 2013. Then I asked the question on Stackoverflow, had received pretty interesting answers; https://stackoverflow.com/questions/10513633/why-can-a-java-...


Really cool post!

Can anyone explain why const makes the array executable? That was the most surprising but for me


The real effect he's trying to cause is to move the bytes for main from a R/W memory segment (like .data) to an eXecutable one (like .text).

The const keyword tells C that a certain variable should never be modified by code. Doing so would be undefined behaviour. GCC is free to implement whatever guarantees for memory meant for const variables. On some versions and under certain criteria, it would place const variables into .text so any access would cause a SIGSEGV. It can also achieve the same by putting it into other write protected memory, like .rodata (which is what the newest version of gcc prefers for the code in this article, making it no longer work). Why GCC chooses one over the other and why it would change over time are hard questions.

A more effective way would be to use __attribute__s (on gcc) or #pragma directives to specify that the bytes need to be in the .text segment. However, that ruins the magic a bit.


Time passed. How was the assignment graded?


If I was the TA in this class, I’d give it -5 (95 of 100) for “doesn’t compile on my 486, please write more portable code” just to screw with the student.


In my university, it would've probably received full points; reason being that people who pull such shenanigans usually don't see "hello world" as a challenge - assuming, of course, the author could explain it.


In Haskell, main is a value, not a function. This is sometimes overlooked because in Haskell "values" can be thought of as 0-parameter functions and functions can be used as values, so everything sort of blends together, but I find it gives some fascinating insight into how lazy programming changes things up.


I like Haskell as much as anyone, but this is pretty non-responsive. "main is usually a function" is a gcc warning, and the whole joke is that "usually" leaves some pretty broad questions. This post answers some of them.


In C (and C++) main is always a function with zero parameters, two parameters as described in detail, or some implementation-specific set of parameters. ANything else is undefined behaviour.

In other words, you can write `main` as anything you want, and it might do something when presented to a C compiler, and that might do something when linked through the system linker, but it's not C code.

Doing something that's by definition incorrect and having it maybe do something somewhere sometimes, or maybe not, is not really all that impressive when you think about it.


> Doing something that's by definition incorrect and having it maybe do something somewhere sometimes, or maybe not, is not really all that impressive when you think about it.

See I would have called this hacking, which here on Hacker News is its own special kind of impressive.


It's hacking, sure. Just like using a hex editor to enter machine code directly. It's not C code, though, despite what the author claims, and it's not showing some clever way to hack C code that makes the author look Klever. SO go ahead an be impressed some something that isn't what it claims to be and doesn't do what it claims to do.


Of course it's C code. It parses, and compiles, giving undefined behavior.

If C wanted to insist that main is a function, it could. That could be in the standard, and then these clever hacks wouldn't compile.

So no, you're dead wrong about that claim. So I'll go ahead and not be impressed with your commentary.

Of course a standards compliant C compiler could give you nasal demons, or delete your root directory, so it's not the sort of thing one should push to prod on a Friday.

It's just a hack.


You could make main() a standard C function that merely calls this machine code hack function.


You could. But `main()` would still then be a function, no?


I took one CS class as an undergrad. On those occasions I showed up for class, I sat in the back row and wrote poetry. I think I only turned in one assignment as well. It was supposed to be a very simple infix calculator with 26 statically allocated variables. I thought that was boring so I ended up creating an algebraic solver that could use dynamically allocated variables of any length as long as they began with a letter and were all alphanumeric. The whole thing was done using cweb. The TA gave me partial credit saying that he couldn't understand any of what the code did (in his defense, the cweave output was 20+ pages long and I think my classmates' programs were all a couple pages of C) but that it appeared to work.


Well I can relate to that kind of story, although I got it mostly out of my system by college.

I took one CS class as well. I didn't read the syllabus carefully, and thus didn't realize that the lab was 10% of my grade, so I got a final score of 89%. I tried to argue that getting 99% of the rest of the course right should make up for the missing lab work, to no avail. This is how I ended up a chemistry major.

Anyway, I think you would get a lot out of this essay. I did at least.

http://www.marktarver.com/bipolar.html


Depends on the assignment, and can easily backfire.

When I was TA'ing the CS intro class, which in my University was actually using a functional programming language (ML), I received a bunch of "working" programs from students who already knew how to program, but not with functional programming languages. They would force an imperative style into ML, which was not what the assignment was, and kind of showed that they must not have paid attention at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: