Size visualization of Go executables using D3

cakoose · on April 2, 2019

> as discussed in my previous article [1], Go uses memory instead of registers to pass arguments and return values across function calls.

[1] https://science.raphael.poss.name/go-calling-convention-x86-...

This is very surprising for a language that targets somewhat high performance.

Looks like it's a 5-10% performance hit, but makes it easier to provide good backtrace information: https://github.com/golang/go/issues/18597

userbinator · on April 2, 2019

That is indeed very surprising. I think the fact that Go isn't dead-slow can be attributed mainly to the sheer speed of CPUs. Looking at that code reminds me of the early days of PC compilers, especially the output of the trivial x+y-z function. The complete lack of push/pop instructions also shows a massive defect in the understanding of how the x86 architecture is supposed to be used. The icache bloat of doing that is enormous.

So return values are passed via memory, on the stack, not in registers like in most standard x86-64 calling conventions for natively compiled languages.

Wow. That's "worse than cdecl" --- which, despite passing parameters on the stack, will at least use the accumulator (and high accumulator) for return values that fit.

but makes it easier to provide good backtrace information

This seems to be a common line of thought but it goes against my belief in how tools should create efficient code --- anything intended for debugging purposes only should have zero effect on the executable when not being used, and compilers should focus on generating the most efficient code. Debugging information goes in a separate file and there you can put as much detail as you want. Don't make code generation worse, improve the debugging tools instead. The code will spend far more time, across everyone who uses it, being run than debugged.

drej · on April 2, 2019

I wouldn't call Go dead slow, it really does depend on the use case. Go was originally designed for systems that don't require absolute max performance, it was focused on providing safety and a good standard library for building things.

You can see from the common use cases, original authors' backgrounds or issue discussions, that e.g. the target audience are not people who want maximum throughput in data processing systems (something I've been interested in). The serde libraries are dead slow (but correct), there is little to no assembly in this code (unlike in the crypto packages), you don't have any higher level access to intrinsics to build this yourself, there is no native (meaning -march) compilation (by design, for portability reasons) etc. etc.

If you try writing high performance Go, it often starts looking like C rather than Go (by avoiding channels, io.Reader, using unsafe etc). It's a shame, but oftentimes, it's your only option. Plus, you don't have clang/gcc developers to help speed up your code on a daily basis, you "only" have the Go team and contributors (yes, there is gccgo, but...).

All that considered, I like the language.

userbinator · on April 2, 2019

I wouldn't call Go dead slow

I'm not saying it is -- but rather, that they could get away with such sloppy code generation because CPUs are so fast now. I did mention that early in the history of the PC, pretty much all compilers were like that due to other constraints, and the difference between that and handwritten Asm was enormous.

drej · on April 2, 2019

Whoops, apologies, misread that one sentence.

Agreed - if one takes the performance on an absolute scale, it is usually sufficient (also the reason why people use Python or PHP, despite both being, relatively, quite slow).

It is only once people start comparing it to the next best thing and/or when they desire better performance that they realise, that there is a lot of not so low hanging fruit.

coldtea · on April 2, 2019

>Go was originally designed for systems that don't require absolute max performance

Originally it was supposed to be a "systems" language, and was expected to lure C++ programmers.

j16sdiz · on April 2, 2019

It is not as slow as it sounds.

golang do heavy inline-ing and most of its performance comes from concurrency. Putting values on stack actually make concurrency easier.

That said, there is a proposal to do it in register: https://github.com/golang/go/issues/18597 It is challenging and not in high priority. The current priority is do more inlining.

1_000_000 · on April 2, 2019

> "It is not as slow as it sounds."

Yes, it is. 5%-10% in the very link you posted.

> "golang do heavy inline-ing"

No, it does not. See https://github.com/golang/go/issues/17566

> "most of its performance comes from concurrency"

That's just ridiculous. They didn't write all crypto and math in assembly just for nothing. Lumping bunch of slow function calls onto several threads is not where performance magically comes from. And you are confusing concurrency and parallelism.

pcwalton · on April 2, 2019

> Putting values on stack actually make concurrency easier.

Could you elaborate how?

pcwalton · on April 2, 2019

> The complete lack of push/pop instructions also shows a massive defect in the understanding of how the x86 architecture is supposed to be used.

Push and pop aren't that great because they mutate the stack pointer for each value you push or pop instead of mutating it once. Usually you want to push or pop more than one argument at a time, or reuse outgoing stack space for multiple function calls, and in that case it's often better to do the stack pointer manipulation once and then use mov to move the values into place.

That said, passing all arguments and return values via the stack is totally silly, is a significant performance hit, and is a bug they really should fix.

userbinator · on April 3, 2019

Push and pop aren't that great because they mutate the stack pointer for each value you push or pop instead of mutating it once.

The push and pop instructions are specifically optimised; see the section titled "Stack Engine" here for a brief explanation:

https://en.wikipedia.org/wiki/Stack_register

Not only does using mov with a base+displacement addressing mode not take advantage of the specialised stack hardware and also increases the code size, it also means extra effective-address calculations for the ALU.

If you don't believe that, look at this:

https://lists.llvm.org/pipermail/llvm-dev/2014-December/0799...

When someone who works for Intel proposes changing LLVM to generate push instead of mov for calls, and even suggests push-pop pairs for memory-memory moves to save on register allocation (I've seen this trick before in handwritten Asm, but can't recall if ICC does it too --- it might), I think you had better listen to him, unless you believe it's some sort of a weird conspiracy to make other compilers worse so that Intel's looks better... but then again, ICC and MSVC (among others) use push too, and they're certainly not any slower for it.

pcwalton · on April 3, 2019

> If you don't believe that, look at this:

The followups cite Agner's tables to show that push and pop are slower. Reg/mem move has reciprocal throughput 0.5 on Haswell, while pop has 1.

It's a code size win, but not really a speed win.

senozhatsky · on April 3, 2019

> That is indeed very surprising. I think the fact that Go isn't dead-slow can be attributed mainly to the sheer speed of CPUs.

Well, I suspect that in any reasonably complex C/C++ application (e.g. firefox) compilers do pass params via stack and probably they do so rather often. Even on x86_64.

/usr/lib/firefox/firefox objdump:

        a8b8:       48 83 ec 40             sub    $0x40,%rsp
        a8bc:       48 89 d3                mov    %rdx,%rbx
        a8bf:       48 89 f5                mov    %rsi,%rbp
        a8c2:       49 89 ff                mov    %rdi,%r15
        a8c5:       64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
        a8cc:       00 00 
        a8ce:       48 89 44 24 38          mov    %rax,0x38(%rsp)
        a8d3:       44 8b 76 10             mov    0x10(%rsi),%r14d
        a8d7:       44 8b 62 10             mov    0x10(%rdx),%r12d
        a8db:       48 89 74 24 20          mov    %rsi,0x20(%rsp)
        a8e0:       48 89 54 24 28          mov    %rdx,0x28(%rsp)
        a8e5:       c7 44 24 30 02 00 00    movl   $0x2,0x30(%rsp)
        a8ec:       00 
        a8ed:       48 8d 7c 24 20          lea    0x20(%rsp),%rdi
        a8f2:       e8 39 fd ff ff          callq  a630 <_ZN7mozilla11Compression3LZ417decompressPartialEPKcmPcmPm@@Base+0x2c0>
    --
      etc.

/usr/lib/firefox/libxul.so objdump:

      81cf5e:       48 89 44 24 30          mov    %rax,0x30(%rsp)
      81cf63:       4c 89 7c 24 58          mov    %r15,0x58(%rsp)
      81cf68:       4d 89 cf                mov    %r9,%r15
      81cf6b:       f2 44 0f 11 64 24 60    movsd  %xmm12,0x60(%rsp)
      81cf72:       e8 a9 00 00 00          callq  81d020 <mont_mulf_noconv@@xul66+0xa30>
    --
      etc.

Thaxll · on April 2, 2019

Go is faster than most languages so I'm not sure it's related to the recent CPU since every languages are benchmarks against the same CPU.

"The complete lack of push/pop instructions also shows a massive defect in the understanding of how the x86 architecture is supposed to be used."

I'm sure the Go team has very skilled people in that field, so I'm not sure what you mean.

You have to understand that the Go compiler does not look for the best optimized code, it's a balance between performance / compile speed and "debuggability".

pcwalton · on April 2, 2019

> You have to understand that the Go compiler does not look for the best optimized code, it's a balance between performance / compile speed and "debuggability".

Passing arguments in registers doesn't help "debuggability". There is a debugging advantage to not using registers at all (though there really shouldn't be; spilling all locals to stack for -g -O0 is only necessary due to design problems in LLVM). But there's no debugging advantage to having a calling convention that spills everything on stack. If you your debugging infrastructure can handle locals in registers at all, it can easily handle arguments in registers too.

coldtea · on April 2, 2019

>I'm sure the Go team has very skilled people in that field, so I'm not sure what you mean.

That's argument by authority. Go team has important industry contributors but hardly has experts that have kept up with 30+ years of PL research.

grumpydba · on April 2, 2019

The team has world class compiler and GC experts. You should know better.

coldtea · on April 2, 2019

Who do you mean?

People who have wrote some successful language are not necessarily "world class compiler and GC experts", and it's even less correlated with "having kept up with the last 30+ years of PL research" as mentioned above.

Would you call e.g. Kernighan or Guido "world class compiler and GC experts", compared to people like Anders Hejlsberg, Lars Bak, Martin Odersky, Wirth, Simon Peyton Jones, Lattner, and co, who are experts devoted to exactly PL (and don't miss real-world accolades).

Robert Griesemer might come close, but he doesn't appear to be the driving force.

Even more importantly, does Go strike you as the product of "PL expertise", or just a competent, pragmatic, if humble, compiler, the likes of which are many (and even more full featured, even from smaller teams)?

What I see Go having more is adoption and libs (probably due to Google's full-time dev support and branding). Not some PL upper hand over other languages like Rust, Crystal, Zig, Nim, etc.

Thaxll · on April 2, 2019

What exactly from other small team have:

- dead simple cross compiling

- multi platforms / arch ( there are many for Go: https://gist.github.com/asukakenji/f15ba7e588ac42795f421b48b... )

- no dependencies

- fast compilation

- good performance

- good debugging

- rock solid and production ready

I don't know any compiler from smaller team that matches the Go one, it's one of the reason why people use Go, compiling things is fast and easy, from my raspbery pi I can compile a win10 64bits exec with 0 problems 0 tools to install. ( ex: GOOS=windows GOARCH=amd64 go build . )

coldtea · on April 2, 2019

None of those are arguments for "PL experts" as none of those are specifically signs of superior PL expertise.

Multiple platforms for example are mainly about adoption (and not having much optimizations/assembly parts in the codebase, making it easier to port).

Fast compilation is something several languages manage. And being fast to compile because you don't do much (in the way of optimizations) is something most languages can manage.

"No dependencies" is also about adoption and resources (to replace popular dependencies). Nothing particular related to PL/compiler expertise about it.

Go never had a "good debugging" story, and it doesn't have the best performance either (e.g. compare with Rust, D, Crystal, etc). In certain areas like text processing it's even worse.

foldr · on April 3, 2019

> none of those are specifically signs of superior PL expertise.

If none of the things on that list is a sign of superior PL expertise, one would have to ask whether PL expertise is actually worth very much.

grumpydba · on April 2, 2019

Richard Hudson, Ian Lance Taylor, David Chase, Austin Clements. The team is incredibly skilled.

And of course Pike, Griesmer & Cox.

I think that they are well aware of the trade-offs they make. And I also think that it is quite arrogant to say otherwise just because your tastes are different.

Go ahead, please write the go haters' handbook. Or write rants saying 'go is obsolete'. And the future will say who has a viable legacy.

ngrilly · on April 2, 2019

> anything intended for debugging purposes only should have zero effect on the executable when not being used

It's not always possible. And debugging is very important in production (be able to inspect memory usage, generate meaningful stack trace, etc.).

cube2222 · on April 2, 2019

Do you have somewhere one could read up on why those really are such antipatterns?

nothrabannosir · on April 2, 2019

The same author, Raphael Poss, has another post on it:

https://science.raphael.poss.name/go-calling-convention-x86-...

It's very extensive and informative.

coldtea · on April 2, 2019

>This is very surprising for a language that targets somewhat high performance.

In many cases Go has opted for ease of implementation over high performance.

z0r · on April 2, 2019

Ease of compiler implementation specifically

senozhatsky · on April 2, 2019

> This is very surprising for a language that targets somewhat high performance.

> Looks like it's a 5-10% performance hit, but makes it easier to provide good backtrace information

Very interesting indeed.

Well, forcing your compiler to start passing parameters via stack is pretty simple and it's not so uncommon.

And probably (and maybe) the Go team just wanted to make things simple. Otherwise they would need to have two parameter passing implementations and sometimes even use a mixed one - when platform does not have enough registers (e.g. 6 CPU registers and a function which takes 6+ params), so some parameters would be passed via registers, some via stack; or maybe all of them via stack. And so on. So I am byuing the "good backtrace" point.

For example, suppose I have a silly logging function my_func() [C code], which takes 8 params:

    __attribute__ ((noinline)) 
    void my_func(const char *module, const char *func, const char *level,
                 int line, int B, int C,
                 const char *app, const char *session)
    {
        printf("%s:%s:%s:%d %d %s %s\n", module, func, level,
               line, B + C, app, session);
    }

    int main()
    {
        my_func("core", __func__, "error", 1, 2, 3, "a.out", "dummy");
        return 0;
    }

Let's compile it with gcc (-O2) for ARM and let's take a look at what main() does:

    000103d8 <main>:
       103d8: e52de004  push {lr}  ; (str lr, [sp, #-4]!)
       103dc: e3002630  movw r2, #1584 ; 0x630
       103e0: e3402001  movt r2, #1
       103e4: e24dd014  sub sp, sp, #20
       103e8: e3003638  movw r3, #1592 ; 0x638
       103ec: e3403001  movt r3, #1
       103f0: e3001600  movw r1, #1536 ; 0x600
       103f4: e3401001  movt r1, #1
       103f8: e58d200c  str r2, [sp, #12]
       103fc: e3000628  movw r0, #1576 ; 0x628
       10400: e3400001  movt r0, #1
       10404: e58d3008  str r3, [sp, #8]
       10408: e3a02003  mov r2, #3
       1040c: e3a03002  mov r3, #2
       10410: e58d2004  str r2, [sp, #4]
       10414: e58d3000  str r3, [sp]
       10418: e3002620  movw r2, #1568 ; 0x620
       1041c: e3402001  movt r2, #1
       10420: e3a03001  mov r3, #1
       10424: eb00004c  bl 1055c <my_func>
       10428: e3a00000  mov r0, #0
       1042c: e28dd014  add sp, sp, #20
       10430: e49df004  pop {pc}  ; (ldr pc, [sp], #4)

Looks like a bunch of stores to stack ptr: + 0 bytes; + 4 bytes; + 8 bytes; + 12 bytes.

[Edit: should have passed __LINE__ instead of hardcoded 1, but that doesn't change the assembly.]

rurban · on April 2, 2019

Nobody says "uses memory", you call that the stack, and that was the dominant calling convention over the last decades. It's not dead-slow, it's just not as fast as the new fastcall conventions via registers. "memory" is usually the heap, which is really dead-slow compared to the stack.

There are many VM's which do the same, they have the advantage that the GC does not need to spill all registers to the heap just to find the roots. All roots are at the stack already. Basile Starynkevitch's GC from Qish e.g. does it like this and is one of the fastest GC's around. The stack is always in the cache, and the registers can be used for more locals or intermediate results.

bastawhiz · on April 2, 2019

I'm curious whether LLVM optimizes this away when using the Go front end. The last time I played with LLVM I was blown away by the types of things that it was capable of optimizing successfully.

amelius · on April 2, 2019

> Go uses memory instead of registers to pass arguments and return values across function calls.

But I suppose not for inlined function calls (?)

kibwen · on April 2, 2019

Last I checked (which was a very long time ago, mind), Go's inliner was very conservative, and would only inline functions that were comprised of exactly one statement.

weberc2 · on April 2, 2019

This is no longer the case, but I believe it is still quite conservative.

awb · on April 2, 2019

Really interesting results and analysis, but small nit:

> there is about 70MB of source code currently in CockroachDB 19.1, and there was 50MB of source code in CockroachDB v1.0. The increase in source was just ~140%

That's a 40% increase, not 140%. This happens on all percentage calculations throughout the article.

That said, super interesting discovery.

knz42 · on April 2, 2019

corrected

ploxiln · on April 2, 2019

go minor releases make a surprising difference - you'll see a big difference if compiling the same project with go-1.10.z vs go-1.12.z

https://github.com/golang/go/issues/27266

a very recent cause of pclntab getting huge is adding preemption safepoint info for every line/instruction range, and they're looking at alternatives:

https://github.com/golang/go/issues/24543

justinclift · on April 2, 2019

Another utility (cli based) for investigating go binary sizes is `goweight`:

https://github.com/jondot/goweight

Written about here:

https://medium.com/@jondot/a-story-of-a-fat-go-binary-20edc6...

userbinator · on April 2, 2019

The purpose of this data structure is to enable the Go runtime system to produce descriptive stack traces upon a crash or upon internal requests via the runtime.GetStack API.

In other words, the Go team decided to make executable files larger to save up on initialization time.

Something about this whole thing just seems wrong. How often does (perhaps should) an application crash? How often does (again, perhaps should) it need to retrieve its own stack? ...and how much of the binary is being taken up just for that purpose?

Of size/performance trade-offs and use cases

Why is startup time even the question when the common-sense approach is to simply compress this rarely-used table and decompress it upon the first time it's used, not upon every startup? That's assuming it is always absolutely necessary to have in the first place, since loading a huge executable isn't going to be fast anyway.

I feel like this is a case of "the tail wagging the gopher".

jchw · on April 2, 2019

Eh, it's a bit of a simplification I think. Certainly the pclntab is consulted in more situations than just application crash; for example, when logging, you can have a source line prepended to the beginning of the line, which certainly uses the pclntab. I would be pretty surprised if there weren't a lot of other cases where the pclntab is consulted.

userbinator · on April 2, 2019

for example, when logging, you can have a source line prepended to the beginning of the line, which certainly uses the pclntab

It needs to go through the table to find which source line corresponds to the current instruction pointer? That's the only reason I can see for needing it, and a very roundabout way of getting information which is known at compile-time and could be simply a constant whereever it's used, much like C has __FILE__ and __LINE__.

rwj · on April 2, 2019

That's not enough for the stack traces. A crash in Go is way more informative than failing an assert in C. Whether the extra memory is worth it is a separate questions.

Twirrim · on April 2, 2019

> for example, when logging, you can have a source line prepended to the beginning of the line, which certainly uses the pclntab.

Naive question: Wouldn't that be trivial to hard code at compilation time within the logging code? Why would it need to go look that up from the pclntab?

jchw · on April 2, 2019

In C you have the preprocessor that can be used to hardcode that. In Go plenty of things already rely on the pclntab and doing lookups is cheap so you just use the pclntab. Go is definitely a less kludgy language for being able to drop the need for a preprocessor, despite the consequences.

lostmyoldone · on April 2, 2019

I don't get this argument, or rather, I don't the apparent corollary that needing pclntab is somehow connected to not using a preprocessor.

I'm probably missing your point, since one certainly don't need a preprocessor for line numbers in logging statements. Unless the language - like C - actually implements the include/import mechanism through a preprocessor. In which case the need for the dependency becomes rather self evident.

jchw · on April 3, 2019

>I'm probably missing your point, since one certainly don't need a preprocessor for line numbers in logging statements.

So, libc doesn't define anything like pclntab. About the closest you can really get is DWARF2 call frame information, which is generally treated as debug information (because, well, it _is_) and stripped in release builds. Further, not all platforms will use DWARF2, and the way it is embedded in binaries differs.

(Example: In Win32, when you are using MinGW, DWARF2 will be embedded as PE sections, which have 8 character long names; this is not long enough for the DWARF section names, so a special PE extension is used to specify the longer section names.)

What i'm trying to illustrate is, it's really, truly not possible to go from a PC value, to a line number in a source file, in pure C, at runtime. However, of course, as you have no doubt noticed, everyone still manages to print out line numbers in C source code. And they do this using the preprocessor.

If you dig deep enough into pretty much any logging library that offers line numbers in C and C++, you will find a macro that passes through __FILE__ and __LINE__. And if you keep tracing, you will probably also find the 'function' you call to log is a macro that eventually calls this macro. As a quick example, here is one in glog, a logging library used at Google:

https://github.com/google/glog/blob/41f4bf9cbc3e8995d628b459...

Pantheios is another popular logging library, and it defines macros that can be used to control the prefix.

https://github.com/synesissoftware/Pantheios/blob/177dc5fcff...

If you look into your logging library, it is quite likely that you will find that at the end of the day, it boils down to the __FILE__ and __LINE__ macros. When all of the macros are computed for your log line, the preprocessor subs in a string literal and line number for the filename and the line number that the macro was originally invoked from.

ithkuil · on April 2, 2019

An uncompressed table in the binary can be demand paged when needed instead of dirtying pages.

Yes, the binary will be bigger meaning that more IO will incur when deploying the application e.g. in docker/k8s.

What that extra IO is surely an annoying thing, I'm not sure I'd trade that off for a huge increase in heap allocation right before getting a stacktrace. Since we're talking about containers, it's not uncommon for containers to be configured with tight memory limits and having an OOM during that process could mask the real reason for a panic.

Would it be possible to have a more succinct data structure or even directly index the compressed table without having to decompress the whole thing?

emj · on April 2, 2019

This feels an awful lot like stripping, can someone explain why need this? Logging is mentioned below but that can be solved in other means. I really have no clue how Go uses pclntab, but it does sound like you could serve that separately like is often done with debug symbols, especially if you offer "one-true-binary" to customers.

It is not a practice I'm found of while debugging, and it also sounds like that table could be be swapped out most of the time.

shereadsthenews · on April 2, 2019

Go actually walks its stack very, very frequently during calls to runtime.morestack, which happens whenever a routine needs more than the default stack size.

knz42 · on April 2, 2019

This does not need pclntab.

chessturk · on April 2, 2019

Would it be possible to pass a flag to go build to change runtime.pclntab to the pre Go 1.02 implementation?

I don't actually have the author's usecase though, I tend to build microservices in Go!

marcus_holmes · on April 2, 2019

considering the number of flags for Go already, that seems the sensible option :)

even in docker-and-microservice-land, though, there's a cost to having an extra 10-50Mb of executable to copy around the place... I'm nowhere near experienced enough in that to work out if that counter-acts the gains on initialisation speed, though.

zerotolerance · on April 3, 2019

I put together a little demo looking at the impact of using fmt vs os for Hello, World.

https://github.com/allingeek/fmt-vs-os

fenollp · on April 2, 2019

Could we see a comparison with both `-ldflags '-s -w'` and `CGO_ENABLED=0`?

I feel like this would solve:

> the Go standard library is not well modularized; importing just one function (fmt.Println) pulls in about 300KB of code.