That is indeed very surprising. I think the fact that Go isn't dead-slow can be attributed mainly to the sheer speed of CPUs. Looking at that code reminds me of the early days of PC compilers, especially the output of the trivial x+y-z function. The complete lack of push/pop instructions also shows a massive defect in the understanding of how the x86 architecture is supposed to be used. The icache bloat of doing that is enormous.
So return values are passed via memory, on the stack, not in registers like in most standard x86-64 calling conventions for natively compiled languages.
Wow. That's "worse than cdecl" --- which, despite passing parameters on the stack, will at least use the accumulator (and high accumulator) for return values that fit.
but makes it easier to provide good backtrace information
This seems to be a common line of thought but it goes against my belief in how tools should create efficient code --- anything intended for debugging purposes only should have zero effect on the executable when not being used, and compilers should focus on generating the most efficient code. Debugging information goes in a separate file and there you can put as much detail as you want. Don't make code generation worse, improve the debugging tools instead. The code will spend far more time, across everyone who uses it, being run than debugged.
I wouldn't call Go dead slow, it really does depend on the use case. Go was originally designed for systems that don't require absolute max performance, it was focused on providing safety and a good standard library for building things.
You can see from the common use cases, original authors' backgrounds or issue discussions, that e.g. the target audience are not people who want maximum throughput in data processing systems (something I've been interested in). The serde libraries are dead slow (but correct), there is little to no assembly in this code (unlike in the crypto packages), you don't have any higher level access to intrinsics to build this yourself, there is no native (meaning -march) compilation (by design, for portability reasons) etc. etc.
If you try writing high performance Go, it often starts looking like C rather than Go (by avoiding channels, io.Reader, using unsafe etc). It's a shame, but oftentimes, it's your only option. Plus, you don't have clang/gcc developers to help speed up your code on a daily basis, you "only" have the Go team and contributors (yes, there is gccgo, but...).
I'm not saying it is -- but rather, that they could get away with such sloppy code generation because CPUs are so fast now. I did mention that early in the history of the PC, pretty much all compilers were like that due to other constraints, and the difference between that and handwritten Asm was enormous.
Agreed - if one takes the performance on an absolute scale, it is usually sufficient (also the reason why people use Python or PHP, despite both being, relatively, quite slow).
It is only once people start comparing it to the next best thing and/or when they desire better performance that they realise, that there is a lot of not so low hanging fruit.
golang do heavy inline-ing and most of its performance comes from concurrency.
Putting values on stack actually make concurrency easier.
That said, there is a proposal to do it in register: https://github.com/golang/go/issues/18597
It is challenging and not in high priority. The current priority is do more inlining.
> "most of its performance comes from concurrency"
That's just ridiculous. They didn't write all crypto and math in assembly just for nothing. Lumping bunch of slow function calls onto several threads is not where performance magically comes from. And you are confusing concurrency and parallelism.
> The complete lack of push/pop instructions also shows a massive defect in the understanding of how the x86 architecture is supposed to be used.
Push and pop aren't that great because they mutate the stack pointer for each value you push or pop instead of mutating it once. Usually you want to push or pop more than one argument at a time, or reuse outgoing stack space for multiple function calls, and in that case it's often better to do the stack pointer manipulation once and then use mov to move the values into place.
That said, passing all arguments and return values via the stack is totally silly, is a significant performance hit, and is a bug they really should fix.
Not only does using mov with a base+displacement addressing mode not take advantage of the specialised stack hardware and also increases the code size, it also means extra effective-address calculations for the ALU.
When someone who works for Intel proposes changing LLVM to generate push instead of mov for calls, and even suggests push-pop pairs for memory-memory moves to save on register allocation (I've seen this trick before in handwritten Asm, but can't recall if ICC does it too --- it might), I think you had better listen to him, unless you believe it's some sort of a weird conspiracy to make other compilers worse so that Intel's looks better... but then again, ICC and MSVC (among others) use push too, and they're certainly not any slower for it.
> That is indeed very surprising. I think the fact that Go isn't dead-slow can be attributed mainly to the sheer speed of CPUs.
Well, I suspect that in any reasonably complex C/C++ application (e.g. firefox) compilers
do pass params via stack and probably they do so rather often. Even on x86_64.
Go is faster than most languages so I'm not sure it's related to the recent CPU since every languages are benchmarks against the same CPU.
"The complete lack of push/pop instructions also shows a massive defect in the understanding of how the x86 architecture is supposed to be used."
I'm sure the Go team has very skilled people in that field, so I'm not sure what you mean.
You have to understand that the Go compiler does not look for the best optimized code, it's a balance between performance / compile speed and "debuggability".
> You have to understand that the Go compiler does not look for the best optimized code, it's a balance between performance / compile speed and "debuggability".
Passing arguments in registers doesn't help "debuggability". There is a debugging advantage to not using registers at all (though there really shouldn't be; spilling all locals to stack for -g -O0 is only necessary due to design problems in LLVM). But there's no debugging advantage to having a calling convention that spills everything on stack. If you your debugging infrastructure can handle locals in registers at all, it can easily handle arguments in registers too.
People who have wrote some successful language are not necessarily "world class compiler and GC experts", and it's even less correlated with "having kept up with the last 30+ years of PL research" as mentioned above.
Would you call e.g. Kernighan or Guido "world class compiler and GC experts", compared to people like Anders Hejlsberg, Lars Bak, Martin Odersky, Wirth, Simon Peyton Jones, Lattner, and co, who are experts devoted to exactly PL (and don't miss real-world accolades).
Robert Griesemer might come close, but he doesn't appear to be the driving force.
Even more importantly, does Go strike you as the product of "PL expertise", or just a competent, pragmatic, if humble, compiler, the likes of which are many (and even more full featured, even from smaller teams)?
What I see Go having more is adoption and libs (probably due to Google's full-time dev support and branding). Not some PL upper hand over other languages like Rust, Crystal, Zig, Nim, etc.
I don't know any compiler from smaller team that matches the Go one, it's one of the reason why people use Go, compiling things is fast and easy, from my raspbery pi I can compile a win10 64bits exec with 0 problems 0 tools to install. ( ex: GOOS=windows GOARCH=amd64 go build . )
None of those are arguments for "PL experts" as none of those are specifically signs of superior PL expertise.
Multiple platforms for example are mainly about adoption (and not having much optimizations/assembly parts in the codebase, making it easier to port).
Fast compilation is something several languages manage. And being fast to compile because you don't do much (in the way of optimizations) is something most languages can manage.
"No dependencies" is also about adoption and resources (to replace popular dependencies). Nothing particular related to PL/compiler expertise about it.
Go never had a "good debugging" story, and it doesn't have the best performance either (e.g. compare with Rust, D, Crystal, etc). In certain areas like text processing it's even worse.
Richard Hudson, Ian Lance Taylor, David Chase, Austin Clements. The team is incredibly skilled.
And of course Pike, Griesmer & Cox.
I think that they are well aware of the trade-offs they make. And I also think that it is quite arrogant to say otherwise just because your tastes are different.
Go ahead, please write the go haters' handbook. Or write rants saying 'go is obsolete'. And the future will say who has a viable legacy.
> This is very surprising for a language that targets somewhat high performance.
> Looks like it's a 5-10% performance hit, but makes it easier to provide good backtrace information
Very interesting indeed.
Well, forcing your compiler to start passing parameters via stack is pretty
simple and it's not so uncommon.
And probably (and maybe) the Go team just wanted to make things simple.
Otherwise they would need to have two parameter passing implementations
and sometimes even use a mixed one - when platform does not have enough
registers (e.g. 6 CPU registers and a function which takes 6+ params),
so some parameters would be passed via registers, some via stack; or
maybe all of them via stack. And so on. So I am byuing the "good backtrace"
point.
For example, suppose I have a silly logging function my_func() [C code],
which takes 8 params:
Nobody says "uses memory", you call that the stack, and that was the dominant calling convention over the last decades. It's not dead-slow, it's just not as fast as the new fastcall conventions via registers. "memory" is usually the heap, which is really dead-slow compared to the stack.
There are many VM's which do the same, they have the advantage that the GC does not need to spill all registers to the heap just to find the roots. All roots are at the stack already. Basile Starynkevitch's GC from Qish e.g. does it like this and is one of the fastest GC's around. The stack is always in the cache, and the registers can be used for more locals or intermediate results.
I'm curious whether LLVM optimizes this away when using the Go front end. The last time I played with LLVM I was blown away by the types of things that it was capable of optimizing successfully.
Last I checked (which was a very long time ago, mind), Go's inliner was very conservative, and would only inline functions that were comprised of exactly one statement.
Really interesting results and analysis, but small nit:
> there is about 70MB of source code currently in CockroachDB 19.1, and there was 50MB of source code in CockroachDB v1.0. The increase in source was just ~140%
That's a 40% increase, not 140%. This happens on all percentage calculations throughout the article.
The purpose of this data structure is to enable the Go runtime system to produce descriptive stack traces upon a crash or upon internal requests via the runtime.GetStack API.
In other words, the Go team decided to make executable files larger to save up on initialization time.
Something about this whole thing just seems wrong. How often does (perhaps should) an application crash? How often does (again, perhaps should) it need to retrieve its own stack? ...and how much of the binary is being taken up just for that purpose?
Of size/performance trade-offs and use cases
Why is startup time even the question when the common-sense approach is to simply compress this rarely-used table and decompress it upon the first time it's used, not upon every startup? That's assuming it is always absolutely necessary to have in the first place, since loading a huge executable isn't going to be fast anyway.
I feel like this is a case of "the tail wagging the gopher".
Eh, it's a bit of a simplification I think. Certainly the pclntab is consulted in more situations than just application crash; for example, when logging, you can have a source line prepended to the beginning of the line, which certainly uses the pclntab. I would be pretty surprised if there weren't a lot of other cases where the pclntab is consulted.
for example, when logging, you can have a source line prepended to the beginning of the line, which certainly uses the pclntab
It needs to go through the table to find which source line corresponds to the current instruction pointer? That's the only reason I can see for needing it, and a very roundabout way of getting information which is known at compile-time and could be simply a constant whereever it's used, much like C has __FILE__ and __LINE__.
That's not enough for the stack traces. A crash in Go is way more informative than failing an assert in C. Whether the extra memory is worth it is a separate questions.
> for example, when logging, you can have a source line prepended to the beginning of the line, which certainly uses the pclntab.
Naive question: Wouldn't that be trivial to hard code at compilation time within the logging code? Why would it need to go look that up from the pclntab?
In C you have the preprocessor that can be used to hardcode that. In Go plenty of things already rely on the pclntab and doing lookups is cheap so you just use the pclntab. Go is definitely a less kludgy language for being able to drop the need for a preprocessor, despite the consequences.
I don't get this argument, or rather, I don't the apparent corollary that needing pclntab is somehow connected to not using a preprocessor.
I'm probably missing your point, since one certainly don't need a preprocessor for line numbers in logging statements. Unless the language - like C - actually implements the include/import mechanism through a preprocessor. In which case the need for the dependency becomes rather self evident.
>I'm probably missing your point, since one certainly don't need a preprocessor for line numbers in logging statements.
So, libc doesn't define anything like pclntab. About the closest you can really get is DWARF2 call frame information, which is generally treated as debug information (because, well, it _is_) and stripped in release builds. Further, not all platforms will use DWARF2, and the way it is embedded in binaries differs.
(Example: In Win32, when you are using MinGW, DWARF2 will be embedded as PE sections, which have 8 character long names; this is not long enough for the DWARF section names, so a special PE extension is used to specify the longer section names.)
What i'm trying to illustrate is, it's really, truly not possible to go from a PC value, to a line number in a source file, in pure C, at runtime. However, of course, as you have no doubt noticed, everyone still manages to print out line numbers in C source code. And they do this using the preprocessor.
If you dig deep enough into pretty much any logging library that offers line numbers in C and C++, you will find a macro that passes through __FILE__ and __LINE__. And if you keep tracing, you will probably also find the 'function' you call to log is a macro that eventually calls this macro. As a quick example, here is one in glog, a logging library used at Google:
If you look into your logging library, it is quite likely that you will find that at the end of the day, it boils down to the __FILE__ and __LINE__ macros. When all of the macros are computed for your log line, the preprocessor subs in a string literal and line number for the filename and the line number that the macro was originally invoked from.
An uncompressed table in the binary can be demand paged when needed instead of dirtying pages.
Yes, the binary will be bigger meaning that more IO will incur when deploying the application e.g. in docker/k8s.
What that extra IO is surely an annoying thing, I'm not sure I'd trade that off for a huge increase in heap allocation right before getting a stacktrace. Since we're talking about containers, it's not uncommon for containers to be configured with tight memory limits and having an OOM during that process could mask the real reason for a panic.
Would it be possible to have a more succinct data structure or even directly index the compressed table without having to decompress the whole thing?
This feels an awful lot like stripping, can someone explain why need this? Logging is mentioned below but that can be solved in other means. I really have no clue how Go uses pclntab, but it does sound like you could serve that separately like is often done with debug symbols, especially if you offer "one-true-binary" to customers.
It is not a practice I'm found of while debugging, and it also sounds like that table could be be swapped out most of the time.
Go actually walks its stack very, very frequently during calls to runtime.morestack, which happens whenever a routine needs more than the default stack size.
considering the number of flags for Go already, that seems the sensible option :)
even in docker-and-microservice-land, though, there's a cost to having an extra 10-50Mb of executable to copy around the place... I'm nowhere near experienced enough in that to work out if that counter-acts the gains on initialisation speed, though.
[1] https://science.raphael.poss.name/go-calling-convention-x86-...
This is very surprising for a language that targets somewhat high performance.
Looks like it's a 5-10% performance hit, but makes it easier to provide good backtrace information: https://github.com/golang/go/issues/18597