Hacker News new | past | comments | ask | show | jobs | submit login
Increasing the D Compiler Speed by Over 75% (drdobbs.com)
174 points by andralex on July 25, 2013 | hide | past | favorite | 68 comments



Time to stop guessing where the speed problems were, and start instrumenting. Time to trot out gprof, the Gnu profiler. I fired it up on the slow example, and waited. And waited, and waited, and waited. I waited overnight. It pole-axed my Ubuntu box, I had to cold boot it. The test case was so big that it plus gprof was too much. gprof slows things down a lot, so I had to cut things way down to get it to work.

It's been a long time since I've used gprof. I switched to Valgrind and OProfile about 10 years ago, and more recently to 'perf' and 'likwid'. If the goal is finding hot-spots, these last might be more convenient since they run with minimal overhead --- a couple percent rather than 100x.

Are there benefits to gprof that I've forgotten?

Are there newer and better profiling tools I don't know about?


As I mentioned elsewhere, I like gprof because it reports on fan-in and fan-out. This gives a picture of how the tree is executing, rather than just the functions.


Maybe I misunderstand what you need, but generating a call graph and showing what percentage of time is spent in a function by caller is possible with all of these. Example for perf: http://lwn.net/Articles/340010/


I didn't know it could do that. Thanks for the info.


Vtune and Zoom need to be mentioned, although they are not free. Vtune is even able to do power consumption analysis these days.


If appropriate for your use, VTune for Linux (and the rest of the Intel Compiler Suite) is available for free under a "Non-Commercial Software Development" license.

http://software.intel.com/en-us/non-commercial-software-deve...


Vtune is even able to do power consumption analysis these days.

...on AMD machines, too? (Not to mention mobile development on ARM.)


As far as I understand, having not used VTune, some basic features work on both, but the interesting stuff relies on Intel-specific counters in the CPU.

This doesn't mean it's worthless, though: Finding your performance bottlenecks on Intel will surely be at least a first-order approximation of the performance on AMD, too.

ARM is a little bit of a different story since the memory model is different, and you might be getting killed by something like unaligned accesses, but that doesn't mean the information is worthless; it just means you should probably use more than one tool. After all, not everything is a nail, but hammers are still good tools.


If your software runs on OS X, try Instruments! It's free (comes with XCode) and has an Apple-quality UI.


This article chronicles some fun I had boosting DMD's speed by doing some simple changes.


The idea of never free()ing and then taking advantage of that with a dumb allocator to get better performance is pretty clever. I wish I could do that with my compiler; sadly I can't let it leak since I invoke it from unit/functional tests so the test runner would run out of memory and explode :( For DMD tests do you just eat the cost of a process setup and compiler startup for every test run?


I do the related small allocations from the pools, and the pools are destroyed at every compilation end. If the pool element is for example 32K and the average element is e.g. 32 bytes that's 1K times less "real" mallocs. Having separate pools for the members of different "structures" also makes for better CPU cache usage.

Edit: as aaronblohowiak writes, this method is commonly known as a http://en.wikipedia.org/wiki/Region-based_memory_management


The compiler is a batch tool, so restarting the process for every run is normal usage.


I've taken to not bothering to free anything in batch tools if it's not super-obvious what to do. Objects with a simple life time (including memory owned by a stack-local object such as a std::vector, etc.) get freed; everything else just leaks. You'd think this would cause masses of problems, but it doesn't. It's easy to imagine data that would be too large, but it seems to be rarer in practice than you might think.

If the alternative is something like a handle-based or smart pointer system, then you'll reap the benefits in terms of ease of debugging on a daily basis.


A simple region allocator could do that and then you just pay for one free() at the end of the unit test.


> The idea of never free()ing and then taking advantage of that with a dumb allocator to get better performance is pretty clever.

Ken Thompson's C compiler does this.

http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/cc/comp...

http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/cc/lex.... (near the end)

http://plan9.bell-labs.com/sources/plan9/sys/src/cmd/cc/macb... (near the end)


The idea of never free()ing and then taking advantage of that with a dumb allocator to get better performance is pretty clever.

Doesn't Erlang do something like this with processes?

What about doing scratch heaps for data with limited lifetime? Especially if you have a way of somehow reasonably predicting how much heap you're likely going to need (perhaps a heuristics obtained with a bit of machine learning?). Allocate a bit more to give yourself some headroom, use a dumb allocator, and if you run out of space, allocate an emergency area. Obviously, in most cases you won't have to do that, though. (If you're obsessed about speed, you can allocate just blindly and use a memory fault handler to detect that you've run out of space. :-) CLISP does that and it seems to work. Look at GNU libsigsegv.)


In the per test tear down you could reset the pointer to the start to reuse the same memory repeatedly?


Why did you use gprof instead of perf? In my experience gprof gives very distorted information because of the instrumenting code it adds to your program. Perf is much more accurate.


gprof gives the "fan in" and "fan out". As important as knowing how much time each function takes is who is calling it. One may be able to prune away the need to call it.


Call graphs are pretty standard across most performance tools ...


You can have a lookup of one overs as well as the hash table sizes?

Would be interesting to know if a simple bitsshift hash table is faster for a compiler usecase anyway?


On what version of Windows did you perform the measurements?


7


Have you considered or tested using either closed hashing or linear array lookups as a replacement for you linked list open hashing implementation. Years ago I significantly improved the speed of a color quantization operation that several other engineers had already optimized by replacing it with a simpler closed hashing algorithm straight out of Knuth. More recently I've had success for small collections using arrays and performing linear search. This technique is used in Redis (see http://redis.io/topics/memory-optimization)


As soon as pools are used (see my other comment here) chaining is much faster than storing all elements in the table behind the hash -- you can use simpler hash function and have better performance even when the table is relatively full.


I haven't spent much time looking into cache effects in the compiler's internal data structures, that gold hasn't been mined yet.


Some ideas (at least on x86):

- alignment greater than 16-bytes, eg. 128 bytes for isolated buffers.

- the hardware prefetcher like to load cachelines around the memory actually accessed, just in case. So data chunks that will be accessed at the same time better be near each other to save cache usage a bit.

- memory access which does not have a simple pattern is slower than one which is contiguous or have a simple stride.


I just wanted to say that when allocations are very cheap and kept "near" keeping lists of elements linked from the hash array (http://en.wikipedia.org/wiki/Hash_table#Separate_chaining) instead of keeping all the elements in the hash array itself (http://en.wikipedia.org/wiki/Hash_table#Open_addressing) makes the hash table faster when the hash functions are simple and fast as the number of collisions inside of the array sinks. I don't know if that would be applicable for dmd.


A lot of people underestimate the performance impact of malloc(). It is dog slow. In addition if you use a poor malloc() implementation heavily with varying sized data, you can easily end up using more memory than had you used a copying GC!


Not only is it slow, many malloc() implementations (like even in glibc, I think) take a global lock, so they are prone to contention when called concurrently. Other mallocs (like tcmalloc and Hoard) satisfy small allocations from a thread-local pool.


glibc's malloc has actually been decent with regards to concurrent memory allocation for a while now, in my experience. I googled around, and found someone talking about glibc's malloc not using a global lock: http://siddhesh.in/journal/2012/10/24/malloc-per-thread-aren...


Thanks for the info; good to know that glibc has improved in this regard. I'd correct my comment but it's no longer editable. :/


I've seen a number of compiler "benchmark" programs that inadvertently measured the performance of malloc, not the generated code.


Very interesting. Do you have any URLs or papers that go into more depth regarding your statement?


malloc() hasn't been a subject of interest to academics in many decades. This knowledge pre-dates the web in some cases. Do a web search for the replacement allocator used in Firefox. You'll probably find some good discussion there.

Plus, every OS, C runtime, and usage pattern is different. YMMV massively depending on everything.


Concurrent memory allocation had some interest in the past decade - I did work in it, as did some others.


I wondered why you don't store the reciprocal w/ the hash table object. Obviously it wastes some space, but it wouldn't be any slower than your specific checks for 4 and 31, I think. (If most of the tables have size 4 or 31, then I'd use your code).


It's not necessary since the set of possible divisors is known in advance.

Also, "multiplying by the reciprocal" is a bit simplistic - there's some other instructions added in based on the specific divisor value. Adding more tests and branches for these likely would not pay off.


Yeah, I looked up the method and agree that if most of your tables are small, it's worth those simple ifs.

I only meant to store it in addition because the alternative, storing an index into the list of divisors (and thus reciprocals) might be slower due to an extra indirection.


Changing the modulus to use known constants is an awesome trick! Great read


I reverted that part in dmd once. Same speed on my Intel i7 processor. Branching vs division is probably a tricky tradeoff.


It's worth checking the generated assembler to see if the optimization actually took place in your build.

Note that you may be using an older dmc which did not do the divide optimization.


Walter, could you explain why lexing was a bottleneck? That's very surprising to me. You don't re-lex template instantiations do you?


Lexing has been a bottleneck in every compiler I've built. The only answer I have is that ASTs are a lot smaller than source code.

Templates are stored as ASTs. They are not re-lexed.


Do you consider lexing only "reading the file, finding out if the sequence of characters is the keyword of the literal or the comment" or something more? I admit I lex the source which is already in memory, and there lexing takes less than all other processing (in CPU use). Also in my case the source is always smaller in memory than any AST.


Lexing is converting the source text into a token stream.

If lexing takes relatively smaller times for you, perhaps you have bottlenecks in the later passes?


I'm just having a hard time understanding how you could be having complaints about the compile times if you have a fast lexer and that lexer is a significant percentage of the total time. Can't your lexer handle a million lines of code in a few seconds? How big are these code bases?


It does handle a million lines in a few seconds - it's just that the rest of the compiler's work goes even faster.

The top cycle sucker is Lexer::scan(). Here's the source:

https://github.com/D-Programming-Language/dmd/blob/master/sr...

See line 440. It's entirely possible I've missed something glaringly obvious - have a go at it and see if you spot anything.

Oh, and "complaints" is a relative term. DMD is incredibly fast at compiling compared to other compilers - it's just that I want it to go even faster. Anything less than instantaneous I regard as "needs improvement." When D gets a design win at a company, I'll often ask what put D out in front of the competition. "Compile speed" is usually mentioned. Compile speed has a huge effect on programmer productivity.


Note: I like the way you think!

You might eliminate a few per-token function calls and cache thrashing by making it less of a 'pull' tokenizer interface. I.e., instead of the (elegant) method having Lexer::scan(&t) scan and produce just one token, scan a batch of them at once and fill a contiguous array.

Also, your Token class looks like it might be 7 or 8 pointers big. Seems like that could probably slim down a bit.


Looks like the same kind of problem that plagues interpreter loops: that top-level switch statement is causing a branch misprediction on almost every hit.

The solution i've seen was to have each opcode handler determine what the next handler would be and directly jump there.


The problem is that most of the time the main switch of lex isn't entered in the loop, instead it is called from the many different points in the parser (a token is considered, something is done, then there's a call for a next token). The only case I see at the moment where there's loop over the switch is when doing the whitespaces, that can maybe be slightly improved.

The whole topic of lexing turning out to be slow is really a thing that should be measured. Really intriguing.


Glancing at the source, I think the whitespace parsing could be improved. Since whitespace will rarely if ever (?) be followed by more whitespace, and since most other things are likely to be followed by whitespace, just having both "parse_expecting_whitespace()" and "parse_probably_not_whitespace()" should improve the prediction.

Other tokens probably have similar but less strong "preferences". It's possible that a smart enough branch predictor will learn these on its own, but changing things so that each token ends with its own predicted branch would likely be a win.

Since the branch predictor uses the address of the branch as the 'key', the general goal is to have a separate branch instruction (if, switch) in each place where the probabilities are different enough to favor a different prediction. A wrong prediction costs about 15 cycles, but a correctly predicted branch costs less than 1, so you can put in quite a lot of these and still come out ahead.

Perhaps just ending every token handler with a best guess at what comes next?

  if (T->ptr[0] == most_likely_next) 
    handle_most_likely_next(T);
  else scan(T);
I don't have the syntax in my head, by I think you can use 'perf' to record mispredicted indirect branches and then display that sorted by caller.


In every line there are more consecutive whitespaces at the begining unless the code isn't indented.


Great, that sounds like a fine opportunity for a "best guess": if newline, expect whitespace. Currently, I'd guess that 'whitespace except newline' is the default prediction for the switch() at line 451. I'd also guess that if not followed by a space or tab, a newline is frequently followed by another newline.

Maybe you could combine the case statements for space and newline, and do a branchless 'cmov' to increment loc.linenum if the match was a newline? This could be combined with loop to grab all the whitespace/newlines in one if you think whitespace is occurs in clumps.


FWIW I'd never do a CPU dependent asm code in the lex.


I played with how to phrase that. You don't need to actually use a CMOV, just write code that allows your compiler to use one if supported.

  int tmp = real;
  if (foo == bar) {
    tmp = newval;
  }
  real = tmp;
I've seen it referred to as a 'hammock'[1], and at least for GCC and ICC it usually is a strong enough hint.

[1] http://people.engr.ncsu.edu/ericro/publications/conference_M...


Indeed. At the very least, that scan() function could be broken down into smaller parts to be individually profiled.


I did the small statistics of the lexer.c and considered it an example of the input. You have there some 3050 lines 90KB, inside of the file there are 44K space (chr(32)) characters, giving on average more than 10 spaces per line. To process the indentation spaces faster I'd try reading them in the loop after the nl something like:

    case '\n':
       char* t = p + 1; // skip the nl
       while ( *t == ' ' ) // skip the indent spaces
           t++;            // while avoiding the big switch
       p = t; 
       continue;
I don't know if it's worth, but seems a good compromise.


This sounds like a good problem to have! Your compiler must be blazing fast. Does it remain so fast with full optimizations turned on? Thanks for the answers.


Please excuse me for even asking, did you propagate your bump-the-pointer greedy allocator to the lexer (that is, what's behind "mem.malloc" in the code you linked to)? And do you have some idea why would lexing otherwise take so much time for you to notice?


Yes, but the lexer doesn't do too much allocating. I try to avoid allocing in the lexer because of speed issues.


A massive advantage of your new linear allocator is that it keeps your memory access continuous. This means that the processor is more likely to have the most recently used memory locations already in cache.

You might see further improvements if you split your allocations between two (or more) allocators. One for memory you expect to remain hot (core to the compiler) and one for stuff you think is one-off. That might improve access locality further.


Does your compiler perform any transformations at all? I imagine it can run out of memory pretty quickly if you're performing multiple transformations in succession on large code base, unless you recycle some of those used memory.

Granted...since you explicitly stated that your compiler focused on compile speed, I guess optimized code generation isn't your main concern, since the two are more or less mutually exclusive.


Compile speed issues are for non-optimized builds. Optimized builds take significantly longer, as those are for maximum generated code speed rather than compile speed.


I guess I wasn't being clear. I was just curious how do you handle your memory in the case of doing optimized builds?


The same.


The rule of thumb is that you get a 10% boost if you use the LLVM or the GCC backend compared to DMD.

I googled a little for a go vs gccgo comparison, but found no general numbers. The situation is similar.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: