x86 Assembly Primer for C Programmers

_delirium · on March 26, 2012

This is just about the opening example, so not really the main point of the slides, but: Is it actually still faster on modern machines to use this repnz version? One analysis a few years ago found that the naive C implementation, when compiled with gcc optimizations, was actually faster than that inline-asm implementation; the inline-asm implementation has fewer instructions, but doesn't execute faster: http://canonical.org/~kragen/strlen-utf8.html

frozeneskimo · on March 26, 2012

This is a great question. You can also find some of glibc's even more optimized versions of strlen() here:

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...

It's counter-intuitive that so many instructions can be dedicated to something so seemingly simple like strlen(), but it does highlight the complexity of modern processors. On that note, I do not know and am reluctant to comment on which implementation of strlen() is fastest, as it seems very difficult to decouple the non-determinism of caches, instruction scheduling, pipelines, etc. by only looking at the source, especially when other code is also involved. But there probably is a benchmark that could give some useful results -- to justify the code above too -- like in the link you posted.

Someone with a better understanding of instruction scheduling, pipelines, and other optimizations can probably give a better answer to this than me.

~vsergeev/frozeneskimo

tcas · on March 26, 2012

Wow, those examples took me a while to go through and understand; it's amazing the level you can optimize a piece of code to hardware. I'm curious to benchmark the implementations and see the performance difference.

Maybe you or someone else knows the answer to this though: it seems like they are processing 4 bytes of the string at a time. If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior? They are only reading in the memory, but I feel like valgrind or another tool would spit out an error if that happened. It's aligned, so it won't trigger a page fault, but it seems like an unsafe optimization.

frozeneskimo · on March 26, 2012

Yeah, I see your point. Like you said, page alignment and size being a multiple of 4 won't cause a page fault, so it's technically "ok". I can only assume that at this level the corner is safely cut for the sake of performance.

Another more trivial example of something like this is in the repnz-based strlen() (slides 7-8), where %ecx is loaded with 0xffff ffff, which technically limits the routine to scan strings up to 4 gigabytes in length. It's a valid assumption that the string is under 4GB (especially on a strictly 32-bit system), but the point is that it's a semantically different routine than the C based one.

~vsergeev/frozeneskimo

VMG · on March 26, 2012

(signatures are frowned upon around here)

frozeneskimo · on March 26, 2012

(I see, my bad. Habit of mine. Thanks)

Someone · on March 26, 2012

"If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior?"

Only if they are using C, not implementing it. That is why the C standard has that 'undefined behavior' claus. It makes optimizations like this possible.

ben0x539 · on March 25, 2012

Would be nice if it was 64bit. Seems a bit late.

rmcclellan · on March 26, 2012

64 bit x86 is very similar to 32 bit. The differences are covered in on slides 191-193 in this deck.

The biggest difference for me is the difference in the calling convention. In 32 bit, all arguments generally are placed on the stack for "standard" calls. In 64 bit, different OSes have different conventions:

http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_... (note that OS X and linux use the "System V" calling conventions)

tptacek · on March 26, 2012

It's easier, too; you get a bunch more registers and the register calling convention is a little like having normal positional arguments (or in any case is more convenient than arranging pushls).

anaisbetts · on March 26, 2012

I was about to call BS on that last sentence but it looks like I was the one full of it. I thought that the amd64 calling convention was standardized, bummer.

On Windows, amd64 is much better than x86 in this regard because of the bevy of x86 calling conventions that are still around. Only one amd64 calling convention.

brigade · on March 26, 2012

To be fair, the System V document is AMD's convention and it was Microsoft that decided to design an incompatible (and worse) ABI.

anaisbetts · on March 26, 2012

Sucks. I wonder if AMD wrote that later, after Microsoft had made up their own and were dependent on it. Dave Cutler was involved in amd64 really early in the process of Clawhammer (mainly because he hates Intel with a passion!)

burstlag · on March 28, 2012

The slides look like they have good information, but I'd really love to see that speech that (I suppose) went with it.

tene · on March 26, 2012

All I get is the speakerdeck main page, with "You are not authorized to access this."

Does anyone have a working link to the content?

frozeneskimo · on March 26, 2012

Sorry, it's back up. Apparently updating the PDF broke speakerdeck, permanently marking the presentation as "unpublished", even though it is public. I had to delete and re-upload. Probably shouldn't have updated the PDF in the first place, though.

Content is available here as well: https://github.com/vsergeev/apfcp

iab · on March 26, 2012

I wish I could upvote this indefinitely. What a great resource, thanks for your efforts!

Duckpaddle2 · on March 26, 2012

Great resource, thanks for it!