Hacker News new | past | comments | ask | show | jobs | submit login
x86 Assembly Primer for C Programmers (speakerdeck.com)
98 points by frozeneskimo on March 25, 2012 | hide | past | favorite | 18 comments



This is just about the opening example, so not really the main point of the slides, but: Is it actually still faster on modern machines to use this repnz version? One analysis a few years ago found that the naive C implementation, when compiled with gcc optimizations, was actually faster than that inline-asm implementation; the inline-asm implementation has fewer instructions, but doesn't execute faster: http://canonical.org/~kragen/strlen-utf8.html


This is a great question. You can also find some of glibc's even more optimized versions of strlen() here:

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...

http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/i386...

It's counter-intuitive that so many instructions can be dedicated to something so seemingly simple like strlen(), but it does highlight the complexity of modern processors. On that note, I do not know and am reluctant to comment on which implementation of strlen() is fastest, as it seems very difficult to decouple the non-determinism of caches, instruction scheduling, pipelines, etc. by only looking at the source, especially when other code is also involved. But there probably is a benchmark that could give some useful results -- to justify the code above too -- like in the link you posted.

Someone with a better understanding of instruction scheduling, pipelines, and other optimizations can probably give a better answer to this than me.

~vsergeev/frozeneskimo


Wow, those examples took me a while to go through and understand; it's amazing the level you can optimize a piece of code to hardware. I'm curious to benchmark the implementations and see the performance difference.

Maybe you or someone else knows the answer to this though: it seems like they are processing 4 bytes of the string at a time. If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior? They are only reading in the memory, but I feel like valgrind or another tool would spit out an error if that happened. It's aligned, so it won't trigger a page fault, but it seems like an unsafe optimization.


Yeah, I see your point. Like you said, page alignment and size being a multiple of 4 won't cause a page fault, so it's technically "ok". I can only assume that at this level the corner is safely cut for the sake of performance.

Another more trivial example of something like this is in the repnz-based strlen() (slides 7-8), where %ecx is loaded with 0xffff ffff, which technically limits the routine to scan strings up to 4 gigabytes in length. It's a valid assumption that the string is under 4GB (especially on a strictly 32-bit system), but the point is that it's a semantically different routine than the C based one.

~vsergeev/frozeneskimo


(signatures are frowned upon around here)


(I see, my bad. Habit of mine. Thanks)


"If they read over (i.e. the NULL byte is in byte position 1 2 or 3), isn't that technically undefined behavior?"

Only if they are using C, not implementing it. That is why the C standard has that 'undefined behavior' claus. It makes optimizations like this possible.


Would be nice if it was 64bit. Seems a bit late.


64 bit x86 is very similar to 32 bit. The differences are covered in on slides 191-193 in this deck.

The biggest difference for me is the difference in the calling convention. In 32 bit, all arguments generally are placed on the stack for "standard" calls. In 64 bit, different OSes have different conventions:

http://en.wikipedia.org/wiki/X86_calling_conventions#x86-64_... (note that OS X and linux use the "System V" calling conventions)


It's easier, too; you get a bunch more registers and the register calling convention is a little like having normal positional arguments (or in any case is more convenient than arranging pushls).


I was about to call BS on that last sentence but it looks like I was the one full of it. I thought that the amd64 calling convention was standardized, bummer.

On Windows, amd64 is much better than x86 in this regard because of the bevy of x86 calling conventions that are still around. Only one amd64 calling convention.


To be fair, the System V document is AMD's convention and it was Microsoft that decided to design an incompatible (and worse) ABI.


Sucks. I wonder if AMD wrote that later, after Microsoft had made up their own and were dependent on it. Dave Cutler was involved in amd64 really early in the process of Clawhammer (mainly because he hates Intel with a passion!)


The slides look like they have good information, but I'd really love to see that speech that (I suppose) went with it.


All I get is the speakerdeck main page, with "You are not authorized to access this."

Does anyone have a working link to the content?


Sorry, it's back up. Apparently updating the PDF broke speakerdeck, permanently marking the presentation as "unpublished", even though it is public. I had to delete and re-upload. Probably shouldn't have updated the PDF in the first place, though.

Content is available here as well: https://github.com/vsergeev/apfcp


I wish I could upvote this indefinitely. What a great resource, thanks for your efforts!


Great resource, thanks for it!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: