Hacker News new | past | comments | ask | show | jobs | submit login

> My usual guess would be that you can often hope for a 50% speedup in a tight loop by dropping from C to assembly

The problem with inline assembler is that it is almost untouchable by the optimizer. By adding some inline asm, you may inhibit a lot of optimization that could give better perf overall.

For this kind of tasks it is often a lot better to use intrinsics (e.g. xmmintrin.h for SSE) or use compiler extensions __attribute__((vector_size(16))) etc. This way you can utilize the CPU features you have available while still allowing the optimizer to do high level optimizations.




While there is lots to be said for the maintainability of intrinsics, I have found inline assembly to be significantly better for performance. And this is precisely because it inhibits the compiler from blindly performing 'optimizations' in the section of code you've already optimized. This thread offers an example and some numbers: http://software.intel.com/en-us/forums/topic/480004


I was under the impression that the parts of performance oriented programs which are typically converted to assembly are in essence small profiled hotspots like very tight loops, as such I doubt that there's any real performance to be had from high level optimizations in conjunction with that code as made possible by insintrics/extensions.

But I'm certainly no expert in this area, so take my opinion with a large grain of salt.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: