Hacker News new | past | comments | ask | show | jobs | submit login
How to achieve 4 flops per cycle? (stackoverflow.com)
89 points by glazskunrukitis on Feb 9, 2013 | hide | past | favorite | 14 comments



Getting the theoretical peak performance requires either extensive manual tuning on each architecture or exhaustive search. To read on the former see http://www.nytimes.com/2005/11/28/technology/28super.html?pa..., and the latter is done for example in the ATLAS BLAS.

It is important to note that the SO discussion is focused on achieving peak flops when the data is already in the XMM registers. The real bottleneck is pushing the data quickly through the levels of the memory hierarchy.


Its worth noting that ATLAS is actually pretty terrible compared to hand written BLAS.

Figure 2 and 3 in this paper (http://cran.r-project.org/web/packages/gcbd/vignettes/gcbd.p...) show atlas is only better than an unoptimized ("reference") BLAS implementation and that it is many times slower than the winner GotoBLAS (now OpenBLAS http://xianyi.github.com/OpenBLAS/), which is written assembly by Kazushige Goto while he worked at the Texas Advanced Computing Center.

Edit: Article in the NY Times about Mr. Goto http://www.nytimes.com/2005/11/28/technology/28super.html?sc...


>extensive manual tuning

I was surprised by how little low-level knowledge it took to get within a stone's throw of the theoretical peak.

The general rule seems to be:

- keep data in packed registers (use YMM instead of XMM if you have them)

- interleave addition and multiplication

- unroll the loop sufficiently to avoid stalls

I wonder if how close to ATLAS you could get if you wrote a matrix multiply following these principles and additionally structured the row/column iterations into cache-sized tiles.


This is a very interesting exercise

CPUs are very capable today, unfortunately, to get 'every last drop of performance' is very difficult.

I'd say that CPUs are more complex than compilers can optimize the code for them.

Of course, manually optimizing for something like this is "easy" (if you have a somewhat deep knowledge of Intel's manuals)

Now for "everyday computing" this gets tougher, even though compilers do a good job (good, not great) and it's usually good enough.

So you end up going to the deeper details in time-sensitive things: games, video processing, etc

There are some tools (from Intel and AMD, even though I tested only the AMD one some time ago) that tells you 'everything' that you need to know about how good your code is. For example, IIRC, if you load a register then immediately store to it (or store then load, I don't remember), there's a stall, so you can do something else in between


I always get a mild wave of depression after reading stuff like this, because of how absolutely much I do not know.


I'm a bit surprised by the most voted answer by a user who has 80K rep on SO.

By this part:

"If you decide to compile and run this, pay attention to your CPU temperatures!!! Make sure you don't overheat it. And make sure CPU-throttling doesn't affect your results!"

Fair enough for the CPU throttling.

But did the nineties just call? Computers dying due to CPU overheating was all scary and spooky back when we were running old AMDs and old Pentium CPUs but I haven't own any CPU that could die due to overheating since a very, very long time. They do automagically throttle to: a) make sure they stay within their TDP specs b) make sure they don't melt.

I mean: honestly, scientists should start to worry about overheating their CPU that are going to melt?


haven't own any CPU that could die due to overheating since a very, very long time

You probably have, CPUs are just smarter about it now, and throttle themselves back:

http://www.intel.com/content/www/us/en/architecture-and-tech...

If you have a densely packed blade farm, you could easily exceed the cooling capacity within the rack. Another scenario is laptops.


Senario: this code triggered a shutdown on my computer due to poor ventilation. If I had programs open I would have lost my documents.


I think the only thing in favor of the argument that it might overheat the CPU is that many more functional units on the CPU will be active at once and with no time in between. Normally they would be idle due to cache misses and memory access.

I don't think with modern CPUs that this is a problem, but you would have to test to find out!


I'm guessing since you're disabling throttling to get better results you are also disabling the safeguards that prevent such failures from happening.


Outside of leakage current CMOS/MOSFET circuitry draws current during state transitions. So, yes, more bits flipping per unit of time will cause additional heating, sometimes dramatically so. This is the reason the last generation of iPads get so hot: Four tomes the pixels means that you have the potential to have four times more bits flipping in the display processing pipeline, hence more heat. Try running footage where every pixel is changing on every frame and see what happens.


>I mean: honestly, scientists should start to worry about overheating their CPU that are going to melt?

My officemate is working with some neural networks researchers, trying to get near-peak throughput running convolutional neural networks on NVIDIA's new K20 GPUs. Several GPUs have already died and they're looking for better cooling for the rest. So, I don't know if peak utilization of multi-core CPUs can still be hazardous, but it's definitely true for graphics cards.


I'm a bit surprised by your smug attitude.


So being correct, raising an interesting issue and discarding urban myths is smug now?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: