Hacker News new | past | comments | ask | show | jobs | submit login

Great article, I wish the discussions around clocks had gone a bit more into how the tradeoff of pipelining vs longest operation ends up impacting designs. That and SRAM vs DRAM access latencies were the things that really connected the dots from how performance optimization on the software side of things is rooted in physical hardware limitations.



Might you or anyone else have some links or references you could share on these two topics? Was there a specific book that helped connect the dots that you could recommend?


Here's a greatly simplified example. Let's say you're trying to calculate y = mx + b in your FPGA. You want this operation to run at 100 MHz. Great, you write the code, synthesize and implement. Uh oh, the tools report that your design has failed timing analysis. What now?

Looking at the output of the tools, they'll say something like "x to y setup time: -2 ns slack". That means your desired operation can't meet the 10 ns clock period; it actually takes 12 ns for all the logic to ripple through. So now what?

You can break up the operation into two steps. Let's say the multiplication takes 8 ns, and the addition takes 4 ns. In timestep 1 you do z = mx, and pipeline c = b. Then in timestep 2 you do y = z + c. This way your operation takes two clock cycles = 20 ns total in terms of latency, but you can maintain a rate of 100 MHz.

Alternatively, you could choose a slower clock rate, say 75 MHz, and have a clock period of 13.333 ns. Then you would be able to meet the logic delay requirements in one cycle.

Again this is greatly simplified but it's similar to what one ends up doing in real FPGA designs. At the beginning you're usually trying to achieve maximum performance. Then later on you add more features to the FPGA, only to find that in doing so, you've caused an existing portion of the design to fail timing, so you need to twiddle things around.


Sorry, no book reference. What is important to realize is that computing can be seen as a 4D problem. Get the right data to the right processing unit at the right clock cycle. This applies to CPU and GPU as well, but got forgotten under a plethora of abstraction layers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: