When you measure peak FLOPS, especially "my desktop computer has X FLOPS", you'r...

When you measure peak FLOPS, especially "my desktop computer has X FLOPS", you're generally computing N FMA units * f frequency, theoretical maximum FLOPS unit. This number, as you note, has basically no relation to anything practical: we've long been at the point where our ability to stamp out ALUs greatly exceeds our ability to keep those units fed with useful data.

Top500 measures FLOPS on a different basis. Essentially, see how long it takes to solve an N×N equation Ax=b (where N is large enough to stress your entire system), and use a synthetic formula to convert N into FLOPS. However, this kind of dense linear algebra is an unusually computation-heavy benchmark--you need to do about n^1.5 FLOPS per n words of data. Most kernels tend to do more like O(n) or maybe as high as O(n lg n) work for O(n) data, which requires a lot higher memory bandwidth than good LINPACK numbers does.

Furthermore, graph or sparse algorithms tend to do really bad because the amount of work you're doing isn't able to hide the memory latency (think one FMA per A[B[i]] access--you might be able to do massive memory bandwidth fetches on the first B[i] access, but you end up with a massive memory gather operation for the A[x] access, which is extremely painful).