At least for x86, there's an incredible wealth of architectural details out ther...

At least for x86, there's an incredible wealth of architectural details out there, both from the vendors themselves and from people who have worked tirelessly to characterize them.

Along the lines of another comment on this post, part of the problem is the GPU compute model is a lot more abstract that what is presented for the CPU.

That abstraction is really helpful for being able to simply write parallel code. But it also hides the tremendous differences in performance possible...