> If you're not within 10% for these kernels, chances are that memory is being used differently
More likely imperfect strictness analysis, etc. Haskell is a pure functional lazy language, after all. Getting within 65% of C's performance on a tight numeric kernel is heroic.
Different strictness analysis results in using memory differently. Note that some of Don's references are embedding a DSL that gives a high-level interface to very low-level code (e.g. CUDA) specific to this problem. They should have control over strictness.
More likely imperfect strictness analysis, etc. Haskell is a pure functional lazy language, after all. Getting within 65% of C's performance on a tight numeric kernel is heroic.