John and David will be giving a talk next week Wednesday (October 10) in Stanford's Computer Systems Colloquium (EE380). The topic is Computer Architecture; the title has not yet been announced. Expect it to be up on YouTube at https://www.youtube.com/playlist?list=PLoROMvodv4rMWw6rRoeSp... on Thursday or Friday next week.
Great talk! My main takeaway is that they advocate hardware-software co-design to improve performance. Performance issues push software in the direction of domain-specific languages, and hardware in the direction of domain-specific architectures (e.g. TPUs for machine learning).
I think DSLs have a lot of advantages for software engineering reasons, but not everyone agrees. But it doesn't appear to be debatable that they have advantages for performance!
Well hand TPU for ML is a great example (about x1000 speed improvement for tensor computation over a CPU).
BUT ....
TPU is just a specialized GPU which has been around for 20 years? So very slow evolution of a pre-existing HW component
Deep learning has been around for 5 years and we already have 2 or 3 generation of frameworks. So very rapid revolution of DSL implemented on abstracted HW
What maybe i am trying to say that a "sufficiently-clever-compiler" can probably make more progress in cache performance that waiting for a great new memory-cpu architecture to appear ?