Author here; the arxiv version can be found at https://arxiv.org/abs/1810.08297. Not much different from OP's linked version, but it includes citations to other interesting Julia AD/TPU-related papers that utilize this technique.
Happy to answer any questions, at least until I turn in for the night :)
I'll forward this to some of my GPU-expert-coauthors in the morning to see if they have a take on your questions. I think there's a few interesting facets here, though, so here's my take.
> What's the most notable way the GPU in particular comes into play?
Forward-mode and reverse-mode AD have different expressibility constraints on the kinds of programs that each can efficiently target in the face of dynamism, and GPUs also have fairly constrained programming models compared to the CPU. For me, a big part of the paper was the exploration of the intersection of these two sets of programmability constraints.
Section 2.2.4 and the experimental sections explain some of this in detail, but I think one of the more surprising results was that the benefits of fusing dynamic control flow into the broadcasted derivative kernel outweighed potential detriments e.g. warp divergence. It turns out newer GPU architectures give you more leeway in that regard than any of us on the team expected.
> How does caching come into play?
Depends on which kind of "caching" you're referring to.
If you mean tape-level partial derivative caching/memory usage:
Broadcasting a forward-mode derivative operator, as presented in this paper, can save on memory when it enables better fusion than reverse-mode on complicated kernels (resulting in fewer temporaries).
However, there is also a question of when this technique should actually be employed: during the forward pass, or during the reverse pass? If employed in the forward pass, then the primal and partial derivative calculations can be fused, reducing compute cost. However, doing so means that the memory required to store the partial derivatives is held captive until those derivatives can be backpropagated in the reverse pass. Conversely, employing the technique in the reverse pass allows you to free the partial derivative storage quickly, but features some redundant computation. Section 2.2.3 of the paper discusses this a bit.
If you mean instruction-level caching, i.e. efficient pipelining of memory into registers:
On the CPU, it's quite easy to thrash cache for high-arity dual number calculations (i.e. calculations where dual number instances carry around a long stack-allocated array of partial derivatives). Our experiment in Section 3.4.1 tries to characterize the analogous GPU behavior by measuring how occupancy scales with target calculation arity.
Also, there was definitely a bit of implementation work to ensure that loads from our GPU-backed "dual number" arrays coalesced properly, that indexing calculations were compiled away when possible, etc. The cool part is that the dual numbers themselves were just the implementation provided by the ForwardDiff package (https://github.com/JuliaDiff/ForwardDiff.jl), which contains no GPU-specific specialization, and they're automagically JIT-compiled for the GPU by CUDAnative (https://github.com/JuliaGPU/CUDAnative.jl).
> What about intrinsic condensing functions?
Hmm...I'm not positive I know what "intrinsic condensing functions" are. Apologies!
Can you explain, from a very high level, what problem is being solved or which thing in a machine learning stack is improved?
I have basic knowledge in machine learning and can handle the maths part but, for instance, I have never heard the term 'broadcasting' in this context.
I'm not trying to scrutinize your work, just being genuinely curious and trying to learn.
So, essentially, most ML frameworks' expressivity is heavily constrained by what the framework knows how to "efficiently" differentiate, generally via AD. Our paper presents a technique for improving many ML frameworks' AD implementations for a really common class of operations ("broadcast" operations) in a way that not only benefits performance, but benefits programmability as well.
More specifically, right now, ML frameworks mainly only support one kind of AD (reverse-mode). There's another kind of AD (forward-mode) that is more efficient in certain cases (and less efficient in others). Furthermore, certain programmatic constructs (like broadcasting certain kinds of kernels) are intractable to differentiate on a GPU in reverse-mode can be tractable if you utilize forward-mode instead.
It's generally a hard problem to combine the two modes in an optimal manner. However, we present a technique that enables the implementer to easily interleave the two modes specifically for differentiating broadcast operations. Our technique also removes some barriers to important compiler-level optimizations (like operator fusion), and thus can improve performance.
> I have never heard the term 'broadcasting' in this context
If you've used numpy, you've likely used this kind of "broadcasting"; AFAIK it was numpy that popularized the use of the term "broadcasting" for this operation in Python (though I'm no Python historian, so take that with a grain of salt).
Happy to answer any questions, at least until I turn in for the night :)