True in this situation, but note that intermediate activations and gradients do ...

True in this situation, but note that intermediate activations and gradients do take memory and in other contexts that's the limiting factor. For example purely convolutional image networks generally take fixed-size image inputs, and require cropping or downsampling or sliding windows to reach those sizes - despite the convolution memory usage being constant for whatever input image size.