It is. A kernel is executed by a set of warps (or waves, AMD), the warps/waves define the SIMD-group that execute an instruction. Memory accesses that feed these SIMD instructions are coalesced into one operation, and therefore the optimal memory layout requires thinking in terms of warps.
Kernels are executed at some blocksize, and each block is executed by a SM. The SM partitions each have a limited set of resources, and therefore the number of warps (or waves, AMD) that execute simultaneously should be tuned according to the resources available (registers, shared memory, etc.).
The keyword to search for here is occupancy, and related topics include register pressure/spilling, shared memory and L1 cache-size.
Kernels are executed at some blocksize, and each block is executed by a SM. The SM partitions each have a limited set of resources, and therefore the number of warps (or waves, AMD) that execute simultaneously should be tuned according to the resources available (registers, shared memory, etc.).
The keyword to search for here is occupancy, and related topics include register pressure/spilling, shared memory and L1 cache-size.
All extremely relevant.