The design I used should have a delay of one 74'350 and one 74'257 per two bits of shift/rotate index, but that's admittedly still one bit per two of extra muxing.
The version I breadboarded has 8 '350s and 5 '257s for a 16-bit SRU[0]; I'm not sure how to compare that area-wise to a 32-bit circuit without '350s, but you'd at least avoid needing logic to do or not do the bit-reverse.
The version I breadboarded has 8 '350s and 5 '257s for a 16-bit SRU[0]; I'm not sure how to compare that area-wise to a 32-bit circuit without '350s, but you'd at least avoid needing logic to do or not do the bit-reverse.
0: four-function: shr,sar,shl,rol