My code was two instructions: imul eax,ecx,67h sar eax,0Ah. The compiler generated code was mov eax,66666667h imul ecx sar edx,2 mov eax,edx shr eax,1Fh add eax,edx.
This was exactly what I was about to mention; I can reproduce your results in Clang but it seems GCC doesn't want to optimize this? Here's what Godbolt gives me: https://godbolt.org/z/aFTdur