I still don't think so. Exposing these microarchitectural concerns to the architectural level limits flexibility. In order for compilers to efficiently schedule multiple execution units, the compiler needs to know the exact latency of all instructions. That may be doable for arithmetic, but varies greatly from one generation of processor to the next. And compilers definitely cannot know the latency of a load: from a few cycles in L1 cache, to a few thousand cycles in DRAM, to millions of cycles if there's a page fault. And these things vary a lot, not just between processor generations but within the same processor generation.