>Do I really know how much cycles/stack it takes to do std::sort(a.begin(), a.en...

cwzwarich · on Feb 5, 2020

> I also don't know how many cycles it takes for my implementation of quicksort apart from checking the output of a specific compiler and counting instructions.

On any modern out-of-order CPU, that doesn't get you close to determining the dynamic performance. Even with full knowledge of private microarchitectural details, you'd still have a hard time due to branch prediction.

loeg · on Feb 5, 2020

The embedded/micro space is not out-of-order, broadly. Memory access latency may be variable, or not, but I would charitably assume OP knows their subject material and is either using cycles as a metaphor, or actually works with tiny hardware that has predictable memory access latency.

whatshisface · on Feb 5, 2020

On most (small, I'm not talking about mini linux) embedded systems, instruction counting will tell you how long something takes to run. In fact, the compilers available are often so primitive that operation counting in the source code can sometimes tell you how long something will take.

archi42 · on Feb 5, 2020

Depends a lot on the compiler and target arch. You'll miss out on a lot of stack accesses, or add too many. You don't get around looking at the final executable if you want good results. And for more complex targets, in the end you need to know what the pipeline does and how the caches behave if you want a good bound on the cycle count. Of course assuming you're on anything more complex than an atmega, for which op counting might be enough. I work in the domain; lots of people do measurements, which only give a ballpark but are bad since you might miss the worst case (which is important for safety critical systems, where that latency spike in the wrong moment might be fatal). Pure op counting is bad since the results grossly overestimate (eg you always need to assume cache misses if you don't know the cache state, or pipeline stall, or DRAM, or...). Look at the complexity of PowerPC, this should give you a rough idea what we're usually dealing with (and yeah, I'm talking embedded here).

To me that "sometimes" feels like "I can wrestle some bears with my bare hands, e.g.a Teddy bear" ;-)

AstralStorm · on Feb 5, 2020

On most embedded ARM it won't (out of order execution, caches, alignment), and they don't have to run Linux, likewise on common MIPS implementations.

Simple predictability ends at 16-bit CPUs generally, and even those can be tricky if it's say m68k.

loeg · on Feb 5, 2020

"Most" embedded ARM? Cortex-A8 and smaller do not have OoO execution. Cortex-A9 is a 32-bit up-to-quad core CPU with clock of 800MHz-2GHz and 128k-8MB of cache. That's pretty big. I guess a lot of this is subject to opinion, but I don't think smartphones with GBs of RAM when I think of embedded systems.

mlyle · on Feb 5, 2020

Even e.g. A53 is in-order, and it's a 64 bit, pretty quick CPU (as seen in Raspberry Pi 2/3).

mlyle · on Feb 5, 2020

When I hear "embedded ARM", I hear "Cortex M".

All of Cortex M is in-order and only M7-- still somewhat exotic-- has a real cache (silicon vendors often do some modest magic to conceal flash wait states, though).

Alignment requirements are modest and consequences are predictable. Etc.

About the most complicated performance management thing you get is the analysis of fighting over the memory with your DMA engine. And even that you can ignore if you're using a tightly coupled memory...