I read your experiment on inlining and it looks quite interesting. Did you do a benchmark on how it affects (improves?) performance? In my line of work I've found ocasionally places where inlining would've helped (functions that do masking with 64 bit masks, for example, are a mess to "inline by hand" and kill a lot of readability and clarity). Even with the current limitations of your "toy" implementation, it seems like it would help to avoid the costly function calls.