Regarding 3), I suspect that the optimization is rather important for page fault...

Regarding 3), I suspect that the optimization is rather important for page fault scalability when the process has many threads. You would traditionally synchronize access to the VMA tree using some sort of reader-writer lock, but scalable read locks impose a higher cost to writers. It's easy to believe that splay trees wouldn't help and might hurt in this case, as lookups may modify the tree structure and thus can require more synchronization than a read lock.

Calling this a micro-optimization is thus misleading; rather, it probably helps quite a lot in some particular workloads and has a negligible impact on others.