That's very intriguing to me -- I've rarely/never seen it be slower, and usually at least several times faster. Are you using benchmark.py? If so, can you send a link to a Gist with the output? It may be that the C module is not compiling, and it's falling back to the much slower pure Python module.
I wasn't running your specific benchmark test. I've been running some more controlled tests now that show it basically being on-par with listdir() which I still consider completely oddball.
Also, not remotely hating on what you did here. I actually wrote/tested a very similar concept in python several years ago for the exact same reason. My C-skills (and mostly time resources) weren't really up to par to try to get it included in core. I mostly really want this to work as well as I think it should :)
I'll run some tests with your benchmark.py (and compare that to my benchmark script) and post some results.
Results are very similar/identical to what my tool said.
Interestingly, If I look at strace call graphs the scandir version spends basically an identical amount of time in getdents as the os version, but, the os version spends a bunch of time in stat calls too. It just seems like on my workload, that time is considerably less than I expected.