Hacker News new | past | comments | ask | show | jobs | submit login
Contributing os.scandir() to Python (benhoyt.com)
194 points by benhoyt on Aug 26, 2016 | hide | past | favorite | 17 comments



Thanks for the write-up. It does a good job of communicating:

* the joys of contributing to Python

* how time and labor intensive the process can be

* the complexities of dealing with multiple operating systems

* what it is like to be alternately helped or hindered by other developers


I think that sort of thing is an often-underappreciated benefit of Python as an ecosystem - it has a well-defined procedure for adding and changing things to core language and library, that strikes a pretty good balance between agility and prudence, and generally yields great results.


That was a great write-up. I liked how you not only covered the technical details, but the human details, too -- how to get your idea noticed by python-dev, how to gain early adoption.


I think the additional windows attributes initiated here:

https://github.com/benhoyt/scandir/issues/22

(Source: It was my issue :) )

I was long using the project since it was named betterwalk and was very glad to see PEP 471 approved. Great work Ben.

(Now I just need to upgrade the project I use it in to 3.5)


Aha, I'd completely forgotten that had come via a scandir issue, sorry! Fixed here: https://github.com/benhoyt/benhoyt.github.com/commit/0c156b9... (let me know your name if you want to be called out by name).


Ha, nah... that wasn't the idea of mentioning it. Probably shouldn't have...

Thanks again for all your efforts.


Great work! To contribute to Python, would you say it's necessary to be comfortable with writing C code?


No, not necessarily at all -- there's a ton of pure Python code in the standard library, so if you're adding to or making improvements there, no need to know C. And then there's documentation and other non-code issues too.


Perfect, thanks. Been wanting to contribute to Python upstream for a while now, but it's overwhelming every time I take a look. :)


Remember that status quo is the sane default. Much more important to fix bugs and add docs than to add features.


Very interesting. I just benchmarked this (the python2.7 module only) with an internal application that walks over filesystems and found scandir.walk() to, inexplicably, be slightly slower than the os.walk().

I think part of the issue (though I've not tested this yet) is that we're stat()-ing every single file anyway so with os cache considered, it really ends up not mattering anyway.

I thought the additional cost of the extra system calls (even if they were entirely cached in memory) would add up, but, it seems like something the scandir module is doing is just less efficient in general.

Devising some much simpler and more controllable tests (but still with our exact workload) and testing more though.


That's very intriguing to me -- I've rarely/never seen it be slower, and usually at least several times faster. Are you using benchmark.py? If so, can you send a link to a Gist with the output? It may be that the C module is not compiling, and it's falling back to the much slower pure Python module.


I wasn't running your specific benchmark test. I've been running some more controlled tests now that show it basically being on-par with listdir() which I still consider completely oddball.

Also, not remotely hating on what you did here. I actually wrote/tested a very similar concept in python several years ago for the exact same reason. My C-skills (and mostly time resources) weren't really up to par to try to get it included in core. I mostly really want this to work as well as I think it should :)

I'll run some tests with your benchmark.py (and compare that to my benchmark script) and post some results.


https://gist.github.com/AdamJacobMuller/8ca79af8c0102285ff07...

Results are very similar/identical to what my tool said.

Interestingly, If I look at strace call graphs the scandir version spends basically an identical amount of time in getdents as the os version, but, the os version spends a bunch of time in stat calls too. It just seems like on my workload, that time is considerably less than I expected.


This issue has been looked into decades ago.

NFSv3 introduced READDIRPLUS in 1995: https://tools.ietf.org/html/rfc1813#section-3.3.17

FUSEv3 (the library) will have support for such a call. It is already in the git repository ( https://github.com/libfuse/libfuse/blob/master/include/fuse_... ) .

Sadly, there is still no syscall on linux to do a "readdirplus" (usually called xgetdents()).


Nice to read this man thanks! Interesting to see your experience in this process and what it's like to contribute to such a project. I've stopped writing alot of python code admitedly because i prefer to torture myself with unnecesarily tedious things... but i like how python evolves, and this kind of contributions are really important and what makes me feel python is one of the more powerfull scripting / programming tools if you're not writing in machine native codes. thanks for the writeup, and ofcourse your contribution to improving all of our codez =]


Nice work, I'll keep it in mind next time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: