I'm only surprised to not see any of Kenneth Reitz' work on this list, e.g. Requests. In fact in that vein is rauth, an OAuth client lib built on top of Requests (https://github.com/litl/rauth). Full disclosure: I'm the author of rauth. :)
Kenneth Reitz has fantastic publicity already. Can I please just hear about the other things going on in the Python world without bringing him into it, sometimes? It is like a cult of personality.
Hey. Lots of people suggested this so I added a notice to the blogpost to further explain why I chose these libraries:
> I specifically excluded awesome libs like requests, SQLAlchemy, Flask, fabric etc. because I think they're already pretty "main-stream". If you know what you're trying to do, it's almost guaranteed that you'll stumble over the aforementioned. This is a list of libraries that in my opinion should be better known, but aren't.
All it does is grab paragraphs from python.org's html a couple thousand times.
==== Total trials: 100000 =====
bs4 total time: 31.6
pq total time: 9.3
lxml (cssselect) total time: 5.4
lxml (xpath) total time: 4.3
regex total time: 8.9 (doesn't find all p)
What does it mean? Unless you're running thousands of queries for parsing, it doesn't matter which library you choose. My computer old and slow. Pick which one is the easiest, that you'll fight the least with. Don't put energy into unnecessary optimization. Using a good library is like choking someone, they'll fight for a little while until they pass out. (you'll remember this analogy next time you want to switch libraries, do you really need to choke someone to get your job done?) After they pass out, it's smooth sailing and you don't have to worry. Don't rock the boat unless you have to.
Haha, thank you mercuryrising. That's quite the interesting comparison. A few thoughts:
- Python.org is quite a simple web app, it would be interesting to run this against something a little more complex like a long wikipedia article or http://www.nytimes.com/ or even the Alexa Top 500
- It would also be interesting to split the times between parsing and selecting as I feel that's where the difference between pq and lxml comes in.
All in all it seems BS4 is quite a bit faster than I gave it credit for (especially factoring in parsing+selecting instead of just the former)
I specifically excluded awesome libs like requests, SQLAlchemy, Flask, fabric etc. because I thought them too "main-stream". If you know what you're trying to do, it's almost guaranteed that you'll stumble over the aforementioned. I tried to compile a little bit of a list of libraries that SHOULD be better known, but aren't.
That's what I thought you'd done, and it was a great idea. Except for dateutil and sh, I hadn't heard of any of these. In response to the 'Python for Humans' post, this was a perfect compendium.
Yeah. Also pretty much anything by Kenneth Reitz (although i can't warm to the args lib for some reason) is worth a good look.
Could probably do a top 7 just by Reitz:
From: https://github.com/kennethreitz
1. Requests
2. CLINT - Easy CLI tools inc cross platform colour
3. Envoy - "subprocess for humans"
4. Tablib - csv, excel and plenty others, tabular data
5. python-guide - A work in progress book
6. dynamo - Amazon Dynamo as a python dict
7. gistapi.py
When compiling lists of the best Python libs, one definitely has to check out Pocoo: http://www.pocoo.org/ They are a bunch of dudes who are incredibly skilled at putting together great API:s. All their libraries, from pygments, jinja2 to sphinx are well-documented and extremely simple to use.
And if you want to have both command line args and config files, check out my project: https://github.com/ipartola/groper. It even supports creating a sample config file out of the options you have defined.
It is a bit verbose, you are right. Let me explain why all of these are necessary. (BTW, argparse is almost as verbose [1]).
server - this is the section/module. I could grab this from the name of the current module, but that means you must use unique module names, and not change them. Otherwise, your config files would stop working.
daemon - required, obviously, since this is the name of the option.
type - I can't assume a type for you. By default it is a unicode instance, so you can omit this parameter if that's your use case.
cmd_name/cmd_short_name - I can try to assume that cmd_name is the same as the second positional argument, but once again, there could be a conflict. For example, if you want to have options.db.filename and options.log.filename, you can't use cmd_name = 'filename' for both. cmd_short_name is even worse, since here you may want a specific letter to be used (such as an upper case D instead of d). Note that these parameters are optional, since most of your values will likely go in the config file, not on the command line.
Dang, wished known about that a week ago when I was kludging it for a couple of utilities (thinking "someone must have a library for this but I can't find it!"). Thank you.
The very first suggestion complains that BeautifulSoup is too slow, but as of version 4, it's actually just a navigation layer on top of your preferred parser. So it's as fast as lxml, and as easy to use as, well, BeautifulSoup.
I used BeautifulSoup for a project once because of all the accolades it gets here but found it to be less than robust. It might be sufficient in a scenario where you have a single site you need to scrap but I found it was totally unreliable when used across a wide range of sites and esp. sites with foreign language content.
I started using PyQuery yesterday, after using BeautifulSoup for a long time. It seems much easier to use.
pq_page = pquery(url=PAGE_URL)
Note that PyQuery has some encoding issues too (or rather the sites I were scraping were too bad, showing two different encodings in meta tag!), here are two different things I have done to workaround:
Hey before BS 4 got released I looked forward to the very same thing you're talking about, using BS on top of lxml. But it seems the performance hasn't really improved:
Hey Permit. This is a really good point. Looking at PyQuery's source code [0], it really does nothing in terms of parsing other than calling lxml's fromstring function and then works with the result of that when evaluating queries. So it would probably be a tad slower than pure lxml since it also does other bits (like checking for URLs and then fetching them for you) but from looking at the source code, I'd think that the overhead is minimal.
The coolest part of dateutil isn't the parser, it's the recurrence rules and recurrence rule sets. Doing that on your own is extremely error-prone if you have a non-trivial recurrence.
If you do any non-trivial work with decorators, the `decorator` module is a must: https://micheles.googlecode.com/hg/decorator/documentation.h.... Think of it as @functools.wraps on steroids (though this probably doesn't do it justice). FWIW I think it should be in the standard library.
I'd never heard of pattern before, and while it looks like it's a nice bundle of features, I'm concerned by the fact it references pyWordNet by name even though it hasn't been an independent project since 2006 (http://osteele.com/projects/pywordnet/). Has anyone actually used it?
Python Imaging Library - Today's web is full of images and PIL makes it easy for image manipulation. Although, it's not extremely performance efficient at very large scale.
In a similar vein, I recently discovered Wand (http://dahlia.kr/wand/) which is a gorgeous API built on top of ImageMagick. It's a lot more limited in functionality than what PIL offers, but for common operations like scaling, cropping, or extracting EXIF data it's much nicer to work with.
path.py is a must have, I am amazed how often it's being rediscovered, despite the fact that I try to advertise it whenever it's relevant. But it's quite old at this point, maybe we should be more excited about efforts like this one: http://www.python.org/dev/peps/pep-0428/
it seems to accept syntax similar to the 'at' command does (and obviates the need for my python C module to do that parsing based on the scheduler parser for 'at'). examples include "1 day ago", "ten hours from now" and the like. very useful.
The docs got my hopes high, but they fail to mention sh isn't supported on Windows -- I saw that by looking at its source on GitHub. I'll take a look at pbs, but this kinda bummed me out.
Sure, I have an Ubuntu box. Thing is, while I would be doing myself a huge favor, the same wouldn't apply to the users of what I'm trying to create. Perhaps the world doesn't need yet another Git GUI, but if I leave out Windows support, the world will need it even less ;)
Not everybody develops web apps that they deploy on servers they control. If 97% of my target audience is running Windows then developing on Linux isn't going to do me any favours.
I'm considering that, but I still haven't managed to understand just why it isn't supported on Windows. Perhaps I haven't delved deep enough or perhaps it's my lack of solid experience with Python.
Of course, I could simply fork it, remove the "not supported on Windows" check and use it, but that feels lazy and dirty ;)
> I'm considering that, but I still haven't managed to understand just why it isn't supported on Windows.
From a quick look at the code, it's not supported on windows because of its reliance on terminal utilities(pty, termios). Doesn't the banner state to use the previous version(pbs) for windows support?
And if for some reason it still doesn't work out, you can always use subprocess(which this library is using anyway) or https://github.com/kennethreitz/envoy
From a quick look at the code, it's not supported on windows because of its reliance on terminal utilities(pty, termios).
Thanks! Like I said, I don't have lots of experience with Python, so this is very valuable info for me :)
Doesn't the banner state to use the previous version(pbs) for windows support?
Yes, but the PyPi page for pbs 0.110 states: "PBS will no longer be supported." I'm not comfortable with basing my code on a library which won't be supported in the future.
And if for some reason it still doesn't work out, you can always use subprocess(which this library is using anyway) or https://github.com/kennethreitz/envoy
I didn't know about envoy either. Thanks again! :)
EDIT: Upon closer examination, the whole OProc class depends on os.fork(), which is available only on Unix, according to Python docs.
If sh allows Pythonic idioms to call Posix functions, then its implementation on a Posix system such as Linux or Mac is probably very different than on non-Posix Windows.
i have reservations about dateutil. it's certainly true that the built-in time and date routines in python are ugly, but dateutil's parsing doesn't try to give a single, consistent interpretation of an input date/time - instead it parses it in chunks, where one chunk can effectively overwrite another. so you can have input that is illogical or inconsistent and dateutil will happily give a single "right" answer instead of flagging an error.
in short: it's ok for non-critical cases where you just need "some date" from input. but don't use it if you would rather have an error than an incorrect interpretation.
Alright, let me say that I do not like this part of path. Operator overloading is fine, I mean the + is overloaded all to hell. But seeing the division symbol in weird places in a project I took over was distracting first, then irritating, and by the time I figured out what was going on I did not want it despite seeing some benefit.
NB. Twisting operator overloading isn't all bad though. For eg. IO:All nearly gets it all right and being a little mad I will sometimes use it :) https://metacpan.org/module/IO::All
Hey, can you talk more on that? I found it to work quite well on my machine and servers, it's also an integral part of another open source library that I'm building. [0]