Hacker News new | past | comments | ask | show | jobs | submit login
Python Libraries you should know about (doda.co)
483 points by trueduke on Nov 12, 2012 | hide | past | favorite | 69 comments



I'm only surprised to not see any of Kenneth Reitz' work on this list, e.g. Requests. In fact in that vein is rauth, an OAuth client lib built on top of Requests (https://github.com/litl/rauth). Full disclosure: I'm the author of rauth. :)


Kenneth Reitz has fantastic publicity already. Can I please just hear about the other things going on in the Python world without bringing him into it, sometimes? It is like a cult of personality.


Hey. Lots of people suggested this so I added a notice to the blogpost to further explain why I chose these libraries:

> I specifically excluded awesome libs like requests, SQLAlchemy, Flask, fabric etc. because I think they're already pretty "main-stream". If you know what you're trying to do, it's almost guaranteed that you'll stumble over the aforementioned. This is a list of libraries that in my opinion should be better known, but aren't.


author mentioned in another comment that he left off many things like Requests, since they are very well known, or easily discovered.


Since everyone is asking (and no one is doing), I put together a simple benchmark for pyquery, bs4, and lxml (cssselect/xpath).

https://gist.github.com/4061368

All it does is grab paragraphs from python.org's html a couple thousand times.

    ==== Total trials: 100000 =====
    bs4 total time: 31.6
    pq total time: 9.3
    lxml (cssselect) total time: 5.4
    lxml (xpath) total time: 4.3
    regex total time: 8.9 (doesn't find all p)
What does it mean? Unless you're running thousands of queries for parsing, it doesn't matter which library you choose. My computer old and slow. Pick which one is the easiest, that you'll fight the least with. Don't put energy into unnecessary optimization. Using a good library is like choking someone, they'll fight for a little while until they pass out. (you'll remember this analogy next time you want to switch libraries, do you really need to choke someone to get your job done?) After they pass out, it's smooth sailing and you don't have to worry. Don't rock the boat unless you have to.


Haha, thank you mercuryrising. That's quite the interesting comparison. A few thoughts:

- Python.org is quite a simple web app, it would be interesting to run this against something a little more complex like a long wikipedia article or http://www.nytimes.com/ or even the Alexa Top 500

- It would also be interesting to split the times between parsing and selecting as I feel that's where the difference between pq and lxml comes in.

All in all it seems BS4 is quite a bit faster than I gave it credit for (especially factoring in parsing+selecting instead of just the former)


Python.org was pretty simple. bs4 really hits the brakes on the NYT page.

    ==== Total trials: 100000 =====
    bs4 parsing time: 0.6
    bs4 selecting time: 146.3
    pq parsing time: 0.0
    pq selecting: 15.7
    lxml parsing time: 0.0
    lxml (cssselect) selecting time: 12.4
    lxml parsing time: 0.0
    lxml (xpath) selecting: 11.5


That seems more in line with my experience :) This is the stuff lxml shines at. Thank you for testing this out.


I'm only upvoting for the choking analogy.

That one is going into my permanent memory file.


You should probably add requests http://docs.python-requests.org. Anyway great list! I'm already using most of them and they are awesome.


Hey. Author of the blog post here.

I specifically excluded awesome libs like requests, SQLAlchemy, Flask, fabric etc. because I thought them too "main-stream". If you know what you're trying to do, it's almost guaranteed that you'll stumble over the aforementioned. I tried to compile a little bit of a list of libraries that SHOULD be better known, but aren't.


That's what I thought you'd done, and it was a great idea. Except for dateutil and sh, I hadn't heard of any of these. In response to the 'Python for Humans' post, this was a perfect compendium.

Thanks a million.


I see your point but in my opinion, requests is quite different from SQLAlchemy, Flask and fabric. I wouldn't put them at the same level.


Yeah. Also pretty much anything by Kenneth Reitz (although i can't warm to the args lib for some reason) is worth a good look.

Could probably do a top 7 just by Reitz:

  From: https://github.com/kennethreitz

  1. Requests
  2. CLINT - Easy CLI tools inc cross platform colour
  3. Envoy - "subprocess for humans"
  4. Tablib - csv, excel and plenty others, tabular data
  5. python-guide - A work in progress book
  6. dynamo - Amazon Dynamo as a python dict
  7. gistapi.py


When compiling lists of the best Python libs, one definitely has to check out Pocoo: http://www.pocoo.org/ They are a bunch of dudes who are incredibly skilled at putting together great API:s. All their libraries, from pygments, jinja2 to sphinx are well-documented and extremely simple to use.


One I'd like to add is Docopt: http://docopt.org/ and https://github.com/docopt/docopt

It makes it very simple and intuitive to build command line apps.


And if you want to have both command line args and config files, check out my project: https://github.com/ipartola/groper. It even supports creating a sample config file out of the options you have defined.


This might just be your examples, but it seems a bit verbose.

For example for:

define_opt('server', 'daemon', type=bool, cmd_name='daemon', cmd_short_name='d')

Can I just write define_opt('daemon') ?


It is a bit verbose, you are right. Let me explain why all of these are necessary. (BTW, argparse is almost as verbose [1]).

server - this is the section/module. I could grab this from the name of the current module, but that means you must use unique module names, and not change them. Otherwise, your config files would stop working.

daemon - required, obviously, since this is the name of the option.

type - I can't assume a type for you. By default it is a unicode instance, so you can omit this parameter if that's your use case.

cmd_name/cmd_short_name - I can try to assume that cmd_name is the same as the second positional argument, but once again, there could be a conflict. For example, if you want to have options.db.filename and options.log.filename, you can't use cmd_name = 'filename' for both. cmd_short_name is even worse, since here you may want a specific letter to be used (such as an upper case D instead of d). Note that these parameters are optional, since most of your values will likely go in the config file, not on the command line.

An alternative API might look like this:

  define_opt_int('server', 'shutdown_timeout')

  define_opt_bool('server', 'deameon')

  define_opt_cmd_unicode('server', 'pid_file', cmd_name='pid', cmd_short_name='p')
With a fallback to:

  define_opt('log', 'level', type=lambda x: x if x in LOG_LEVELS else 'DEBUG', cmd_name='log-level', cmd_short_name='L')
Any feedback is greatly appreciated.

[1] http://docs.python.org/dev/library/argparse.html


So maybe just define_opt('server.daemon'), with the rest given sane defaults and customizable as needed?

Interesting module nonetheless!


Dang, wished known about that a week ago when I was kludging it for a couple of utilities (thinking "someone must have a library for this but I can't find it!"). Thank you.


You are welcome. I hope you do end up using groper in other projects. Any feedback is greatly appreciated!


The very first suggestion complains that BeautifulSoup is too slow, but as of version 4, it's actually just a navigation layer on top of your preferred parser. So it's as fast as lxml, and as easy to use as, well, BeautifulSoup.


I used BeautifulSoup for a project once because of all the accolades it gets here but found it to be less than robust. It might be sufficient in a scenario where you have a single site you need to scrap but I found it was totally unreliable when used across a wide range of sites and esp. sites with foreign language content.


I started using PyQuery yesterday, after using BeautifulSoup for a long time. It seems much easier to use.

  pq_page = pquery(url=PAGE_URL)
Note that PyQuery has some encoding issues too (or rather the sites I were scraping were too bad, showing two different encodings in meta tag!), here are two different things I have done to workaround:

  page = requests.get(PAGE_URL)
  pq_page = pquery(page.text)
If that doesn't cut it (because requests detects it wrong too), try forcing the encoding in requests:

  page = requests.get(PAGE_URL)
  page.encoding = 'utf-8'
  pq_page = pquery(page.text)


Hey before BS 4 got released I looked forward to the very same thing you're talking about, using BS on top of lxml. But it seems the performance hasn't really improved:

http://www.crummy.com/2012/1/22/0


A DOM parser will always be slower than a stream parser ...


Is PyQuery fast? The whole premise of the first point was that BeautifulSoup was too slow, but then he didn't provide a comparison between them.


The comparison should be moot, because Beautiful Soup 4 uses the lxml parser when it's available: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#specif...


Hey Permit. This is a really good point. Looking at PyQuery's source code [0], it really does nothing in terms of parsing other than calling lxml's fromstring function and then works with the result of that when evaluating queries. So it would probably be a tad slower than pure lxml since it also does other bits (like checking for URLs and then fetching them for you) but from looking at the source code, I'd think that the overhead is minimal.

[0] https://github.com/dsc/pyquery/blob/master/pyquery/pyquery.p...


Do you have any more up-to-date data than a 2008 comparison?


Hey, I commented on this a bit further down in the topic: http://www.crummy.com/2012/1/22/0


It uses lxml, so yes it is fast.


So does BeautifulSoup


I <3 you OP. So much right now. I didn't even know i needed these until I read this post and now I know and I'm so happy.

I know matplotlib comes with Python(x,y) but that's a pretty awesome one too.


The coolest part of dateutil isn't the parser, it's the recurrence rules and recurrence rule sets. Doing that on your own is extremely error-prone if you have a non-trivial recurrence.


Cool, thank you for making me aware of that. I haven't had to use something like that but it's good to know. I'll put it in the blogpost.

EDIT: I added it below the `parse` example. Thanks again!


Great list. In particular, fuzzywuzzy and pattern caught my eye in a "how-have-I-not-heard-of-these" kind of way.


If you do any non-trivial work with decorators, the `decorator` module is a must: https://micheles.googlecode.com/hg/decorator/documentation.h.... Think of it as @functools.wraps on steroids (though this probably doesn't do it justice). FWIW I think it should be in the standard library.


No mention of pandas or nltk?

I'd never heard of pattern before, and while it looks like it's a nice bundle of features, I'm concerned by the fact it references pyWordNet by name even though it hasn't been an independent project since 2006 (http://osteele.com/projects/pywordnet/). Has anyone actually used it?


Here are some Ruby libraries I’ve used that are similar to those Python libraries:

pyquery equivalent: Nokogiri (http://nokogiri.org/). Lets you select elements with jQuery-like selectors. Uses libxml2 as its parser.

watchdog equivalent: watchr (https://github.com/mynyml/watchr). Run code when the filesystem changes.

path.py equivalent: rush (http://rush.heroku.com/). Provides a far better API to the filesystem than the standard library.

I also found this equivalent to fuzzywuzzy, but I’ve never used it: amatch (http://flori.github.com/amatch/)


Excellent collection! I will add one more.

Python Imaging Library - Today's web is full of images and PIL makes it easy for image manipulation. Although, it's not extremely performance efficient at very large scale.


In a similar vein, I recently discovered Wand (http://dahlia.kr/wand/) which is a gorgeous API built on top of ImageMagick. It's a lot more limited in functionality than what PIL offers, but for common operations like scaling, cropping, or extracting EXIF data it's much nicer to work with.


actually, Pillow is a much better fork of PIL. It's much more friendly

http://pypi.python.org/pypi/Pillow/


I have a few scripts that could benefit from PyQuery, didn't know about that one. Also path.py looks like it will save me some time to. Thanks!


path.py is a must have, I am amazed how often it's being rediscovered, despite the fact that I try to advertise it whenever it's relevant. But it's quite old at this point, maybe we should be more excited about efforts like this one: http://www.python.org/dev/peps/pep-0428/


for fuzzy date work i tend to relay on the parsedatetime module:

http://code.google.com/p/parsedatetime/

it seems to accept syntax similar to the 'at' command does (and obviates the need for my python C module to do that parsing based on the scheduler parser for 'at'). examples include "1 day ago", "ten hours from now" and the like. very useful.


Speaking as someone with approximately 4h of Python experience -- great list. I'll be using the sh lib right away.


Unless you're using Windows :(

The docs got my hopes high, but they fail to mention sh isn't supported on Windows -- I saw that by looking at its source on GitHub. I'll take a look at pbs, but this kinda bummed me out.


Do yourself a huge favor and develop on Linux. VirtualBox + Ubuntu 12.04 = Free.


Sure, I have an Ubuntu box. Thing is, while I would be doing myself a huge favor, the same wouldn't apply to the users of what I'm trying to create. Perhaps the world doesn't need yet another Git GUI, but if I leave out Windows support, the world will need it even less ;)


That would really not improve my workflow for developing windows phone applications.


Not everybody develops web apps that they deploy on servers they control. If 97% of my target audience is running Windows then developing on Linux isn't going to do me any favours.


Scratch that itch. :)


I'm considering that, but I still haven't managed to understand just why it isn't supported on Windows. Perhaps I haven't delved deep enough or perhaps it's my lack of solid experience with Python.

Of course, I could simply fork it, remove the "not supported on Windows" check and use it, but that feels lazy and dirty ;)


> I'm considering that, but I still haven't managed to understand just why it isn't supported on Windows.

From a quick look at the code, it's not supported on windows because of its reliance on terminal utilities(pty, termios). Doesn't the banner state to use the previous version(pbs) for windows support?

https://github.com/amoffat/sh/blob/master/sh.py#L33

And if for some reason it still doesn't work out, you can always use subprocess(which this library is using anyway) or https://github.com/kennethreitz/envoy


From a quick look at the code, it's not supported on windows because of its reliance on terminal utilities(pty, termios).

Thanks! Like I said, I don't have lots of experience with Python, so this is very valuable info for me :)

Doesn't the banner state to use the previous version(pbs) for windows support?

Yes, but the PyPi page for pbs 0.110 states: "PBS will no longer be supported." I'm not comfortable with basing my code on a library which won't be supported in the future.

And if for some reason it still doesn't work out, you can always use subprocess(which this library is using anyway) or https://github.com/kennethreitz/envoy

I didn't know about envoy either. Thanks again! :)

EDIT: Upon closer examination, the whole OProc class depends on os.fork(), which is available only on Unix, according to Python docs.


If sh allows Pythonic idioms to call Posix functions, then its implementation on a Posix system such as Linux or Mac is probably very different than on non-Posix Windows.


i have reservations about dateutil. it's certainly true that the built-in time and date routines in python are ugly, but dateutil's parsing doesn't try to give a single, consistent interpretation of an input date/time - instead it parses it in chunks, where one chunk can effectively overwrite another. so you can have input that is illogical or inconsistent and dateutil will happily give a single "right" answer instead of flagging an error.

for example, see this bug on so - http://stackoverflow.com/questions/10575919/strange-date-par...

in short: it's ok for non-critical cases where you just need "some date" from input. but don't use it if you would rather have an error than an incorrect interpretation.


Thanks for sharing. These are a nice set of libraries I never used. I am bookmarking this page.


The latest BeautifulSoup uses lxml (version 4+). How does it compare to PyQuery?


You have a typo. Search for 'PyQyery', twice.


Thank you, fixed.


This is quite a cool idea:

>>> path('a') / 'b' / 'c' path('a/b/c')

It would be fun to have that in Ruby!


Alright, let me say that I do not like this part of path. Operator overloading is fine, I mean the + is overloaded all to hell. But seeing the division symbol in weird places in a project I took over was distracting first, then irritating, and by the time I figured out what was going on I did not want it despite seeing some benefit.


I agree. It reminds me of FancyRoutes which I also found off putting - http://news.ycombinator.com/item?id=1100248

Path.py may already do it (docs are limited so I'm not sure) but something like below would be better:

  my $foo = dir('a', 'b', 'c');
This is how Path::Class works in Perl (https://metacpan.org/module/Path::Class). This way the path is a filesystem agnostic directory object.

NB. Twisting operator overloading isn't all bad though. For eg. IO:All nearly gets it all right and being a little mad I will sometimes use it :) https://metacpan.org/module/IO::All


I actually wrote up a module a while ago, what it did was in ways similar to path:

>>> print SimpleFS('a').b.c

a/b/c

The bigger thing (and original purpose) was uses like this:

>>> root = SimpleFS("/")

>>> sda_size = root.sys.block.sda.size

>>> for fd in root.proc.self.fd:

... os.close(int(fd))

>>> root.var.log[sys.argv[0] + ".log"] += "Size of disk: %d" % sda_size


watchdog is pretty buggy on linux at least.


Hey, can you talk more on that? I found it to work quite well on my machine and servers, it's also an integral part of another open source library that I'm building. [0]

[0] http://github.com/doda/imagy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: