Hacker News new | past | comments | ask | show | jobs | submit login
Thinking Psycopg3 (varrazzo.com)
268 points by fanf2 on March 11, 2020 | hide | past | favorite | 49 comments



At Mesosphere/D2iQ we used psycopg2-backed sqlalchemy to interact with CockroachDB.

Daniele, one point that I'd like to get your opinion on, and that's maybe worth considering for API developent around psychopg3: we found it difficult to implement timeout control for a transaction context. Consider application code waiting for a transaction to complete. The calling thread is in a blocking recv() system call, waiting for the DB to return some bytes over TCP. My claim is that it should be easy to error out from here after a given amount of time if the database (for whichever reason) does not respond in a timely fashion, or never at all (a scenario we sometimes ran into with early versions of CockroachDB). Certainly, ideally the database always responds timely or has its internal timeout mechanisms working well. But for building robust systems, I believe it would be quite advantageous for the database client API to expose TCP recv() timeout control. I think when we looked at the details back then we found that it's libpq itself which didn't quite expose the socket configuration aspects we needed, but it's been a while.

On the topic of doing "async I/O" against a database, I would love to share Mike Bayer's article "Asynchronous Python and Databases" from 2015: https://techspot.zzzeek.org/2015/02/15/asynchronous-python-a... -- I think it's still highly relevant (not just to the Python ecosystem) and I think it's pure gold. Thanks, Mike!


Hi! Much of the difficulty comes from using blocking libpq calls in psycopg2. I'm thinking to avoid them altogether in psycopg3 and only use async calls with Python in charge of blocking.

Note that you can obtain a similar result in psycopg2 by going in green mode and using select as wait callback (see https://www.psycopg.org/docs/extras.html#psycopg2.extras.wai...). This trick enables for instance stopping long-running queries using ctrl-c.

You can also register a timeout in the server to require to terminate a query after a timeout. I guess they are two complementary approaches. In the first case you don't know the state of the connection anymore: maybe it should be cancelled or discarded, we should work out what to do with it. A server timeout is easier to recover from: just rollback and off you go again.


:-)


I'm again reminded how much we depend on and how we owe these these unnamed heroes of modern age who maintain and develop software we take for granted in our every day lives. Kudos to the author.

And interesting to find out after years of using it that psycopg2 actually doesn't use prepared statements underneath.


The one thing I dislike about the Python DBAPi spec is the "cursor". Maybe the idea was good when the creators of the spec came up with it, but no other database API exposes a cursor so upfront to the users. When someone new to databases first learns about the Python DBAPI then he will probably think that cursors are something really important, but really - they aren't. Developers just want to fire queries at the database, the cursor is just something between the db connection and the query result. Maybe it is needed to keep the state of query objects, but it should have some other name.


I don't understand this comment. A cursor is just an iterator over the results. Would you prefer "iterator" as a name? A concept like it exists in basically every database API (for row-oriented databases) so that you can start processing results as soon as they start coming in rather than having to load all results into application memory first.


I think OP has a point. He's not complaining about the functionality (of course a DB client needs to iterate through results) but that there's no reason a user needs to understand cursors for running simple DB queries. I should be able to do:

    results = conn.execute("select * from mytable").fetchall()
or

    query = conn.execute("select * from mytable")
    for row in query:
        print(row.id)
You can do the same with cursors, but it's extra mental overhead and a somewhat awkward API.

Plus the name conflicts with Postgres' CURSOR, which confused me when I first started. So, yes, I would prefer "iterator" :)


Yes, that's what I meant, it doesn't have to be exposed to the user, I just want to execute a query and get the results. If needed then maybe I should be able to create a cursor to go through the results. It's just that other database access APIs don't expose and force the developers using the API to use a cursor, so it's mildly annoying in Python.


You are right and I had forgotten about that. I mostly use SQLite from Python which actually does allow .execute() on the connection as a non-standard extension to DBAPI, so it supports exactly the two snippets you posted. It is a bit awkward that you have to create your cursor before you execute the query, I don't quite see the reason for that.


This seems non obvious to me, purely because one has to create the cursor before ever executing any queries [1]. It therefore think it feels like a "sub-connection", or something the database is aware of at any rate.

It would be much clearer to me that the cursor is an iterator if it was actually returned as the result of executing a query.

[1] https://www.psycopg.org/docs/usage.html


> one has to create the cursor before ever executing any queries

You are right, that is a bit weird, although only a very slight annoyance IMO. As I said in reply to the sibling comment, I had forgotten this as I mostly use SQLite from Python and it doesn't actually require this: you can call .execute() on the connection object and it returns the cursor, which does seem a lot cleaner.


Yes, but with other database access APIs I can just get a connection and execute a statement on it. Other APIs don't have an additional abstraction between the connection and the execute() method.

It's especially useless when executing anything besides SELECT. An UPDATE or INSERT shouldn't need a cursor.


Though except with very large result sets (100k), there is little performance gain from having a cursor.


That's objectively not true though, so I'm not sure why you would say something like that. Ignorance?

A hugely significant portion of database workloads involve iterating over the rows returned from a query. Exposing a cursor/iterator/whatever in the spec is the only sensible way of handling this - your code works regardless of if the results are fetched in a single bulk operation or streamed to the client, or if you're returning 1 row or 1,000,000.

It's the same reason you wouldn't do "for i in list(range(100_000))`, you'd just do `for i in range(100_00)` - iterating over a generator (which is what streaming results from the database really is) is far more efficient than creating a huge structure upfront and _then_ iterating over it.

Also 100k rows isn't very large.


One could argue that pumping huge amounts of data with psycopg2 is slow in any case due to the blocking nature of the network calls.

I recently copied some big tables (100M+ rows) into a different system, when I switched the simple script to asyncpg I gained 5x over psycopg2. Experimenting with the amount of rows cursor.fetchmany returns changed absolutely nothing.

I wonder how much gain there is from using a cursor, for CRUD applications you most likely need the full result set before being able to do anything with it. Having a cursor could be made optional for when you actually want to iterate over the resultset.

The cursor interface could also be hidden, I just want to iterate over the resultset - I don't care what happens in the background.


> when I switched the simple script to asyncpg I gained 5x over psycopg2.

This should really be a comparison between asyncpg and psycopg2+gevent+psycogreen to be technically aligned, as otherwise it's a comparison between two scripts with non-blocking and blocking IO.


Here's a benchmark [1] against aiopg, which is psycopg2+asyncio.

[1] http://magic.io/blog/asyncpg-1m-rows-from-postgres-to-python...


Unless aiopg patches psycopg2 around its C extension with available callback hooks (see [1] and [2] - psycogreen uses it to register gevent callbacks), this comparison is still between blocking and non-blocking IO scripts. Quick GitHub search for `set_wait_callback` didn't return any result in the aiopg repository - https://github.com/aio-libs/aiopg/search?q=set_wait_callback...

[1] https://www.psycopg.org/docs/extensions.html#coroutines-supp...

[2] https://www.psycopg.org/docs/advanced.html#green-support


You can't scale that. Adding clients will have a lot of overhead (ex: 100K clients to 1 db server). Only way is autocommit with proxy (pgbouncer etc).


The point is to compare the blocking vs non-blocking though.


I was addressing OP's impression that psycopg2 was slow "in any case" and that it cannot be used in non-blocking fashion.

> psycopg2 is slow in any case due to the blocking nature of the network calls.


That vaguely touches (warning going off topic here, my apologies) my main problem with python. dict.keys() returns an iterator, this is all fine and good, very efficient I suppose, however, the only reason I ever use dict.keys() is so I can sort the keys and you can't sort an iterator so every single dict.keys() in my code base is in the form list(dict.keys()). And it is not even like you need the iterator. A dict works perfectly fine as a key iterator on it's own. /rant


> sorted(my_dict.keys())

Or just:

> sorted(my_dict)

Sorting keys like that is also weird. Keys are insertion ordered, utilize that property where you can and avoid needlessly re-sorting.


Glorious, Thank you kind soul for listening to my slightly raving rant.

Now I feel several sorts of ashamed at missing what is kind of a core function.

salutes


Fwiw, make sure you're using a newish python version. They weren't always sorted by default.


3.6+


> he only reason I ever use dict.keys() is so I can sort the keys

Hum... I use dict.keys() every time I use dictionaries to describe some unspecified data set, what happens way more frequently than any use case that requires sorting the keys. It's even incentivized by the language with that kwargs construct and a mainstream usage within libraries.

I do agree that the result of dict.keys() should have a sort method. There is no reason not to, but from there it doesn't follow that that making it an iterator is a minor gain.

That's the same situation as the GP asking for the removal of a main feature of the library just because he doesn't work with a kind of software that uses it. Congratulations, remove cursors from the library and suddenly Python is a lot less useful for data science.


It doesn’t and shouldn’t have a sort method because python prefers functions that act on objects rather than methods attached to objects.

That’s why you use “len(obj)” rather than “obj.len()”, and why you use “sorted(dict_keys)” rather than adding a “sort” method to random iterables.


Oh, ok. I was comparing it with 'list.sort()', but yeah, this one mutates the list and returns None (just checked it), so 'sorted' is the expected way to sort a dictionary keys, and my post is just wrong.


'list.sort()' is one of those leftovers from "Python before it was really Python" and is not one that's particularly worth it to change because it's useful sometimes.

In general Python prefers builtins that operate on protocols. You use "len(x)" because it's consistent, otherwise you might call "x.len()", "x.length", "x.size()" or any other number of possibilities. And if an object doesn't define it? You're out of luck. With this any object that supports the protocol (__len__) can be passed to length.

The same applies to "sorted()". list.sort() is useful in some specific circumstances where "sorted()" is not adequate, but in general the answer to "how do I sort something" is to use sorted.


Wait...are you saying 100,000 records is a lot? If so, I disagree. Any serious enterprise workload is millions, usually tens, hundreds of millions, and oftentimes billions or tens of billions or more over the lifetime of any given production UNIX process. Hundreds of thousands is dev testing range. I've dealt up to single-digit billions, and I can say having a cursor is absolutely a requirement when working with production data.


Yes, I'm claiming that retrieving 100,000 records in a single query is a lot.

If thousands of records weren't a lot, HN wouldn't have to manually break up comment threads anytime a U.S. president is elected.


For the database. If your application is retrieving millions of rows, you might want to reconsider what using a database with a complex query protocol is giving you and whether you are using it correctly.


If my application is retrieving millions of rows, it may be because I need to pass a window function over millions of rows. Or it may be that I need to copy millions of rows from one place to another. Or I may need to compare them with other millions of rows from another source somewhere else. Or I may need to transform those millions of rows on my application because I can scale it better than the database...

Just because you didn't think of some valid usage for a feature, it doesn't mean that there isn't any.


> it may be because I need to pass a window function over millions of rows

Every major RDBMS supports window functions (even MySQL, though recent). Why would you bring remote data into a local process to do something the remote process can do with its own local data vastly more efficiently?


Probably related to why the database has millions and millions of rows in the first place. Maybe someone read about denormalization in "Super Webscale SQL!" and blew up a trivial database. Or they're hoovering up vast quantities of garbage data and running it through a BI tract to produce bullshit.

I worked at a company that bilked the government for vast sums of money by doing all this and charging them for the hardware. The guvvies knew what was going on, but their little empires were more important if they got a bigger budget, so they went along with it.


Oh, God forgive if a company actually has a few million sales it must account for, or if a government body actually wants to control and verify what it does. No it's all because of stupid webscale! developers, because nobody ever does anything that isn't a web or finance startup! Ever!


I don't really understand your point. You and the person you replied to are both agreeing that cursors are useful for large datasets. Whatever defines "large" isn't that interesting and is dependent on context: got a resource-constrained DB? Or very wide tables? Constraints on the client side for how much data it can consume and process? Then, maybe your "large" is a lot different to mine.


> Whatever defines "large" isn't that interesting

Some people get excited about huge datasets; I don't get it myself, but "big data" is a successful buzzword because it pushes people's buttons.


They are arguing it’s useful for smaller datasets as well.


If you're exporting data en mass from the database then this is a concern, but for the typical CRUD application use case, it's never an issue.

If there's enough results that loading them is causing performace problems, there's enough results that you need to be paging them in your UI.


If we are doing wishlists: I would love to see better support for fast creation of Arrow/Dataframes/numpy objects from query results (possibly through the COPY support, maybe through some kind of plugin or compile time optional support). See for example Turbodbc (https://turbodbc.readthedocs.io/en/latest/pages/advanced_usa...).

At the moment we are forced to have query data converted from the binary stream into (boxed) python objects, then unbox them back into arrays - this can add a lot of overhead. I did some very rough experiments in Cython and got 3x speedup for bulk loads of queries into arrays.


Off topic:

The comments system on the site is awesome, can't believe I haven't seen it before!

https://utteranc.es/


Did a triple-take when I scrolled to the bottom and saw.. GitHub. Super cool.


The first part, Query parameters adaptation, is perhaps problematic. Postgres allows only 32K parameters (they use a signed short? really?), which is relatively few. This limits the numbers of rows you can INSERT in a batch. Expanding client-side neatly side-steps this.

Also: with copy, is there a way to ignore, or replace, duplicates?


For batch insert it makes more sense to parse an insert statement for one row and then send bind-execute messages for each row. That would much more efficient than a huge insert statement with interpolated values.

Copy doesn't on conflict handling yet, although there doesn't seem to be a major reason why it couldn't. A workaround is to copy into a temporary table and do a insert-select from there.


What do you mean by bind-execute? I'm guessing you mean sending multiple statements in a string where the first PREPAREs and the rest EXECUTE? I haven't benchmarked anything like that.

I've done lots of exploring and profiling and comparing of the best ways to do big upserts into a variety of DBs. For Postgres, I've messed around with CTEs but found INSERT ON CONFLICT UPDATE to be fastest. And inserting large numbers of rows is a clear win over individual statements round-tripping.


PostgreSQL wire protocol has a concept extended query which splits out query execution into 3 steps, parse, bind and execute. https://www.postgresql.org/docs/devel/protocol-flow.html#PRO... If you send one parse message and multiple bind and execute messages you can insert multiple rows parsing the statement only once. Currently psycopg2 doesn't support that, which is the first thing the blog post was talking about.

If working in the confines of psycopg2, you could do something like insert .. on conflict .. select * from unnest(%s) rows(a int, b text, ...); and pass in an array of row types. Parsing a huge array is cheaper than parsing a huge SQL statement.


And relevant to the roundtrip issue at hand here: You can send bind/execute in a pipelined manner.

I'd benchmarked batch upserting at some point, and at that time for large amounts of data the fastest approach was somewhat unintuitive: A separate view with an INSTEAD trigger doing the upserting. That allows for use of COPY based streaming (less traffic than doing separate bind/exec, less dispatch overhead), and still allows use of upsert. Not a great solution, but ...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: