Hacker News new | past | comments | ask | show | jobs | submit login
PGX: Write Postgres extensions in Rust instead of C (github.com/zombodb)
305 points by adamnemecek on July 13, 2020 | hide | past | favorite | 59 comments



Guys, this is really really cool. Been looking to learn Rust and it just so happens I need to write a Postgres extension in the near future so two birds and one stone.

Btw - Also, I've seen ZomboDB and while I've not used it I must say it looks EXCELLENT. So congrats on that too.


blush


Wow it does look great. Going to try it out!


This is really cool! The only other similar thing I know of is WASMer's extension (which only supports Rust atm I believe):

https://github.com/wasmerio/postgres-ext-wasm

One question though: If you built and compiled an extension with this, how would you deploy it on top of a Postgres Docker image?

I see instructions for installing and testing locally, via the cargo command, but couldn't find anything on this.

Does it spit out a static asset that you put in the project folder, and copy over during the build pipeline? Something like:

  FROM postgres:12-alpine
  COPY ./my-rust-extension /some/container/directory
  RUN install-command-here


That's a great question, and one probably best answered over on pgx's GitHub page.

But! I plan on adding a command to "cargo-pgx" to package up the extension for you into a directory structure (or tarball, maybe).

The idea is that you'd just run: cargo pgx package

And it would just build a --release library, and create the proper directory structure (based on what pg_config) says for the shared library and the associated .sql.

I actually need this ASAP for supporting ZomboDB proper, so... Coming Soon!


Awesome, look forward to it =D



Interesting, it makes coding a postgres extension seem delightfully simple. I have been looking for something to try using rust on! Thanks for posting this.


Thank you!

If you decide to jump into it, definitely let us know any pain points you have.

It takes a bit of time to work out the kinks in a thing like this.


Just out of curiosity, what sort of requirements compel a person to write a postgres extension? I know of things like PostGIS but I'm not well versed enough in this world to know the totality of the problem.


Author of `pgx` here.

We developed pgx so that we could rewrite "ZomboDB" (https://github.com/zombodb/zombodb) in Rust. ZDB is a custom "Index Access Method" for Postgres. Think btree index, but stored in Elasticsearch.

So that's definitely a thing.

Other ideas might be custom data analytics/transformation/processing engines that you'd prefer to run in the database instead of externally.

Custom data types to represent, for example, street addresses or genetic information.

The only limit is yourself! ;)


> The only limit is yourself! ;)

For folks who don't get the joke, please visit zombo.com before Flash is finally dead :).


Alternatively, there's https://html5zombo.com


For folks who still didn't get it, like me: https://en.wikipedia.org/wiki/Zombo.com


Have you evaluated Tantivy/Toshi as possible alternative to ES? I only ask because of your native use of Rust. PGX seems like a more robust offering than pg-extend-rs. This is a great addition to the community!

Also, v.0.0.3 doesn't create much confidence about stability! I'll keep an eye on the project and wish the team the best.


re: v0.0.3 -- sure. I just published it last night.

We've been working on it since November last year, and have now fully ported ZomboDB to it.

It's proving out nicely, but keep in mind that Postgres' internals are infinitely complex. Getting safe wrappers around all its "things" is going to take a very very long time.

I'd rather get something that seems very stable now, and continue to iterate on it over time.


I think the comment was less about your code but rather about your v0.0.3 version number. I guess a v0.99 would imply more stability.

That's why lots of companies have switched to date based versioning, similar to Ubuntu and others.


I've looked into them. It seems they're designed to work within a single process, and it's not quite clear to me how sharing the underlying data files across postgres backends (even with proper Postgres locking) would work.

That's not say they aren't good frameworks. I'm sure they are. It just seems like they're designed for different use cases.

That said, I have other ideas on this front that I can't talk about today. ;)


> The only limit is yourself! ;)

You have no idea how significant of a limit that is... I'm kidding.

Thanks for the response! And zombodb looks very cool. I might bring it up since we have a couple of heavy analytics endpoints right now.


The most common reason to write a PG extension is to add native support for a structure/data type that PG doesn't already have native support for.

For example, despite Postgres having a (base-10000) numeric/decimal type, there are certain things that can be done much more cheaply in base-256 (e.g. ser/des from bit-packed and/or hex strings), and thus pgmp (https://github.com/dvarrazzo/pgmp) exists as a Postgres libgmp binding.

There are also "data service" type extensions—exposing some other data "into" Postgres so it can participate in a query. PG's own pg_stat_statements extension is an example. Postgres's support for Foreign Data Wrappers has mostly obviated writing these as extensions, though; it's much easier (and safer for your RDBMS-side data!) to just write these as standalone daemons that Postgres mounts as FDW servers.


A foreign data wrapper is an extension.

To install it, you run "create extension my_fdw"


Yes, a given Foreign Data Wrapper—that is, the module of functionality that describes the connection strategy for some remote resource, like an ORM "adapter" module—is a Postgres extension. To get PG to speak a new protocol, you need to write a new FDW extension.

But a Foreign Data Wrapper server is just an instance of that class. You don't need to write a Postgres extension, just to use an existing Foreign Data Wrapper.

And, crucially, many of these FDWs are written to be backed by standard (or at least well-documented) network protocols.

Which means that, a large percentage of the time, people thinking "I'll take this third-party thing and integrate its data into Postgres" these days, don't write a Foreign Data Wrapper, but rather choose one of the existing network protocols with an existing Foreign Data Wrapper that supports it, and make their third-party thing speak that protocol, so that they (and anyone else who likes) can mount it as a server using the existing FDW.

Specifically, many systems that want to be "Postgres-compatible" these days (CockroachDB, Materialize, etc.), speak the Postgres wire protocol as if they were Postgres itself. This is not (mainly) so that real clients can talk to them through a libpq binding (as they tend to have their own, more idiomatic client ABI); but rather so that actual Postgres instances can mount them using `postgres_fdw`, the FDW that assumes the other end is another Postgres server and uses libpq to talk to it.


This is a sort of broad question, but if I were to try and answer I would say that if there is anything you want to be handled by the database you would gravitate towards an extension.

PostGIS is a great example since one of the most well known it additions it makes is a datatype, but that's not the only thing it adds. A datatype on it's own wouldn't be that useful without Spatial Indexing (https://postgis.net/workshops/postgis-intro/indexing.html) or spatial utilities to make spatial joins more expressive (https://postgis.net/workshops/postgis-intro/joins_exercises....).

Rich data types aren't the only candidates for extensions, automated partition management (https://github.com/pgpartman/pg_partman), data sharding (https://www.citusdata.com/), or even database cron scheduling (https://github.com/citusdata/pg_cron) are also good examples of things that are well suited to be extensions.

To be honest, I have a hard time imagining alternative paths where some of this functionality _isn't_ an extension. You might imagine an external daemon for things like pg_cron and pg_partman, but if PostGIS wasn't an extension you would probably be using a different database/tool (or a fork) if you had geospatial requirements. It's worth noting that Citus was a fork of PostgreSQL before it was refactored to be an extension.

I know this isn't a direct answer to your question, but hopefully you find it somewhat useful.

Edit: I totally forgot to add 1 more thing about extensions. When functionality is packaged as an extension there is a much higher likelihood you can mix and match them. For example, I currently run a PostgreSQL cluster using Citus, with pg_partman for automatic time-based partitioning, pg_cron regularly scheduling partition creation, and PostGIS for geospatial datatypes. You could extrapolate the various ways you might use this kind of setup...one that jumps to my mind is a scalable GPS history that could be queried by time period as well as by geographic region.


Yeah, "rich data types" is a good point.

pgx provides a #[derive(PostgresType)] macro that lets you represent any normal Rust struct (that serde_cbor can (de)serialize) as a Postgres type. You write no code.

It even generates all the boring SQL boilerplate for you.

I plan on putting together an example about this and doing a twitch stream this week to discuss in detail.


I think “extension” covers any custom code (even functions that need to import a c library), not just major db features. I have had to do it recently [1] to customize partitioning function to be the way I like. It would be great to do it in a friendlier language

[1] https://harikb.github.io/posts/postgres-11-hash-partitioning...


> It would be great to do it in a friendlier language

Whether Rust is "friendlier" than C depends a lot on your definition of what that means.


You're not wrong, but barring bugs in `pgx` (of which I'm sure there are plenty right now), at least Rust gives you compile-time guarantees around not crashing.

And when running inside your database process, that's a huge win.


As a follow-up. Here's what Postgres' "generate_series" function looks like with pgx: https://github.com/zombodb/pgx/blob/0803bd3aa2f8ef2bd80d2e03...

And here's what you'd have to do to implement it in C: https://github.com/postgres/postgres/blob/dad75eb4a8d5835ecc...


One more follow-up...

The top one is pgx, the bottom is Postgres. So there's a little room for improvement here with pgx, but that's okay for a v0.0.3 release.

    test=# select count(*) from srf.generate_series(1, 10000000);
    Time: 1552.115 ms (00:01.552)

    test=# select count(*) from generate_series(1, 10000000);
    Time: 1406.357 ms (00:01.406)


The largest part of the time executing the above query isn't inside the function, so this isn't that a material comparison. The reason for that is that SRFs in FROM to be materialized into a tuplestore, which isn't free:

    postgres[607045][1]=# SELECT count(*) FROM generate_series(1, 10000000);
    ┌──────────┐
    │  count   │
    ├──────────┤
    │ 10000000 │
    └──────────┘
    (1 row)

    Time: 1249.224 ms (00:01.249)

    postgres[607045][1]=# SELECT count(*) FROM (SELECT generate_series(1, 10000000)) s;
    ┌──────────┐
    │  count   │
    ├──────────┤
    │ 10000000 │
    └──────────┘
    (1 row)

    Time: 460.206 ms
For mostly historical reasons SRFs in the target list can use the non-materializing SRF query protocol, but SRFs in the FROM list can't.

Any chance you could show the timings for the pgx version of the second query?


Sure! Top is Postgres, bottom is pgx, after running each 5 times...

    test=# SELECT count(*) FROM (SELECT generate_series(1, 10000000)) s;
      count   
    ----------
     10000000
    (1 row)

    Time: 399.630 ms
    test=# SELECT count(*) FROM (SELECT srf.generate_series(1, 10000000)) s;
      count   
    ----------
     10000000
    (1 row)

    Time: 478.194 ms

Thanks for the reply. I'm not surprised there's room for optimization in pgx, especially in the Iterator-->SRF path.

edit: PS, thanks for the idea of putting what I assume is the backend pid in psql's prompt. I need to go figure out how to do that right now!

edit edit: hmm, I guess that's not the PID.


Thanks for the update.

> edit: PS, thanks for the idea of putting what I assume is the backend pid in psql's prompt. I need to go figure out how to do that right now!

Here's my ~/.psqlrc:

    andres@awork3:~/src/postgresql$ cat ~/.psqlrc
    \set QUIET 1

    \pset pager always
    \set VERBOSITY verbose
    \pset null '(null)'
    \set COMP_KEYWORD_CASE upper
    \pset linestyle unicode
    \pset border 2
    \set PROMPT1 '%/[%p][%l]%x%R%# '
    \set PROMPT2 '%/[%p][%l]%x%R%# '
    \set PROMPT3 'c:%/[%p][%l]%x%R%# '
    \set HISTCONTROL ignoredups
    \set HISTSIZE 100000

    \timing on
    \set QUIET 0


In the function definition of `generate_series`, what is going on with the `default!` macro?

That isn't some sort of way of actually getting real default param values is it? I thought that wasn't possible in Rust.


That is certainly nicer.


Does Rust give you compile-time guarantees around not crashing? I thought any function could panic without so much as a peep from the compiler.


Fair. With pgx, however, Rust "panic!"s are translated into standard Postgres "ERROR"s, such that instead of crashing, only the current transaction aborts.

So while you're pretty much correct in general, pgx handles it in the way a PG extension author would expect.


One might say that Rust is friendlier downstream, to the ops people running the compiled binary :)


Really? I struggle to imagine a case for claiming that C is friendlier.


The ecosystem of preexisting C extensions to generalize from, for one


It just depends on the type of problem you'd like to solve.

I've got a whole bunch of ETL helper functions and patterns i've written for my business that could be packaged up as an extension if I were so inclined for example. Those are all SQL or PL/PGSQL functions, so no need for native code there though...

For PostGIS, a whole lot of that code is just SQL or PL/PGSQL functions...but there are quite a few C functions as well, to take care of the heavy lifting which is less efficient to implement in SQL or PL/PGSQL.


A few years ago I wrote an experimental Logical Decoding plugin that translates database transactions to DNS UPDATE messages. I was super impressed by how nice it was coding to PostgreSQL's plugin APIs in C. I'm currently not sure if a C plugin is the best approach for this particular task: it might be better to use one of the more popular Logical Decoding to JSON plugins and munge the JSON into DNS UPDATEs in something more high-level than C.


We have to write a Json logger versus the default log format


`pgx` would let you do that.

You could hook Postgres "emit_log_hook" and probably just use serde to xform the provided "ErrorData" pointer right to json and ship it off where ever you want.

(edit: typeos)


This has been on my back burner for about a year now...glad someone has gotten to it. It already looks full featured. I've been thinking that with the right library subset (just like libcore or no_std), it would be a very safe possibility for a trusted language...accessible to users that are not superusers. That would be extremely valuable to RDS users that would love to use something fast, as opposed to the standard trusted languages like pl/python.


I've noodled the idea of a pl/pg_rust, but the whole "Cargo.toml" and downloading dependencies from the internet seems bad.

I see what you're saying about no_std tho. Could be doable.


Recent similar thing in the Ada world: https://github.com/AdaCore/gnatcoll-db/tree/master/pgxs


Also of interest: rpgffi - https://github.com/posix4e/rpgffi

"R(Rust) PG(Postgresql) FFI (Foreign Function Interface)"


That is interesting.

pgx does a lot of that too, but for only pg10/11/12. And then it post-processes the bindings and builds a "common.rs" for symbols that are identical across versions.

That way you can target all three versions at once, and then you can use Rust's #[cfg(feature=)] directive to opt into support for specific versions.

pgx has also done a little bit of work to mask differences between the three versions so they appear as part of "common.rs" anyways.

I wanted a framework that let me know, at development time, if I was trying to do a thing that isn't supported (or the same) across various PG versions.


At first I thought it is a Postgresql driver for Go that has the same name

https://github.com/jackc/pgx


Yes, it's always mess like sqlx.


It would be awesome if someone with some Rust knowledge could make an extension to create a compressed string type like Antirez Smaz [1].

I know Postgres compresses TEXT types by default, but only when the data exceeds TOAST_TUPLE_THRESHOLD (default 2 kB)

Some examples on the project page [1] show reduction of size between 10% and more than 50% for some small strings (less than 100 bytes).

[1] https://github.com/antirez/smaz


I think the existing C API would be a better fit, and the work would likely be a fairly simple modification to the citext extension found in contrib section of the PostgreSQL source if anyone is game.


I'd be inclined to agree.

That said, it'd be easy to prototype with pgx, assuming that compression library exists in the Rust ecosystem too?


I guess this compression library should do the trick [1], but I haven't tested it yet.

[1] https://docs.rs/compress/0.1.2/compress/entropy/ari/index.ht...


Is there any way to make custom types using pgx?


Yes there is. It's not documented/example'd yet tho.

There's a derive macro called #[derive(PostgresType)]. Combine that with serde's Serialize, Deserialize, and you're gtg.

I'm going to be working on more docs and twitch streams over this week.


Typo in the title: oc->of


Fixed. Thanks!


[flagged]


JavaScript is still pretty strong these days.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: