Hacker News new | past | comments | ask | show | jobs | submit login
Apache Arrow Flight SQL: Accelerating Database Access (apache.org)
150 points by ionychal on Feb 16, 2022 | hide | past | favorite | 17 comments



This is great and makes a ton of sense as a refinement to their work on Flight previously. InfluxDB IOx already supports Flight, but we'll almost certainly be updating to support Flight SQL before we launch. We've been thinking about adding Postgres wire protocol support, but this would be even better if enough downstream clients and projects end up adding SQL Flight support.

Great work to the Arrow team!


Oh, and for anyone interested in pitching in on the Rust implementation, there's an issue logged here along with some discussion: https://github.com/apache/arrow-rs/issues/1323


tldr; it's a new wire protocol that columnar databases can opt into that is better optimized for columnar data than ODBC/JDBC.

Key quotes:

    While it can be directly used for database access, it is not a direct replacement for JDBC/ODBC. Instead, Flight SQL serves as a concrete wire protocol/driver implementation that can support a JDBC/ODBC driver and reduces implementation burden on databases.

    Flight SQL means database servers can implement a standard interface that is designed around Apache Arrow and columnar data from the start. Just like how Arrow provides a standard in-memory format, Flight SQL saves developers from having to design and implement an entirely new wire protocol. As mentioned, Flight already implements features like encryption on the wire and authentication of requests, which databases do not need to re-implement.


This is true, but to clarify and expand a bit, you add support for FlightSQL to some datasource by adding Protobuf RPC handlers to it. This doesn't necessarily have to run in-process, the FlightSQL RPC service might run side-by-side.

The datasource also doesn't necessarily have to be a database. It just needs to implement the RPC methods. So you could implement your own data-access layer over any datasource (flat file, an external API, a regular DBMS, etc) by providing an implementation for "get all schemas", "get all tables", "execute a query", etc.

Some useful quotes:

  > The idea is that the driver is a “one-size-fits-all” driver — a user or tool vendor only needs to supply a generic driver that can connect to an infinite number of databases. This is even future-proof — if a new database comes out, it can work with existing tools as long as an Arrow Flight SQL endpoint is provided. In fact, by adding an Arrow Flight SQL endpoint they would automatically enable JDBC connectivity too.

  > Not only will Arrow Flight SQL reduce the technical burden on applications and users, but it leverages Arrow, which means it will provide better performance. ... having a single reference JDBC driver allows any data source that adds an Arrow Flight SQL endpoint to get JDBC “for free” as an onramp. So the selling point is add an Arrow Flight SQL endpoint to your data source and automatically get JDBC connectivity.

I have been experimenting in my free time with building a platform that autogenerates GraphQL CRUD API's on top of arbitrary datasources.

It lets you do federated/distributed queries and cross-databsource joins.

So you can connect it to a Postgres DB, a MySQL DB, a MongoDB, and a CSV file and then run a query that joins and filters data across them in a single operation.

I am using Apache Calcite for this, but am interested in potentially offloading the query plan execution using FlightSQL/Substrait[0] for better performance, since I am targeting OLTP workloads and it's latency-sensitive.

https://github.com/GavinRay97/GraphQLCalcite

[0] https://substrait.io/


I'm not tuned into Arrow all that much. I've read some of the about and stuff, but the code examples (to my eye) look really complex and complicated. [1]

Could someone point me to a more glossy "arrow flight sql for dummies" examples? What I'm gleaning from this (or am I wrong?) is you could use a JDBC driver + arrow jdbc client and write... SQL? Or is it something a lot different?

Is this the sort of thing where you could just add a plugin to postgres and be arrowified or something?

[1] https://github.com/apache/arrow/blob/release-7.0.0/java/flig...


There are several community members actively working on making things more accessible in general. (Flight SQL definitely needs some polish here.)

For Arrow itself, you could start with the Cookbook [1] (and feel free to file issues if there are things that are unclear, or that you would like to learn more about [2])

The example there is very long, but much of it is repetitive (that said, it could certainly be improved). But to your specific question, what you would eventually do is write a plugin or proxy for Postgres (or another database) implementing these RPC methods. Then you could submit SQL queries using the Flight SQL client (which calls those RPC methods internally) to get back columnar data. You could wrap this all in a JDBC driver as well (which is being actively worked on [3]), in which case you'd be mostly ignorant of what's going on underneath; the benefit there would (hopefully) be faster data access due to the lower layers being columnar (and zero-copy where possible), and enabling parallelism due to Flight (Postgres might not support this for instance, but other databases could, if they could expose internal parallelism or partitioning). Hope that helps.

[1]: https://arrow.apache.org/cookbook/ [2]: https://github.com/apache/arrow-cookbook/ [3]: https://github.com/apache/arrow/pull/12254


Ok, I think I'm getting it. Then some of the remaining work (it sounds like) is getting that flight sql proxy/plugin setup for 3rd party datastores but once that is done you could hook up and run either flight directly through the flight client or via ODBC/JDBC wrappers that make it feel a little more sqly.


Mark Raasveldt and Hannes Muhleisen at CWI wrote about these challenges in their 2017 paper Don't Hold My Data Hostage [0]. It's great to see progress on this front.

[0] https://ir.cwi.nl/pub/26415


Performance quote:

Compared to existing libraries like PyODBC, Arrow Flight is already as much as 20x faster (~00:21:00). Flight SQL will package these performance advantages into a standard interface, ready for clients and databases to implement.

Would be good to compare with more main stream ODBC/JDBC drivers, especially how it compares to the ones provided by the DBMS providers.


The Flight SQL spec landed a bit ahead of implementations. I would have liked to cite more relevant/concrete performance numbers, but we need a full implementation for that! (I believe there is at least one in the works, but I can't speak to those plans myself.)


Off topic, but the "Data Handling" landscape has become extremely confusing. By Data Handling, I mean

  - Collection
  - Distribution
  - Processing
  - Storage
  - Querying
Earlier, it was SQL + Software, but now, its a myriad of systems, a myriad of methods, with overlapping functions and a whole slew of libraries that come with related tools.

I do understand that the underlying need for all this is the sheer volume of data that is being generated in the world. But as an experienced programmer, but newbie to the big data section, I find it overwhelming.


> Earlier, it was SQL + Software, but now, its a myriad of systems, a myriad of methods, with overlapping functions and a whole slew of libraries that come with related tools.

Apache Arrow and the steadily growing ecosystem around is out to simplify those myraid ways of doing things and bring about interop between various frameworks (Spark, Trinio, Snowflake etc), libraries (pandas, dplyr, etc), and languages (R, Julia, Python, Rust etc).


You don’t have to use those systems. Fundamental systems are still valid and solid. Don’t be afraid to reinvent a wheel which suits your case perfectly.


Exciting, remember whiteboard conversations about this years ago, and benchmarking early ideas like Turbodc :)

I could not find: Is there a list of conformant DBs & their implementation statuses? Been curious both for Flight and Flight SQL....


There is not a list (though right now it is zero, unfortunately); Dremio has been doing some work around this (and contributed the spec and implementation) though I can't speak for their plans.

Just a minor distinction - Flight is lower-level and might be used to build a wire protocol for a database, but a "database supporting Flight" still requires something like Flight SQL to specify how the database expects to serve requests over Flight.


Yes -- we actively work w/ DB/compute companies supporting Flight w/ ~ad-hoc APIs and are thinking through integration. But seen 0 with Flight SQL (and saw a ~20% lift with Turbodc vs ODBC). So curious about either Flight (ad-hoc) or Flight SQL. The blogpost made it sound like at least 2-3 WIP for Flight SQL, which would be great to learn about & track :)

For Graphistry, we 'simply' expose Arrow-over-REST APIs for streaming/batch upload+download, so no Flight needed, and that's why we didn't build Flight into the original JS Arrow lib. But as we're getting into bigger graphs / multi GPU and wanting parallel IO for DB connectors, been watching...


Ah, interesting to hear what you're doing. I would like to see Flight supported in JavaScript/the browser someday.

Here, the Flight SQL spec landed ahead of implementations, so that's why there's nothing to list yet. I would stay tuned to hear what people do with this. (I can't speak to any plans myself, though.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: