Rust DataBase Connectivity (RDBC)

CodesInChaos · on Jan 12, 2020

1. Can `Box<dyn RowAccessor>` even work like that, considering it's not object safe? My understanding is that adding the `Self:Sized` bound to `get` makes it compile, but also means that this method won't be available in a dynamic context, so the accessor you get from the rowset is completely useless.

2. Why do the accessors return results? IO errors should already happen when loading into the rowset. So at that point the only error the column accessor could run into is an out of bounds index, which is a panic worthy programming error.

3. Why do the accessors return options? Shouldn't that be absorbed into the generic `T` and only be used for nullable columns?

4. Why return owned accessors (boxes) instead of references?

5. Columnar datastorage without any low level access to the in-memory representation seems rather pointless, for a performance point of view. You can't access it with SIMD instructions and incur an indirect call overhead for each accessed element. Do you expect people to downcast the column accessor to a specific type for high performance access?

IMO abstracting over storage formats via dynamic dispatch is the wrong approach. The proper way is making all types take a generic a parameter for the format.

twic · on Jan 12, 2020

What you write sounds extremely sensible to me! Thoughts:

2. Loading the rows might be lazy, using a cursor etc, so you could encounter IO errors while traversing the rowset. You really, really want lazy rowsets for iterating enormous results without having to materialise the whole lot in memory.

5. Maybe as well as element-by-element access, there should be some sort of bulk access to get a buffer full of elements in one go.

A. I'd like to see some way to use a ColumnAccessor, or some other thing, to access elements in a row. I would like to write code like:

    use std::io::Result as IoResult;
    
    struct Column<T> {
        column_type: std::marker::PhantomData<T>,
    }
    
    struct Row {}
    
    impl Row {
        fn get<T>(&self, column: &Column<T>) -> T { unimplemented!(); }
    }
    
    struct RowSet {}
    
    impl RowSet {
        fn get_column<T>(&self, name: &str) -> IoResult<&Column<T>> { unimplemented!(); }
    
        fn get_row(&self, index: u64) -> IoResult<&Row> { unimplemented!(); }
    }
    
    pub fn main() -> IoResult<()> {
        let rows = RowSet {};
        let weight_column = rows.get_column::<f32>("weight")?;
        let row = rows.get_row(23)?;
        let weight = row.get(weight_column);
        Ok(())
    }

The point being that i can do the lookup of the column once, ensuring that it exists and has the type i expect, and then safely extract column values from rows later on.

CodesInChaos · on Jan 12, 2020

2. I understood the `RowSet` as representing a batch, not the whole data set.

A. I agree that the column should only be requested once, similar to what you propose. Though I'd move the lifetime into the `Row` type instead of returning a reference. Unfortunately you don't totally escape the runtime check, since you still need to check if the column and row come from the same RowSet.

twic · on Jan 12, 2020

> I'd move the lifetime into the `Row` type instead of returning a reference

I'm afraid i don't understand; which lifetime, and which returned reference?

> Unfortunately you don't totally escape the runtime check, since you still need to check if the column and row come from the same RowSet.

True, but at that point it's a programmer error, so at least you can panic rather than returning a Result!

peatmoss · on Jan 12, 2020

I appreciate that the author here takes pains to review existing standards and implementations like ODBC/JDBC, and also reviews newer ideas like columnar stores and projects like Apache Arrow. It inspires confidence when I see engineers do some degree of review before swashbuckling their way into new code. It’s like the literature review in graduate school: first read, then code.

xxxtentachyon · on Jan 12, 2020

Andy Grove is a PMC on Apache Arrow, and a great columnar store community member. It was probably less of a literature review for him than a reflection on past experience.

peatmoss · on Jan 12, 2020

I didn’t have that context—thank you. Still, I guess it’s an endorsement when systematic review of prior art can be mutually confused with experience :-)

pimeys · on Jan 12, 2020

There are also other projects working solving similar problems.

The other is sqlx[0] serving as an asynchronous crate connecting to mysql and postgresql that validates the queries at compile time and is built as a new ground-up implementation using async/await and async-std.

And our project is quaint[1] which builds on top of existing tokio-based database crates giving a unified interface and a query builder.

[0] https://crates.io/crates/sqlx [1] https://crates.io/crates/quaint/

cwp · on Jan 13, 2020

I've never understood the appeal of this sort of thing. The idea is create a common interface to a bunch of different databases. But you still have to write or generate SQL that is specific to whatever database you're actually using. Outside of a very narrow range of apps (eg SQL-IDE kinds of tools), you're going to be both coupled to a specific database and limited by lowest-common-denominator design constraints in the interface library. Ugh.

pjmlp · on Jan 13, 2020

Having to fine tune SQL, or change store procedure syntax call, is a tiny change, compared to rewriting from scratch the complete database binding code.

Plus for many kind of applications, the ANSI SQL coverage across RDMS is quite ok.

lenkite · on Jan 13, 2020

We have the same product that runs against 4 different DB's with 99% SQL being identical. Sure you sometimes need to hunt on whether a standard SQL feature is supported, but they mostly are. And even if the query dialect for the feature is a bit different, the DB access code remains the same.

de6u99er · on Jan 12, 2020

I think the initiative is great. I did lots of Java database development. Starting with JDBC, circumventing J2EE with Spring and Hibernate, and ending up doing JPA with Spring. I haven't seen anything similar in Rust.

Therefore I recommend looking at JPA for object relational mapping, and Spring Frameworks JdbcTemplate for things like basic CRUD support and JPA abstraction.

sverhagen · on Jan 12, 2020

These tools are my daily ones too, but they're abstractions on top of JDBC and as such a different gap than what the author is trying to fill.

eb0la · on Jan 12, 2020

The work made by this guy is incredible. Apache arrow commiter (Datafusion), wrote SQL crate, BAllista (think spark+k8s-java+rust), and now this.

I feel too much envy.

thayne · on Jan 12, 2020

> RDBC is specifically for the use case where a developer needs the ability to execute arbitrary SQL against any database and then be able to fetch the results.

So RDBC solves the problem of different connection protocols. But what do you do about the fact that writing a query that works for every database is extremely difficult?

spullara · on Jan 12, 2020

You don't do anything about that in this context. The queries you write must target the database you are going to use.

thayne · on Jan 13, 2020

Right, I'm just curious what the use case is where you want to be able to abstract over the connection for a variety of relational databases, but don't need to abstract over the differing syntax for the queries.

jonfk · on Jan 13, 2020

The use case would be the same as the one covered by jdbc and odbc. That is creating a standard interface to connect to databases at a lower level than the orm. This could then be used by an orm or query builder to communicate with the database. That’s how for example I can use Spring JPA and if I write either fairly portable sql or use the crud repository methods, I can connect to multiple different sql databases by simply providing a different jdbc URL.

How to handle the differing syntax of each databases would be handled at a higher level, by the query builder or orm. The orm could decide to let the user decide for themselves whether to write portable sql or not and fail at runtime, or like diesel in rust, enforce the useable features through its types.

In my experience that is something that is currently missing in the rust ecosystem. Each library seem to have its own connection logic and if I want to write a program that could run on different databases in rust right now, I have to write different code to connect to each database I would like to support.

Diesel has been serving me really well but I currently need to choose a specific backend when I am building a query, and implementing the above dynamic style of connection would be quite cumbersome to implement. It would be nice if we could choose another a standard sql backend where I would only be able to use a subset of features supported by most sql databases and build queries against that. Hopefully the RDBC project could help with something like that.

kyllo · on Jan 13, 2020

One is data analysis. I often have data in two or three different datastores that I need to analyze together (a column-oriented DB, one or more row-oriented DBs, a key-value store, a document store, a Hadoop/Spark cluster) and it's nice to be able to use the same code to open and manage connections to any one of them. My queries are and always will be vendor-specific, but my connection handling is all just generic ODBC code.

It's also great for WYSIWYG BI tools like Tableau or Power BI, which can connect to and fetch results from anything that has an ODBC driver. Without that, the BI tool vendor would need to write their own driver specific to each datastore on the market. So if anyone ever wants to build a generic BI/analytics tool product in Rust, this is a prerequisite.

ww520 · on Jan 12, 2020

This is great. Using the traits as the interface is clever. That means as along as a DB specific driver implements the traits, it will work.

snuxoll · on Jan 13, 2020

That’s the point, just like JDBC, ODBC, DB-API, ADO.Net, etc.

dingribanda · on Jan 12, 2020

Efficient access is more than the interface on the client. The wire format matters. If the server sends the data in the row form from a columnar stored table in row form, it wont be efficient, there will be too many transformations. Perhaps a way for the client to tell the server what format it wants the data may be useful.

pylua · on Jan 13, 2020

Why not use macros to map the result back? I understand that that is not consistent with jdbc and may be crossing into ORM territory, but it does seem like a clean way to handle it in rust.