>The single SQL endpoint is well suited for a data marketplace. Data vendors currently ship data in CSV files or other ad-hoc formats. They have to maintain pages of instructions on ingesting this data. With Splitgraph, data consumers will be able to acquire and interact with data directly from their applications and clients.
I appreciate the effort to make it easier for users to access heterogeneous data sets, but I really hope data vendors keep shipping raw CSV files. I don't want a company to gate access to the data, merely offering a proxy. I want to be able to download the whole raw datasets from the vendor directly if I want to.
Absolutely, having ability to download the actual data and keep it is always going to be important. We want to facilitate access to data and think it should be available from the source. But, there will inevitably be fragmentation, so it's valuable to have a service available to catalog and aggregate it and make it available over a single protocol.
For what it's worth, we run PostgREST [0] on top of the DDN, so you can get your query results in JSON and CSV files. For example:
It's limited to 10k rows but we might have an ability in the future to "order" a CSV dump asynchronously and place in a destination (like an S3 bucket) of choice.
For the REST API endpoints (e.g. [0] [1]), we do set the Access-Control-Allow-Origin header to *, but for GET requests we do not send it at all. I'm not sure if this is the same as setting it to star. We can set it explicitly though; we do want to enable this use case (eventually we'll have quotas).
Soon we're releasing a Web UI for writing SQL queries. Part of that will include an HTTP (or websocket maybe) endpoint that takes a SQL query and returns the rows, so you'll be able to execute SQL queries directly from the browser.
As an aside, we'd love to get this working with Observable! Let us know anything we can do to help. You can email me at miles@splitgraph.com or join the Discord and we can figure it out :)
In my view our collective interest in CSV as a medium for data distribution has resulted in far too much information loss, and consequently, time wasted on input sanitization, validity checking, and unresolvable conversations about the intent of data values like ”1.12345E+11” and "".
Having spent way too much time wrangling vendor-provided CSVs, I 100% agree. I'd love for there to be a common, well-understood format for typed tabular data that supports multiple tables and enforces foreign keys between them. Ideally with a concept of "patching" to enable incremental updates.
Probably the closest thing I'm aware of is handing around a sqlite file, but I'm a little uneasy using a format that's meant to be a database as a transfer format. Dolt looks promising here too. Are there other ways?
At work, we use Parquet (https://parquet.apache.org/) for almost everything related to a dataframe. We don't really care about performance gains (although, it's nice to have), but we really like to have a schema.
Note, we use mostly Python, some R, and a various range of ML or Optimisation tools, depending on the project.
Parquet is kind of a royal pain in the ass compared to CSV/JSON/plaintext mostly because it uses a ton of Thrift encodings, resulting in mostly terrible/broken implementations anywhere outside of the Java/JVM ecosystem. If you're running Apache <Whatever> then sure, it'll probably be fine, but I'd recommend avoiding it if you start having to go down the rabbit hole of implementing support for things in your language du jour.
The Rust and python impl are fine. But I get it, Parquet may not be perfect or optimal or whatever. It works as a simple, typed, columnar format.
We had to pick a single file format recommendation for sending 100GB+ tables on FTP servers or dropbox, scanning terabytes of useless stuff only to grap an key-value pair, and properly reading integer and UTF-8 columns. Turns out, Parquet is practical. Enough for users to start using it instead of CSV. It could be Avro, but it's just not as easy.
> But I get it, Parquet may not be perfect or optimal or whatever.
I actually think Parquet is pretty great in practice, I just have some issues with the sheer volume of abstractions necessary to implement it. I just wish it was anything other than Thrift.
I would probably choose Parquet over anything else, though.
Parquet files don't reference anything outside of the file, usually. A group of parquet files in a folder is usually considered a table, where the schema is union of the schema of the files.
The Splitgraph core code on GitHub [0], around which we've built the DDN, is all about managing "data images" which are basically snapshots of PostgreSQL schemata. You can build them with a format similar to Dockerfiles as well as do a "checkout" into a local instance of Splitgraph (which you can connect to with any PG client) -- this enables change tracking and delta compression too.
Behind the scenes, we store them as cstore_fdw [2] files which is a columnar storage format that helps with analytical queries.
Seconding parquet and sqlite, but hdf5 and netcdf4 certainly deserve a mention, especially for multidimensional, scientific, and generally large datasets.
Both JSON and XML are self-documenting, and most modern databases support directly importing and exporting them. Though the tools to accomplish this could be better, these formats are far better suited as a "medium for data distribution" than CSV files are.
That said, a simple compressed .sql file of INSERT statements can often go a long way.
The only reason CSV is widely used is because normal people think of data as spreadsheets, and asking them to fire up a database and shred JSON data into it is ridiculous when they just want to whip up a line graph or answer a simple question (e.g. "What was value X on a this particular date?").
The thing that attracted me to SplitGraph from the very start is that they are proposing to make the PostgreSQL wire protocol and SQL dialect a general interface to remote data. The interface is not only well known but backed by permissively licensed, open source libraries. Plus there are hundreds of tools that already connect to PostgreSQL.
This idea makes such sense it's a little surprising nobody did it before.
It is not a design flaw to make a reasonable choice about what data types you support. Especially given the ones captured in JSON are overwhelmingly the most commonly used and necessary types.
In cases where that's not enough, you could roll your own types by putting the values in plain strings and it would still be strictly more expressive than CSV
That's a good point. Any idea if there's a well spec'd JSON format around, for database data? eg something that handles binary data, trinary logic (eg null values), and hopefully referential integrity
You can generate CSV if you need it. See psql --csv. [1] What's brilliant about this approach is that you can generate any format that's supported by the interface defined by the PostgreSQL SQL dialect and wire protocol. (Obvious caveats about timeouts, network bandwidth, etc. apply.)
I see a new job title coming into being - Enterpruse Data Librarian
40,000 data sets - even if many are just diff versions - is a ridiculous number to manage or even know about on a non full time basis.
Data driven decisions need data yes, but they also need people to know the data exists. And what it means.
And this is just external curated data - use this as the standard for what each department should be producting internally.
In fact that's a good idea - a data publishing standard - not just the data types / schema, but actually supplying it through a format that is consumable by others.
As someone who tried, and almost succeeded, to get rid of pachyderm for the last two years, I like what I just read.
Something is not entirely clear to me right now: An image is an immutable snapshot of a dataset at a given point-in-time - great - but, can I query the same dataset at two different PIT using layered querying in SQL ? Something like this: SELECT * FROM dataset:version-1, dataset:version-2
Also, are you storing the entire dataset as new or only the diff between versions (and later reconstruct the full image) ?
Now, onto the things that could be improved...
- Git-like semantics (pull, push, checkout, commit) are poorly suited for versioned, immutable datasets. Just (intelligently) abstract fetching and sending datasets by looking at the SQL query (dataset:version-2, above)
- Versions should be at least partially ordered and monotonically increasing. Hashes doesn't convey the information necessary to decide if dataset:de4d is an earlier version of dataset:123a, or not.
- Tracing a derived dataset provenance will only work if you can assert that the "code" or transformations applied to the original dataset is deterministic (side-effect free). So, either you have your own ETL language that you can execute in a sandbox and add a myriad of useless stuff for creating and scheduling pipelines (please don't do that!), or you just let it go and don't end up becoming Pachyderm (sounds great!).
Great comment! We're pretty much in agreement with all of this.
> can I query the same dataset at two different PIT using layered querying in SQL
Yes. We have a query that does this on our home page (it's the second in the carousel at the top):
SELECT
new.domain AS domain,
new.id AS id,
new.name AS name
FROM
"splitgraph/socrata:20200809".datasets old
RIGHT OUTER JOIN
"splitgraph/socrata:20200810".datasets new
This is querying two versions of the same image.
> are you storing the entire dataset as new or only the diff between versions
Basically we store the diffs between versions, yes. The simplified explanation is that data images are an initial snapshot + chain of diffs ("delta-compressed"). You can read more in the concepts section of the docs. [0]
> Git-like semantics (pull, push, checkout, commit) are poorly suited for versioned, immutable datasets
We pretty much agree with this. Some similar projects take the wrong approach and try to implement 100% of git commands "for data." But it's a square peg trying to fit into a round
hole. The use cases are not the same. There are some similarities, but they're limited. Our goal was to step back and think from first principles: What do we like about our coding tools? How can we get the same benefits with data?
> Versions should be at least partially ordered and monotonically increasing. Hashes doesn't convey the information necessary to decide if dataset:de4d is an earlier version of dataset:123a, or not.
Sure. If you want monotonically increasing versions, you can do that by tagging an image with `sgr tag`. (FWIW, image hashes are currently random, but the hashes of objects that comprise them are not -- they're actually content addressable representations of the changeset from the root to that version using an algorithm called LTHash.)
> Tracing a derived dataset provenance will only work if you can assert that the "code" or transformations applied to the original dataset is deterministic (side-effect free).
Yes, definitely. Dataset provenance works best when you use Splitfiles [1] to construct images. It's still possible to run into problems with non-determinism, but for the OLAP use case, we think those situations should be rare.
Ah ! I initially missed the splitfiles page ! Declarative, cacheable... It's a great approach you have here. I think you could go even further and infer some SQL transformations from the schema definition.
Anyway, I'm curious to see where your project will end up. You put much more efforts in the design than most "for data" tools !
This is very cool. Relatedly, as a data scientist, I wish companies would expose their APIs through SQL. I've spent a lot of time pulling data into ETL jobs from things like mixpanel, adwords etc., and having a unified interface would make things much simpler.
I'm trying to understand the architecture of Splitgraph. Are all foreign data wrappers controlled directly by you, or can third parties host a database and connect it to Splitgraph in a federation?
Currently we control and set up all the FDWs (well, an orchestration layer does it on the fly as the query comes in and routes the query to the correct schema with foreign tables).
You can also run a Splitgraph engine locally and add your own FDWs to it. We have a lot of scaffolding around FDWs to make their instantiation much more simple and even wrote a blog post [0] about adding a custom FDW to Splitgraph.
However, in the future we'll be adding the ability to add your own backend data sources to Splitgraph that it can proxy to (whether as a private dataset on the public Splitgraph instance or as a "data virtualization" layer when you have an in-house Splitgraph deployment).
The cool thing about this is that this can be a single gateway to all your data silos (Snowflake, third-party SaaS, public datasets) that can handle federated query execution, data discovery and access control (e.g. firewalling queries to sensitive columns even if the backend data source doesn't support this level of granularity).
Is there a CPU limit or timeout for queries? I’d be a little concerned that an intentionally slow and inefficient query could pin the CPU at 100% and ruin the performance for other users
We currently limit all queries to 30s of execution and 10000 rows returned (by adding a `LIMIT` clause to queries that don't have it). We also have some mechanisms like query result caching and rate limiting for better QoS. One of our directions is building a basically CDN for databases, so it's good to figure these things out as early as possible.
That’s actually pretty cool, to see a public URL with PostgreSQL protocol signifier like that.
Makes me wonder if any developers or DB Architects ever thought of putting their resume in a DB and putting a public read-only postgresql:// URL on their business card :D
I wouldn't expect much from going this far, but yes, too many HMs either forget or never realized they're being interviewed, too.
Saw this more before I started refusing interviews with the HugeCos, but the strangest flavor of this I've run in to was at an early startup - I suspect dude was acting out an arrogant genius script, hoping to convince people he was one. (It came out as very weird, not in a good way.)
- usually the first person to screen or introduce you won't be a developer or perhaps not even technical, so you've put yourself out of the running early on. I can query it, but many who might be in a position to put you forward might not. If you want to do that though then fair enough, that's your choice.
- reflecting on my current situation where I'm busier than normal right now and I want an easy run on non-development tasks ;)
And, with respect, you missed one of my points, regarding the code.
Sure websites are normal, but if you present me with the source and you wrote it (i.e. it's not just a wordpress site) and it's something worth looking at then I can already see that you can do something interesting and of value.
Very neat indeed. I thought Postgres had a max identifier length of 63 characters so I was surprised to see "cityofchicago/covid19-daily-cases-deaths-and-hospitalizations-naz8-j4nc".covid19_daily_cases_deaths_and_hospitalizations in the FROM part of the statement. Does the max identifier length not apply for some reason here or have Splitgraph done something to increase it?
On a related note, I've long wanted longer identifier lengths in Postgres so we can have more meaningful column names but the powers-that-be have always refused... hopefully one day it'll increase in the default distribution.
Co-founder here. The 63-char limit still applies (we didn't recompile Postgres!) but we have some code in front, embedded in a layer of PgBouncers, that intercepts the query, parses it and rewrites it into a shorter dataset ID hash that we then "mount" on the database on the fly using Postgres FDWs before forwarding it.
We also use this to drop unwanted queries and rewrite clients' introspection queries (e.g. information_schema.tables) to give them a list of featured datasets instead of normal Postgres schema names.
Looks like you have to modify the source code and rebuild from source to get longer identifiers.
“
...
The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.
The only annoying part is that you're plugging your code to an interface that might (and has sometimes) broken between releases of PG. So kind of the same fun as maintaining a gcc plugin...
By the way, anyone has any idea on the licensing terms/issues of PG FDWs and PG extensions in general?
FDWs are a powerful feature. We’ve done a lot of work to make scaffolding them easier, if you use sgr. One of our earlier blog posts includes an example of making an FDW for the HN API and packaging it as a Splitgraph mount handler. [0]
There’s also this great post about using FDWs for parallel query execution, by David Grier at Swarm64. [1] They seem to be doing cool things with FDWs too.
Personally I think proxies are a really powerful abstraction in general. Cloudflare and Twilio are two examples of companies built around proxies.
I'm not sure where you're going with this product, but I like the idea of proxies and your idea of a DDN, and I wish you the best.
I'm trying to 'bind them all' at my job right now and Postgres is very inspiring with its extension success stories, and all the FDW work happening. And once you can bind to C, you can mostly chose your language :-)
Sorry about that! I just tested with the same DBeaver version and was unable to reproduce the bug. If you email support@splitgraph.com with your username, we can check the logs and figure out the problem for you.
In general, some clients will have issues with introspection queries (which they use to show the list of available schemas). And even when introspection works, we can't show you all 40k datasets or it might break your client. So we just show "featured" datasets, and for a more exhaustive list you can go to the website to find a dataset.
But, you can still query any dataset on Splitgraph regardless of what shows in your client's schema list. You can go to https://www.splitgraph.com/explore to find datasets, and from each dataset's repo page, you can click "tables" or "query SQL" to get a sample SQL query to run.
Thanks for opening new way to work with public data and discover it. I have several ideas regarding this. I used public free APIs and the worst thing with them that they are all unreliable. Unrelaible on conditions, limits and usually don't scale. And you cannot blame API providers because you don't pay for it. I vote for premium resource based access to the data with free tier. When you can pay and have level of service you need, or can use tiny free limited access.
'Data' is quite broad. If you need results in a tabular format I agree with you - without doubt SQL is the API.
But for nested data (XML, JSON, etc.) it really isn't the best language for that. I am talking here specifically about not querying data that is nested, but actually getting query results in a nested format. SQL can do it (almost all major databases have XML and JSON support) but it rally isn't the easiest thing to use.
I'm wondering if anyone has (good or bad) experience with pg's composite types? They're supposed to allow getting hierarchical data as query results, right? This and arrays should cover most 'simple' use cases?
Hey, co-founder here. There's no pricing yet as we've just launched. Our plan is basically a public/private instance model, kind of like GitHub. The public instance will eventually have quotas (definitely for storage, maybe for server-side query execution). But the main product will be private deployments of Splitgraph, for companies that want a catalog for their internal data. Eventually you'll be able to use the web UI to connect arbitrary upstream sources (Snowflake, BigQuery, SaaS, etc.) to the catalog. You'll be able to manage services on top of each source (e.g. caching, snapshotting, access control, query rewriting, firewalling) and share data with colleagues from the web UI. Basically we can provide a sort of aggregation layer on top of your disparate data sources. We think combining the proxy with a catalog is a really powerful combination.
mm interesting... we have this open Postgres instance (read only) for covid19 research: https://covid19.eng.ox.ac.uk/
we have it running on our own (cheap) server, but we fear we may get overwhelmed by too much traffic if the project becomes very successful. Would this be a solution for us? Is it for free?
Very cool! This is a great use case for Splitgraph. We'd be happy to help you deliver that data. The easiest method would probably be for us to proxy queries to your Postgres instance (you can't do this yourself from the website yet, but it's a planned feature, and we can work with you to set it up), and then you could benefit from our connection pooling and caching. Another option would be for you to push the data to Splitgraph as an image (to keep up to date, you can setup a local Splitgraph instance as a PG replication client and periodically `sgr commit` a new image [0]). If you'd like to chat details, feel free to email support@splitgraph.com or join the Discord (https://discord.gg/eFEFRKm).
In terms of price, we'll eventually add quotas (storage + server side query execution) on the public tier. But the main monetization will be private deployments. In an ideal world, the private deployments will be able to subsidize the costs of some of the open data on the platform. Certainly we'd like to be able to support projects like this one.
Postgres foreign data wrappers is a weird choice of engine. Most queries to this service will be scans, in which case a column-oriented, vectorized, massively parallel engine like Presto will be 1000 times faster or so. Postgres’ underlying engine is optimized for scenarios where you read a small number of rows using an index.
Hey George, thanks for the comment and for the good points!
We want to initially focus on the use case with lots of diverse backend datasets and ad-hoc APIs (maybe with a no-code like solution on top of a generic FDW) where performance won't be the bottleneck. If necessary, the backend data sources can perform aggregation and fast query execution. For example, you can also put Splitgraph in front of Presto (through JDBC). The value we want to provide in these cases is:
* granular access control (e.g. masking for PII columns, auditing etc)
* firewalling/query rewriting/rate limiting (for publicly accessible endpoints that proxy to internal databases that vendors want to publish more easily than through cronjobs with data dumps)
* cataloguing (so you get to discover datasets/data silos, get their metadata and query it over multiple interfaces in the same product)
We also like keeping the PG wire format in any case, as there are so many BI tools and clients that use it that it makes sense to not break that abstraction. We started with PG FDWs just because of the simplicity and the availability of FDWs, but we might swap the actual Postgres FDW layer for some faster execution in the future, if it's needed.
Behind the scenes, for Splitgraph images, we use cstore_fdw as an intermediate storage format (it's a columnar store similar to ORC with support for all PG types like PostGIS geodata). There's a potential in using this as a format for a local query/table cache on Splitgraph nodes that we intend to deploy around the world for low-latency read-only query execution.
Looks lovely, I can see real use for this in my work, postgres and the availabilty of postgis extension is really useful for mapping data and spatially realted queries.
I'm not sure if you're asking about (a) querying Oracle from a Postgres client through Splitgraph, or (b) querying Splitgraph from Oracle.
We want to support both these use cases. For (a), Oracle would be an "upstream" to Splitgraph. We'll need to write a plugin that implements the FDW and does introspection. Eventually, we want you to be able to configure upstreams from the Web UI.
For (b), you can probably find a way to query Splitgraph from Oracle, e.g. using Oracle's "gateway" feature [0]. What's nice about Splitgraph is that it's compatible with any SQL client that can speak the Postgres protocol (or ODBC). So if Oracle can connect to a Postgres database, it can connect to Splitgraph.
We have instructions for how to query Splitgraph from within ClickHouse at [1]. We're actually giving a presentation about this to a ClickHouse meetup on Sep 10, feel free to join. [2]
SF ClickHouse meetup organizer here. Thanks for the shout-out for the SF ClickHouse Meetup. We're looking forward to hearing about SplitGraph on September 10th.
I appreciate the effort to make it easier for users to access heterogeneous data sets, but I really hope data vendors keep shipping raw CSV files. I don't want a company to gate access to the data, merely offering a proxy. I want to be able to download the whole raw datasets from the vendor directly if I want to.