Hacker News new | past | comments | ask | show | jobs | submit login
The Splitgraph Data Delivery Network – query over 40k public datasets (splitgraph.com)
297 points by mildbyte on Aug 21, 2020 | hide | past | favorite | 95 comments



>The single SQL endpoint is well suited for a data marketplace. Data vendors currently ship data in CSV files or other ad-hoc formats. They have to maintain pages of instructions on ingesting this data. With Splitgraph, data consumers will be able to acquire and interact with data directly from their applications and clients.

I appreciate the effort to make it easier for users to access heterogeneous data sets, but I really hope data vendors keep shipping raw CSV files. I don't want a company to gate access to the data, merely offering a proxy. I want to be able to download the whole raw datasets from the vendor directly if I want to.


Absolutely, having ability to download the actual data and keep it is always going to be important. We want to facilitate access to data and think it should be available from the source. But, there will inevitably be fragmentation, so it's valuable to have a service available to catalog and aggregate it and make it available over a single protocol.

For what it's worth, we run PostgREST [0] on top of the DDN, so you can get your query results in JSON and CSV files. For example:

    $ curl -sSH "Content-Type: text/csv" https://data.splitgraph.com/cityofchicago/covid19-daily-cases-deaths-and-hospitalizations-naz8-j4nc/latest/-/rest/covid19_daily_cases_deaths_and_hospitalizations | wc -l

    172
It's limited to 10k rows but we might have an ability in the future to "order" a CSV dump asynchronously and place in a destination (like an S3 bucket) of choice.

[0] https://postgrest.org/en/latest/


Any chance you’re willing to enable CORS on these endpoints? (Access-Control-Allow-Origin: * response header.)


Hey! Big fan of your work.

For the REST API endpoints (e.g. [0] [1]), we do set the Access-Control-Allow-Origin header to *, but for GET requests we do not send it at all. I'm not sure if this is the same as setting it to star. We can set it explicitly though; we do want to enable this use case (eventually we'll have quotas).

Soon we're releasing a Web UI for writing SQL queries. Part of that will include an HTTP (or websocket maybe) endpoint that takes a SQL query and returns the rows, so you'll be able to execute SQL queries directly from the browser.

As an aside, we'd love to get this working with Observable! Let us know anything we can do to help. You can email me at miles@splitgraph.com or join the Discord and we can figure it out :)

[0] Docs: https://www.splitgraph.com/cambridgema-gov/covid19-cumulativ...

[1] Endpoint (powered by PostgREST): https://data.splitgraph.com:443/cambridgema-gov/covid19-cumu...


Not setting it is equivalent to a deny


Ok, so this should be working now, for all types of requests. Let us know if it's not.

If you're testing with the same endpoints, you may need to give the cache a few minutes to evict the old values missing the header.



Did you mean to use the "Accept" header? "Content-Type" describes the format of the request body.


Yeah, sorry, got too excited! It will return JSON by default and CSV with Accept: text/csv:

    $ curl -sH "Accept: text/csv" https://data.splitgraph.com/cityofchicago/covid19-daily-cases-deaths-and-hospitalizations-naz8-j4nc/latest/-/rest/covid19_daily_cases_deaths_and_hospitalizations?select=lab_report_date,cases_total,deaths_total | head -n5
    lab_report_date,cases_total,deaths_total
    2020-03-01,0,0
    2020-03-02,0,0
    2020-03-03,0,0
    2020-03-04,0,0


Very nice!


In my view our collective interest in CSV as a medium for data distribution has resulted in far too much information loss, and consequently, time wasted on input sanitization, validity checking, and unresolvable conversations about the intent of data values like ”1.12345E+11” and "".


Having spent way too much time wrangling vendor-provided CSVs, I 100% agree. I'd love for there to be a common, well-understood format for typed tabular data that supports multiple tables and enforces foreign keys between them. Ideally with a concept of "patching" to enable incremental updates.

Probably the closest thing I'm aware of is handing around a sqlite file, but I'm a little uneasy using a format that's meant to be a database as a transfer format. Dolt looks promising here too. Are there other ways?


At work, we use Parquet (https://parquet.apache.org/) for almost everything related to a dataframe. We don't really care about performance gains (although, it's nice to have), but we really like to have a schema.

Note, we use mostly Python, some R, and a various range of ML or Optimisation tools, depending on the project.


Parquet is kind of a royal pain in the ass compared to CSV/JSON/plaintext mostly because it uses a ton of Thrift encodings, resulting in mostly terrible/broken implementations anywhere outside of the Java/JVM ecosystem. If you're running Apache <Whatever> then sure, it'll probably be fine, but I'd recommend avoiding it if you start having to go down the rabbit hole of implementing support for things in your language du jour.


The Rust and python impl are fine. But I get it, Parquet may not be perfect or optimal or whatever. It works as a simple, typed, columnar format.

We had to pick a single file format recommendation for sending 100GB+ tables on FTP servers or dropbox, scanning terabytes of useless stuff only to grap an key-value pair, and properly reading integer and UTF-8 columns. Turns out, Parquet is practical. Enough for users to start using it instead of CSV. It could be Avro, but it's just not as easy.


> But I get it, Parquet may not be perfect or optimal or whatever.

I actually think Parquet is pretty great in practice, I just have some issues with the sheer volume of abstractions necessary to implement it. I just wish it was anything other than Thrift.

I would probably choose Parquet over anything else, though.


For exhaustivity sake, apache arrow could also do the work but Parquet seems more appropriate here https://stackoverflow.com/questions/56472727/difference-betw...


Does Parquet provide a way to define foreign keys between different tables in a single dataset?


Parquet files don't reference anything outside of the file, usually. A group of parquet files in a folder is usually considered a table, where the schema is union of the schema of the files.


Avro is a strong contender for this use case: https://avro.apache.org/docs/current/

https://medium.com/ssense-tech/csv-vs-parquet-vs-avro-choosi...

It's well-supported in Pandas and Kafka, has good schema support, and reasonably small compressed file sizes.


https://www.sqlite.org/appfileformat.html - it is OK to use sqlite that way! =)


The Splitgraph core code on GitHub [0], around which we've built the DDN, is all about managing "data images" which are basically snapshots of PostgreSQL schemata. You can build them with a format similar to Dockerfiles as well as do a "checkout" into a local instance of Splitgraph (which you can connect to with any PG client) -- this enables change tracking and delta compression too.

Behind the scenes, we store them as cstore_fdw [2] files which is a columnar storage format that helps with analytical queries.

[0] https://github.com/splitgraph/splitgraph/

[1] https://splitgraph.com/docs/concepts/images

[2] https://github.com/citusdata/cstore_fdw


Seconding parquet and sqlite, but hdf5 and netcdf4 certainly deserve a mention, especially for multidimensional, scientific, and generally large datasets.



Both JSON and XML are self-documenting, and most modern databases support directly importing and exporting them. Though the tools to accomplish this could be better, these formats are far better suited as a "medium for data distribution" than CSV files are.

That said, a simple compressed .sql file of INSERT statements can often go a long way.

The only reason CSV is widely used is because normal people think of data as spreadsheets, and asking them to fire up a database and shred JSON data into it is ridiculous when they just want to whip up a line graph or answer a simple question (e.g. "What was value X on a this particular date?").


The thing that attracted me to SplitGraph from the very start is that they are proposing to make the PostgreSQL wire protocol and SQL dialect a general interface to remote data. The interface is not only well known but backed by permissively licensed, open source libraries. Plus there are hundreds of tools that already connect to PostgreSQL.

This idea makes such sense it's a little surprising nobody did it before.


run-away queries become a problem when direct sql is exposed to public.


CSV is just an example, data vendors can choose e.g. JSON if they want more a sane and well-defined format.


Regarding "sane & well-defined":

http://seriot.ch/parsing_json.php


Json has many design flaws, e.g. support for large integers, floating point infinity and NaN, for instance.


It is not a design flaw to make a reasonable choice about what data types you support. Especially given the ones captured in JSON are overwhelmingly the most commonly used and necessary types.

In cases where that's not enough, you could roll your own types by putting the values in plain strings and it would still be strictly more expressive than CSV


As an interchange format for database data, CSV is pretty terrible.

There's no comprehensive single specification. The closest is an RFC which completely misses major pieces (eg handling of binary data, and more).

Using SQLite as the interchange format is better. :)


IIUC, JSON is available too...


That's a good point. Any idea if there's a well spec'd JSON format around, for database data? eg something that handles binary data, trinary logic (eg null values), and hopefully referential integrity


You can generate CSV if you need it. See psql --csv. [1] What's brilliant about this approach is that you can generate any format that's supported by the interface defined by the PostgreSQL SQL dialect and wire protocol. (Obvious caveats about timeouts, network bandwidth, etc. apply.)

[1] https://www.postgresql.org/docs/12/app-psql.html


I see a new job title coming into being - Enterpruse Data Librarian

40,000 data sets - even if many are just diff versions - is a ridiculous number to manage or even know about on a non full time basis.

Data driven decisions need data yes, but they also need people to know the data exists. And what it means.

And this is just external curated data - use this as the standard for what each department should be producting internally.

In fact that's a good idea - a data publishing standard - not just the data types / schema, but actually supplying it through a format that is consumable by others.


This job already exists as either Business Intelligence Business Analyst or Business Intelligence Developer. The professions have existed for decades.


As someone who tried, and almost succeeded, to get rid of pachyderm for the last two years, I like what I just read.

Something is not entirely clear to me right now: An image is an immutable snapshot of a dataset at a given point-in-time - great - but, can I query the same dataset at two different PIT using layered querying in SQL ? Something like this: SELECT * FROM dataset:version-1, dataset:version-2

Also, are you storing the entire dataset as new or only the diff between versions (and later reconstruct the full image) ?

Now, onto the things that could be improved...

- Git-like semantics (pull, push, checkout, commit) are poorly suited for versioned, immutable datasets. Just (intelligently) abstract fetching and sending datasets by looking at the SQL query (dataset:version-2, above)

- Versions should be at least partially ordered and monotonically increasing. Hashes doesn't convey the information necessary to decide if dataset:de4d is an earlier version of dataset:123a, or not.

- Tracing a derived dataset provenance will only work if you can assert that the "code" or transformations applied to the original dataset is deterministic (side-effect free). So, either you have your own ETL language that you can execute in a sandbox and add a myriad of useless stuff for creating and scheduling pipelines (please don't do that!), or you just let it go and don't end up becoming Pachyderm (sounds great!).


What's wrong with Pachyderm? Sounds like you are having a lot of pain with them.

I am not affiliated at all with Pachyderm, I am just curious about this problem-space.


There is a good number of specific reasons, but it really comes down to this : they have too many features.


Is something like dolthub [0] what you have in mind?

[0] https://www.dolthub.com/


No. They embraced the "Git for data" mindset, like Pachyderm and DVC. This design is restricted by git semantics when we need to go further.

Think more about Kafka's topic/message objects, with each message being an entire table, and their Stream API


Maybe a JCR implementation like Apache Jackrabbit Oak would be a good tool for this data. See here for sql semantics: https://jackrabbit.apache.org/oak/docs/query/grammar-sql2.ht...


Great comment! We're pretty much in agreement with all of this.

> can I query the same dataset at two different PIT using layered querying in SQL

Yes. We have a query that does this on our home page (it's the second in the carousel at the top):

    SELECT
        new.domain AS domain,
        new.id AS id,
        new.name AS name
    FROM
        "splitgraph/socrata:20200809".datasets old
    RIGHT OUTER JOIN
        "splitgraph/socrata:20200810".datasets new
This is querying two versions of the same image.

> are you storing the entire dataset as new or only the diff between versions

Basically we store the diffs between versions, yes. The simplified explanation is that data images are an initial snapshot + chain of diffs ("delta-compressed"). You can read more in the concepts section of the docs. [0]

> Git-like semantics (pull, push, checkout, commit) are poorly suited for versioned, immutable datasets

We pretty much agree with this. Some similar projects take the wrong approach and try to implement 100% of git commands "for data." But it's a square peg trying to fit into a round hole. The use cases are not the same. There are some similarities, but they're limited. Our goal was to step back and think from first principles: What do we like about our coding tools? How can we get the same benefits with data?

> Versions should be at least partially ordered and monotonically increasing. Hashes doesn't convey the information necessary to decide if dataset:de4d is an earlier version of dataset:123a, or not.

Sure. If you want monotonically increasing versions, you can do that by tagging an image with `sgr tag`. (FWIW, image hashes are currently random, but the hashes of objects that comprise them are not -- they're actually content addressable representations of the changeset from the root to that version using an algorithm called LTHash.)

> Tracing a derived dataset provenance will only work if you can assert that the "code" or transformations applied to the original dataset is deterministic (side-effect free).

Yes, definitely. Dataset provenance works best when you use Splitfiles [1] to construct images. It's still possible to run into problems with non-determinism, but for the OLAP use case, we think those situations should be rare.

[0] https://www.splitgraph.com/docs/concepts/images

[1] https://www.splitgraph.com/docs/concepts/splitfiles


Ah ! I initially missed the splitfiles page ! Declarative, cacheable... It's a great approach you have here. I think you could go even further and infer some SQL transformations from the schema definition.

Anyway, I'm curious to see where your project will end up. You put much more efforts in the design than most "for data" tools !

Cheers !


This is very cool. Relatedly, as a data scientist, I wish companies would expose their APIs through SQL. I've spent a lot of time pulling data into ETL jobs from things like mixpanel, adwords etc., and having a unified interface would make things much simpler.

I'm trying to understand the architecture of Splitgraph. Are all foreign data wrappers controlled directly by you, or can third parties host a database and connect it to Splitgraph in a federation?


Currently we control and set up all the FDWs (well, an orchestration layer does it on the fly as the query comes in and routes the query to the correct schema with foreign tables).

You can also run a Splitgraph engine locally and add your own FDWs to it. We have a lot of scaffolding around FDWs to make their instantiation much more simple and even wrote a blog post [0] about adding a custom FDW to Splitgraph.

However, in the future we'll be adding the ability to add your own backend data sources to Splitgraph that it can proxy to (whether as a private dataset on the public Splitgraph instance or as a "data virtualization" layer when you have an in-house Splitgraph deployment).

The cool thing about this is that this can be a single gateway to all your data silos (Snowflake, third-party SaaS, public datasets) that can handle federated query execution, data discovery and access control (e.g. firewalling queries to sensitive columns even if the backend data source doesn't support this level of granularity).

[0] https://www.splitgraph.com/blog/foreign-data-wrappers


Is there a CPU limit or timeout for queries? I’d be a little concerned that an intentionally slow and inefficient query could pin the CPU at 100% and ruin the performance for other users


We currently limit all queries to 30s of execution and 10000 rows returned (by adding a `LIMIT` clause to queries that don't have it). We also have some mechanisms like query result caching and rate limiting for better QoS. One of our directions is building a basically CDN for databases, so it's good to figure these things out as early as possible.


Interesting! I’ll be keeping an eye on this project, I love postgres


> postgresql://data.splitgraph.com:5432/ddn

That’s actually pretty cool, to see a public URL with PostgreSQL protocol signifier like that.

Makes me wonder if any developers or DB Architects ever thought of putting their resume in a DB and putting a public read-only postgresql:// URL on their business card :D


I would recommend you don't do that. Nobody who is reviewing candidates has time to write a client to get your raw information out of a database.

If you build the client, and a web page/app and also document how you did it and also show us the code, that'd be much more useful.


I once applied to a local job ad in 1996. The ad read: “PUT text webmaster resume 123.123.123.123”

So I PUT in my resume to the actual IP address, got an interview. Alas they were a strict Windows shop, and at the time I was a Unix Perl dev


I'm sorry, what. If you can't query that, why should I work for you? ;)


I wouldn't expect much from going this far, but yes, too many HMs either forget or never realized they're being interviewed, too.

Saw this more before I started refusing interviews with the HugeCos, but the strangest flavor of this I've run in to was at an early startup - I suspect dude was acting out an arrogant genius script, hoping to convince people he was one. (It came out as very weird, not in a good way.)


My point was regarding the following:

- usually the first person to screen or introduce you won't be a developer or perhaps not even technical, so you've put yourself out of the running early on. I can query it, but many who might be in a position to put you forward might not. If you want to do that though then fair enough, that's your choice.

- reflecting on my current situation where I'm busier than normal right now and I want an easy run on non-development tasks ;)


> "If you build the client, and a web page/app and also document how you did it and also show us the code, that'd be much more useful."

So a website resume, like literally every other candidate...

I think you've missed the entire point of the idea... which is to stand out with a different approach


And, with respect, you missed one of my points, regarding the code.

Sure websites are normal, but if you present me with the source and you wrote it (i.e. it's not just a wordpress site) and it's something worth looking at then I can already see that you can do something interesting and of value.


Write a client? Just use psql, or any of a dozen other interactive DB clients.


Is this an interview or a consultation?


Very neat indeed. I thought Postgres had a max identifier length of 63 characters so I was surprised to see "cityofchicago/covid19-daily-cases-deaths-and-hospitalizations-naz8-j4nc".covid19_daily_cases_deaths_and_hospitalizations in the FROM part of the statement. Does the max identifier length not apply for some reason here or have Splitgraph done something to increase it?

On a related note, I've long wanted longer identifier lengths in Postgres so we can have more meaningful column names but the powers-that-be have always refused... hopefully one day it'll increase in the default distribution.


Co-founder here. The 63-char limit still applies (we didn't recompile Postgres!) but we have some code in front, embedded in a layer of PgBouncers, that intercepts the query, parses it and rewrites it into a shorter dataset ID hash that we then "mount" on the database on the fly using Postgres FDWs before forwarding it.

We also use this to drop unwanted queries and rewrite clients' introspection queries (e.g. information_schema.tables) to give them a list of featured datasets instead of normal Postgres schema names.


Fascinating and impressive - nice work and nice solution until the day comes where the identifier length is increased by default


Nice! And how do you parse SQL? Do you use Postgres's parser or did you have to implement one yourself?


We use pglast [0]: it's basically a Python wrapper around Postgres's query parsing code.

[0] https://github.com/lelit/pglast


Looks like you have to modify the source code and rebuild from source to get longer identifiers.

“ ...

The system uses no more than NAMEDATALEN-1 bytes of an identifier; longer names can be written in commands, but they will be truncated. By default, NAMEDATALEN is 64 so the maximum identifier length is 63 bytes. If this limit is problematic, it can be raised by changing the NAMEDATALEN constant in src/include/pg_config_manual.h.

...”

https://www.postgresql.org/docs/current/sql-syntax-lexical.h...


https://www.postgresql.org/docs/12/limits.html

> identifier length 63 bytes can be increased by recompiling PostgreSQL

Maybe they compiled their postgres with a different limit? Just a guess I, haven't tried it.


I think FDWs are not more used because they're not easy to get into.

The best link/example I found was https://github.com/beargiles/passwd-fdw and it's quite easy to follow the code and understand all the moving parts.

Once you've written a FDW you'll see them everywhere.

In fact the same author wrote https://github.com/beargiles/zip file-fdw and https://github.com/beargiles/tarfile-fdw

If you (still?) need inspiration and want to see what already exists: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Simpler, using a 'generic' file-based FDW shipping with pg's sources: https://aaronparecki.com/2015/02/19/8/monitoring-cpu-memory-...

There's a python wrapper to get your feet wet (or prototype an idea) : https://github.com/Segfault-Inc/Multicorn (though I'm not sure how maintained this is).

The only annoying part is that you're plugging your code to an interface that might (and has sometimes) broken between releases of PG. So kind of the same fun as maintaining a gcc plugin...

By the way, anyone has any idea on the licensing terms/issues of PG FDWs and PG extensions in general?


FDWs are a powerful feature. We’ve done a lot of work to make scaffolding them easier, if you use sgr. One of our earlier blog posts includes an example of making an FDW for the HN API and packaging it as a Splitgraph mount handler. [0]

There’s also this great post about using FDWs for parallel query execution, by David Grier at Swarm64. [1] They seem to be doing cool things with FDWs too.

Personally I think proxies are a really powerful abstraction in general. Cloudflare and Twilio are two examples of companies built around proxies.

[0] https://www.splitgraph.com/blog/foreign-data-wrappers

[1] https://swarm64.com/post/scaling-elastic-postgres-cluster/


Thanks for both links!

I'm not sure where you're going with this product, but I like the idea of proxies and your idea of a DDN, and I wish you the best.

I'm trying to 'bind them all' at my job right now and Postgres is very inspiring with its extension success stories, and all the FDW work happening. And once you can bind to C, you can mostly chose your language :-)


Ok I signed up and used your recommended client (DBeaver 7.1.5) but I don't see the schemas in your picture.

https://ibb.co/gwLfHVz


Sorry about that! I just tested with the same DBeaver version and was unable to reproduce the bug. If you email support@splitgraph.com with your username, we can check the logs and figure out the problem for you.

In general, some clients will have issues with introspection queries (which they use to show the list of available schemas). And even when introspection works, we can't show you all 40k datasets or it might break your client. So we just show "featured" datasets, and for a more exhaustive list you can go to the website to find a dataset.

But, you can still query any dataset on Splitgraph regardless of what shows in your client's schema list. You can go to https://www.splitgraph.com/explore to find datasets, and from each dataset's repo page, you can click "tables" or "query SQL" to get a sample SQL query to run.


Same issue with DBeaver 7.1.5 on Windows 10. I've sent an email to support@splitgraph.com.

Congrats on the product BTW. Very promising!

I've been looking forward to something like this for years.


It's fixed now, thank you!


I think we fixed this now, was a bug in how we did our client introspection query rewrites (wasn't actually DBeaver-specific).

Could you try again when you have the chance?


Works now, thanks!


Thanks for opening new way to work with public data and discover it. I have several ideas regarding this. I used public free APIs and the worst thing with them that they are all unreliable. Unrelaible on conditions, limits and usually don't scale. And you cannot blame API providers because you don't pay for it. I vote for premium resource based access to the data with free tier. When you can pay and have level of service you need, or can use tiny free limited access.


SQL is the API for data.


'Data' is quite broad. If you need results in a tabular format I agree with you - without doubt SQL is the API.

But for nested data (XML, JSON, etc.) it really isn't the best language for that. I am talking here specifically about not querying data that is nested, but actually getting query results in a nested format. SQL can do it (almost all major databases have XML and JSON support) but it rally isn't the easiest thing to use.


Right. As the name implies, for structured data. Another way of looking at is, that if we make efforts to structure the data, SQL is pretty good.

We should also consider the capabilities of PostgreSQL to query JSON documents.


I'm wondering if anyone has (good or bad) experience with pg's composite types? They're supposed to allow getting hierarchical data as query results, right? This and arrays should cover most 'simple' use cases?


I can find a privacy policy. It's not awful.

I can't find pricing.


Hey, co-founder here. There's no pricing yet as we've just launched. Our plan is basically a public/private instance model, kind of like GitHub. The public instance will eventually have quotas (definitely for storage, maybe for server-side query execution). But the main product will be private deployments of Splitgraph, for companies that want a catalog for their internal data. Eventually you'll be able to use the web UI to connect arbitrary upstream sources (Snowflake, BigQuery, SaaS, etc.) to the catalog. You'll be able to manage services on top of each source (e.g. caching, snapshotting, access control, query rewriting, firewalling) and share data with colleagues from the web UI. Basically we can provide a sort of aggregation layer on top of your disparate data sources. We think combining the proxy with a catalog is a really powerful combination.

We're actually looking for early customers to beta test the private deployment, more on that here: https://www.splitgraph.com/about/company/private-cloud-beta

p.s. Glad you like our privacy policy at least. :) We hate third party trackers.


Does the Splitgraph Data Delivery Network allow queries that ORDER BY an unsorted column? This seems like a vector for a Denial of Service attack.


mm interesting... we have this open Postgres instance (read only) for covid19 research: https://covid19.eng.ox.ac.uk/

we have it running on our own (cheap) server, but we fear we may get overwhelmed by too much traffic if the project becomes very successful. Would this be a solution for us? Is it for free?


Very cool! This is a great use case for Splitgraph. We'd be happy to help you deliver that data. The easiest method would probably be for us to proxy queries to your Postgres instance (you can't do this yourself from the website yet, but it's a planned feature, and we can work with you to set it up), and then you could benefit from our connection pooling and caching. Another option would be for you to push the data to Splitgraph as an image (to keep up to date, you can setup a local Splitgraph instance as a PG replication client and periodically `sgr commit` a new image [0]). If you'd like to chat details, feel free to email support@splitgraph.com or join the Discord (https://discord.gg/eFEFRKm).

In terms of price, we'll eventually add quotas (storage + server side query execution) on the public tier. But the main monetization will be private deployments. In an ideal world, the private deployments will be able to subsidize the costs of some of the open data on the platform. Certainly we'd like to be able to support projects like this one.

[0] https://github.com/splitgraph/splitgraph/tree/master/example...


Excellent, thanks a lot for this! I will bring your suggestions to the rest of the team and come back to you via Discord.


Postgres foreign data wrappers is a weird choice of engine. Most queries to this service will be scans, in which case a column-oriented, vectorized, massively parallel engine like Presto will be 1000 times faster or so. Postgres’ underlying engine is optimized for scenarios where you read a small number of rows using an index.


Hey George, thanks for the comment and for the good points!

We want to initially focus on the use case with lots of diverse backend datasets and ad-hoc APIs (maybe with a no-code like solution on top of a generic FDW) where performance won't be the bottleneck. If necessary, the backend data sources can perform aggregation and fast query execution. For example, you can also put Splitgraph in front of Presto (through JDBC). The value we want to provide in these cases is:

* granular access control (e.g. masking for PII columns, auditing etc)

* firewalling/query rewriting/rate limiting (for publicly accessible endpoints that proxy to internal databases that vendors want to publish more easily than through cronjobs with data dumps)

* cataloguing (so you get to discover datasets/data silos, get their metadata and query it over multiple interfaces in the same product)

We also like keeping the PG wire format in any case, as there are so many BI tools and clients that use it that it makes sense to not break that abstraction. We started with PG FDWs just because of the simplicity and the availability of FDWs, but we might swap the actual Postgres FDW layer for some faster execution in the future, if it's needed.

Behind the scenes, for Splitgraph images, we use cstore_fdw as an intermediate storage format (it's a columnar store similar to ORC with support for all PG types like PostGIS geodata). There's a potential in using this as a format for a local query/table cache on Splitgraph nodes that we intend to deploy around the world for low-latency read-only query execution.


How do you handle expensive queries? Several JOIN over multiple large data sources can easily take minutes if not hours.


Looks lovely, I can see real use for this in my work, postgres and the availabilty of postgis extension is really useful for mapping data and spatially realted queries.


Can other Database systems be used to do this like SQL Server or Oracle? Many of us are forced to use systems other than PostGres.


I'm not sure if you're asking about (a) querying Oracle from a Postgres client through Splitgraph, or (b) querying Splitgraph from Oracle.

We want to support both these use cases. For (a), Oracle would be an "upstream" to Splitgraph. We'll need to write a plugin that implements the FDW and does introspection. Eventually, we want you to be able to configure upstreams from the Web UI.

For (b), you can probably find a way to query Splitgraph from Oracle, e.g. using Oracle's "gateway" feature [0]. What's nice about Splitgraph is that it's compatible with any SQL client that can speak the Postgres protocol (or ODBC). So if Oracle can connect to a Postgres database, it can connect to Splitgraph.

We have instructions for how to query Splitgraph from within ClickHouse at [1]. We're actually giving a presentation about this to a ClickHouse meetup on Sep 10, feel free to join. [2]

[0] https://docs.oracle.com/cd/E18283_01/owb.112/e10582/gateways...

[1] https://www.splitgraph.com/connect

[2] https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Mee...


SF ClickHouse meetup organizer here. Thanks for the shout-out for the SF ClickHouse Meetup. We're looking forward to hearing about SplitGraph on September 10th.


Great project! Congrats with the launch.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: