Hacker News new | past | comments | ask | show | jobs | submit login
Streaming data in Postgres to 1M clients with GraphQL (github.com/hasura)
192 points by tango12 on Oct 16, 2022 | hide | past | favorite | 73 comments



Streaming data to clients at the edge (apps or services) is a hard problem. We built an approach that keeps Postgres at the center and allows clients to consume a stream of data that is already in Postgres without any additional moving parts.

This works with read replica style scaling or with Postgres flavours that support scaling out easily (eg: Cosmos Postgres, Yugabyte & cockroach coming soon).


I think there’s a huge unacknowledged gap in the database industry for good streaming products.

If we are building a report or dashboard that we pull up a few times a day then a pull based model where we query the database on page load is fine.

For almost anything else such as an app, a microservice, an alerting system, a web page, a dashboard, we want to be able to update it in near real time for the user experience. Receiving a stream of query results is by far the easiest way to do this.

Polling is obviously a poor interim solution.

I think streaming will be a huge story in data over the next decade. The products are coming through now which is a start.


Materialize gives a vaguely PostgreSQLish user land on top of this idea. Obviously Postgres has LISTEN/NOTIFY which are fine for most places you need a queue and push updates but the key is incremental view updates so you get instant results on complex queries.


Unfortunately, Materialize has a pretty restrictive license.

https://github.com/MaterializeInc/materialize/blob/main/LICE...

I'm generally pretty happy to pay for open source software, but licensing like this is just too risky. I need to be able to experiment with something at scale, in production, before I start paying someone.


Free for 1 node, paid after. Code reverts to permissive license after 4 years. Think this is fair. you can bootstrap for free and only pay if it works out


It's more than fair, of course. It's just a barrier.


There's https://www.risingwave.dev which is similar to materialize.


I wouldn't quite say that this need is unacknowledged, most DBs have some kind of streaming interface for actual data contents (not queries, though), which for instance are leveraged by Debezium to expose change event streams in one uniform event format (disclaimer: I work on Debezium). Going foward, I expect that these capabilities will be substantially built out, also providing streaming query support (incrementally updated materialized views), not being locked in separate licensing layers or a more or less complex to use after-thought, but a primary part of database offerings, just as SELECT today.


Streaming has been the story for the past couple of decades already. But once you accept the value of an event-first architecture, the benefits of traditional RDBMSes become much weaker.


It’s been discussed for a long time, but the ecosystem is still spotty. Many databases have some kind of “live query” feature for instance but with limitations or not intended for production use.

A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.

I do think a lot of work has been done, but it all needs to move up a few levels of abstraction.


> A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.

Personally I find it much easier to write the code explicitly than try to understand what a query planner is doing, especially if performance is relevant. (That's not to say there's not plenty of room for improvement in the streaming world - but I'd rather have a helper library that I can use on top of the low-level API, than have to go through a parser and planner for every query even when I know exactly what I want to do) But I seem to be an outlier in this regard.


Yep. Exactly this.

Our focus (Hasura) is on the last mile so that innovations on the data side (eg: materialize, ksql, timescale continuous aggregates) are “just obvious” to start building applications and services against.


maybe if you're a really big company with petabytes of data but I wouldn't' say its worth the operational complexity for 90% of tech companies out there. Seriously, design your schema so you can have tables optimized for aggregate functions. you can use materialized views or have them populate via triggers. Either is still going to be an order of magnitude less work that dealing with all the edge cases that are ging to come up when you try to build a distributed data pipeline. Thats a death by thousand cuts for any small to medium startup that doesn't already have a team of engineers experienced in making such systems work.

Plus you can stream db changes from postgres to kafka for those edge cases where you really need it

TLDR: if you're in a startup and thinking of building a distributed system.. DONT. stick with a monolith and spin out services as NEEDED.


> maybe if you're a really big company with petabytes of data but I wouldn't' say its worth the operational complexity for 90% of tech companies out there. Seriously, design your schema so you can have tables optimized for aggregate functions. you can use materialized views or have them populate via triggers. Either is still going to be an order of magnitude less work that dealing with all the edge cases that are ging to come up when you try to build a distributed data pipeline. Thats a death by thousand cuts for any small to medium startup that doesn't already have a team of engineers experienced in making such systems work.

Funny, my experience is the exact opposite. Materialized views and triggers are death by a thousand cuts with all their edge cases. Whereas if you just use Kafka for everything from day 1 then everything works consistently and you have no problems.


So who's managing the kafka cluster? are you running scheduled backups of it? what happens if the system is down while ingesting data? I can only imagine you have someone thats already veery experienced in kafka if you find that easier to manage than an rdbms which already has established hosted services available.


Running Kafka is fiddly but a lot easier than running a true master-master RDBMS setup (which none of my employers have ever managed; at best some have had a read replica and an untested shell script from five years ago that supposedly promotes that read replica to master if the master goes down). Backups are the same problem either way (and, while it's not a full replacement for actual backups, the fact that Kafka is clustered means you're protected from some failure scenarios out of the box in a way that you aren't with something like Postgres). And there are plenty of established hosted kafka services if that's your preferred approach.


Would you not be able to solve this by having a messaging queue and getting the database to publish it’s updates?

Curious what products you’ve seen as well.


A message queue is a database too.

So the question is: Why have two copies of your data, two products to learn and monitor and operate, write boilerplate to move data between the DBs, etc.?

A "message queue" comes down to being another index on your table/set of tables ordered by a post-commit sequence number. These are things all SQL DBs have already, it just lacks a bit of exposing/packaging to be as convenient to use as a messaging queue.


That's why I was watching rethinkDB but that didn't pan out.


While I love ingenuity of developers at Hasura (because I've been personally through these scaling challenges), I always get a gag reaction with GraphQL. I've honestly tried hard to digest it, and I can tell you at large scale where single DB won't cut it, you would either need to develop a large federation layer like [Netflix](https://netflixtechblog.com/how-netflix-scales-its-api-with-...), or just rip-it out. Streaming might elevate the problem a little, but I real problem still lurks under the hood. The fact that front-end community wants to fit everything under GraphQL bothers me, because every backend developer knows that a single tool/technology is usually not the best tool for solving all problems of your life. Remember the golden words, THERE IS NO SILVER BULLET!


If database capacity is your main concern and not independent schema deployability, then federation is overkill. You can just connect to whichever databases contain your data in your resolvers within a single service.

You have to be at pretty massive scale before federation becomes necessary and by then (if ever) your frontend teams have experienced benefits that are pretty much miraculous. The reason frontend wants to fit as much as possible into it is because it's vastly better than what came before it unless you have a 95th percentile org that is really doing an outstanding job managing the API via other means.


I think graphql with a single db is an odd thing to gain popularity - as the raison d'être for graphql was to federate and proxy different, heterogeneous data sources (databases, json/rest services, soap, other graphql services...).

Speaking of Netflix - I think they had an alternative Api federation service that used some clever tricks with json string vs number keys to allow for alternating http put/get - and through that leverage http level caching. But I can't find the the link...


GraphQL raison d’être was not to federate and proxy different, heterogeneous data sources. It has initially been developed as a better API for the Facebook monolith. I asked one of GraphQL’s coauthors directly: https://twitter.com/ngrilly/status/1317415232717860866?s=46&....


Interesting - I didn't even know that there was a monolith at Facebook [ed: beyond there being reports that they use a large monorepo for code]. I stand corrected, thank you.

This doesn't directly address if graphql at Facebook federated different data sources (eg graph and sql databases) - but I suppose not?


I don't have any "insider" information regarding the data sources used by the Facebook monolith, written in PHP, when GraphQL has been developed, but I would expect it to talk to multiple/different datastores, and not just MySQL. Then yes, it could be argued that it has been used to federate multiple data stores, even though it was already the case before using GraphQL :)


The main thing I see it being used for is field selection and model nesting. From an end user perspective, it's pretty nice for this I think. Certainly nice that if I know it's a graphql endpoint I've already got a fairly solid idea about how to query it.


Are you talking about Falcor? https://github.com/Netflix/falcor


I believe so; I think there was a blog post that went into the details - but I can't find it - but I did fint this comment:

https://news.ycombinator.com/item?id=10076235


FD: I work at hasura and work with users/customers.

Fair point about the DB scaling but not sure if everyone is going to run into this issue. Also, lots of solutions are emerging for this specific problem (with different trade-offs of course) like distributed databases (crunchy, YugaByte, Spanner, etc.). Most folks I work with get by with a reasonably sized DB and some read replicas.

Not a GraphQL problem though IMO.


In my experience, if you have more than a small complexity to your data, and you're a medium scale business (or are going in that direction), you're going to end up with this issue. And then it becomes really painful to get out of the situation as migrating data and splitting things up is a real PITA.

While you are correct, not everyone is going to end up with this issue, those that are thinking of getting to a medium sized business should be working to avoid it, which unfortunately means solutions such as hasura lose value. It would be good to see more ability to collect data from multiple sources (please reply and correct me if you already do this, I'm not super familiar with the service).


Hasura supports multiple data-sources and creating relationships across multiple data sources.

Sources can be databases (postgres family, sql-server, big-query currently, and as we add support for new databases) or REST APIs or GraphQL APIs.

This composition allows in a final GraphQL API that federates across different sources.

From our docs: https://hasura.io/docs/latest/data-federation/data-federatio...


With Hasura you can create a federation of databases and expose them as graphql or rest endpoints. You can also wrap your existing REST/Graphql services. Almost like API gateway, place where you unify your services, manage access / row-level permissions etc.


I also have issues with GraphQL at times, but mostly because I find the syntax unintuitive.

What about the concept doesn't work? It's just a syntax for queries, I'm confused why it wouldn't scale.


I've been thinking about the problem and the more I think about it, I tried many different alternatives, it's best to have plain SQL being sent for queries.

The problem there is whi has the right to run which queries?

So the real problem is authorization.


It’s not just a syntax for queries—GraphQL is essentially a single-stop shop for entity resolution.


The front-end (read: Node) community is fairly young. They are still to learn that complexity kills. The hard way.

As for GraphQL - it's great for clients. As a backend engineer, you still have to do the work and a LOT of it.

This is just like microservices. No due diligence on whether the added complexity and destroyed productivity is worth it. "Everyone else is doing it".


> The fact that front-end community wants to fit everything under GraphQL

Maybe you’re too out of touch. This isn’t true, at least not anymore. GraphQL is losing appeal.


It was never the solution to begin with, IMHO. It's lacking a permissions or let's say authz framework. It's only half the solution. And it's too complicated and there's no real standard, as in real world standard. Everyone has their own little soup cooking, because the manifest is incomplete and always evolving.


I've been using Hasura in production for small commercial projects deployed on AWS and I've been positively impressed by the stability and the speed up - it makes spinning up a graphQL backend with row-level security straightforward.


Indeed, and it consumes very little memory as well, even compared to any “native” Node.js GraphQL servers.


Just curious: my understanding is Hasura kind of discourages using RLS in lieu of their own access control layer[1]. Did you consider pros/cons of either approach?

[1] https://hasura.io/docs/latest/auth/authorization/basics/


In short, they aren't really comparable. The thing is, you use postgres RLS for DB users (or give access to an app). But hasura ACL is for app users.


People do use RLS for app user-level access control. It has been advocated by PostgREST for years [1] and also widely practiced by the PostGraphile community [2]. Hasura distinguishes itself by not actively advocating db-level access control. From what I recall, the first versions of Hasura were created at a time when Postgres' RLS was still very slow, but it's not clear to me why they still push their own solution. (Not saying there can't be a good reason for it, just not clear to me what it is, and I'd rather default to using functionality built into the database.)

[1] https://postgrest.org/en/stable/auth.html#roles-for-each-web... [2] https://www.graphile.org/postgraphile/security/


But for column level permissions views are still needed, right? This isn't the case with hasura. (Of course hasura's approach has own limitations).

> Anyone accessing the generated API endpoint for the chat table will see exactly the rows they should, without our needing custom imperative server-side coding.

That looks like a bad joke. It is imperative, but now in the database. I'm not sure that it is better. Hasura's approach is declarative.

From my point of view, postgres' security model is still not suitable for users. It is more complex, imperative, and therefore more error prone. Hasura's approach is still not perfect, but a combination of postgres and hasura functionality make a huge difference.


My company is using this in production to stream messages/online presence to our Unity game clients and it works very well.


Which is it? Sounds cool.


FD: I work at Hasura.

Seeing some feedback on GraphQL - Hasura has had support for converting templated GraphQL into RESTish endpoints (with Open API Spec docs if needed). We are planning to do the same for this streaming API as well - does anyone have good examples of existing REST/RESTish endpoints that something similar?


We’ve been using hasura at work, but we’ve stopped using it for everything other than subscriptions in favour of hand written rest apis. The problem for us wasn’t really graphql itself, but the fact that the client app was determining the query. If the client could request a “named query”, that was then determined by the backend (perhaps via a web hook?)then we’d have been able to use hasura more.


Hasura also has controls for allow listing opertaions for that composed on the FE problem (https://hasura.io/docs/latest/security/allow-list/).

There's also a few NPM packages for auto-generating that allow list from your project (https://www.npmjs.com/search?q=hasura%20allow%20list -- the one I've used before was from `tallerdevs`).


I’ve been a big hasura user for a while. Give their RESTified endpoints a go, solves this issue for you and still gives you all and access control goodness and subscriptions under one roof.


I would have hoped to see some benchmark latency numbers. What does it mean to easily handle a 1M clients?



It shows only connections count and CPU. No other metrics, unfortunately.


We have written a post[1] on building a real-time chat app with Streaming Subscriptions on Postgres. It gives a quick overview of the architecture used and how you can leverage the API on the client side with AuthZ. There’s a live demo that you all can try out.[2]

[1] https://hasura.io/blog/building-real-time-chat-apps-with-gra... [2] https://eclectic-dragon-25a38c.netlify.app/

Would love to see more use cases coming out of this :)


Has anyone come across neat tools for load-testing streaming APIs?

We used https://github.com/hasura/graphql-bench and a set of scripts to monitor runtime characteristics of Hasura and Postgres, and reconciliation to make sure data was received as expected and in-order.

But would love to see if there's other tools that folks have come across!


How different is this from Supabase Realtime?


Disclaimer - I work on Hasura, so I won't comment too much on which is better.

They're all similar flavors of producing realtime results - which take similar, but different, methods to their approach.

My understanding (please feel free to correct me if I'm wrong):

- Supabase Realtime uses WAL.

- Hasura Streaming Subscriptions uses a query which will be append-only (could be a sort-by or also WAL).

- Hasura Live Queries uses interval polling, refetching, and multiplexing.

- Supabase uses Postgres RLS for authorization, while Hasura uses an internal RLS system which composes queries (which allows for features like the multiplexing above).

- All 3 use websockets for their client communication.

Supabase Realtime

https://github.com/supabase/realtime#introduction

https://supabase.com/docs/guides/realtime

Hasura Streaming Subscriptions / Live Queries

https://github.com/hasura/graphql-engine/blob/master/archite...

https://github.com/hasura/graphql-engine/blob/master/archite...


Thanks for this. I was more interested with the differences in implementation rather than which one is better.


Haven’t looked at the internals or the API, but I think for starters, Hasura is designed for working on any Postgres including ones that have existing data, especially modern Postgres offerings (crunchy, alloy db, neon, Aurora, cosmos Postgres, timescale, yugabyte)


I believe they had a POC for this already a few years ago, but great to see the examples expanded, and seems like it’s more “prod-ready” now.


I work at Hasura.

We have 2 interfaces for real-time subscriptions: live queries and streaming. The former has been around for few years now. You can read more here: https://hasura.io/docs/latest/subscriptions/postgres/index/#...


I'm a (very satisfied) hasura user and trying to figure out which realtime interface to pick for a chat component. Can you explain the difference between the two and how/when to pick the right one?


You would benefit from using a streaming subscription. If you use a live query, you have to manually maintain a "cursor" (ie last message you fetched) and update your graphql subscription accordingly every time a new message is received or sent. If you don't you'll be refetching the same messages repeatedly, and they'll just grow as more messages are added.

With a streaming subscription this is all taken care of, you simply do something like this: 1) Load the last 50 messages (or whatever number of existing messages suits you) 2) Run your streaming subscription using the most recent of the messages from above as you cursor 3) When you send a message, do nothing and let the subscription handle getting the message you just sent


It sounds like the stream must be an append-only table. This is awkward - I would expect streaming updated results for a query. If I want clients to refetch changed record in real time, do I still need to build that on top of this stream primitive? Like, I stream an audit log style table, and then refetch any IDs mentioned in the stream separately?


I work at Hasura. You don't need an append-only table always. You may have to add a monotonic ID to your table which increments on DMLs (and use it as the cursor for streaming) and this can be done using a post-DML trigger in Postgres. Also, streaming works over tables with relationships as well so you can get related data in the same stream as well.

Also, there is another interface for real-time subscriptions called live queries which might be more appropriate depending on the use-case: https://hasura.io/docs/latest/subscriptions/postgres/index/#... .


This is my favorite kind of schema -- do not UPDATE Person.Name, instead INSERT into PersonNameChange.

With the right indexes and lateral joins/cross applies you can make the aggregate you need when you query instead of when you write.


I think you’re missing the key part which is that this is delivered over a single GraphQL subscription.

So as new data is added to the table, hasura automatically pushes it to the client via the Websocket subscription.

No refetch needed.


This looks really cool and solves my major gripe with something like Hasura, which is the duplication of data.


Can you go into more detail about what you mean by "duplication of data"? I worked on a team whose product uses Hasura; my understanding is that data still comes directly from Postgres, even if Hasura is sitting in the middle. Where is the data being duplicated?


They're saying products like Hasura suffer from this, and Hasura doesn't.


Nope, they're saying that products like hasura resolve the issue with the duplication of data. Or to put in another way, you don't need multiple stores like kafka with redundant data storage. Just use hasura.


How does it do on the Pi4 4GB, the universal standard benchmark system?


I sort of hate GraphQL. Multi-mutation syntax sucks, no select-all.. do not like.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: