Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: PostgresML, now with analytics and project management (postgresml.org)
444 points by levkk on May 2, 2022 | hide | past | favorite | 70 comments
We've been hard at work for a few weeks and thought it's time for another update.

In case you missed our first post, PostgresML is an end-to-end machine learning solution, running alongside your favorite database.

This time we have more of a suite offering: project management, visibility into the datasets and the deployment pipeline decision making.

Let us know what you think!

Demo link is on the page, and also here: https://demo.postgresml.org




Seems like a great idea. When you look at many ML frameworks half the code and learning overhead is data schlepping code and table like structures that "reinvent" the schema that already exists inside a database. Not to mention, there can be security concerns from dumping large amounts of data out of the primary store (how are you going to GDPR delete that stuff later on?). So why not use it natively where the data already is?

For anything substantive it seems like a bad idea to run this on your primary store since the last thing you want to do is eat up precious CPU and RAM needed by your OLTP database. But in a data warehouse or similar replicated setup, it seems like a really neat idea.


Yeah — seems like all you need is Snowflake-esque separation of storage and compute and bob’s your uncle.


Looks like a good fit to run on Hydras.io (YC W22) the Postgres data warehouse that separates the PG query layer from compute / storage. Disclaimer, I'm the co-founder.


Nice! Feel free to file any issues on Github if something doesn't work. I'd love to understand more about the internals of the hydra architecture.


It looks like this is running Python to do the actual ML?

There’s nothing stopping you from reading database table structures directly into memory in Python or R now. You don’t need an intermediate data store.

I agree that running training on production instances would be a bad idea. First, you need to denormalise data for ML, and secondly you typically don’t want your training data to be constantly changing.


Outside of academia, you need to be able to constantly adapt to changing training data. That is a constant pitfall for many projects that try to transition from offline to online.


Outside academia you need to do a lot of data cleaning and feature engineering, and if you’re constantly changing the model as well as the data you’ll never be able to attribute changes to either.

Chances are a DBA wouldn’t consider letting you do data engineering in a live production database anyway, so this really is all academic.


Haven't played with it yet, but this tooling is exactly what I'm looking for for my SQL-first-python-second "data lake" workflows. It doesn't exactly matter to me if data in a table gets rewritten in-place since it's built on a model meant to be destroyed at any time from an original source file - generally data from FOIA or gov sources that's write-once.

In a way, having to pull things out of the DB and into python is something that requires change attribution on its own since it's a conversion between abstractions which I tend to think of as a lossy process (even if it isn't). This sort of tooling keeps the abstractions localized, so it's much easier to maintain a mental model of models of changes.


Outside academia you need to do a lot of data cleaning and feature engineering, and if you’re constantly changing the model as well as the data you’ll never be able to attribute changes to either.

I get the concern but sometimes I really just do want a black box regressor or classifier. Model performance monitoring is important, but I don't care about attribution.

Chances are a DBA wouldn’t consider letting you do data engineering in a live production database anyway, so this really is all academic.

Maybe it isn't data engineering, but I'm curious what you'd call using Google's BigQuery ML? "BigQuery ML enables users to create and execute machine learning models in BigQuery by using standard SQL queries."

I haven't used it in production, but I'd use it in a heartbeat if I was on BigQuery.


This is really cool, running ML workloads on top of SQL is a very practical way of doing ML for a lot of businesses. Many companies don't have the fancy ML workloads like you see at OpenAI, they just have a SQL database with some data that could greatly help their business with some simple ML models trained on it. This looks like a nice way to do it. A slightly different approach that I've been working on involves hooking data warehouses up to Pachyderm [0] so you can do offline training on it. Not as good for online stuff as this, but for longer running batch style jobs it works really well.

[0] http://github.com/pachyderm/pachyderm


Can this be used to deploy an "active learning" model that learns from fresh data and model auto-updates?


That's exactly the target use case. Models make online predictions as part of Postgres queries, and can be periodically retrained in a cadence that makes sense for the particular data set. In my experience the real value of retraining at a fixed cadence is so that you can learn when your data set changes, and have fewer changes to work through when there is some data bug/anomaly introduced into the eco system. Models that aren't routinely retrained tend to die in a catastrophic manner when business logic changes, and their ingestion pipeline hasn't been updated since they were originally created.


Well, you would also need new labels to retrain.


Yep! Part of the power of being inside the OLTP is that you can just create a VIEW of your training data, which could be anything from customer purchases, search results, whatever, and that VIEW can be re-scanned every time you do the training run to pickup the latest data.


Do you plan on adding support for managed PostgreSQL services like RDS in the future?


We can’t control the extensions RDS allows to be installed, and they are historically conservative. Lev and I do have some fairly extensive experience with replication patterns to Postgres instances running in EC2. Foreign data wrappers are also an option, and depending on workload may be a good horizontal scaling strategy in addition.


I just gave it a try to install on Crunchy Bridge [1], disclaimer I work at Crunchy Data. I did this as a standard user vs. anything special about access. I got quite close, but looks like I'm limited by Python version, we give you plpython3u with Python 3.6. Is there any chance of supporting an earlier 3.x version? If so I'm pretty sure could give a guide on how to self-install on top of us.

[1] https://www.crunchydata.com/products/crunchy-bridge/


Awesome. I'll see what it'll take to get the extension running w/ Python 3.6. It's good to know what people's ecosystem dependencies look like.


Just in case you were not aware, Python 3.6 reached end of life ~4 months ago: https://endoflife.date/python


Will that also apply to plans? I'd love to see ML models come up with better plans, if possible.


Amazon Aurora already has ML extensions for connecting to AWS SageMaker and AWS Comprehend. I doubt they'd add another, especially one they didn't write and isn't integrated with their existing lineup. https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...


How do you deal with different dataset train/validation/test? How do you measure the degradation of the model? Is there any way to select the metric you target (accuracy, f1-score or any other)?


The data split technique is one of the optional parameters for the call to ‘train’. Model degradation is a really interesting topic, that is hopefully made less difficult when retraining is trivialized, but we also want to add deeper analytics into individual model predictions, as well as better model explanations with tools like shap. We haven’t exposed custom performance metrics in the API yet, but we’re computing a few right now and can add more. The next thing we may build could be a configuration wizard to help make these decisions easy based on some guided data analysis.


When I was talking about metrics I meant metrics of the model (accuracy, precission, recall, mean square error, etc), not performance.


We currently calculate those as applicable to classification and regression, but they are only displayed on a model detail page like here:

https://demo.postgresml.org/models/1 https://demo.postgresml.org/models/15

The short term goal would be to expose more metrics from the toolkit.


This is great! FYI for those who haven't seen, BigQuery can also run statistical learning methods directly on your data as part of the query. Really cool to see ML going this direction.


Hello really nice !

Can you explane the differences with https://madlib.apache.org/ ? Wouldnt an OLAP db better suited than pg for this kind of workload ?

Does being a postgreSQL module make it compatible with citus, greemplum or timescale ?


OLAP vs OLTP will depend on your ML use case. Online predictions will likely be better served by an OLTP vs offline batch predictions being better served by OLAP.

OLAP use cases often involve a lot of extra complexity out of the gate, and something we're targeting is to help startups maintain the simplest possible tech stack early on while they are still growing and exploring PMF. At a high enough level, it should just work with any database that supports Postgres extensions, since it's all just tables going into algos, but the devil in big data is always in evaluating the performance tradeoffs for the different workloads. Maybe we'll eventually need an "enterprise" edition.


How about the difference between this and the Madlib project? Better ergonomics?

I've used Madlib in the past and although it was 'successful', the constraint was unfamiliarity with the library from our data scientists, who preferred the classic Python libraries.


Can we offload model train to a different server? It can be parallelized? Anyway, nice API and a promising project.


This can be done "manually" by configuring Postgres replication and/or foreign data wrappers. We don't have a magic button for that, but if we have a few examples in the wild we can establish best practices and then put those into code. I say this with some optimism that we may be able to see more targeted ML specific scalability use cases that can be solved more completely than general database scalability.


Congratulations on the launch!

This is the most exciting ML related project I've seen in a while, Mainly because the barrier for entry seems low as anyone with PG database could apply a model on them using PostgresML if I understood the premise correctly.

Most of the comments here seems to regarding separating the compute from the database machine which it seems isn't possible right now with PostgresML, But the GitHub reads at the start:

> The system runs Postgres with the pgml-extension installed on port 5433 by default, *just in case you happen to be running Postgres already*:

  $ psql -U postgres -h 127.0.0.1 -p 5433 -d pgml_development

I think the second part needs to be clarified better, Is it installing PGML extension on a machine running a existing PG database and connecting to it (or) does it mean just starting the postgres session of the PGML docker package?


Great idea! I see this is implemented using the Python language interface supported by PostgreSQL and importing sklearn models. I always wonder how scalable this is considering the serialization-deserialization overhead between Postgres' core and Python. Do you see any significant performance difference between this and training the sklearn models directly on something like Dataframes?


This is an interesting benchmark I'll try to code up. Although, it seems a bit like an apples/oranges comparison, since a Dataframe in memory had to come from somewhere, either a CSV or database like Postgres, in which case I have my money on Postgres outcompeting the standalone process parsing CSV.

In the end though, it'll be important to have benchmarks for all the key steps in the process, both in terms of memory and compute. Off a hunch, I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.


The Dataframe is loaded from disk true, but it is possible that batch loading is faster (esp. with structured data) than row-by-row translation Postgres types into Python types. Would be interesting to see the benchmark results.

> I think the memory inefficiency involved in high level pandas operations is more likely to be a driving force to move operations into lower layers, than CPU runtime.

Indeed. Not only memory but also inefficiency related to Python itself. It would be great if feature engineering pipelines can be pushed down to lower layers as well. But for now, the usability of Python is still unparallel.


This looks amazing!

The animated GIF on your homepage moves a little bit too fast for me to follow.


Interesting concept, but I think Big Query ML [1] has been providing similar features for years now. Curious to learn what are the differences, other than offering this as a Postgres plugin.

[1] https://cloud.google.com/bigquery-ml/docs/introduction


I don't understand 5he example on the homepage. How does the extension know what is "buy it again"?


"Buy it again" is simply a name for the PostgresML project that the model is being trained for.

There is deeper explanation in the README: https://github.com/postgresml/postgresml


never mind. figured it out from github. this is cool.


Reminds me of https://riverml.xyz/latest/ (which is awesome) but the idea is even better because it skips all the copying and preprocessing yak shaving. Can't wait to kick the tires!


Cool approach. This nicely fits in the trend of SQL-as-much-as-possible because that makes it just a tiny bit more accessible. Definitely going to play with this in the next few days. (edit:) Being able to get training data from a SQL view is by far the nicest. Keep it up!


I feel like a lot of issues out of ML systems came from the fact that some person got a CSV dump of the data and then iterated for a month to build a fantastic model, which nobody knows how to integrate with the DB.

So, this is why I really like this idea and about 3 years ago I seriously thought about starting this thing as well. I went ahead and built a specific data company (so not a tooling one) and now I don't like this idea anymore.

To me this is a lot like proposing: "lets get rid of Rest Apis and Graphql and connect the frontend directly to the DB". (ignoring security issues for a bit).

In frontend: The view you like to display your data is a different one than how it should be saved. Exactly the same in ML, the view your data can be trained / predicted on is a very different than it should be stored.

They are connected, but IMO there always has to be a transformation layer. (and Python is just a much better way to do that transformation, but that's an other story)


Neat project. Any roadmap for cross-validation support (GridSearchCV and friends)?


Yep! traditional hypertuning techniques and automated broader surveys across multiple algos are both on the roadmap for "soon".


Landed a PR to enable GridSearchCV and RandomSearchCV. https://github.com/postgresml/postgresml/pull/83


Sql to rule the world! Now just Sql to create GUI and websites and I'm


This is awesome. I’m guessing the models are executed on the database server and not a separate cluster? What about GPU training? How is that handled? I’d love to see more docs.


This is great - will be experimenting with one weekend soon...


I wonder if/how a PostgreSQL plug-in can provide an optimal mix of computing and storage resources for varying machine learning workloads.


Stateful services like Postgres cannot be rapidly scaled up/down to adjust to daily loads, even when those loads are very predictably busy at noon, but often sit idle overnight. Scheduling ML jobs while the db is typically wasting resources might be an efficient strategy.


Anyone care to reframe the question or improve it?


How this compares to https://mindsdb.com/


Is this a competitor to bigquery autoML or to something like kubeflow?


This looks awesome! I’m not an expert but wouldn’t the typical database hardware not be really optimal for running ML? Is this meant to run on a replica (which is quite straightforward to setup) that has ML optimised hardware?


It’s probably a stretch to run GPT-3 inside a db, but most of the “deep learning” models I’ve run in more traditional environments are a few megabytes. That’s millions of params, but Morre’s law has been generous enough to us over the decades that I think there is a good case to spend a few megabytes of DB ram on ML. I would think this idea has really landed though, when we start hearing about Postgres deployments with GPUs on board though :)


You can already do that by using the pl/python or pl/java extensions with the right environment. However, the interface between SQL and model inference is typically narrow enough that IMO it's better to: read from Postgres --> process in an external Python/Java process --> persist to Postgres.

Maybe enhance this with a FDW to an external inference process to allow triggering of inference from Postgresql itself.


Very cool! Will probably use it soon.


Thanks! We’d still consider this an early stage project and would love your feedback for which features to prioritize. Our roadmap is only getting longer…


is it possible, or how hard is it, to plug in custom proprietary models?


Custom proprietary models are on the TODO list. There are open questions about the mechanism to distribute the model params and code, which will lead to additional complexity, but I think it's something we'll eventually need to support.

In the opposite direction of bespoke fully custom models being the norm, I'd like to build more "rails" for ML. Hopefully we can expose enough hyperparams, even for deep learning architectures, and automatically adapt the inputs with configurable transformers so that we cover 90% of the custom use cases out of the box.


Do I need to know about ML/statistics to interpret the results?


WHY!?


What affiliation does this have with PostgreSQL?


None, it's just an extension.

Which is part of what is so awesome about PostgreSQL, everyone can build extensions that look and feel native and can do almost anything.


You should review the PostgreSQL trademark policy: https://www.postgresql.org/about/policies/trademarks/


You might want to use a different elephant logo... the branding implies that it is an official part of the project.

E.g. Postico uses an elephant but not the _same_ elephant - https://eggerapps.at/postico/


I think the logo is clever, and if there’s no rights issue it’s great. The fact that it’s a different website, domain, theme etc doesn’t make it ambiguous IMO


This logo and name, to me at least, implied some sort of official affiliation with the PostgreSQL project. Which I think should at least be clarified on the site/readme.


I found this to be very confusing and took some sorting out that this is not an official project of postgresql. I personally think a name change and logo redo is in order here.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: