Hacker News new | past | comments | ask | show | jobs | submit login
Migrating from RethinkDB to Postgres – An Experience Report (medium.com/fuzzy-sharp)
135 points by okket on Sept 15, 2017 | hide | past | favorite | 78 comments



Interesting quote: "we decided to compute all statistics on demand. This was something we previously tried in RethinkDB, but the results were not good... When we tried implementing statistics as SQL queries in Postgres, we were amazed by the performance. We could implement complex statistics involving data from many tables."

I think standard line "use right tool for the job" is still the ultimate answer. Data in most applications is relational, and you need to query it in different ways that weren't anticipated at the beginning, hence the longevity of SQL.

That said, I too often see HN commentators say something like "this data was only 100 GB? Why didn't they just put it in Postgres?" which is not as clever as the writer may think. Try doing text search on a few million SQL rows, or generating product recommendations, or finding trending topics... Elasticsearch and other 'big data' tools will do it much quicker than SQL because its a different category of problem. It's not about the data size, it's about the type of processing required. (Edited my last line here a bit based on replies below.)


We do exactly what you describe on about 80TB dataset stored across a number of PG instances. The 100GB comments are extremely clever because running a query against a dataset that fully fits in RAM will be blazingly fast. "Try doing text search on a few million SQL rows" we are doing it on billions of rows.


Exactly. People overestimate the amount of effort and cleverness it takes to horizontally shard/scale your data yourself rather than trust a specialized database to do it as a black box. And overestimate the cleverness of the black box.

What you retain by staying with Postgres rather than going to a more exotic database is priceless. There is a threshold of data size or product need that makes a more specialized database the right choice. It's just well above 100 GB and your application should have some very specific needs to justify it.


I think some of this can be attributed to expectations; Many systems such as elastic are expected to run on ram, whereas people tend to expect relational systems to perform reasonably well on disk. I think we all want a magic black box that makes such problems disappear, and we always seem to be disappointed with the system in question when at some point it doesn't.


You are 100% right the perpetual search for "magic black box" is the cause of much pain on the bright side it's also a source of significant amount of high paying work to clean up the mess.


Do you have FULLTEXT indexes on those rows? I find it hard to believe that searching something like e.g. a fragment of text in Youtube comments can be as fast in a SQL system (even in RAM) as in Elasticsearch.

As for the other stuff I mentioned (recommendations, etc.) I'm not just basing it on my personal experience--here's a write-up from Pinterest about having to dump all their MySQL data to Hadoop to drive various types of analysis. I doubt they would do it it if just putting the SQL DBs in RAM was adequate! https://medium.com/@Pinterest_Engineering/tracker-ingesting-...


PG is something of a different beast too. With MySQL the standard mode of operation for most non-basic functionality is to dump the data in another tool.

Full text with Postgres is pretty fantastic and configurable. Not putting the data somewhere else keeps you from having to maintain a second system, keep things in sync, etc.

People jump to putting data in a search tool because it's a search tool waaaaay too quickly IMHO. If the use case justifies it, go for it...but don't add uneccessary complexity unless you have to.


> Not putting the data somewhere else keeps you from having to maintain a second system, keep things in sync, etc.

To be fair there is some amount of "keeping things in sync" one has to do with postgres, even if it's just setting up the right triggers to update the FTS index on updates.


True, but it's transaction safe so should be impossible to get out of sync which is what I was really getting at.


I can assure you that Pinterest's dataset is vastly bigger than 100GB :) at certain scale RDBMS will obviously experience issues and might no longer be the optimal solution. For PG (hard to generalize) but beyond 10-20TB things become painful. Now the thing is that "limit" is constantly shifting so if you are starting with 100GB datasets and it is growing at 200GB a year you can basically stay on single instance RDBMS forever.


Sorry, I'm just not buying that it's about the file size of data.

Right now I'm working on an app that works with tweets. I want to find all tweets that link to iTunes.

When I was using:

> select * from `tweets` where `url` not like '%twitter.com%' and `url` like '%itunes.apple.com%'

I could scale my server up to 16 CPUs, it would still take several minutes to search a few million tweets.

Yesterday I added another field to the database, `is_audio_url` where I pre-compute whether the URL is an itunes link (by string matching in the app code) when I insert the record into the database. So I can do:

> select * from `tweets` where `is_audio_url` = 1

And now it's blazing fast. It is just my most recent of many experiences that MySQL really struggles with text matching.


Full text search in Postgres is fast if you have it configured correctly.

> It is just my most recent of many experiences that MySQL

1) You're using MySQL not Postgres; given that this is a discussion about whether Postgres can compete with Elasticsearch, that's not super relevant. :)

> select * from `tweets` where `url` not like '%twitter.com%' and `url` like '%itunes.apple.com%'

2) That's not how you query a full text index; that's going to be glacially slow.

You need a FULLTEXT index and to use a MATCH...AGAINST query. Check out the docs[1].

[1]: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html


Fair enough, I wasn't aware of MATCH... AGAINST will have to check it out. But although I'm not going to keep going back and forth on this in HN comments it really has been my experience that Elasticsearch can do some very cool things that I've struggled to do with MySQL. I've already named the top three examples I have in mind (full text search, recommendations/related items, and trending topics i.e. finding terms used in text fields in the last week that weren't as frequently used in the last month.)


But didn't you just say that you had not used full text search...you haven't struggled with FT - you haven't used it..


I am not selling anything to you :) You can scale your server to even 100 vCpu or whatever blackbox name the provider is using and still have same speed mainly because MySQL does not have parallel query :). BTW you none ever mentioned MySQL but you keep brining it up.


Queries with %text% syntax do not use indexes by default.

You can use trgm module to index this kind of queries.

https://www.postgresql.org/docs/9.1/static/pgtrgm.html


Prob better to point to the current (9.6) version of that doc, rather than the no-longer-supported ancient v9.1 version. There are some differences. :)

https://www.postgresql.org/docs/current/static/pgtrgm.html


You should really take the time to learn how storage and indexes work, and the difference between IO and CPU constraints on query performance. The database isn't magic... with the possible exception of elasticsearch, they will all disappoint you if you chuck data into them blindly and expect optimal performance on inefficient queries.


the problem is that you are probably not using the correct index. considering that when you use a '%' at the beginning of a string you need to use at least a GIN (or was it gist?) index with string ops as far as I remember but actually using % at the beginning is not recommended. I think using a tsquery might be more recommended for contain searches. at least on postgresql


He is using MySQL not PG


You can directly define is_audio_url as a partial index in postgres. No need for app code.. postgresql.org/docs/current/static/indexes-partial.html


You could do a number of things to speed it up.

- Full text index - Extract the domain name to another column index it - Change mysql defaults - Change engine types. Maybe go in memory - Create lookup table of domains


Create a fulltext index and use MATCH AGAINST instead of LIKE


If you wanted to, you could even use a trigger to compute that field, and then backfill it - no changes to app code required!


I'm doing fulltext search on 270 million rows in postgres (with textvector indices) in below 65ms on commodity hardware (sub 8€/mo).

And it works better than ElasticSearch, just due to the lower additional overhead.


How do you sort the result? Do you use ts_rank?

What is the size of your database on disk?


I use ts_rank combined with many other factors (as I also rank by time, by message type, etc).

The size of the database is a few dozen gigabytes by now, but that isn’t relevant with tsvector, only the row count has an effect on search speed.


Thanks for following up :-)

I was asking because ranking can be slow in PostgreSQL. PostgreSQL can use a GIN or GiST index for filtering, but not for ranking, because the index doesn't contain the positional information needed for ranking.

This is not an issue when your query is highly selective and returns a low number of matching rows. But when the query returns a large number of matching rows, PostgreSQL has to fetch the ts_vector from heap for each matching row, and this can be really slow.

People are working on this but it's not in PostgreSQL core yet: https://github.com/postgrespro/rum.

This is why I'm a bit surprised by the numbers you shared: fulltext search on 270 million rows in below 65ms on commodity hardware (sub 8€/mo).

A few questions, if I may:

- What is the average number of rows returned by your queries? Is there a LIMIT?

- Is the ts_vector stored in the table?

- Do you use a GIN or GiST index on the ts_vector?

Cheers.


It will not be as fast but it will be fast enough for the wast majority of use cases.


is there some resource one can figure out what category of problem need what storage solution?

What storage solutions would you use for product recommendations or trending topics?


I used Elasticsearch. But I didn't use it as a primary datastore, more just a mirror of the MySQL data.

It comes out of the box with a way to find "More Like This" https://www.elastic.co/guide/en/elasticsearch/reference/curr...

For trending topics I believe I used something related to this: https://www.elastic.co/guide/en/elasticsearch/reference/curr...

An interesting article on using ES: https://auth0.engineering/from-slow-queries-to-over-the-top-...


Document databases are for data you know will be hierarchical and you know you won't have to do much cross referencing on. That's not as broadly useful as it might sound and then once requirements change you end up reimplementing a RDBMS in your NoSQL. I don't hate doc dbs, I just don't think they're as generally useful as they're made out to be.

Relational is well understood. The modeling problem is well understood so you don't have to guess too much about how your data should be structured. By default I start projects with a rdbms and then carve out the portions that truly are hierarchy-only.

That said, I'm starting to investigate graph databases more closely and what I like about them is what I like about relational: queryability. It's so easy to write a short query and extract your data. I like the model quite a bit too because graphs are a pain to shoehorn into a relational db. I'm still not entirely sold but I would love to find more good resources on graph dbs.


But data you store in a relational database is also hierachical...

Order / OrderLine

In a document db we store it as a single document because an order is the root aggregate and the order line cannot exist without an order.

It relates to other objects in the database in the sense that it may relate back to a User...

We hack together our "understood" objects to fit into a relational database.


How much does WITH RECURSIVE help to reduce pain with graphs in PostgreSQL?


Interesting timing... literally just just finished moving from RethinkDB to Posgres myself https://github.com/heiskr/sagefy .

When I was first building the project, I didn't totally know what schema I would need. I started on Postgres. But having to constantly do database migrations while I was developing was a pain. RethinkDB was great for development while I was figuring out what schema I needed. But now that the website is live https://sagefy.org , the schema started getting stable.

I'm not using perfect third-normal form, but in many cases I have no query needs there so JSONB makes sense. JSONB columns can actually remove much of the need for NoSQL. The biggest gains from moving were: not having to do foreign key validations myself (yay!) and about 2/3 filesize reduction to the database. My database isn't large enough for performance to matter, but I'm sure it would eventually.

If I were to do this again, I would probably start again with a NoSQL database (maybe Redis or ES, or just files even) until I figured out the schema and then move to Postgres before launching. I'm sure there's smarter people out there who can start with SQL and predict easily what the feature set is going to be be before they start, but I don't usually have that foresight. `ALTER TABLE ...` gets old really quickly.

I'm keeping an eye on CockroachDB too... if they can really make something similar to Postgres but easily scales horizontally... that would be amazing.


After starting up a big data timeseries project at work for the past several years, and spending a big chunk of that looking into and testing various DBs, whenever I am asked, "What database I should use for X?" I tell them Postgres, unless you have an well-researched and proven reason not to. I picked Druid for our project, and am very happy with the result. However, for some of our smaller customers, Postgres would have been a much better choice. Only because most of our big, important customers have data needs starting at Petabytes am I happy we went with Druid. Well, that and a bunch of other reasons ;)

Much of the "reasoning" I see for people choosing Elasticsearch or InfluxDB or MongoDB seems to come down to, "It showed up a lot on Hackernews".


I am currently working on a thesis paper evaluating persistence technologies regarding their fitness for a petabyte-scale sensor data analytics product. I am interested to learn more about your experience with Druid.

I would be grateful if you contact me at 8nvyve+wj7zvh7q5e0@sharklasers.com (it's a throwaway email as I avoid posting my private email publicly). Thank you.


I am beginning to see a trend on HN. Teams are moving away from new and young databases to old and "boring" databases as their need for stability and correctness increases.


Which is why we grey beards just wait for the caravan to pass until the dust settles, then we get back to action and recover our dues.


Would be interesting to hear how they arrived at the RethinkDB/Elasticsearch combination over Postgres at the start of the project. It doesn't seem they lost any features after the transition. Was something lacking in Postgres at the time of initial implementation?


However, they do say:

> leaning heavily on Haskell in order to fill in some of the gaps quickly.

So it looks like they had to do some work to cover some features from rethinkdb/elasticsearch.


The code in the post is just used to migrate from an untyped interface to a typed one. Unless I'm missing something, there seems to be no mention of any missing features in postgres.


They did mention missing JSON schema validation in Postgres.


And also many mentions of no schema validation in Rethink, so that's probably not a gap.


hmm not really, it doesn't make any sense ( text search included ) to put everything into PG.

PG doesn't even have a clustering solution out of Citus, how do you scale / HA postgres using the default setup without doing sharding yourself?


What? There are over a dozen. See:

https://wiki.postgresql.org/wiki/Clustering

Some commercial.


Those are third parties extensions, ES does clustering out of the box well. Most of the thing on your list don't have proper clustering solution built-in which means you have to support it yourself.


You said that PG had no clustering solutions, I showed you there were. Some of them may require more work than others but there ARE clustering options none the less for PG. It's inaccurate to say there are none.


"Postgres is basically going to be around forever". My favourite quote from the article and a really great point. Even with RethinkDB going opensource there's only been one single release in the last year, and it was a bug fix release.


With RethinkDB and Basho going out of business, it has an interesting year for databases. I think we are seeing the dawn of the new databases golden era, and the world is converging to 3-4 databases, that is Postgres, Cassandra and Mongo and ElasticSearch.


MySQL still has massive usage, and isn't going anywhere.


Agreed, I meant other than Oracle, MySQL/MariaDB and MSSQL, which are the top 3 anyway.


I talked to Phil about this migration, I maintain two of the libraries mentioned in the post. I mentioned the Elasticsearch client (which I actually wrote, unlike the SQL library) in this comment: https://news.ycombinator.com/item?id=15241886

Maintaining each library has enhanced my appreciation and respect for how the Postgres people do their thing.


In addition to the Postgres knowledge, this post also provided me with tremendous help in building a sane data architecture for RESTful web services in Haskell.


We solved this differently. We needed to maintain a two DB solution (two-state solution as some of my team has called it). For this, we took advantage of RethinkDB's changefeeds to deal with the foreign key constraints. We have some Elixir supervisors that link to it's direct dependencies. So e.g. a Comment can't exist without a Post, so the Comment changefeeds don't start until the Posts insert changefeed caught up with at least upserting the IDs. We then have a followup task that makes sure the data is also up to date. This has allowed us to keep our FKs, while allowing functionality to be moved over as we needed it to. As new_app became the writer, changefeed was removed, and the process continued. If one dies for whatever reason, it kills the dependent changefeed listener processess and the VM restarts the process at that point.


I really wish that there was a column-oriented storage native to Postgres. My impression is that people are forced to abandon Postgres for analytical processing way too early, at relatively small scales.


FWIW, I think currently column vs. row-store is not in the critical path of making postgres more suitable analytical query processing. The CPU bottlenecks are just elsewhere - and we (various postgres devs) are working on them. The storage density parts of column stores can currently partially be achieved by using a filesystem that does compression - far from great, but not disastrous either.


I thought the bigger advantage was that you could do sequential-only I/O on projections?

Or are you saying that you're only CPU-bottlenecked?


> I thought the bigger advantage was that you could do sequential-only I/O on projections?

Not quite following - a column store will usually not have more sequential IO than a row-store. Often enough to the contrary, because you have to combine column[-groups], for some queries. What you get is: Higher compression ratios, better IO & cache access patterns for filter-heavy queries, easier to vectorize computations. Especially if you either filter heavily or aggregate only a few rows, you can do a lot less overall IO in total, but the sequential-ness doesn't really improve.

> Or are you saying that you're only CPU-bottlenecked?

Oftentimes, yes. You might be storage space constrained, but storage speeds for individual sequential-IO type queries are usually fast enough. Parallelism helps with that (if you can push down enough work, a lot of it added in 9.6 & 10), plain old code optimizations (better hash-tables, new expression evaluation framework, both in 10), as does JITing parts of the query processing (WIP, patches posted for 11).


There is the cstore foreign data wrapper, https://github.com/citusdata/cstore_fdw, but it is still an extension. Hopefully it will go into mainline at some point.


There sort of was, but unfortunately Amazon bought it and renamed it to Redshift and now you have to rent it.


I thought Redshift was built completely from scratch, and only looks like Postgres at the wire protocol level?

Also, it would be really nice if I could do both my transactions and my analytics in the same box. Then ETL is basically just maintaining a materialized view.


Amazon built Redshift by using technology from Paraccel. Paraccel was built on Postgres.

See a Quora answer I wrote:

https://www.quora.com/Amazon-redshift-uses-Actians-ParaAccel...


Wow, we had to do the same thing recently. Rather than replace all the model queries across a complex codebase, we created a rethink-(rdbms=knexjs in node) adapter layer which gave us a nice 80/20 solution.

https://github.com/MoveOnOrg/rethink-knex-adapter


>  Rethink is no longer commercially maintained.

Just wondering, seeing as am not a Postgress user - is Postgress commercially maintained? Or is it pure open source (like I believe RethinkDB is currently)?


There is no single company behind PostgreSQL, there's a foundation which manages development, but there are a number of well established businesses offering commercial grade support. Some of these companies employ some of the major contributors to PostgreSQL.


> there's a foundation which manages development,

There's not really a foundation that manages development - there's one that holds the trademark etc. but that's largely the extent of its activities and there are some geographical associations. The development is just managed by the community - there's a number of committers that technical authority to make decisions, there's the "core team" that resolves conflicts should they otherwise not be resolvable, release team, infrastructure team, ... but these are just people working together.

> ..., but there are a number of well established businesses offering commercial grade support. Some of these companies employ some of the major contributors to PostgreSQL.

Indeed, most of the active PG devs work for one of them.


Wasn't there in fact an agreement amongst the PostgreSQL contributors that they'd never all go and work for the same company?


I'm not sure (others here, naturally, may be in the know). Having said that, there do seem to be some larger concentrations of them in a few companies. I don't keep up too much with that either, but it seems like EnterpriseDB was in this camp (and logically so).

I do think there are enough companies now that develop PostgreSQL based products (EnterpriseDB, Citus Data, Greenplum, etc) along with the larger PostgreSQL consultancies which would probably raise their hand if they thought one player or another was becoming dominant in some way hostile to the others.

All of this is outside observer speculation, but it's the way I read the tea leaves.


>> Wasn't there in fact an agreement amongst the PostgreSQL contributors that they'd never all go and work for the same company?

At least some of us have that understanding - and looking at where at least the committers work it's fairly well distributed.

> Having said that, there do seem to be some larger concentrations of them in a few companies. I don't keep up too much with that either, but it seems like EnterpriseDB was in this camp (and logically so).

There is some concentration, but if you look at the list of committers (smaller number, I don't have to look up affiliations), they're fairly well distributed across the the larger players (alphabetically 2ndQuadrant, Crunchy Data, EnterpriseDB) and various other orgs with some.


The thing is, Postgres has grown to where it is without being commercially maintained. So the open-source ecosystem is proven for Postgres.

Whereas RethinkDB was primarily developed by one company, then open-sourced. I hope that the RethinkDB project is successful, but it's reasonable to suggest that the future of Postgres is 'safer' at the current time.


Postgres is Open Source, but multiple companies sell support, employ devs and/or provide distributions of it.


Postgres is always the answer.


No it's not. It's question dependent. "Would you like a burrito?" "Postgres". See? It doesn't work.


Works for me. I'd probably lose weight if I started using that answer.


Pos que cres?

(What you think?)

Seems like a legit answer to me.


Well not always but for a wast majority of cases it's a very strong option.


I've had a prodigal son moment with Postgres myself. I found that even for many tasks that seem to favour NoSQL (time series, etc.) you get lot more bang for the buck in terms of performance for the same resource usage with Postgres.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: