How to Migrate 50M Records from Mongo to PostgreSQL in Less Than a Day

_3u10 · on Sept 24, 2014

Umm... Isn't a 50 million row export / import usually an hour long task... tops.. I mean assuming you have the scripts?

Dump the first DB, import into the second? Maybe a couple ETL scripts in between to massage the data?

This task sounds like monodb_dump export_db (or whatever they call their backup tool) | etl.sh | psql -d import_db. Of course shell scripts probably aren't 'webscale' enough to handle 150 GB of data.

Protip: When IO is saturated, make the IO sequential before ever even thinking about threads. Exactly how are more CPUs supposed to make the disk faster? The best you'll do is exploit the command queue more fully.

kermatt · on Sept 24, 2014

Wondering the same thing - What was the slow part, reading from Mongo or writing to Postgres?

It should not be the latter, unless there are a lot of indexes that were not dropped before the bulk load, or the ETL works one...row...at...a...time...

Even a simple 'mongoexport --csv | psql -c "COPY FROM stdin"' should faster for 50M records.

allegory · on Sept 24, 2014

Spot on.

This is not hard work. I'm not sure exactly what was being done that made this take more than an hour.

On our kit, which is pretty hefty, I'd schedule a 30 minute window for a 150Gb ETL even with heavy transformations.

When we moved our stuff from Mongo to SQL Server it took an entire 6 minutes to drag 30Gb over including SQL full text. That was a 16 core Xeon with 64Gb of RAM and 15K SAS RAID 10 so not a big machine either.

rjbrock · on Sept 24, 2014

It does seem like it should be faster, but this isn't a vanilla export/import. He is going from unstructured to (I assume) structured data. So he needs to deduplicate and possibly do other operations on the data before importing

collyw · on Sept 24, 2014

What about importing the unstructured data into one table in Postgres and querying that table to populate a structured schema on the same Postgres server? It shouldn't be as IO bound, as its copying data from Postgres to another table rather than from Mongo. Can anyone tell me if that would likely bring a performance improvement?

_3u10 · on Sept 24, 2014

Probably, when dealing with IO bound problems it's important to think of IO as a really long pipe extremely fat pipe. Think oil pipeline.

The key is keeping the pipe full, requesting a piece of data based on what's in the pipe will mean that it's going to be hard to keep the pipe full. You want to request the data long before you're ever going to need it.

If your SQL queries are done in such a way that they don't depend on data in the pipe they'll work fast, if they aren't you'll be not much better off than whatever was currently in place.

allegory · on Sept 24, 2014

That's usually part of the ETL pipeline. If you look at SSIS and spring integration for example it has the ability to do this in the pipeline.

_3u10 · on Sept 24, 2014

You mean import the raw data into a json column and then SELECT DISTINCT INTO?

There are only 50 million rows... this sounds like a 5 minute query.

rjbrock · on Sept 24, 2014

I was assuming he was moving the json into a structured foreign key-> data type of relationship

zenogais · on Sept 24, 2014

It was actually a combination of several queries in the first database followed by several inserts into postgresql.

kermatt · on Sept 24, 2014

Ratio of query time vs insert time?

jerrysievert · on Sept 25, 2014

i'm curious as well - even if the data was only somewhat structured on the mongo side, an import into a single column table with JSON should be fairly quick with \COPY, then with a couple of indexes applied on your disparate data in the JSON data you should have been able to query that into a new table structure quite easily.

just wondering what the tradeoffs were in OP's eyes.

zenogais · on Sept 24, 2014

This is great. Appreciate the feedback :)

icebraining · on Sept 24, 2014

This requires you to wait for the whole batch to complete, instead of starting a new process as soon as one finishes.

For these cases, I started using GNU parallel instead of plain shell. It does the heavy concurrency lifting for you, and even allow almost seamless parallelization across different machines.

For example, your final code could be written as such:

  parallel ./settings.sh python bin/migrate_media.py {} $LIMIT ::: $(seq 0 $LIMIT $NUMPLAYLISTS)

http://www.gnu.org/software/parallel/

zenogais · on Sept 24, 2014

This is great! Appreciate the tip

rektide · on Sept 24, 2014

Seems like this would've been a completely ideal place to rock some mongo_fdw, which would give postgres the ability to query and extract data directly from mongo. https://github.com/citusdata/mongo_fdw

rjbrock · on Sept 24, 2014

That adds quite a bit of latency to your database, which is already frequently the bottleneck.

In reality, most of what Mongo does can be done just as well if not better in Postgresql 9.3 (and I hear good things about 9.4). Then you can simultaneously have structured data via postgres

Sanddancer · on Sept 24, 2014

I don't think the parent's talking about doing all operations via the fdw, but rather using the foreign data wrapper to fill the table. So instead of pulling all the data into a python script and then pumping it out into the postgresql database, you're using postgresql itself to do the fill operation. Additionally, by using the foreign data wrapper, you could also create a materialized view based on the results, if you still want most activities using the mongodb instances for whatever reason.

angersock · on Sept 24, 2014

And then you could use mongres when you eventually decided to replace that Mongo server without confusing the original PG database. It's elephants all the way down! :)

jerrysievert · on Sept 24, 2014

typically doing a \COPY is going to be much faster than an INSERT in general, so a SELECT/INSERT may end up a bit slower. i haven't benchmarked it, but am pretty sure i'm right here.

jtwebman · on Sept 24, 2014

This is a perfect example of doing something row by row instead of in a data sets is bad for anything with over a million rows. Bulk Dump / Copy Load and then run the rules to clean up data.

zenogais · on Sept 24, 2014

This would have been great, but the lack of a 1-to-1 mapping between data models complicated this. To migrate a single "record" required several queries in the old database, some restructuring, and then several inserts into the new database. We were restructuring nested records in mongo into several separate relations in Postgres.

jtwebman · on Sept 24, 2014

You could still copy raw and then run selecting inserts to load the real tables. This could have been done in under an hour I am sure of it.

zenogais · on Sept 24, 2014

Would honestly be great to know about, but how do you raw copy a nested mongo document? The big hurdle was the lack of 1-to-1 mapping between primitive types in the databases.

zenogais · on Sept 24, 2014

Totally agree that would've likely worked too, though I'm not convinced it would have been significantly faster doing a dump, import, and data processing. Dumps from mongo typically took upwards of two hours not to mention transferring the file and then importing it then processing the data. Just a different way to skin the cat.

jtwebman · on Sept 24, 2014

Mongoexport will let you export into flat files using --fields and --csv http://docs.mongodb.org/v2.2/reference/mongoexport/

sokoloff · on Sept 25, 2014

"We were mostly I/O bound so we knew we’d like to use gevent’s quick task switching so we could write the maximal amount of data per-process. A single process really constrained us to one vCPU. Therefore, to speed it up meant running more than one process."

Wait. What?! The response to being I/O bound is not using more processes...

U2EF1 · on Sept 25, 2014

So he really means they were "waiting for IO" bound.

theforgottenone · on Sept 24, 2014

why use python for real work?

For reference, I was handling 30M daily records in about an hour on hardware 15 years ago. This performance is absurd.

dr_faustus · on Sept 24, 2014

I'd be more interested, why you moved from Mongo to Postgres.

fein · on Sept 24, 2014

Probably because postgres can do what mongo does just as well, and is a full RDBMS.

Take a look at postgres hstore with a gin index.

bjt · on Sept 24, 2014

hstore isn't really as flexible as a Mongo document, given that both keys and values must be strings, and you can't query into a document nested inside an hstore value.

Now, with the jsonb support coming in 9.4, PG gets pretty much everything I once liked Mongo for.

http://www.postgresql.org/message-id/E1WRpmB-0002et-MT@gemul...

angersock · on Sept 24, 2014

Look at performance numbers:

http://thebuild.com/presentations/pg-as-nosql-pgday-fosdem-2...

Furthermore, consider that PG is pretty much an amazing RDBMS in addition to having a good document story.

ericingram · on Sept 24, 2014

If numbers are the deciding factor for anyone, have a look at TokuMX which I believe has much better performance and storage characteristics compared to plain Mongo.

http://www.tokutek.com/products/tokumx-for-mongodb/

astine · on Sept 24, 2014

I don't know about the author, but I've found MongoDB to have very poor performance compared to Postgres.

zenogais · on Sept 24, 2014

To answer your question - we were rebuilding the app and in the process moving from Rackspace into AWS. We wanted to use as many amazon managed services as possible since we don't have a huge team. PostgreSQL seemed like a nice managed solution that would give us plenty of room to scale. The ACID guarantees aren't too bad either.

errorrrr · on Sept 24, 2014

somewhere around 2004 we pump into Oracle 5 million customers & product data (basically whole telecom company customers portfolio, quite complicated relational data) in around half an hour. Table partitioning and tiny java program to manage parallel pl/sql scripts running and migration steps transitions.