Hacker News new | past | comments | ask | show | jobs | submit login
How to move from Amazon RDS to a dedicated PostgreSQL server (authentise.com)
161 points by cinnamonman on May 24, 2016 | hide | past | favorite | 71 comments



>To my surprise, PostgreSQL v9.5 on Ubuntu 14.04 does not default to unicode schema.

PostgreSQL does (if unicode character set was specified when PostgreSQL database cluster is initialized first time); it's Ubuntu OS which creates PostgreSQL database cluster with ASCII character set (encoding) after PostgreSQL's installation.

> Here is the snippet that I had to use: ...

Instead of resorting to those hacks, follow the PostgreSQL documentation[1] to do it the right way. The simplest way is to initialize your PostgreSQL database cluster as:

initdb --encoding=UTF8 --pgdata=<database-cluster-dir>

If you want to use an existing database cluster, PostgreSQL documentation shows how you can do it.

--

[1] - https://www.postgresql.org/docs/9.5/static/multibyte.html


Using the distro-supplied packages for PG (or any database server) is typically not the best choice for a production deployment.


    > Using the distro-supplied packages for PG
    > (or any database server)
Or almost any of your application environment, really.


Distro supplied packages are a handy way of receiving timely security fixes without keeping too close an eye on recent CVEs + application changelogs.


For regular packages, sure. But for highly-active, well-maintained codebases like postgres – who package their software for your OS, upstream is going to be quicker with security fixes than distro.


I would like to hear more about that. Our Server Team swears on using CentOS Packages with Backports for everything for security purposes. This slows us a bit down, but so far we haven't had any security or audit related issues. How do you deal with CVEs and these staying audited and secure?


We use the official PG packages from their repo on CentOS. It appears there is an apt repository for Ubuntu too: https://www.postgresql.org/download/linux/ubuntu/


We use CentOS but PostgreSQL's official packages from their repo, just like we did when we were Debian based. http://yum.postgresql.org/


Ubuntu has lots of poor defaults and this is a good example of one.


Wouldn't this be worth raising a bug report for?


Probably, uphill battle though IMO, it's not like Ubuntu has a good reputation of sane defaults or quality packing or timely security updates.


actually that is not true for ubuntu 16.04 there it will use the locale setting:

newly setup:

                                      List of databases
       Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
    -----------+----------+----------+-------------+-------------+-----------------------
     postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
     template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
               |          |          |             |             | postgres=CTc/postgres
     template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
               |          |          |             |             | postgres=CTc/postgres
    (3 rows)


It's good news that they've changed the default PostgreSQL character set (encoding) from ASCII to UTF8 in Ubuntu 16.04. It'll definitely save a good amount of time of new PostgreSQL users who need UTF8 encoding as default.


No it's not the "default" the default is 'use locale settings'.

So LANG=de_DE.UTF-8 will load a de_DE.UTF-8 Database, while my LANG previously was en_US.UTF-8, i can change LANG to something different and it will create the cluster to the value I've set my LANG to.


> Instead of resorting to those hacks, follow the PostgreSQL documentation[1] to do it the right way. The simplest way is to initialize your PostgreSQL database cluster as:

This is actually poor advice. The right way on Ubuntu is to use the packages provided by PostgreSQL Development Group (https://wiki.postgresql.org/wiki/Apt), and in that setup you use pg_createcluster instead of initdb.

The snippet in the original article was so hacky because it fixed an existing cluster instead of destroying it and creating a new one in UTF-8.


> This is actually poor advice. The right way on Ubuntu is to use the packages provided by PostgreSQL Development Group (https://wiki.postgresql.org/wiki/Apt), and in that setup you use pg_createcluster instead of initdb.

initdb is part of standard PostgreSQL utilities/tools[1] for initializing your PostgreSQL server that's why I suggested using it. On the other hand, I don't see any reference to pg_createcluster command in the official PostgreSQL documentation. So I don't know why you think using a non-standard PostgreSQL tool (in place of a standard one i.e. initdb) is the right way of initializing a PostgreSQL database cluster.

[1] https://www.postgresql.org/docs/9.5/static/reference-server....


Since we're talking about Ubuntu, I would use the Ubuntu tools. They make it very easy to maintain several Postgres clusters side-by-side, e.g. /var/lib/postgresql/9.4/main vs /var/lib/postgresql/9.5/main. You can also control which binaries you use with the PGCLUSTER variable, like this:

    PGCLUSTER=9.4/main pg_dump foo
    PGCLUSTER=9.5/main pg_dump foo


Not that it necessarily should, but the article did not really address why they were moving away from RDS. Was it cost, performance, other issues?

I've been trying to compare RDS for Postgres to other offerings like Compose or Heroku but have come up surprisingly dry on comparisons.


I'm the decision maker for the company in question - Authentise. It was a combination of cost and the fact that we need to support an in-datacenter install option for our customers. RDS is really great if you need all of the failover and backup options it provides and you don't have the manpower to invest in it. That's how we got our services started. But, as our expertise and automation grew it ended up being a modest cost savings to handle our own backups and redundancies. Primarily, though, we just can't install RDS in someone else's datacenter on premises.


Thanks for the background info, very helpful. How has on-prem install gone for you guys ? There was an article on HN yesterday suggesting it was a royal PITA (for ops and support), would be interesting to hear about your experience.

Also FYI, your homepage has a 2015 copyright notice on it. No big deal, but thought you might like to know since it could put off particularly "nit picky" types of customer :)


> Also FYI, your homepage has a 2015 copyright notice on it. No big deal, but thought you might like to know since it could put off particularly "nit picky" types of customer :)

If that's when the work was created, then that's correct. Copyright notices aren't there to tell you what year it is today. They are there to tell you when the work was created. If they change it to 2016 when the work was really created in 2015, then that's an invalid copyright notice and equivalent to no notice at all.


Heard this for the first time, so being really curious (I always have the most recent year in a range on my sites)

What if the page is dynamically created on the fly? What copyright should I have? If page contains snippets/work created in different years?


Exactly. I've actually had a manager tell me to update a copyright notice from 2016 to 2015. These are the real nitpickers.


Well done on not being married to AWS. Its usually a costly lesson for those who don't prepare for that earlier in their environment's lifecycle.


In my personal comparison, a huge factor for RDS and against Heroku is something that I have never seen mentioned in these comparisons: you can run multiple databases on a single RDS instance and you can't do that with Heroku.

My company has a bunch of microservices that each need a Postgres database, but none of them are particularly high traffic. We do however need high availability guarantees for all of them, and so Heroku's pricing starts at $200/mo per database.

With RDS, we get adequate performance for a similar total cost (around $300/mo), but get to run about a dozen separate high-availability databases on that same instance, each with their own usernames and passwords. That means significant savings.

(of course you could just share a single Heroku Postgres database and user account between these services, perhaps separating them by scheme, but that wasn't really to my taste.)


FYI, the Aiven database service allows you to run multiple database per instance. HA plans for Aiven start at US$200/mo and are available in multiple clouds including AWS and Google, see https://aiven.io/postgresql#comparison


First to hear about Aiven and you guys released pg hoard. Right? Interesting offerings. Are you connected to upcloud? (Both are based in Helsinki from what I could gather)


Yeah, PGHoard is the PG cloud backup solution we initially developed for Aiven and later open sourced. It's now used in a lot of other environments as well.

Aiven services are available on UpCloud and we're friends with the UpCloud people as we're both based in Helsinki and have met at various events, but there's no other connection between the companies.


It's because it's a pretty situational thing to talk about. IMO, the most sensible reasons to move away from RDS would be customer requirements or you're not using AWS for anything else.

What traits are you comparing on? What kind of a thing are you building? What features (cost, performance, ease of use, time) are the most important to you?


Agreed, I always look for the reasoning behind these types of decisions. It helps others understand the issues that people face with different platforms.


OVH is likely to launch a dedicated instance for Postgres in the near future. It's currently in the beta stage of RunAbove.

https://www.runabove.com/PaaSDBPGSQL.xml

No affiliation, just a happy customer of OVH's hardware.


I wrote one up for Codeship earlier this year:

https://blog.codeship.com/heroku-postgresql-versus-amazon-rd...


Cost, performance, size limitations


At what scale have you found Performance and Size Limitations an issue? I've seen people leave due to cost before (not that I thought they were always correct), but not for the other two you listed.


30,000 iops is 1/2 of a crappy consumer grade SSD. 6TB is fairly low for projects that generate any sizable amount of data. We don't run things in AWS unless it's strong client preference.


Not to mention 30k IOPS and 6TB is very expensive.

You're looking at $2700/mo for the IO provisioning and $750/mo for the 6TB of storage. Double those if you want Multi-AZ. Then you get to price the server size which I imagine is one of the more expensive options if you need 30k/6TB.


Overall a nice post. It doesn't quite get into all the options for pg_dump as there's a number of other flags that can be useful to reduce downtime. It's also probably worth taking a look at Amazon Migration Service as well (though I haven't personally used it or heard from others that have yet) - https://aws.amazon.com/dms/.

One tool they did miss for continuous archiving is WAL-E, which tends to be the one most used including by us at Citus Cloud and Heroku Postgres - https://github.com/wal-e/wal-e


We've tested out DMS to migrate from a 3 node Oracle RAC to Oracle RDS, and I was incredibly impressed. They list a bunch of limitations in their documentation that you should look at first. It migrated our roughly 1 TB database in two hours and ten minutes and handled replication flawlessly - not a single record discrepancy between the two database systems over a week long test. We were not expecting it to work that well and doubted that it would work with RAC as the source.


Yeah, I would* do a `pg_dump | pg_restore` with these settings: https://github.com/heroku/heroku/blob/master/lib/heroku/help...

*disclaimer: I wrote that code while at heroku


Unfortunately, DMS is quite limited for Postgres. We trialled it but found these two limitations deal breakers:

* "AWS DMS doesn't support change processing on Amazon RDS for PostgreSQL. You can only do a full load."

* "AWS DMS doesn't map some PostgreSQL data types, including the JSON data type. The JSON is converted to CLOB."

http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source....

Hopefully it improves soon, since it's an otherwise awesome tool.


Maybe Yani himself can chime in when it's daytime in his time zone, but if memory serves correctly he did look at WAL-E. It had some nice features in its favor. I'm not sure why it didn't make it into the blog post.


Yeah, came here to say that the migration service was built to solve this exact problem (among others).


It would be interesting to hear the motivation of why they did this. My own experience is moving the opposite direction (from master-slave self-managed to RDS master-replica).

I wrote up my experiences and the reasons for the switch here: https://www.theguardian.com/info/developer-blog/2016/feb/04/... .


For those that have done this, I'd be interested in the performance gain (say, IOPS at an equivalent hosting + hardware cost). I'd also be interested in RDS vs. Heroku Postgres.

Do people do this for security / policy purposes, or are they motivated by cost savings? Does removing all that virtualization buy improved performance?


I'm not sure that the article moved to bare iron. The author could have just set up postgres on a regular ec2 instance.


Yes, we did move to a regular EC2 instance for our cloud service. Where we install databases on premises for our customers it can be bare iron, but it can also be a similar virtualized setup.


Does that bring performance or cost savings? What is the motivation and calculus for making that transition?


It can bring a cost savings, but the savings is very slight if you're comparing apples-to-apples. For example, if you're using RDS in a multi-AZ replication set up with periodic S3 backups you have to take into account running each of the replica instances and the S3 costs. What we found was that RDS was treated like a black box to our automation (which was a good thing for a long time) and so we were less attentive to being over-provisioned and actually testing failover situations than we should have been.


30,000 iops is less then single crappy consumer grade SSD 6TB is fairly low limit largest instance has 244Gb RAM So improved performance, running larger instances, lower cost


Amazon DMS (database migration service) actually works really well and allows you to do this without writing any code. You may need to change your Postgres config a bit.

DMS isn't just for moving two RDS it can also be used to move off or RDS or even move from one non-RDS server to another non-RDS server that isn't even on AWS.

It can also do Postres to MySQL, MySQL to Postgres, etc.

It also does continuous replication so once the copy is done it will keep up to date (yes, even if it is Postgres -> mySQL).

One gotcha, though, it does not bring over secondary keys so make sure to recreate them before sending traffic over.

But this article does provide some insight on how to configure Postgres to be similar to RDS in terms of functionality.


OK a bit of topic, but just curious we are considering rolling Postgres hosted service that would support much large instances up 50TB much higher IOPS 1,000,000+ lower latency (infiniband vs ethernet SAN) pricing would be in line with AWS RDS +/- optional ElasticSearch integration optional change notification pub/sub Would you guys be interested/think such service is a good idea? Would appreciate to learn what you guys think?


When it comes to data, trust is key.

Think about it this way: OCZ has on paper SSDs that are simultaneously both faster and cheaper than, say, Intel. But I wouldn't put a OCZ drive into the lowest budged ricer gaming PC I could imagine (let's just say that have a bit of a reputation, to put it lightly), and I should be rightfully fired if I suggested putting one in a server.

Yes, your service might be massively faster, but... at 10-15% cheaper? No way! Not until you've been running for years with a great reputation. Now, if you enter the managed database space and pull a Digital Ocean ($5-$40 pricing vs comparable EC2 at $80-$400 AND better raw performance on top), then you can essentially create a new market niche. That's my suggestion to focus on, because businesses who operate 50TB databases aren't looking to save a few dollars. But people who use DO, need a database to go with it that is both faster and cheaper than AWS.

Honestly, performance ranks pretty low on the things I've found people look for when choosing a managed database service. You can serve a million users per day and average a whopping 12 IOPS, which might just about push a Raspberry Pi with sqlite slightly. Got more than a million users? Congrats, you now officially qualify to throw money at problems, the kind of problems RDS and Amazon loves to solve by injecting money.


The latest OCZ Trion series at least honor O_DSYNC correctly. You get shitty write performance, but it is the same for Samsung 950 Pro. The read performance is another story for OCZ though :(

Anyway enterprise SSD's have so much overall better performance that consumer grade SSD's is something I consider a joke. A joke even for consumer applications, web-surfing and email-clients. Remember that a lot of applications out there use SQLite, and poor DB performance means poor application performance.

I highly recommend reading this: http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test...


You do not have an option to host anything larger then 6TB on RDS. I guess what we are trying to figure out is if there is a need for service for people who have needs beyond those RDS is addressing.


Not only disk space, you are limited to just two nodes (one primary and one standby). Of course you can create read replica, but that if you are write intensive, you can end up consuming all of your pools.


actually you would shard than, which of course is possible with rds aswell. since you mostly shard at the application level. there are some users that use extensions but I guess calling the support of aws will get these extension in your rds instances.


My 2 cents: you're promising a lot more for the same price which seems dubious. How are you going to beat the wholesaler?

Even if you did I probably wouldn't use it. I don't want a 3rd party dependency on the most critical part of my infrastructure.


The price would be comparable to AWS appropriate tear eg 6TB storage for a bit less (10-15%) then amazon's 6TB RDS option but with much better performance. I totally get your reluctance to rely on 3rd party, but the question is more for people who already do rely on 3rd party such as AWS or Azure.


> I totally get your reluctance to rely on 3rd party, but the question is more for people who already do rely on 3rd party such as AWS or Azure.

Just a little input: Amazon and Microsoft are well known companies. I (and my superiors) think they can support us. Your company that I haven't heard of is not.

It may be dumb sounding, but I won't get in trouble if I used AWS and something went wrong. How could I have known?

With an unknown company, if something goes wrong, it's on me because I selected that vendor.

Stupid perhaps, but important to understand.


How are you imagining solving latency issues? Many people want their DB to be close to their appservers.

edit: and if it's an hosted service, I'm assuming decent networking options (ipsec et al)?


Exactly. I want my app server and my db server to be as close as possible. That's why services like Compose and ElephantSQL offer just that. So then it comes down to performance/price (value) between RDS and those.


Latency will be higher. We are considering if we do go forward: a) Partnerships with smaller providers where there is an option to colocate equipment b) Colocating as close to as possible to targeted providers.

The main thing we are trying to figure out - if there is a need for service that has significantly higher tears then what Azure, AWS offers.


I would imagine people who need either more AWS can supply (or cheaper) but cannot pay for it, and also haven't figured out how to run it themselves, to be a very limited set. Happy to be proven wrong though and curious about what you will find out :) Is there a way to keep in loop?


The thing that is impractical to do on premise is "transparent scaling" say your load increases and you need a way bigger box that's the kind of problem the service can solve (take slave offline remount storage on a bigger box catchup on replication switch to slave repeat for former master). We are at the early stage just investigating if we should move forward so @andre1sk on twitter would prob the only place :)


Dont go for IOPS , but price. Make sure that there is replication and fail over... people at this stage just want to make sure that their data isn't lost. Use cheap hardware like ovh - you can beat AWS on price. Plus, googling for "hosted postgresql" is still something that can be competed upon.


My priorities are: Reliability > Size ~= Price > Performance

Reliability of a database is always no.1, without it, nothing else matters. So the customer must have enough access to the machine to setup replication and failover themselves (eg. using something like repmgr). Or have damn good guarantees that you're doing it well for them. ie. multiple replicas, verifiable backups, offsite backups, etc.

Secondly is size. This is a major sticking point for me when looking at hosted postgres solutions - they're mostly for relatively small databases <1Tb or so.

Closely in third is price. Even when you do find a service that provides a decent size like when RDS eventually went up to 6Tb (still not quite enough for me), the price they charge for it (and the ram) over buying dedicated iron is exorbitant. Given that 6-8Tb drives are mainstream now, it's a real head scratcher why they'll charge $750-1500/mth for a 6Tb drive that costs $250 outright. And then charge you for the IOPS on top of that... Even factoring in multiple redundant drives/replicas, SSD vs HDD, it's still very unfavourable. I can install 5 machines (at least) in a cluster for the price of 1 on RDS; giving me not just price savings, but my no.1 need - reliability.

Finally performance. The number of high IOPS workloads out there is rather small. You're far more likely to have a high storage need with small IOPS, than high IOPS, smaller storage. Everyone overestimates the amount of traffic they'll get, and generally if you have an IOPS problem, it's better off fixed elsewhere in the app. Usually some bad code or ORM is thrashing the DB.

Take a simple login DB as a thought experiment, just 10,000 IOPS is upto 864 million users logging in per day. Whereas the storage for that number of users, at a very generous 1Kb per user, is 864Gb. Size is way more important than IOPS. And if you've got that number of users, you've also got your own datacentre :)

You might also run into the problem that the kind of organisations that need high storage + high iops are also the kind that wouldn't use the cloud for it in a million years (banks, high freq. traders, etc).

In summary, I think the main thing that would attract me is high capacity storage for a decent price. Reliability is a given, but it must be easy to manage. Ideally I'd like a mixed storage system, with the ability to arrange the tablespaces across the drives as I need. 50Tb is great, but I need the ability to put the high IOPS tables on an SSD, then spread the infrequently accessed tables onto large HDDs.


Thanx for your input many of the frustrations you outline is the exact motivation why we are considering building this service.


I can tell you what I really want out of a managed Postgres provider. It has nothing to do with anything that was listed: it's configurability. None of the managed providers allow me access to many of Postgres's features, like custom full text dictionaries or third-party extensions, which is a real shame because those are some of its biggest differentiators. I would use a managed service that offered those things in a heartbeat.


From the article (emphasis mine):

> [pg_dump ...] The main disadvantage of this method is that it will not provide high reliability.

Pretty sure the author means "availability" not "reliabilitiy". pg_dump is completely reliable, arguably more so than ANY other backup mechanism as it creates logical machine independent backups.

Best intro tip regarding pg_dump: use -Fc (custom format)


Yes, and when you are using pg_restore use the -j flag! (restore in parallel).


tldr - a howto




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: