PostgreSQL indexing in Rails

mapgrep · on Dec 27, 2013

This is an incredibly suboptimal approach.

Postgresql (and even Mysql for that matter) lets you explicitly declare foreign key references when you create/alter a table and then the database will enforce integrity of those references for you. Which is great because when someone writes some kind of code outside of Rails to work with your db, and that code has bugs, and those bugs impact data integrity, there is a hard stop at the DB layer preventing dangling/invalid references that will blow something up later.

A side effect of creating these explicit foreign key relationships is that any needed indices are created on both sides of the key. [UPDATE: Nope, I'm wrong, see below.]

Here the author has a Rails setup that fails to declare foreign key constraints when it's making a table with relations. You can see this in the DB description of the `products` table which references the `categories` tables via a plain int called `category_id`. As a result of not having an explicit foreign key constraint this table also has no index on `category_id`.

So given a table with poor data integrity at the SQL/db level and lacking an index as a sympton of this problem, the author treats the symptom and advises the creation of an index on `category_id`, leaving the real problem woefully intact. And that real problem, to be clear, is the fact that the database is not being used properly but rather treated as a relatively dumb place to dump columnar data; things that Postgres gives you for free are set aside and pseudo-duplicated in Rails code.

Now I'm not blasting the author because I don't know if this is a tactical issue or an issue with Rails itself. Does Rails not allow foreign key constrains in the migrations?

Whether this is a flaw in Rails or how it's being used, this is a bad solution to the problem. The RDBMS is your friend, use it. (And if you're not going to use the integrity constraints of the DB why are you using Postgres instead of say BerkeleyDB or a loosely configured MySQL or Mongo or whatever?)

rosser · on Dec 27, 2013

A side effect of creating these explicit foreign key relationships is that any needed indices are created on both sides of the key.

This is simply not true.

Foreign key relationships typically use the primary key of the referenced table, which is implicitly indexed, by virtue of its being a primary key, not its being a foreign key. The referring column is never implicitly indexed, though it often should be explicitly indexed.

  rosser=> create table referenced (id int primary key, blah text);
  NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "referenced_pkey" for table "referenced"
  CREATE TABLE
  rosser=> create table referencing (id int primary key, referenced_id int references referenced(id), stuff text);
  NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "referencing_pkey" for table "referencing"
  CREATE TABLE
  rosser=> \d referenced
    Table "public.referenced"
   Column |  Type   | Modifiers 
  --------+---------+-----------
   id     | integer | not null
   blah   | text    | 
  Indexes:
      "referenced_pkey" PRIMARY KEY, btree (id)
  Referenced by:
      TABLE "referencing" CONSTRAINT "referencing_referenced_id_fkey" FOREIGN KEY (referenced_id) REFERENCES referenced(id)

  rosser=> \d referencing
       Table "public.referencing"
      Column     |  Type   | Modifiers 
  ---------------+---------+-----------
   id            | integer | not null
   referenced_id | integer | 
   stuff         | text    | 
  Indexes:
      "referencing_pkey" PRIMARY KEY, btree (id)
  Foreign-key constraints:
      "referencing_referenced_id_fkey" FOREIGN KEY (referenced_id) REFERENCES referenced(id)

mapgrep · on Dec 27, 2013

I stand corrected! Fair point. While I wish the original author had not sidestepped the data integrity issue, I can see why he'd want to address the index issue independently.

electrum · on Dec 27, 2013

Rails migrations do not support creating foreign keys in the database, but you can do it with this library: https://github.com/matthuhiggins/foreigner

bratsche · on Dec 27, 2013

Anyone know why ActiveRecord doesn't just include this functionality already? Or does anyone have any idea if it's planned to add?

tominated · on Dec 28, 2013

I was searching for the same thing myself recently, and it seems that DHH is against having any sort of logic in the database - he believes that all validations and constraints should be in the application instead of being in both (which I believe is the correct way).

xentronium · on Dec 28, 2013

The problem is that sometimes you simply can't do it properly in application. Uniqueness is the most obvious example, although not the only one.

tominated · on Dec 28, 2013

I probably phrased that incorrectly. I definitely think that you should have validation logic at the database level. I've had a couple of major issues in the past (while still learning web dev) where lack of indexes/constraints has pretty much ruined my database when I tried use them in production

jeltz · on Dec 28, 2013

You also want to implement foreign keys in the database since you do not want to have to remember to lock referenced rows when inserting new referring rows. The database helps you out with concurrency issues.

gingerlime · on Dec 27, 2013

I was about to suggest the same thing. And if you go with foreigner, then you can / should also use the aptly-named immigrant[1] gem to automatically detect where foreign keys are missing and create the migrations for you.

[1]https://github.com/jenseng/immigrant

sandGorgon · on Dec 27, 2013

So ... this might sound like a really dumb question but I come from a less-than-illustrious MySQL background and have built up "rules of thumb" over the years - especially around master-slave replication and security. Do note that I'm not specialized in devops, but just doing something good enough at a startup stage - for a database it means that downtime is acceptable in exchange for data safety.

I'm really, really unsure about how to use postgres in such a way: most posts that I read are around these really complex setups with zero downtime or these Postgresql-quickstart guides (which I'm not looking for) or Heroku howtos.

Does anyone have a non-devops set of Postgres best practices that could be used to get your transactional MVP up and running on Softlayer ?

hahainternet · on Dec 27, 2013

I don't know Softlayer too well, but Postgres is better every day.

1. Install Postgres 9.3, the version after the replication fix.

2. Tune Postgres until it reaches a reasonable speed (lots of tutorials available, primarily just assigning it more RAM)

3. Create a second instance, do a backup, restore backup on second instance

4. Turn on streaming replication, ensure WALs are being received by second instance

5. If you're paranoid, use synchronous replication, which won't finish a transaction until it's been committed on the primary and a streaming secondary.

I'm not our DBA, but I believe this covers the primary setup. You will have to test whether Postgres correctly moves into a new history chain on migration as support for this via WALSender is pretty new I believe.

rosser · on Dec 27, 2013

I am our DBA, and, though brief, this is a reasonable list.

A caveat, though: be careful with synchronous replication; it actually increases your chances of having an outage. If your slave goes down with sync rep, the master will no longer accept writes. (To mitigate that risk, have multiple slaves. But that can further increase the latency sync rep introduces, as now the master has to wait for all the slaves to report back a successful write.)

hahainternet · on Dec 27, 2013

Good point, I believe if you list multiple sync rep targets it only requires one of them, but I would have to go check that properly :)

Still, Postgres in a hot standby configuration is relatively simple, still high performance and an actual real database that doesn't need tens of grand in licensing. You really can't go much wrong with it.

sandGorgon · on Dec 27, 2013

thanks! just to be clear, you do mean host standby with streaming replication right ? And why did you specifically recommend 9.3 - I cant google for anything specific to standby fixes.

The authentication defaults for postgres and mysql are vastly different - I am always tempted to move all authentication to md5 (pretty much the same as mysql). Am I doing it wrong ?

rosser · on Dec 27, 2013

You want either 9.3.2 or 9.2 >= 9.2.5. Before either of those, there was a bug in the streaming replication code such that, on failover, the last few outstanding transactions on the master might not be replayed on the slave, resulting in data loss.

And, no, you aren't doing it wrong. You generally want to use MD5 auth.

hahainternet · on Dec 27, 2013

The reason I recommend 9.3 specifically is that when a migration occurs, Postgres switches to a new historical branch. As of 9.2 this change gets written only to the WAL directory, and you must use something like Ceph / Gluster in order to have these migration files available to each replicating server.

In 9.3 at some point this is slated for inclusion in the WALSender. This allows you to do full streaming replication without the requirement for a cluster filesystem to hold your binary logs.

sandGorgon · on Dec 27, 2013

but this is bleeding edge stuff ( never mind that it will work perfectly given past experience).

What did hackers (not whole development + ops) teams do before 9.3.2 ? with all due respect, is this the reason why startups still default to mysql over postgres ?

hahainternet · on Dec 27, 2013

Postgres historically has been tricky to configure with 'proper' replication. This is partially because replication is not a remotely simple problem to fix as it appears. The reason people still default to mysql is simply because it occupies more mindshare and is marginally easier to set up. I still don't think it even allows such things as DDL rollbacks so it's hard to talk about it next to Postgres when you're talking about safety.

jeltz · on Dec 28, 2013

People did custom scripting or used third party tools for easy failover before 9.2/9.3. What PostgreSQL has improved the last years is simplicity of setting up replication and speed of failover.

> with all due respect, is this the reason why startups still default to mysql over postgres ?

I do not these particular issues. But if we go back before streaming replication (when people had to transfer WAL files with rsync) then I am pretty sure this was part of the reason for why small companies chose MySQL over PostgreSQL:

erichurkman · on Dec 27, 2013

Also, with Postgres memory tuning: you may have to change kernel parameters to permit more shared memory. Make sure you make these changes persistent between reboots!

For example, in FreeBSD, you can set the parameters at runtime via sysctl, but unless you also modify /etc/sysctl.conf, they won't be applied after a reboot.

It's not fun scrambling to figure out what your kernel parameters were after you've applied a security update and rebooted your machine. Not fun at all.

Test out reboots.

jeltz · on Dec 28, 2013

In 9.3 or later you should not have to do this.

losvedir · on Dec 27, 2013

I tried adding a partial index the other day but Rails didn't seem to particularly like it. Are the examples here from Rails 4? (I'm on 3.2.16).

The issue I had was that, while the partial index syntax worked and actually did create it, it wasn't reflected anywhere in the generated schema. So I was worried that future developers wouldn't realize there was a partial index (we use schema.rb as the canonical reference of DB state), and if we ever get rid of migrations by condensing them into one big one we might miss that there should be a partial index on the column.

And changing the config to generate schema.sql instead of schema.rb didn't help. It's broken with the latest version of postgresql...

twerquie · on Dec 27, 2013

Switch from schema.rb to structure.sql by setting config.active_record.schema_format = :sql in your application config if you want to do anything remotely interesting with your schema.

ricny046 · on Dec 27, 2013

The article covers only Rails 4.

losvedir · on Dec 27, 2013

Ah, bummer. Anyone have any luck getting partial indexes to work in Rails 3.2?

netghost · on Dec 27, 2013

You can always execute arbitrary SQL in migrations. For instance:

    def up
      execute('ALTER TABLE people ADD INDEX ...')
    end

The one thing to keep in mind though is that if you use ruby schema format, your tests won't pick up those execute statements. In that case it's best to either use the sql format or re-run migrations in the test environment to set up the test db.

srgpqt · on Dec 28, 2013

Upgrading from Rails 3.2 to Rails 4.0 takes very little time, assuming the gems you depend on are compatible (which they should be, by now, or you might want to look for replacement gems..)

atombender · on Dec 27, 2013

You could just do it manually:

    execute %{
      create index index_users_on_name
      on users (name)
      where state = 'NY';
    }

ufmace · on Dec 27, 2013

Interesting, I was just puzzling over a postgresql indexing question myself. The generated migration to create one of my tables automatically generated indexes on the foreign keys. However, after further considering my model, I want to make the index on one of these foreign keys unique. Is there any way for a migration to alter the index, or do I just have to delete the old index and create a new one? Based on what I've seen so far, it looks like I have to delete and recreate, but I wouldn't mind confirming that.

brasetvik · on Dec 27, 2013

You mean on the from side? Yes, just drop the existing index and create a unique index. You can do that within the same transaction.

Postgres already requires the pointed to relation to have a unique index:

    => create table foo (id int); -- No unique index here.
    CREATE TABLE
    => create index foo_id on foo(id);
    CREATE INDEX
    => create table bar(foo int references foo(id));
    ERROR:  there is no unique constraint matching given keys for referenced table "foo"

pnathan · on Dec 27, 2013

Question about pg/rails approach...

This seems to tie the database closely to the ORM. What if there are multiple applications using the schema? Wouldn't it be better to have the database code & controlling system separate?

I've found it more convenient to have a set of SQL files for specifying tables & indexes and to have the app not control the database schema directly (Perhaps I just don't see the clear advantage, or perhaps Rails tooling is so good it obviates my concerns. :-) ).

josephlord · on Dec 27, 2013

Rails makes it easier to do migrations and roll them back if necessary but you can dump the SQL from Postgres easily and you can choose whether Rails stores the schema in SQL or in its own format. I use SQL as at least in 3.2 Rails' own format didn't support all the features of Postgres (partial indexes etc.)

When doing a new deploy you shouldn't usually go through the migrations but just load the schema.

bratsche · on Dec 27, 2013

If you prefer, you can store your schema in SQL instead of schema.rb.

In config/application.rb:

config.active_record.schema_format = :sql

stplsd · on Dec 27, 2013

Ok, obviously I never done web development, but why you would want declare schema from rails? And ActiveRecord is some kind of ORM (the buzzword I was hearing lately, but never actually care to fully understand it)?

Now I am reading book High Performance MySQL: Optimization, Backups, and Replication (yes, I know this post about PostgreSQL) and there are so much stuff to know, so this ORM thing seems like a toy.

So what is wrong to creating schema with plain old (database specific) SQL?

gerjomarty · on Dec 27, 2013

Rails creates the database tables for whatever ORM you're using using its Migrations. Rails keeps track of which migration files have been run.

The usual syntax for creating these migration files is in Ruby, as this makes them database-independent, arguably more easily readable, and in newer versions of Rails, reversible.

Of course, you can put plain old SQL in these migrations if you want to as well.

phsr · on Dec 27, 2013

You can also create indexes concurrently, using `algorithm: :concurrently` when creating the index[1]. You can't create concurrent indexes within a transaction, so be sure to call `disable_ddl_transaction!` in your migration definition.

[1]: https://github.com/rails/rails/commit/2d33796457b139a58539c8...

asdasf · on Dec 27, 2013

Postgresql advice from a mysql user. Foreign keys don't need indexes on the from side, and they should be pointing to a primary key on the to side so that is already covered. In mysql you likely want an index on the from side because you are probably joining on it and mysql only knows how to do nested loop joins. But postgresql is a real database, so we have hash joins and merge joins, which don't need indexes on the join conditions, rather on the where clause (like you would want anyways). http://use-the-index-luke.com/sql/join/hash-join-partial-obj...

mfenniak · on Dec 27, 2013

> Foreign keys don't need indexes on the from side

Yes, they're not a hard requirement. However, be aware that any field with a foreign key constraint will be searched during a DELETE that affects the related table, to check RESTRICT DELETE or CASCADE DELETE constraints. It can be very beneficial to performance to have an index on that field for those internal database searches.

The same is true for an UPDATE that affects the PK, but that's relatively rare if you're using surrogate primary keys.

philwelch · on Dec 27, 2013

In the case that you use your database for more than just a persistence backend to your Rails app (blasphemy, I know, but odds are your Rails app is generating transactional data that's worthy of analysis you need real SQL for and not just ActiveRecord), indexing foreign keys is still worthwhile because sometimes you'll join the same foreign key in two different tables without ever needing to join in the table they normally map to.

asdasf · on Dec 27, 2013

What you describe doesn't need an index on the foreign key, that's the point I was making.

philwelch · on Dec 28, 2013

If you designate it as a foreign key in the schema it'll just use the PK index even if you don't join that table in at all? Wait, how does that even work? Does the PK index include the foreign key rows then? That's an awesome feature.

dspillett · on Dec 27, 2013

This is why FKs don't automatically get indexed in many DBMSs as some people expect (I find peopel assume FKs imply indexes due to the fact that PKs and unique constrains tend to). Of course if any code searches for rows matching a given key witohut joining to the target table of the FK and without other relavent filtering clauses, then an index definitely is useful, and in some circumstances the query planmner will determine (given the choice) that the hash join is not the most efficient way to go for a given query anyway, so it is sometimes worth indexing your FKs. Testing/benchmarking (with real or real-a-like data of realistic scale) is your friend here when you are not sure. Unless disk space and write performance are a significant issue (which for many applications they may not be) I tend to err on the side of indexing FKs.