I don't usually go on tirades like this, but this is what I do for a living.
You'll want to use schemas vs. databases. Not only is it significantly more convenient (i.e., you'll administer 1 database instead of n), it will let you scale. For instance, a postgres database has a pool of shared memory that it uses for query plans, caching, etc., and you'll be able to optimize the use of it across all of your clients, rather than one by one.
Related is the impact on connection resources. A connection to a database may be used for multiple schemas, so you'll be able to make much better use of those. Suppose you have 1000 clients, which each typically have 1-2 users on your application when it's typically in use. In the multiple database config, you'll need one connection pool per database, each pool having at least 1 open connection, so that under no load you're still maintaining 1000 connections. With the schemas approach, you can maximize the throughput and balance the connections proportionally to your active users rather than the number of customers you have.
Oh, one more biggie: remember the shared buffers? Those are typically per-session (connection). So if you frequently run "select price from widgets where category=?", you'll want to parse, rewrite, plan, compile, and cache that as infrequently as possible, so keep those connections and sessions open as long as possible so you get the benefit of prepared statements. You will see load go down if you do this, from reduced CPU for parsing and plan generation, and reduced system wait since you won't be churning your shared buffers as much.
Could you please clarify what you mean by "schemas"?
My understanding is that a schema is an abstract definition of what your database fields/relations are. So that's too vague for me.
Adding a customer_id field to every single table in the database would technically fit the definition of "schema", but I suspect you mean something else altogether. Please elaborate.
Neeleshs' reply is what I meant. In Postgres, it's like a namespace.
But you have a point: adding customer_id is not unreasonable. In fact, this is my preferred logical approach, but I'll typically only do this on Oracle. It's DRY, meaning that, in the abstract definition of schema, you're modeling your relations once. If you have widgets belonging to customers, you have a one widgets relation containing a customer id, rather than one widgets relation in each customer schema. It's an elegant design, and it takes even greater advantage of the benefits I described earlier.
I'd advocate against it when building a multitenant SaaS application on Postgres if the following hold:
1. You have many, many customers
2. Each customer has a lot of data
The reasons:
1. You'll quickly hit a place where it's hard to throw more vertical processing power at it. You'll need a serious SAN with lots of spindles to be able to service many request to a lot of data, and plenty of CPU and RAM to handle the grunt work. Let's face it: you're on Postgres because you're broke :) I really love it, but sometimes I'd rather buy a solution to a problem (e.g., VPD). However, if you're rolling out schemas, you can build a new database and move them discretely. This isn't automatic or transparent, but it's very clean and very easy.
2. You're likely dealing with clients with special needs. It might not be appropriate to roll out new features to every customer all at the same time. If you ALTER TABLE widgets... it's all or nothing. Sometimes that's great, but sometimes it's not, and in building multitenant SaaS apps, I find it's most often not. In repeating your design, you're buying the flexibility to deal with your customers on a more granular level.
3. Others have already mentioned this: sometimes it's easier to articulate the integrity of your system to technical auditors with a hard separation like independent schemas. It's a little easier to show that a particular database account has access only to a particular schema than it is to convince him that your code does a sufficient job of enforcing the rules. I've done a lot of work where regulatory compliance is a concern, and this is always an easier sell.
The thing is that most vendors have solutions to these problems, and if Postgres doesn't have already built, you can probably build it from their primitives. This is just my general thinking under this particular set of circumstances.
"As in a typical multi-tenant architecture, one database per customer."
Maybe I've misunderstood what "multi-tenant" means for the last 5 years, but I always thought of it as "multiple customers on a single installation/instance and using a single database" ... as in, every customer has some sort of account model other models are scoped to the account.
I'm sure there are valid reasons for having a unique db per customer (we do it in one product) but it seems easier to maintain/backup a single one.
If your data is easily and completely partitioned by customer, it may make sense to do so. It's as easy a partitioning scheme you can come up with. It gives you the power to do staged roll outs and other flexibility in deployment, as well as security isolation that makes your life easier and some customers may prefer.
I don't really see much of a security benefit to "one DB per customer". The actual data will be stored on the same disks, the queries issued to through the same DB processes. This seems like a huge amount of overhead for very little realized gain.
snprbob86's point from another comment makes sense: it gives you more control. For example, doing staged rollouts of different features for different customers, etc. I'm not saying it's worth the headaches, but there are advantages.
Also consider behind the firewall installs or premium dedicated hosting.
One of the nice things about enterprise software is that you can charge for things like this. If you have a large enough customer, even if you are hosting, you can simply forward to them your bill for a dedicated server.
This approach will not scale. In fact vie seen it personally with multiple of my hostingnclients fall over after even 5000 databases. There has to be a file per table with innodb usually and this approach will work until younhave a few thousand customersmand then you will feel the pain of refactoring this intoms true multi tenant db system
Seems like row-level permissions seems to be a better paradigm to make this work. Last I checked, Django had some support for it - recently improved in their 1.2 release.
Here's a Django Advent post, published while they were ramping up for the 1.2 release about row-level/object permissions:
I assume by "like this" you mean a separate db for each customer.
The reason is probably a perceived improvement in security. I've worked in systems where they've done this for security reasons. I believe Joel Spolsky did it for this reason when he was betaing a release of product (FogBugz?).
It certainly does jack up the maintenance overhead. I wouldn't design a system this way myself.
I'm really curious if this is in fact a common practice out there in the wide world.
I mean, any other export would just be the result of some query. And it doesn't have to be a complicated query, either. Just "WHERE customer_id=5" or whatever.
There are lot of cases where the query will indeed be complicated. Reports, for example. I have seen SQL queries reports which are quite complex, making sure that such SQL is always qualified with WHERE customer_id=5 is a maintenance nightmare.
There is also the customer perception thing, as petethomas said in this thread.
Scaling, it provides cheap, quick sharding from Day 1. If your db is overloaded, just move some to a second machine. FB used a similar approach with mysql and isolated university databases in the early days, I believe.
This solution seems overcomplicated. It might work fine, I've never tried it, but I can't imagine databases (even postgres) being optimized for such things. The way we did it at http://historio.us was to just have a foreign key to each user (this can be easily extended to groups of users, i.e. companies) and just performed a lookup of the related fields depending on the domain.
This has worked very well so far, it's very fast, scalable and has given us zero problems. Since all the tables are the same, I imagine it should work very well for the poster, with an appropriate index, too...
The main advantage of one-database-per-business-customer isn't isolating customer data, it's the ability to serve business customers who are willing to pay for heavy customizations and the extra support those entail, while not polluting your core offering through the introduction of attributes, features, etc. for which other customers will have no use.
For a lot of B2B web applications aimed at serving "bigger" companies, customer adoption can still hinge on scotch-sipping and rounds of golf more than well-placed sign-up buttons or great viral marketing campaigns. It's been argued of late that maybe this part of the industry is becoming more "web-like"[1], but in any case it's not very web-like yet, when it comes to sealing deals.
Many times the companies who were convinced by your talented sales team to buy your product eventually demand that the product be tailored to their (sometimes very company-specific) needs. If you want to keep the customer happy, or maybe just keep the customer, you have to keep customizing.
One could make the argument that the most talented sales teams are the ones that sign deals with customers who are willing to "stay on core", but over time as your customers' own businesses evolve, the challenge grows for everyone involved.
it's the ability to serve business customers who are
willing to pay for heavy customizations and the extra
support those entail
You'd need to have flags in the app itself, which would likely turn into an unmaintainable monstrosity. If you have a handful of large clients, are heavily customizing the app for each one, are using a modern framework like Django or Rails and a DVCS, the obvious maintainable approach is to have full individual installs, forks for each install and a modular application that pushes the customizations into pluggable components as much as possible.
Having a shared core and a lot of hooks for customization is the road we have taken for our application. I've written a lengthy post about it just last friday:
The 1 DB/customer approach is usually a bad idea. Strict data security requirements are one of the only situations this approach makes sense compared to the traditional single DB approaches or individual installs.
I don't usually go on tirades like this, but this is what I do for a living.
You'll want to use schemas vs. databases. Not only is it significantly more convenient (i.e., you'll administer 1 database instead of n), it will let you scale. For instance, a postgres database has a pool of shared memory that it uses for query plans, caching, etc., and you'll be able to optimize the use of it across all of your clients, rather than one by one.
Related is the impact on connection resources. A connection to a database may be used for multiple schemas, so you'll be able to make much better use of those. Suppose you have 1000 clients, which each typically have 1-2 users on your application when it's typically in use. In the multiple database config, you'll need one connection pool per database, each pool having at least 1 open connection, so that under no load you're still maintaining 1000 connections. With the schemas approach, you can maximize the throughput and balance the connections proportionally to your active users rather than the number of customers you have.
Oh, one more biggie: remember the shared buffers? Those are typically per-session (connection). So if you frequently run "select price from widgets where category=?", you'll want to parse, rewrite, plan, compile, and cache that as infrequently as possible, so keep those connections and sessions open as long as possible so you get the benefit of prepared statements. You will see load go down if you do this, from reduced CPU for parsing and plan generation, and reduced system wait since you won't be churning your shared buffers as much.