Django and SaaS

ibejoeb · on Aug 2, 2010

No! Schemas!

I don't usually go on tirades like this, but this is what I do for a living.

You'll want to use schemas vs. databases. Not only is it significantly more convenient (i.e., you'll administer 1 database instead of n), it will let you scale. For instance, a postgres database has a pool of shared memory that it uses for query plans, caching, etc., and you'll be able to optimize the use of it across all of your clients, rather than one by one.

Related is the impact on connection resources. A connection to a database may be used for multiple schemas, so you'll be able to make much better use of those. Suppose you have 1000 clients, which each typically have 1-2 users on your application when it's typically in use. In the multiple database config, you'll need one connection pool per database, each pool having at least 1 open connection, so that under no load you're still maintaining 1000 connections. With the schemas approach, you can maximize the throughput and balance the connections proportionally to your active users rather than the number of customers you have.

Oh, one more biggie: remember the shared buffers? Those are typically per-session (connection). So if you frequently run "select price from widgets where category=?", you'll want to parse, rewrite, plan, compile, and cache that as infrequently as possible, so keep those connections and sessions open as long as possible so you get the benefit of prepared statements. You will see load go down if you do this, from reduced CPU for parsing and plan generation, and reduced system wait since you won't be churning your shared buffers as much.

phugoid · on Aug 2, 2010

Could you please clarify what you mean by "schemas"?

My understanding is that a schema is an abstract definition of what your database fields/relations are. So that's too vague for me.

Adding a customer_id field to every single table in the database would technically fit the definition of "schema", but I suspect you mean something else altogether. Please elaborate.

ibejoeb · on Aug 2, 2010

Neeleshs' reply is what I meant. In Postgres, it's like a namespace.

But you have a point: adding customer_id is not unreasonable. In fact, this is my preferred logical approach, but I'll typically only do this on Oracle. It's DRY, meaning that, in the abstract definition of schema, you're modeling your relations once. If you have widgets belonging to customers, you have a one widgets relation containing a customer id, rather than one widgets relation in each customer schema. It's an elegant design, and it takes even greater advantage of the benefits I described earlier.

I'd advocate against it when building a multitenant SaaS application on Postgres if the following hold: 1. You have many, many customers 2. Each customer has a lot of data

The reasons:

1. You'll quickly hit a place where it's hard to throw more vertical processing power at it. You'll need a serious SAN with lots of spindles to be able to service many request to a lot of data, and plenty of CPU and RAM to handle the grunt work. Let's face it: you're on Postgres because you're broke :) I really love it, but sometimes I'd rather buy a solution to a problem (e.g., VPD). However, if you're rolling out schemas, you can build a new database and move them discretely. This isn't automatic or transparent, but it's very clean and very easy.

2. You're likely dealing with clients with special needs. It might not be appropriate to roll out new features to every customer all at the same time. If you ALTER TABLE widgets... it's all or nothing. Sometimes that's great, but sometimes it's not, and in building multitenant SaaS apps, I find it's most often not. In repeating your design, you're buying the flexibility to deal with your customers on a more granular level.

3. Others have already mentioned this: sometimes it's easier to articulate the integrity of your system to technical auditors with a hard separation like independent schemas. It's a little easier to show that a particular database account has access only to a particular schema than it is to convince him that your code does a sufficient job of enforcing the rules. I've done a lot of work where regulatory compliance is a concern, and this is always an easier sell.

The thing is that most vendors have solutions to these problems, and if Postgres doesn't have already built, you can probably build it from their primitives. This is just my general thinking under this particular set of circumstances.

daredevildave · on Aug 2, 2010

Is there a solution like schemas for MySQL?

pilif · on Aug 2, 2010

no, but you can achieve the nearly same thing using multiple databases in MySQL. Databases in MySQL are like schemas in PostgreSQL

neeleshs · on Aug 2, 2010

I think he meant http://www.postgresql.org/docs/9.0/static/ddl-schemas.html

neeleshs · on Aug 2, 2010

Very helpful comment. Indeed, schemas are the way to go. Too many benefits against creating a database per customer to ignore.

callmeed · on Aug 1, 2010

"As in a typical multi-tenant architecture, one database per customer."

Maybe I've misunderstood what "multi-tenant" means for the last 5 years, but I always thought of it as "multiple customers on a single installation/instance and using a single database" ... as in, every customer has some sort of account model other models are scoped to the account.

I'm sure there are valid reasons for having a unique db per customer (we do it in one product) but it seems easier to maintain/backup a single one.

snprbob86 · on Aug 1, 2010

If your data is easily and completely partitioned by customer, it may make sense to do so. It's as easy a partitioning scheme you can come up with. It gives you the power to do staged roll outs and other flexibility in deployment, as well as security isolation that makes your life easier and some customers may prefer.

idan · on Aug 1, 2010

FWIW, I'm no expert but...

I don't really see much of a security benefit to "one DB per customer". The actual data will be stored on the same disks, the queries issued to through the same DB processes. This seems like a huge amount of overhead for very little realized gain.

yesimahuman · on Aug 2, 2010

Poorly crafted and tested SQL queries or django ORM calls accidentally exposing customer information? Shouldn't happen, but it very well could.

edanm · on Aug 1, 2010

snprbob86's point from another comment makes sense: it gives you more control. For example, doing staged rollouts of different features for different customers, etc. I'm not saying it's worth the headaches, but there are advantages.

snprbob86 · on Aug 1, 2010

Also consider behind the firewall installs or premium dedicated hosting.

One of the nice things about enterprise software is that you can charge for things like this. If you have a large enough customer, even if you are hosting, you can simply forward to them your bill for a dedicated server.

ezmobius · on Aug 1, 2010

This approach will not scale. In fact vie seen it personally with multiple of my hostingnclients fall over after even 5000 databases. There has to be a file per table with innodb usually and this approach will work until younhave a few thousand customersmand then you will feel the pain of refactoring this intoms true multi tenant db system

jseifer · on Aug 1, 2010

Have you seen similar problems with Postgres? That's what the author is using.

chrismsnz · on Aug 1, 2010

Seems like row-level permissions seems to be a better paradigm to make this work. Last I checked, Django had some support for it - recently improved in their 1.2 release.

Here's a Django Advent post, published while they were ramping up for the 1.2 release about row-level/object permissions:

http://djangoadvent.com/1.2/object-permissions/

niels · on Aug 1, 2010

I'm confused... why would they ever come up with a solution like this?

mattchew · on Aug 1, 2010

I assume by "like this" you mean a separate db for each customer.

The reason is probably a perceived improvement in security. I've worked in systems where they've done this for security reasons. I believe Joel Spolsky did it for this reason when he was betaing a release of product (FogBugz?).

It certainly does jack up the maintenance overhead. I wouldn't design a system this way myself.

I'm really curious if this is in fact a common practice out there in the wide world.

neeleshs · on Aug 2, 2010

"It certainly does jack up the maintenance overhead."

A bit of automation would solve this problem,as far as I can see it.

"I'm really curious if this is in fact a common practice out there in the wide world."

It is. I've worked for companies which do this, and have scaled with Oracle for several tens of thousands of customers (=databases).

Also, complex reports that require hand-coded SQL becomes that much simpler. (Compared to having all customers' data in a single database)

tomjen3 · on Aug 1, 2010

It makes it dead simple, among other things, to export the companies data should the company wish that.

endlessvoid94 · on Aug 1, 2010

I mean, any other export would just be the result of some query. And it doesn't have to be a complicated query, either. Just "WHERE customer_id=5" or whatever.

Or am I missing something?

neeleshs · on Aug 2, 2010

There are lot of cases where the query will indeed be complicated. Reports, for example. I have seen SQL queries reports which are quite complex, making sure that such SQL is always qualified with WHERE customer_id=5 is a maintenance nightmare.

There is also the customer perception thing, as petethomas said in this thread.

matclayton · on Aug 1, 2010

Scaling, it provides cheap, quick sharding from Day 1. If your db is overloaded, just move some to a second machine. FB used a similar approach with mysql and isolated university databases in the early days, I believe.

stavros · on Aug 1, 2010

This solution seems overcomplicated. It might work fine, I've never tried it, but I can't imagine databases (even postgres) being optimized for such things. The way we did it at http://historio.us was to just have a foreign key to each user (this can be easily extended to groups of users, i.e. companies) and just performed a lookup of the related fields depending on the domain.

This has worked very well so far, it's very fast, scalable and has given us zero problems. Since all the tables are the same, I imagine it should work very well for the poster, with an appropriate index, too...

petethomas · on Aug 1, 2010

The main advantage of one-database-per-business-customer isn't isolating customer data, it's the ability to serve business customers who are willing to pay for heavy customizations and the extra support those entail, while not polluting your core offering through the introduction of attributes, features, etc. for which other customers will have no use.

For a lot of B2B web applications aimed at serving "bigger" companies, customer adoption can still hinge on scotch-sipping and rounds of golf more than well-placed sign-up buttons or great viral marketing campaigns. It's been argued of late that maybe this part of the industry is becoming more "web-like"[1], but in any case it's not very web-like yet, when it comes to sealing deals.

Many times the companies who were convinced by your talented sales team to buy your product eventually demand that the product be tailored to their (sometimes very company-specific) needs. If you want to keep the customer happy, or maybe just keep the customer, you have to keep customizing.

One could make the argument that the most talented sales teams are the ones that sign deals with customers who are willing to "stay on core", but over time as your customers' own businesses evolve, the challenge grows for everyone involved.

[1] http://techcrunch.com/2010/07/25/enterprise-software-is-sexy...

danieldon · on Aug 2, 2010

    it's the ability to serve business customers who are 
    willing to pay for heavy customizations and the extra
    support those entail

You'd need to have flags in the app itself, which would likely turn into an unmaintainable monstrosity. If you have a handful of large clients, are heavily customizing the app for each one, are using a modern framework like Django or Rails and a DVCS, the obvious maintainable approach is to have full individual installs, forks for each install and a modular application that pushes the customizations into pluggable components as much as possible.

pilif · on Aug 2, 2010

Having a shared core and a lot of hooks for customization is the road we have taken for our application. I've written a lengthy post about it just last friday:

http://news.ycombinator.com/item?id=1557218

danieldon · on Aug 2, 2010

The 1 DB/customer approach is usually a bad idea. Strict data security requirements are one of the only situations this approach makes sense compared to the traditional single DB approaches or individual installs.

neeleshs · on Aug 3, 2010

Based on ibejoeb's comments, there is an updated entry at http://tidbids.posterous.com/ which uses one schema per user.

Thanks ibejoeb!