How FriendFeed uses MySQL to store schema-less data

paul · on Feb 27, 2009

Another interesting detail is that this is roughly the 4th iteration on the FriendFeed backend since we launched 17 months ago. If you look at the the graphs at the bottom of Bret's post, you can see that our previous system was about to die -- average pageview latency had increased from about 135ms to 260ms in less than a month! (not a good trend) This new design also accommodates some important upcoming features that would have been problematic in the old system.

This experience reinforces my belief that it's better to be quick than brilliant. If we had wasted a lot of time trying to build some really smart, super-scalable system right from the start, it would have inevitably been over-optimized for the wrong things, since the product and requirements have changed quite a bit since launch. By continually rebuilding the system, it stays relatively simple and close to our actual needs (plus it incorporates everything we learned from the previous iterations).

abstractbill · on Feb 27, 2009

This experience reinforces my belief that it's better to be quick than brilliant...

I've come to the same conclusion during the past year and a half of working on Justin.TV. If I was writing a list of pieces of startup advice, this would be #1.

pxlpshr · on Feb 27, 2009

After reflecting of my own experience with a failed B2C product, I definitely agree with the statement in your second paragraph. We spent a significant amount of time worrying about 'scaling' on a lot of fronts, not just technical, but the highest CCU we ever hit was a couple hundred users over the course of a year and a half.

From a design standpoint, I think its perfectly acceptable to keep said objectives a high priority... but from an execution standpoint, its more important to be as nimble and flexible as possible. Re-writes shouldn't be feared too much, they take less and less time if the team is applying what they've learned.

jbert · on Feb 27, 2009

Hi, is there any chance you could have a look at this comment:

http://news.ycombinator.com/item?id=497070

and let me know how you resolve this problem in practice? (Or correct my understanding if there is no problem really).

paul · on Feb 27, 2009

Increasing the number of shards is similar to changing the backend infrastructure, but simpler. Downtime obviously isn't acceptable, so when switching from one system to the next, we have a period during which we write to both so that it is safe to read from either one. Other schemes could be used for resharding, but this is simple enough and also works for other changes, and in practice we've changed the schema more often than the number of shards.

Also, keep in mind that it is ok to have more shards than you really need (multiple shards can run on the same machine, for example), so resharding needn't be a common operation.

breck · on Feb 27, 2009

I tried something like this before but thought it was a "naive way to do things".

I love the advice that sometimes doing what may seem like a naive thing is good enough.

wayne · on Feb 27, 2009

Thanks for sharing. That was very educational.

We store data in a really similar way at Spock, pointing all our reads at a big MySQL table which essentially just has an id column and a really big text column containing all the values, JSON-encoded, w/roughly the same number of rows as the FriendFeed table. Our DBA wrote a blog post about it a while back: http://www.frankf.us/wp/?p=17

The way you guys create/delete indices and avoid all the replication nightmares sounds super cool.

All of these solutions ultimately end up sounding like a poor man's Big Table.

nostrademons · on Feb 27, 2009

This is really interesting.

We did something similar for GameClay. I stored game properties as JSON-encoded dicts stored in MogileFS, then had a "regular" MySQL table that would point to the MogileFS key for the file, then the Python code would just read it out, use a JSON library to parse it, and manipulate it as a Python object. We had normal MySQL indexes on all the game metadata that appeared in the UI, so if you want all arcade games that nostrademons has posted, it would do a normal index search on (nostrademons, "arcade"), find the Mogile key for that, and then fetch it.

We never got to the point where we'd need to shard, alas. I'd like to think I would've thought of the index-per-table approach independently, but probably not.

Curiously, I did things this way for ease of development, not performance (and there were probably lots of little things that would've killed our performance anyway - it's really hard to diagnose performance bugs until you have real users banging away). We needed a storage system - MogileFS and MySQL work out of the box. We needed a data format - JSON works out of the box. We needed to be able to change game schemas rapidly, since we kept finding our existing design was not really fun to play with (and that's what killed us, ultimately). The rest of it followed pretty obviously.

sam_in_nyc · on Feb 27, 2009

I just recently wrapped up using a similar "store serialized data" set-up myself. In my example, I'm allowing the user insert/remove/reorder items on a list. This type of operation is pain to do with SQL operations. If you have a list, A,B,C,D,E,F and you want to insert G before C, then you either have to:

  delete all, then insert A,B,G,C,D,E,F.
  set "sortnumber" on G to "3", and increment sortnumber on all >= 3.
  set "sortnumber" of G to be the average of B and C.

The latter is the cleverest, but eventually you run out of space in floating-point land, unless you have a Cron come in and clean everything up periodically. And of course, reordering (the example was an insert), and removing need to be considered as well. So it's a choice between lots of deletes and inserts (but really easy). Semi-annoying logic of reordering/removing and a few updates. Or clever hack that requires a cron to cover your ass.

What I did instead was stored the list as JSON, convert it to an array, and use array splicing functions to reorder things. Then, I convert it back to JSON and store it. It's worked extraordinarily well... It takes a fraction of the amount of time to do native data structure stuff than it does to touch the DB several times.

bk · on Feb 27, 2009

> It takes a fraction of the amount of time to do native data structure stuff than it does to touch the DB several times.

And app processes are much easier to scale than a database. Even in "slow" languages (Ruby in my case) array operations are extremely fast, definitely "fast enough".

jsrn · on Feb 27, 2009

While your approach to do this in JSON is cool, I think you have overlooked the 'direct' solution to do this in a RDBMS - with a linked list. Here is a quick suggestion (works in Postgres):

    create table l (
           id char primary key references l(prev) deferrable initially deferred,
           prev char unique not null references l(id) deferrable initially deferred,
           mydata text not null
    );

then I populate the table with your example items:

    insert into l (id, prev, mydata) values ('A', 'F', 'dA'),
                                            ('B', 'A', 'dB'),
                                            ('C', 'B', 'dC'),
                                            ('D', 'C', 'dD'),
                                            ('E', 'D', 'dE'),
                                            ('F', 'E', 'dF');

let's see how that looks like:

    test=# select * from l;
    select * from l;
     id | prev | mydata 
    ----+------+--------
     A  | F    | dA
     B  | A    | dB
     C  | B    | dC
     D  | C    | dD
     E  | D    | dE
     F  | E    | dF
     (6 rows)

to insert a new item into the list, you would do:

    begin;
    update l set prev='G' where prev='C';
    insert into l (id, prev, mydata) values ('G', 'C', 'data for G');
    commit;

so that's one update, one insert for an insertion into the list. Note that the two commands have to be in one transaction, because inside the transaction the foreign key constraint is violated (as allowed by the deferrable initially deferred modifier).

Let's inspect our list again:

    test=# select * from l;
    select * from l;
     id | prev |   mydata   
    ----+------+------------
     A  | F    | dA
     B  | A    | dB
     C  | B    | dC
     E  | D    | dE
     F  | E    | dF
     D  | G    | dD
     G  | C    | data for G

so the predecessor of G is C, and the predecessor of D is G, like specified.

Of course, you loose the ability to sort with 'order by', but that's no big deal: you know the predecessor and successor of each item, so it's easy to traverse the list in either order. This could be done on the client side [probably the best solution in your case], in the application code, or inside the database with a stored procedure or with a recursive query (coming in PostgreSQL 8.4), in Oracle it could probably be done with 'connect by'.

In reality, you would of course choose other datatypes for id and prev (probably integer), but I wanted to translate your example as literally as possible. Another problem that's easily solved: how do I get all elements of one list? Solution: Either give me one 'starting element' and the list is traversed and returned. Or introduce a listId attribute and select by that, which is probably faster but without sort order.

sam_in_nyc · on Feb 27, 2009

Aha! I did forget the linked list approach. So essentially, each item on the list stores what is before (or) after it. I'm assuming it's an arbitrary choice that you're using "prev" instead of "next," correct?

The use of deferred, I've never heard of, but it makes perfect sense in this case. Unless, of course, you want to insert the record first and then modify the update to exclude the item you just inserted.

Right now, I'm using MySQL.. I only have 4 tables, and product is not launched. Would you advise switching to Postgres?

jsrn · on Feb 27, 2009

> I'm assuming it's an arbitrary choice that you're using "prev" instead of "next," correct?

yes, in effect it's a doubly linked list (circularly doubly linked), so you could remember the id of the first element of the list and then traverse in any order as long as this id does not reappear.

> Unless, of course, you want to insert the record first and then modify the update to exclude the item you just inserted.

yes - in this case you would get a violation of the unique constraint:

    begin;
    insert into l (id, prev, mydata) values ('G', 'C', 'data for G');
    update l set prev='G' where prev='C';
    commit;

this would violate the unique constraint for prev, because after the insert (but before the update!), both the new element and the element not yet updated have prev set to 'C' - the transaction will then be rolled back.

In standard SQL this would be possible because it allows to declare unique constraints (and I think other constraints, such as check clauses) as deferrable, too - PostgreSQL doesn't implement this, it allows deferrable only for foreign key constraints. But usually it's no problem to order the commands in a way that only foreign key constraints get violated during a transaction.

> Right now, I'm using MySQL.. I only have 4 tables, and product is not launched. Would you advise switching to Postgres?

as an entrepreneur, you should probably do what's best for your customers - and they will likely not care which RDBMS you use:-) Perhaps you could play around a little bit with PostgreSQL on the side and (perhaps) make the switch once you are comfortable with it. And keep your JSON-based lists - if the system works, why bother with a rewrite / schema change (for now?).

Considering momentum, I think PostgreSQL is gaining steam while MySQL is losing momentum (some key developers left after the aquisition by Sun) - of course, that's my subjective impression.

Technically, of course I think that PostgreSQL is better - here is a good comparison: http://www.wikivs.com/wiki/MySQL_vs_PostgreSQL for amusement, read the discussion here: http://www.reddit.com/r/programming/comments/764fp/mysql_vs_...

sam_in_nyc · on March 1, 2009

There's one more thing I forgot to mention...

The data I'm having the user order is not only order-specific, but is recursive. That is, one of the items on the list, rather than be a letter like "B" can be a list in and of itself. It's basically like "files and folders"

A major problem isn't only the UI, which took me days to make (drag and drop for files/folders ... anybody seen anything like this done before in HTML+JS?), but mostly is in the updating/validation. I receive information, like "moved item: /path/to/item to before /some/other/path" and I now need to make sure this is a valid action (eg: can't put a folder into one of its own subfolders, can't put a file in a file, etc), and also update the database to reflect this.

I chose to use JSON to encode objects in the DB, then decode them into an array, and do some easy/fast array stuff to perform the action they requested. Just judging on the array code, which does several "isset()" calls and splices, I'd imagine doing this with a DB would be a major heartache... but I have no doubt you'd be able to come up with some brilliant way to do it.

To each their own :)

jsrn · on March 1, 2009

> To each their own :)

yeah, of course! And if it works, then, by definition, it's good for your customers and thus for you! Also, your approach is probably more flexible (no fixed schema for the recursive list structure), which makes it easier and faster for you to iterate and react to customer feedback.

If at some time you want to have a more fixed structure or let the database do some of the server-side validation work of your tree-like structure, here is a good writeup by Phil Greenspun showing how to model and query tree-like datastructures with an RDBMS: http://philip.greenspun.com/sql/trees.html (the rest of the document 'SQL for Web Nerds' is also quite good stuff: http://philip.greenspun.com/sql )

he uses 'connect by', which is a non-standard Oracle extension, the same thing can be achieved with 'with recursive'-queries, which is standard SQL (1999) and part of PostgreSQL 8.4 Also, Joe Celko's Book 'SQL for Smarties' has a good chapter covering hierarchical data structures in SQL (he also has a separate book about 'trees and hierarchies' in SQL).

Oh, and good luck and much success with your startup!

drewp · on March 1, 2009

A tree widget with DD? http://extjs.com/deploy/dev/examples/tree/reorder.html

I bet there are lots of others too. This doesn't sound like something you'd ever want to write yourself (unless the widget IS your project).

ryanwaggoner · on Feb 27, 2009

This is really interesting...could you do the same thing using S3 instead of MogileFS? The advantage would be cost and simple scalability, especially if you're running on EC2 so you don't have to pay for all the back and forth transfer from EC2 to S3. Concerns might be latency issues and whether this would scale to billions of objects. Other than that, it would seem like you could run a pretty huge site off of just a couple beefy servers.

nostrademons · on Feb 27, 2009

Absolutely. It'd work with any sort of distributed filestorage. It wouldn't be too hard to wrap the S3 methods so that the API is consistent (the MogileFS API is basically a dict), so app code wouldn't even have to change.

We used Mogile over S3 mostly because I didn't realize there was no minimum monthly fee for S3.

joshu · on Feb 27, 2009

This is great. I sketched out a similar design for the Delicious backend, but the team that built it went in a completely different and stupid direction. Oh well.

Mine also had:

- some way to do server locality (I don't think FF needs this)

- some way to do table locality (I don't think FF needs this either)

- per blob versioning and type

- a virtual shard layer, so new machines can be added inexpensively

Sorry for any inadvertent whining, incidentally.

DavidMcLaughlin · on Feb 27, 2009

Excellent article. I work in an environment where we have serious vendor lock-in and the people in charge of infrastructure have a real fear of change. I have free reign to implement in any improvements in our performance and code so long as I don't make things difficult for them.

Thus, the idea of a MySQL cache system is something I've been planning to look at for a while as it plays with the existing infrastructure. This (seemingly proven) system will make things a lot easier, so thanks for sharing.

antirez · on Feb 27, 2009

"However, none of them seemed widely-used enough by large sites to inspire confidence. In the tests we read about and ran ourselves, none of the projects were stable or battle-tested enough for our needs"

Ok, just some hour ago I released the beta-3 of Redis (http://code.google.com/p/redis/ if you care) and I'm near to feature-freeze with exactly with this goal. To make it rock solid (I'm going to use it in my startup's web stuff with a lot of users/month, so I care about stability).

The question is: what's in your opinion the right path to make a system like Redis stable and reliable for the real world usage? What to publish on the site in order to inspire a good feeling about stability? Thanks

wayne · on Feb 27, 2009

Not to be a naysayer, but I don't know that there's a lot you can do except run it yourself on a large, popular site that people have heard of. Companies like FriendFeed will use memcached cause LiveJournal/Facebook use it, they'll use MySQL because just about every web startup uses it, but no matter how awesome a project is, they're not going to use something complex that a few developers or some other startup wrote that hasn't been battle-tested on a large, well-known site. It's too risky compared to writing another one yourself that you understand.

antirez · on Feb 27, 2009

ok this sounds like I can have this goals:

1) Take Redis simple enough so that a single developer can understand the implementation in little time.

2) Run Redis on my own large sites. The problem is that while this sites are really large they are famous only in Italy. One example is http://oknotizie.alice.it

3) In the future when Redis will be stable and some little startup will start using it for real work try to collect success stories and write it in the front page of the project.

aristus · on Feb 27, 2009

I've been evaluating databases recently. It doesn't matter if the site is not well-known in the US as long as you can post traffic numbers. TokyoTyrant is Japanese, after all.

Also these help:

   * able to handle large (>200GB) datasets
   * client libraries for top N languages
   * easy way to write client libs (eg simple protocol, a C library, etc)
   * connection pooling and/or cheap connections
   * easy to install on Mac, Linux and Windows development machines
   * repl (this is not much of a problem with Python & Ruby)
   * stable dump / restore format

One problem for you is that MySQL plus serialize() basically does these already, with 10+ years of testing on top. Your system has to do a lot more to make it worth the risk.

antirez · on Feb 27, 2009

Agreed on most of the points... about MySQL, I developed Redis just because MySQL does not scale enough with given kind of datasets :)

aristus · on Feb 27, 2009

Scale enough in which way? Read traffic? Write traffic? Locking? If you use MySQL as a btree it is pretty fast and consistent. I gather from this post that a lot of companies started in the last 18 months are doing this. It's a hack but a remarkably useful one. Good luck! I will keep my eye on redis.

antirez · on Feb 27, 2009

Probably it just depends on the dataset. The issue we have with our service (http://lloogg.com) is that you need to take the last N items of logs for every site. To ask for the latest M items should be fast. To push a new log line on the list should be fast. Every kind of MySQL configuration we tried was unable to reach the 10k writes/second we reach with Redis. Obviously. Even when you use MySQL as a btree implementation you get a lot of overhead. Starting from the protocol and the format of the statements, for example.

The idea to encode things with json or other formats in a blog text is just a ugly hack. People are using this because they are desperate, not because is good computer science. They started with mysql, know mysql, hacked with mysql. Clearly will try to fix their site with MySQL.

The json+blob can work as long as the data that's stored in this fields is trivial to serialize-deserialize. What about having a 10000 elements list in every blob and at every page view you need to append an element?

So: great hack, you found a way to work with the tools you have, but this does not mean in any way that fast key-value persistent DBs don't have something to say into the web-scale theater.

ezmobius · on March 1, 2009

I added consistent ring hashing and a rediscloud class to the ruby client library today. so you can now use a cluster of redis servers and have the client use ring hashing to determine the server for a given key.

richcollins · on Feb 27, 2009

How is redis different from TokyoTyrant?

antirez · on Feb 27, 2009

it's higher level. Not a plain key value stuff. For instance as value you can have a list or a set, push/pop elements, ask the server for all the keys matching a given glob style pattern and so on. Most of this operations are atomic in order to make sure there are no race conditions.

You can read more about the difference between Redis and other key-value stores here: http://code.google.com/p/redis/wiki/FAQ

Basically the long term goal is to have something between a relational DB and a key-value DB. Not all the higher level features must be killed in order to be scalable.

zandorg · on Feb 27, 2009

Has anyone here got TokyoTyrant working under Windows? I found a static library binary for Windows, but the header files need to compile are either not there, or the code references some bizarre POSIX stuff which isn't easy to hack into working under Windows.

hsiung · on Feb 27, 2009

That's reassuring, I thought I was the only crazy fool who was storing json objects in database columns =). We do something similar for some of our data models at thesixtyone.com. It's really nice for not having to bring down the site for schema upgrades.

gaius · on Feb 27, 2009

It's really nice for not having to bring down the site for schema upgrades.

All the other popular databases let you modify the schema online. This feature has been taken for granted for over a decade.

hsiung · on Feb 27, 2009

Actually we're on postgres... It just makes me nervous when the migration is taking upwards of 10 minutes to complete since some of our tables have millions of rows and lots of concurrent read/writes are happening.

srini · on Feb 27, 2009

I don't think executing long-running DDL with concurrent read/writes is that bad, especially if the DDL is additive and doesn't lock the whole table.

Concurrency - isn't that what a DB is supposed to do well?

hsiung · on Feb 28, 2009

If the site isn't taken down during an update, some parts of it will always be broken during a schema migration since our ORM expects each table corresponding to a data model to be in sync after a code update. I guess you could always version your data models and handle the pre-upgrade and post-upgrade cases, but that seems like a lot of work in our situation.

bmj · on Feb 27, 2009

I find this interesting because for a previous project, we had to manage separate database instances for each client (regulations). Even though we had a database management tool that allowed us to push schema changes out to the various DBs automatically, it would be better to not have to push those changes out at all.

bdr · on Feb 27, 2009

Without locking tables?

gaius · on Feb 27, 2009

newt0311 · on Feb 27, 2009

Indeed. This is not a RDBMS problem. Its a MySQL problem and the correct solution is to use a mature DB.

apgwoz · on Feb 27, 2009

Even in Oracle, we've had to schedule downtime for a single column schema change and population for 45 minutes. I would call Oracle "mature."

Note: I wasn't involved in doing the update, so it's possible there was a better way to do it. The table is used by about 50 different applications.

fendale · on Feb 28, 2009

Almost all major oracle operations can be done online ... Index rebuilds, table reorgs etc. Depends how much hardware you have how much it affects things ...

gnaritas · on Feb 27, 2009

True in this case, but not always true, the issue often isn't upgrading the schema, it's that there can be no single schema that can hold the data because it varies on a per row basis. No RDBMS can deal with this well, no matter how mature, it's just not what they're designed for. This is trivial for an OODBMS which just stores raw objects, no matter their shape.

fauigerzigerk · on Feb 27, 2009

We're doing something very similar to store RDF data. Our reason for doing it this way isn't performance but rather RDF's pretty intricate schema requirements (and opportunities). The drawback of using this scheme is that it roughly doubles the amount of data stored.

Contrary to Friendfeed, we have to use joins a lot because analysing data is the purpose of our application. We tried to do it with mysql, but mysql turns out to be completely unsuitable for the task due to its lack of merge or hash joins.

I'm quite surprised that someone like Friendfeed would change their entire data model for performance reasons instead of considering a stronger RDBMS (of which there are many). Their problem with index maintainance isn't exactly new. It's a solved problem that needs no wheel reinventing and doesn't merit the increased complexity of asynchronous index updates in my view.

sam_in_nyc · on Feb 27, 2009

Really a great way to do things... thank you so much for sharing this.

I can't really see any disadvantages to doing anything this way. You still obtain access to data in a relational sense, though where normally you compare columns of a table you'd now do joins. For example, you can get the unique users that have submitted a link by joining the index_link table to the entities table, then joining that to the index_user table.

The problem here seems to be that sharding would prevent this type of operation... so how do they get around this? It's possible they just don't need this data, but lets assume they do. I'm presuming they have some slave(s) munging non-realtime-needed data into whatever real relational tables they choose. But, if the problem because realtime, I'm at a loss as to how they'd do it.

nostrademons · on Feb 27, 2009

They explained the sharding in the text - I glossed over it on first read-through too. When they say "join" in quotation marks, they don't actually mean a join in the MySQL sense. Rather, the initial call to user_id_index.get_all reads all entity_ids for that user into the Python code (they say it consults all shards for this, but isn't the index sharded on user_id, so all entities for a given user_id live on one shard?). The Python code then uses whatever shard function applies to the entities table to query its database backends, selecting the relevant entities. Then the Python code filters the returned records by the indexed field (in case the indices are out of date) and returns it.

As for disadvantages - well, it's denormalized, for starters. ("Normalization is for sissies", says Cal Henderson.) If an indexed field changes, you need to update it in both the index and the relevant entities. There're also a bunch of little inefficiencies, places where they traded performance for scalability. Imagine if you naively plugged this engine into an app with only 10 records: instead of a simple index search, it'd have to go to the index table, fetch the relevant entities, go to the entities table, fetch them, filter on indexed value, and then return them all. But then, if your database fits on one machine, you don't have the same sort of engineering challenges FriendFeed does.

sam_in_nyc · on Feb 27, 2009

When they say it's sharded on "user_id" what they mean is that's the field that decides which database the record is stored. It might go something like: if the user starts with 0-8, store in DB1, otherwise, store in DB2. This is up to their Datastore controller to decide how to hash based on the user_id and the number of databases.

nostrademons · on Feb 27, 2009

Yeah. They actually explained that in the article too - the shard number = user_id % num_of_shards. So user 1 is on DB1, user 2 is on DB3, etc. If they have 10 shards, user 11 starts back on DB1 etc.

jbert · on Feb 27, 2009

I don't understand how that scheme can work, since changing the number of shards changes the location of most users.

e.g. we have 4 shards, so user 5 is on shard1. If we go to 6 shards, user5 is now on shard5.

I guess it works with downtime to move the users, or another layer of indirection, where the newly created shards can "point back" to existing shards, but otherwise I don't see it.

My understanding of sharding is that it's best to have a global lookup table user->shard. That's not a huge amount of data, even for millions of users.

Anyone care to educate me as to how they get user%num_shards working in practice?

gnaritas · on Feb 27, 2009

You don't change the number of shards, you change the number of machines and re-balance the shards across them. Shard != machine.

    12 shards on 1 machine
    6 shards on 2 machines
    4 shards on 3 machines
    3 shards on 4 machines
    2 shards on 6 machines

And on and on...

jbert · on Feb 27, 2009

Ah, OK. So you start with enough shards "for ever" and not change the number?

Edit: no wait. If you can split a shard across multiple machines, what's the benefit of having more than 1 shard? Why not have 1 shard split across 1000 machines?

mbreese · on Feb 27, 2009

The above table wasn't too clear, but my understanding is that a shard exists on only a single machine, but that a single machine could host multiple shards.

gnaritas · on Feb 27, 2009

Correct, you don't split shards across machines. Each machine hosts x number of shards.

jbert · on Feb 27, 2009

OK, but # of shards is fixed "for ever" under the modulus scheme? You pick it once, when you first shard and then you're looking at downtime to adjust it?

In order to split across multiple dbs, you're looking at creating say 100/1000 dbs in our initial split (when you've got maybe 2-3 machines). And that number then caps the number of machines you can scale to without adding another layer (sharding-shards) or having downtime?

gnaritas · on Feb 28, 2009

> You pick it once, when you first shard and then you're looking at downtime to adjust it?

Yes, but you say that like it's a bad thing, it's not. If you're ever forced to reshard, that means you grew beyond what you ever hoped... hurray, awesome, nice problem to have. The reality is however, 99% chance that'll never happen.

Like 1 in a 1000 people ever run into a scaling problem of this magnitude but if you read the blogs you'd come away thinking scaling issues that require sharding are common and everyone needs this stuff, but they aren't, and they don't.

mbreese · on Feb 27, 2009

Yes. In the simple modulo scheme, the number of shards is fixed. Unless you want to re-shard the entire thing. (AFAIK)

ropiku · on Feb 27, 2009

Instead of using normal hashing you can use consistent hashing so you don't need to move that many users. Here's a post explaining it: http://www.spiteful.com/2008/03/17/programmers-toolbox-part-...

apgwoz · on Feb 27, 2009

I don't see how this scheme can scale simply for the reason that there's no built in balancer. What's to stop shardN from becoming overwhelmed when all the power users end up there, while shardN-1 has no activity?

aaronblohowiak · on Feb 27, 2009

Edge-Case YAGNI. Have lots of shards, put multiple shards on each machine, shuffle the shards around as necessary. Rinse. Repeat.

smokinn · on Feb 27, 2009

flickr deals with that problem by manually moving power users if necessary. They have a tool that lets them do that. See: http://highscalability.com/how-flickr-handles-moving-you-ano...

In practice it's going to be very rare that you'll overwhelm a single shard and adding an extra layer to point to where people actually are is quite simple and fast. You can use a simple cache key (read-through cache of course) that, if it exists means the user is on a specific shard, overriding the default algorithmic pick.

joshu · on Feb 27, 2009

You make a large number of virtual shards. Each virtual shard is mapped to some machine. When you add a new machine, you move some shards to the new real machine.

Presumably consistent hashing is also helpful here.

jcoby · on Feb 27, 2009

Hopefully they simplified things for the sake of brevity and they're using consistent hashing or some other predictable algorithm in practice.

finiteloop · on Feb 27, 2009

We don't do any joins in MySQL. We query the indexes in all of the shards in parallel to get a list of entity IDs and then query the entities tables in all of the shards in paralel to get the entity bodies as a second operation.

gaius · on Feb 27, 2009

But would you, if MySQL could do hash joins?

derwiki · on Feb 27, 2009

I like the trick for getting around index creation and deletion, I wasn't aware that MySQL required a full table lock. I checked the docs for DB2, and it allows read/write during index creating -- is this a MySQL only limitation or do the other major databases impose the same restriction?

gaius · on Feb 27, 2009

Pretty much it's MySQL only.

Most of the MySQL posts here are actually about clever ways to do things most DBAs and developers have taken for granted for years. Do you wonder why no-one seems to be writing blog posts about "sharding" Sybase or DB2 or Oracle or Postgres...?

paul · on Feb 27, 2009

Actually, they do, e.g. http://highscalability.com/skype-plans-postgresql-scale-1-bi...

MySQL is popular with startups though, so it's not surprising that there's a lot written about it.

It's true that MySQL has some lame limitations, but I don't believe that there are any silver bullets out there. Google tried to switch their ads system from MySQL to a "real" database once, and it was basically a disaster and had to be abandoned in favor of MySQL (I wasn't working on it, so I can't really give all the details).

Another problem we had with MySQL that Bret didn't mention was that it would try to be "smart", and sometimes it would "randomly" (from our perspective) choose a very inefficient strategy, and we would have to waste a lot of time figuring out what it was doing and how to force it to do the right thing. The approach Bret describes basically avoid any MySQL "smarts" and treats it as a dumb, but fast and well tested B-tree. This gives us fairly reliable and predictable performance characteristics because we know exactly what it's doing (mostly).

davidw · on Feb 27, 2009

No, there are no silver bullets, but it seems to me that people reach for Mysql a bit too quickly, without considering the pros and cons. And while it's improving, Mysql has had many frustrating things in the past... to me it's always seemed like a "worse is better" kind of thing. Sure, it's "fast", but at what cost? Once you go to InnoDB, you lose that speed advantage.

One thing that's not a tech tradeoff, and is generally a Postgres win, is the BSD style licensing. You can take Postgres and do whatever you want with it with no worries.

> Google tried to switch their ads system from MySQL to a "real" database once,

I hope the part that actually handles money has been fiddled with by Google to be robust.

nl · on Feb 27, 2009

Actually, Postgres doesn't have any great out-of-the-box solution for partitioning the database across machines. The usual suggestion is Slony, but that is no where near as robust and widely deployed as MySQL replication. The GPL licence for MySQL isn't really a problem for webapps anyway.

OTOH, Postgres does somewhat better than MySQL on a single box with multiple cores (it's fairly linear up to 8 CPU, which is much better than MySQL) - mainly because of the work Sun put into scaling it before they bought MySQL. (At least - that's according to some people from Sun who do a lot of performance work with both databases)

davidw · on Feb 27, 2009

It's not as robust? Do you have a source for that? You're doubtless right about it not being as widely used.

I've always preferred the fact that Postgres tried to do things correctly. Most recently, I bumped into this with Mysql, and it reminded me why I get irritated when I use it:

http://journal.dedasys.com/2008/11/11/another-mysql-doesnt-d...

I've often bumped into things like that that just irk me.

neilc · on Feb 27, 2009

Postgres does somewhat better than MySQL on a single box with multiple cores ... mainly because of the work Sun put into scaling it before they bought MySQL.

That's not really the case. Sun has some some benchmarking, but most of the work on Postgres SMP performance was done by others (mostly Tom Lane, who works for Red Hat, and various EnterpriseDB employees).

sam_in_nyc · on Feb 27, 2009

Excuse my ignorance, but what's the Postgres solution to sharding?

gaius · on Feb 27, 2009

It's to write efficient hash joins, partitioning and row versioning into your core database engine.

simonw · on Feb 27, 2009

How do hash joins help with sharding? Surely once you go above a certain number of writes per second you're going to need to have more than one writable database server, at which point you need to start partitioning and better join algorithms aren't going to do anything to help.

gaius · on Feb 27, 2009

If you can't hash join then you can't join over large datasets anyway, so sharding costs you nothing in that respect.

Right now, using off the shelf kit and doing nothing particularly clever, running a major commercial RDBMS you could do 10,000 commits/sec and handle 100T of data on a single instance. Sure it would cost you a pretty penny, but the thing is, unless running a database is the one competitive advantage your company has, you're better off keeping your people focussed on the thing that does make you money.

nl · on Feb 28, 2009

I'm not too sure if you are arguing for or against the commercial RDBMS.

To me, that sounds like a pretty good argument against it - the great thing about MySQL/Postgres+Sharding is that it scales down, as well as up. You can start out with a single server, then gradually add in extra servers as you need them. With the "single big DB server model" it doesn't work like that - you have to make a pretty decent investment early on in the software licence, build your software to use it, and then the pricing isn't linear, either.

Plus, the backup/redundancy thing sucks too - with sharded MySQL you need a couple of spare cheap servers, but with Oracle etc you need to pay twice for the licence and a second server.

OTOH, in a corporate environment it's much easier to predict your usage and the pricing is easier to justify. Also your priority is safety+justifiability first rather than price or even price/performance.

simonw · on Feb 27, 2009

Which commercial database? Any idea which web scale applications are using it, and why the other ones aren't?

DenisM · on March 1, 2009

MySpace runs on Microsoft SQL Server.

http://blog.de-hao.com/2007/11/17/myspacecom-running-a-megas...

easp · on March 7, 2009

Yes, and it's partitioned out the Wazzooo.

yesimahuman · on Feb 27, 2009

Reading this makes me realize how naive my DB setup is for my startup.

easp · on Feb 28, 2009

Did you read the part where they said this was the third or 4th iteration of their DB backend? Or maybe I read that somewhere else.

Point being, experience will come.

dmaclay · on March 1, 2009

This looks better suited to the B-Tree version of Berkeley-db than MySQL.

rads · on Feb 27, 2009

I love this key-value store trend. Looking at CouchDB in particular has made databases are fun again.

zafarali · on March 1, 2009

cool. very interesting.