*The limits of the RDBMS are orders of magnitude greater than the NoSQL crowd th...

silentbicycle · on Oct 29, 2010

In discussions about "NoSQL" systems, I've found out that some of the developers complaining about RDBMS performance didn't even know what indexes were.

Usually they learned how to use MySQL from thirdhand PHP & MySQL tutorials off somebody's blog or something, and thought it was representative of all RDBMSs.

Not saying everyone using "NoSQL" is poorly informed, just that sometimes peoples' impressions of performance aren't very accurate. It makes me suspicious when somebody's benchmark only uses MySQL.

SkyMarshal · on Oct 29, 2010

I've slowly come to the same opinion over the past few years. I'm just going to leave this here for posterity:

http://www.dbdebunk.com/

jbooth · on Oct 29, 2010

There is nothing approaching even a single order of magnitude in improvement in random I/O with conventional HDDs over the last, say, 20 years. Not even a 2X improvement. You're under 500 IOPS per HDD, period, use them wisely.

Moore's law doesn't apply to RPMs of spinning disks.

_3u10 · on Oct 29, 2010

Yes, but a battery backed write cache will turn random IO into sequential IO eliminating this '500 IOPS' barrier. Put 512 MB of BBWC into a system and see how your 'random IO' performs. The whole point is that NO ONE scans their ENTIRE dataset randomly, if you have to scan your entire dataset, just access it sequentially. Plus, nothing in NoSQL solves ANY of the points you are outlining.

jbooth · on Oct 29, 2010

Write cache is useful of course but it doesn't make random reads any faster. If your dataset is too big to cache, disk latency becomes your limiting factor. The question then becomes how to most effectively deploy as many spindles as possible in your solution - SANs are one way, sharding/distribution across multiple nodes another.

_3u10 · on Oct 30, 2010

No, you would never waste write cache on random reads, instead you'd buy more RAM for your server. Why would anyone ever buy a whole new chassis, CPU, drives, etc when all you need is more RAM? As for how to effectively deploy spindles the answer is generally external drive enclosures. Generally you can put a rack of disks and save a 2-4U for the server.

jbooth · on Oct 31, 2010

I said "where it's too big to cache" -- that means too big for RAM, aka >50GB or >200GB depending on what kind of server you have.

mmt · on Oct 31, 2010

SANs are one way,

Agreed, if you include fast interconects like SAS and exclude the network[1] requirement of SANs.

sharding/distribution across multiple nodes another.

I disagree, for the sam reason that doing so with iSCSI over ethernet isn't: too much added latency.

Infiniband may help, but I have yet to try it empirically.

[1] Switching/routing, multiple initiators, distances longer than a few dozen meters.

mmt · on Oct 29, 2010

You're under 500 IOPS per HDD, period

This is only significant if one is limited to a trivial number of spinning disks. 20 years ago, with separate disk controllers, this was the case.

If you run some benchmarks, I expect you'll find that, for random I/O, N disks perform better than N times one of those disks.

SCSI provided (arguably) an order of magnitude for number of disks per system.

Now, SAS provides another. $8k will buy 100 disks (and enclosures, expanders, etc). How many IOPS is that?

ETA: The Fujitsu Eagle (my archetype of 20ish years ago disk technology) had, IIRC, an average access time of 28ms. If its sequential transfer rate was one 60-100th of modern disks, what fraction of a modern disk's 4k IOPS could it do?

jbooth · on Oct 29, 2010

Yes, I agree that the solution is to throw more spindles at the problem.

PL/SQL, though, with global data reach and advanced locking states for every single transaction, make it really hard to move off of a single host. So it's more and more work to get more disks attached to that host, and CPU is a hard upper limit.