Am I the only one that finds certain aspects of NoSQL databases easier to use than SQL? Despite having used different SQL ORMs and writing two on my own, I'm a big fan of MongoDB. It's fun using it for small projects, too, even if I know scalability and speed won't be an issue.
You were surprised? In my experience most hackers love redoing the last project, only better. There's all the joy of victory over your past mistakes, and less of the pain of struggling with an unfamiliar problem.
They love it so much they usually have to be counseled out of doing it when it isn't a good idea for strategic or other reasons.
On the contrary, both were done with the employers' knowledge. The first ORM was very... magic-heavy, relying on monkey patching classes and stuff like that. Therefore, it was a pain to debug. The second one I wrote later using code generation. Much simpler, yet just as flexible.
I agree with you. I mostly use MongoDB with Ruby and mongo_record. I find this combination so easy to use that I don't have to think about it (like driving a car). I always keep MongoDB running as a service on my laptops and it is installed but not always running on my servers. BTW, if you use Ruby, have you noticed how much faster mongo_record accesses MongoDB using Ruby 1.9.1 instead of 1.8.x? (Faster than I expected from the usual 1.9 speedup.)
NoSQL is the duck-billed platypus in the evolution of databases. Absolutely fascinating, provides great material for PhD students, but a total dead end. Might survive in some isolated corners of the world; not one to bet on to be the next dominant species.
A while ago I have held the same opinion; that is up until I have faced a database problem that genuinely wasn't very RDMBS-friendly so I was "forced" to have a look at the alternatives. Ever since then I have realized that there there are large domains of problems that can be solved a lot easier with "nosql" configurations than with RDBMS systems. I mean stuff that I've used to solve with SQL databases before.
My point is: it is always a good idea to have a larger perspective.
I had a similar change of perspective in 1997 when I worked at a well known Internet retailer. The entire catalog of items we sold (as well as customer reviews and other data) were all stored in a number of key value stores (Berkeley DBs) that were routinely built and pushed out to each web front end. This was very fast and for our purposes was much better than storing this information in a centralized SQL database.
I worked at said retailer as well, as a matter of fact I was in charge of said millions of customer reviews. And you know what? BDB's were mostly used just because of distribution. They were pushed to all live servers nightly, that way you had no dependency on one giant server.
This eventually changed as everything moved to a service architecture, and those BDBs were also eventually stuck in an Oracle database as well.
Basically BDB's, although great for some things were more a product of how to scale initially, and the pain quickly became so big they moved away from them on most things.
So I wouldn't consider said retailers experience as a terribly pro-nosql one.
If you'd had viable open source SQL databases in 1997, would you have spent the engineering time on BerkeleyDB, or would you simply have replicated the master database onto each server? In 1997, you weren't choosing NoSQL vs SQL, you were choosing open-source vs commercial.
Anyway, you're essentially storing pre-generated pages, which isn't a use case that I think anyone considers particularly database-appropriate (SQL or NoSQL). Using memcached to cache the data that is in active use seems more efficient, faster, and gives you the option to force through out-of-sequence updates (though again, that wasn't available to you off-the-shelf in 1997.) Complete re-generation of data might have worked for Amazon in 1997, but is this what they're using today?
With open source SQL databases there is no "simply" when it comes to replication. Even today, MySQL replication is brittle, and master/slave inconsistencies are the rule rather than the exception. Slave crashes often cause replayed transactions due to lack of atomicity in writing master.info and relay-log.info. The replication landscape with PostgeSQL is varied and essentially a bag on the side. Last I counted there were more than 10 different ways of doing it, a number involving trigger based log shipping. It wasn't about open-source vs commercial, it was about scaling the reads.
The detail pages weren't pre-generated, they were based on read-only catalog data, which I think is entirely database appropriate. I imagine that complete re-generation of data is no longer done, but I'd be willing to bet that Berkeley DBs are still used in production somewhere.
I agree that built-in replication can be difficult to administer even today, but you're being completely revisionist here. Replication wasn't introduced into MySQL until 2000. In 1997, you would by necessity have rolled your own replication system tailored to your needs (much simpler than solving the general-case problem). That's basically what you did anyway, but you solved it in the most trivial way possible: you 'replicated' by doing a complete database dump and re-distributing the entire DB. If you'd had a viable open-source relational database, you could have scaled the reads and got more developer productivity by distributing a SQL database (e.g. SQLLite) rather than a key-value database (BDB).
I appreciate your standing up and giving a concrete example of NoSQL usage - nobody else has been brave enough to do so. But it seems that the reasons for it were highly specific to the time: there were no viable open-source databases, Amazon was just introducing the idea of customer reviews (i.e. pre Web 2.0) so data was primarily read-only, memory was comparatively expensive and memcached didn't exist, and you had a comparatively small product catalog where complete re-generation was an option. I don't think you can carry forward the optimizations you made in that framework into today's world.
I actually was responsible for that system, and moving away from BDB's being pushed to servers sometime in '00 or so.
As you said, these weren't really databases by any stretch of the imagination, simply snapshots, and built for a very specific type of query. (by asin, by time, reverse ordered)
The building of the DB's was a pain in the ass, because the sheer scale of them was so big that you had to do clean builds (instead of incrementals) fairly often without them wasting space. There was also all sorts of voodoo magic going on to work around various BDB issues.
The system did eventually move to a service architecture (as all of AMZN did), for two main reasons:
1) pushing that much data to more and more servers was getting insane, even on their inner networks.
2) we wanted faster turnaround for new reviews
3) rebuilding the BDBs was becoming more and more cumbersome with scale
All that said, the original system did take us pretty darn far, both in scalability of traffic and scalability of data, farther than most websites will ever reach.
Fun times working there, you really get to work on some unique problems.
It was essentially a key-value store of 256-bit audio fingerprints for an audio recognition system not unlike Shazaam but used for different purposes. The system was required to work on real time audio input so low response times were essentiall. And we're talking about many billions of keys here and a lot of queries per second. We've tried PostgreSql initially but response times were a real bottleneck. Then we have made a proof of concept with BerkeleyDb and it worked really well but we couldn't use it due to licensing issues so we have ended up coding our own key-value store
I presume you used B-Tree indexing in PostgreSQL vs hash-indexing in BDB & your own code? PostgreSQL does support hash-indexing, but the documentation suggests it's not faster than B-trees anyway (probably because it would involve a second disk access).
Let's be honest though - this is an incredibly niche use-case. There's nothing in the relational model that precludes supporting this use case optimally, but in practice engineering resources get focused on more mainstream uses. You did your due-diligence and found that support for your scenario was lacking in existing RDBMS systems. That's great, but I think you should be qualifying your support for NoSQL lest others with traditional use cases (best served by RDBMS systems) simply copy your conclusions where they don't apply.
I was initially sceptical over the NOSQL approach, but I have since changed my mind. Let me be very clear SQL is not dead, and it's the right tool for many jobs.
However, if you want a cheap, replicated, multi-master, write heavy system - there is no simple way to this. Oracle is too expensive, I've used mysql-clsuter and pgpool-II etc, but in my experience they tend to introduce more failure points than they remove (at least in projects requiring a short development cycle).
NOSQL pushes a degree of the complexity of the data model back to the application (e.g. you end up building some of your own indexes), but the data model is often no more complex in the end (less normalised, and greater flexibility of columns).
The reward for this is massively simpler scaling. Something like Cassandra is so much easier to replicate to a multi-master multi-site system than PostgreSQL or MySQL.
If you're building any form of semi-mission critical site (where you don't want to rely on or be tied to a single hosting provider) and you need to be multi-site & replicated from an early point, then NOSQL philosophy has a lot going for it.
That said - I strongly hope that the MySQL/Postgres camp will fight back with a simple replication strategy that just works rather than is an bolt-on/afterthought.
I find it helps to think of NOSQL being about simplifying the datastore to make it easier to replicate. So far I think only Cassandra does a persistent multi-master solution properly, but others will catch up.
Cassandra isn't just multi-master, it's symmetric and masterless (peer-to-peer). It also isn't the only system with that property: other Dynamo "implementations" (Dynomite, Riak, Voldemort) are the same way.
There are also multi-master systems that employ 3-PC like fault tolerant distributed commit protocols (e.g., Paxos with Leases, ZAB) and leader elections, allowing transactions to go through even if one of the coordinators fails.
I contend NoSQL is the introduction of nailguns to scaling up large projects with few consistancy constraints where priorly there were only screwdrivers and hammers.
It's useful in it's arena. As are SQL based solutions in their arena. And there are tons of projects that can be done in either. Anyone who contends either is the only way hasn't spent enough time using the other.
IMO, its the equivalent of panting as opposed to sweating: a functional way to get something done that fits some things quite well.
While I agree on the NoSQL movement in general (really, how many websites don't fit well on an RDBMS?), there are many useful cases for it. In my case, I'm storing massive binary analysis databases in a key-value store that allows quick graph traversal. Traversing code paths in a standard database or KV-store is insanely, insanely expensive, so a custom solution was really the only route to take.