This movement sounds reactionary, and perhaps for good reason. SQL has been the default method for storing and retrieving data for a long time, but obviously it isn't the best fit for every case.
That said, I like sql. I find it extremely easy and natural to keep relational models in my head and write queries. While I do appreciate some of the tools that free me from writing very basic CRUD stuff or persisting objects, I don't like to be too insulated from my database, and I'll gladly drop to sql when I need to.
The problem is that sql has been used for all data persistence and retrieval for far too long, when all kinds of different models might have been appropriate. So I guess this is a kind of backlash. And eventually, there will be a backlash against the backlash - or, more likely, people will forget there ever was an argument, and we'll get a new generation of developers who think they've discovered relational algebra for the first time.
Another thing is that some people have problems with SQL itself, not relational databases per se. SQL is just the compromise-of-a-standard that came out of several RDBMS query languages competing for mindshare.
And, naturally, in some peoples' minds any problems that MySQL has are joined to SQL, and in fact all relational databases.
There seem to be two quite contradictory arguments for NoSQL. One camp says, we already have what we need in the form of [programming language] objects, so why would we take everything apart and map it to something as reductionist as relational tables?
The other camp says, relational tables are soooo complicated, we need something even more reductionist and free form, so key-value pairs are just right.
I can understand the second argument, but the first is a step backwards unless your programming language is something like lisp, which has a reductionist data model itself.
Object oriented models are fine if you write one application. But they tie functionality to data very closely and that makes repurposing of data for seperate applications much more difficult. So how do they analyse their data or use it in more than one application?
They have to write a lot of code using an application specific object oriented API instead of a general purpose reductionist data model. That's horrible but how horrible it is only shows a few years down the road. It's a disaster in the making. A whole army of programmers will be required to extract data from all those big APIs of legacy apps.
I'm convinced that data has a different life cycle than procedural code and therefore needs to be expressed in a simple, uniform, reductionist model independent of all application code.
>>Object oriented models are fine if you write one application. But they tie functionality to data very closely and that makes repurposing of data for seperate applications much more difficult.<<
I've taken a different tack in my research--create a generalized user object and then build applicatin objects on top of them, with the base user object responsible for all persistent data. (These objects can each be mapped into an XML document, so there's a fair amount of flexibility).
The work has gone very slow, but here's the site: http://agilewiki.ning.com/ and mind, I have mixed in some hype/BS, though mostly it is intentional things which I've previously prototyped (been working on this over 6 years now).
"Object oriented models are fine if you write one application. But they tie functionality to data very closely and that makes repurposing of data for seperate applications much more difficult. So how do they analyse their data or use it in more than one application?"
This was an extremely poorly-written and poorly-researched article. One problem is that the projects being discussed are not object databases in the sense of storing straight-up serialized representations of business objects.
Take CouchDB for example; the basic "thing" you store in it and get back from it is a JSON object, which is a set of key/value pairs. In this sense it's not too far removed from SQL-based DBs, because you still choose which bits of data you're going to store (the keys in the object). It departs from the SQL point of view in not requiring that all the records have the same schema, and in not having a representation (at the data-storage level) of relations between records.
This turns out to offer some big advantages: in my experience, the number-one cause of unwieldy SQL is a schema that's had to grow over time to accomodate ever more edge cases. Maybe it's been done through lots of nullable columns, maybe it's been done through lots of related tables or some other mechanism, but it frequently has to be done and ends up making the database painful to work with.
CouchDB throws that out the window: edge-case records simply go in like anything else, and if they don't have some particular field present in other records, so what? The "query" is actually hitting a map/reduce in which the "map" function can take whatever action it wants with a record that's missing some particular key. It can skip that record, it can spit out a default value for a missing key, it can do anything it likes.
"I'm convinced that data has a different life cycle than procedural code and therefore needs to be expressed in a simple, uniform, reductionist model independent of all application code."
Simple and independent, yes. Reductionist and uniform, no.
To run with CouchDB as the example, those assumptions are thrown out because CouchDB essentially adds a layer to the stack. Traditionally, you have application code over here, querying data over there. In CouchDB, you have application code which queries a CouchDB view which returns data, but the query doesn't necessarily know anything at all about what the actual data in the DB is, or how it's structured (or even if it is structured in any sensible way; maybe it's just a bunch of random key/value pairs). The view layer is the part which cares about that.
And views are not static (or mostly static) things like the schemas in relational DBs; views are free to evolve over time, you're free to add or remove views in response to changing needs, and the underlying data never has to change as a result. And so you don't need to agonize over the most efficient way to reduce your data to a uniform schema. You don't need to "migrate" your underlying data storage representation to change the types of things you can store or the types of queries you can run.
I think data independence all but implies a reductionist data model (uniform probably isn't a very clear term). And when I say data model I don't mean a particular schema but the primitives that are used to model the data, like relations and attributes in the relational model or key value pairs in Berkeley DB.
I cannot say anything useful about CouchDB as I don't know it nearly well enough. What I think doesn't work is to hide data behind a procedural API when it comes to read access (write access is a different matter).
A procedural API is a black box that you cannot reason about and has a very application specific purpose that doesn't lend itself to analytics apps. Analytics apps should know as little as possible about particular applications. They cannot easily call arbitrary functions.
My experience with data centric apps is that it's a good thing to have that situation where everything is a table and each of the few operations you have creates another table. Tables in, tables out. The same thing works with lists, key/value pairs, etc.
My battle cry would be "No applications specific APIs" (for data access)
My biggest hold up to investing time learning how to harness "noSQL" data stores is that there are so many young-ish projects and I'm afraid of picking a loser. Couchdb, mongoDB, Redis, TokyoCabinet, etc, etc.
Seems like it's too early to cast lots, and I'll just stick with what I know for the time being. Is that prudent of me or just naive?
The noSQL school includes both key/value databases and bonafide object stores. None of this stuff is new, BerkeleyDB, circa 1986, is as industrial strength as they come. O2, GemStone, Statice, etc. all go back to the 70s and 80s.
Just because something is on github doesn't mean it's ground breaking ;-) The future is full of wonderful things that are yet to reach us from Lisp and Smalltalk research of the past.
The sticking point with BDB is typically its license - if used commercially, you suddenly need a commercial license from Oracle, and the cost is unspecified.
The tricky thing is that there is not going to be a single winner here. What we are seeing is data moving out of a creaky (albeit very capable given certain conditions) silo and the managing it is becoming more task-specific. Each of the databases you mention has its own strengths and weaknesses depending on what you want to get out of it, the nature of your data and access patterns, and the environment in which it will be deployed. Right now I would suggest that you just watch the space, become conversant in the various models (especially the weaknesses of each) and take each for a small test drive so that you have a basic understanding of what each particular model is capable of.
Cassandra can write 50GB of data in 0.12 milliseconds, more than 2,500 times faster than MySQL.
Really? Cassandra can write 416TB, 666GB per second? If Cassandra can write 50GB in 0.12ms, then it can write over 416TB in a second and 25petabytes per minute. Which clearly isn't true. Of course, that quote would also mean that MySQL can save 166GB per second.
The article is a piece about technology that the writer doesn't understand. They all have trade-offs. Key-value stores eliminate your ability to access data by anything other than its key. So, you have an article with an id of 5. What if you want to look up the articles by author 9? You're just unable. Column based databases aren't magic either. They just change the orientation of the data. Rather than seeing 1,Adam,West;2,Mark,Twain;3,Will,Smith you see 1,2,3;Adam,Mark,Will;West,Twain,Smith. That offers some advantages - such as being able to do metrics on a column easily - but it also means that if you want to get the data for one row, it has to do MORE work than a regular row-based database. To retrieve 1,Adam,West it has to do multiple lookups since the data isn't located next to each other. And frankly, random access of a row is what you're likely to want.
They run on clusters of cheap PC servers. PC clusters can be easily and cheaply expanded without the complexity and cost of "sharding," which involves cutting up databases into multiple tables to run on large clusters or grids.
That's like saying that if you plant a hamburger you'll get a McDonalds. They don't just magically run on clusters of cheap PCs. The fact is that there are tradeoffs. If you put 50% of the data on server1 and 50% on server2 you'll get faster query speeds except that you then have to know where to look for a specific piece of data.
So, sharding: basically, you split your data along logical (or non-logical) lines into separate databases. Like registration at a conference, users A-M are in DB1 and users N-Z are in DB2. So, when someone comes up with a query, you can easily tell which place to send them and then the query is run there. Or you can non-logically split and have a map that tells you where to go. So you query metadb and ask "where's Frank" and it says "Frank is in DB1" and then you query DB1 for Frank.
BigTable isn't so different. It does do some awesome auto-splitting of tables, but when it comes down to it, it's much the same. There's a single META0 tablet that gets queried to find the location of the META1 tablet which knows the location of the actual data. And, in order not to overload the META0 tablet, one needs to cache the location that you get back.
--
Often times, the problem is that you have web programmers who wouldn't know a B-Tree from a linked list. All of these technologies exist because there are certain things that they're good at. However, what this article purports to know is that these NoSQL databases are just superior. In most ways, they're inferior. A key-value store has its place, but it's severely limiting in a way that you can't just use a key-value store; you need more querying power. Likewise, whether a column-oriented database is right for you depends more on how you want to access the data (by column or by row).
Oh, and really, learn SQL indexing and check whether your queries are doing full table scans. That's the root of a lot of problems. I mean, a good index should do lookups in log(n) time which means that on 10,000 rows a query should take 1/769th the time; on a million rows a query should take 1/50,000th the time; on a billion rows it will take 1/33,333,333th the time of doing a full table scan of a table that size. Yeah, indexes make a huge difference, but they aren't magic either. They merely order the data in a certain way that makes it easy to pluck out certain rows since you don't have to look at every value. For example, dictionaries are alphabetized and so you don't have to look at every word. You just start jumping toward the area of the word you're looking for. Indicies work the same way. Now imagine if the dictionary was unordered. You'd have to look at every single word to see if it was the word you were looking for.
Oh, and just saying "put an index on stuff" isn't enough to know. It isn't too complicated, but there's a Google talk that goes over some of the mistakes one can make with assuming that indexes are doing what you want (http://www.youtube.com/watch?v=u70mkgDnDdU&feature=chann...). It's in the middle somewhere, but it's a good watch for the whole thing.
--
Really, is this the type of article we've become? I mean, column-oriented datastores and key-value datastores are both valuable technologies, but a NoSQL flame written by someone who doesn't know that Facebook is still MySQL backed and uses Cassandra for only specific things and that they use many different technologies where they're good (like heavy use of memcached as a key-value store to reduce load on MySQL where queries aren't needed). Or perhaps that the article doesn't seem to acknowledge the drawbacks that the authors of these systems would talk about. For example, mongoDB states that they lack ACID integrity and that it's more suited for "High volume, low value data" - that should be read as whether someone has upvoted or not; if that data gets lost it doesn't really matter so much because the person can click the vote button again and all is well or just because it isn't super important data.
Why are we supporting "we hate this technology just because"? Technologies are tools that have their places. It's foolish to write off a good technology just because you want to make fun of it or deride it. Use the right tool for the job. In your web application, you might find that using multiple tools in concert is the way to go.
Really? Cassandra can write 416TB, 666GB per second? If Cassandra can write 50GB in 0.12ms, then it can write over 416TB in a second and 25petabytes per minute. Which clearly isn't true. Of course, that quote would also mean that MySQL can save 166GB per second.
If you click through to the slides, they are actually referring to latency. The slides say that for accesses of > 50GB, Cassandra's latency is 0.12ms, and MySQL's is around 300ms.
This, of course, has very little to do with the rate at which data is written.
(Yay for IT journals... quality information for PHBs...)
"They all have trade-offs. Key-value stores eliminate your ability to access data by anything other than its key. So, you have an article with an id of 5. What if you want to look up the articles by author 9? "
For cases like that you don't have one source of data, you do almost like an AFTER INSERT trigger to update other flattened tables of data that you need.
For instance, where you would normally perform a JOIN you would just after save update a JOIN table or a VIEW that is created by incremental updates.
So then you could filter by many other things and really aren't limited too much by space to make those duplications.
It happens with large RDBMS systems anyways after about 5GB of data, same type of scalability decisions have to be made.
But yes the article is off and you are right, most likely systems will approach this with multiple solutions for specific problems.
> It happens with large RDBMS systems anyways after about 5GB of data, same type of scalability decisions have to be made.
I work every day with a > 4TB database and its all on 1 machine (a very big machine) with some very fast SAN. If you have money, Oracle can get very big on the right hardware. I am willing to bet MySQL or Postgres could do very well on that sort of hardware too.
True but scaling vertical is really a luxury, in most situations especially free to play or open systems for a certain level of usage, this is typically not economical. For a good vertical scaling strategy you need servers that reach into the tens or hundreds of thousands of $$.
The interesting thing about the article is that the NoSQL movement has succeeded in reaching outside the database nerd circles! To get some insight in the actual trade-offs and uses cases for non-RDBMS database systems, this article is a much better read: http://thinkvitamin.com/dev/should-you-go-beyond-relational-...
I should have clarified, I meant around 5GB per one type of data. Maybe a table or dataset that needs to be merged. For instance if you had a product database table that needed to be joined with another GB large or set of data at runtime.
At that point, even in RDBMS you have to stop with the joins and flatten. I have experienced these limits in Oracle and MSSQL around 20-30 million row tables.
Scalable fundamentals like database flattening, dimensional modeling etc are all what key value stores give you from the start. But really a good mix works best or project specific, just saying in the future with TB, PB of data the JOIN is a historic remnant.
It is possible one day that RDBMS will be seen as one of those evil optimizations that we made in our small relative worlds at the time.
It is possible one day that RDBMS will be seen as one of those evil optimizations that we made in our small relative worlds at the time.
I doubt that for the simple reason that relational databases are maths: the relational algebra and relational calculus. There's no such theoretical underpinning to object databases.
I doubt that for the simple reason that relational databases are maths: the relational algebra and relational calculus. There's no such theoretical underpinning to object databases.
True, the relational architecture is moved to the object or code level, if you think about it in relation to size, it makes more sense there for the future.
The RDBMS was the brain but it is also the storage, code will now dictate how to use the storage and not the storage itself, the storage is becoming a component.
Well, the disk space is not the controlling parameter usually. Getting a ridiculously huge amount of fast, reliable disk is pretty much a solved problem. If all you were doing was writing logs at a constant rate then you could "scale" pretty much indefinitely just adding larger and larger disks to the same poky DB server, and it wouldn't matter if it was 5GB or 5PB.
The proper scaling issue is, of course, with the cpu and memory of your single DB server - and its disk speed - which can only be expanded so much, and at exponentially increasing cost. It's conceivable that even with a fixed amount of data, say 100G, you might have to confront scaling issues with increasing site usage, long before you reached anywhere near the limits of your raw disk space.
One database I work on was under 2Tb when I got it a little over 3 years ago and is nearly 30Tb now, so I am very familiar with the scaling issues. No need for sharding or any of that nonsense either!
Who said that? Certainly not me, and I doubt gaius has 30TB of RAM so not him either. How much RAM you need is a complex question and revolves to a certain extent around the DB caching the "hot points" but differs so radically app by app there's really no rules or general guidance you can point to. It comes down to measurement of your particular installation, as always.
And fyi I prefer PostgreSQL. But MySQL isn't THAT bad, c'mon.
I'm not sayin MySQL is that bad, but I exclusively hear this kind of advice and anti-DB rhetoric from people running MySQL and MySQL only, who are complaining about scaling.
MySQL may be able to perform better, as I haven't done in depth tests I wouldn't know how far it can be stretched, but my point is that almost everyone complaining about the performance of the RDMS model seems to come from MySQL thinking that it represent the utmost limits of what relational databases can do, and that is quite sad.
Hm. Well, I agree with your observations. But maybe they're more willing to suggest alternatives because they haven't invested so much into learning an alternate system? You would expect someone who has, say, spent the last 15 years becoming an expert on DB2 to refrain - even subconciously - from questioning the very fundamentals his career is built on. MySQL users may not be advanced but at least they are unbiased.
MySQL is not that bad, no. With proper knowledge it can do a lot, and beyond its limits, capably configured, I am not sure any RDBMS can help. Facebook is not developing Cassandra because they can't afford Oracle.
I first used MySQL a decade ago... There are plenty of people who've fully invested their careers in it. Which is as sad as someone who thinks MS Access is the state of the art.
This article is really terrible, and it's sad because most of the pieces of software discussed are at least interesting. Several are indispensable to anybody in the business of dealing with PB-scale data. (I'm pretty sure Hadoop is seeing healthy enterprise adoption.)
That said, I think that calling the whole event NoSQL was kind of asking for it; of course people are going to try to squeeze some drama out of that stone. (I would have gone to the meetup anyway, though, if I hadn't already been booked.)
I agree with you. There's a lot of hype around these nosql databases, because a lot of people simply don't understand sql databases or how to make them performant or they haven't run into the problems RDBMSs solve yet.
I would like to add one quick correction or maybe clarification to, [Indexes] merely order the data in a certain way that makes it easy to pluck out certain rows since you don't have to look at every value.
Indexes don't always order the data, only clustered indexes order the data and there can only be one clustered index on a table. You can have lots of unclustered indexes that are merely pointers to data within the larger table.
http://en.wikipedia.org/wiki/Index_(database)#Clustered
"Cassandra can write 50GB of data in 0.12 milliseconds ..."
These numbers have been taken from p.21 of the linked PDF by the Facebook engineer Avinash Lakshman where it shows this MySQL vs Cassandra comparison:
"MySQL > 50GB Data
Writes Average : ~300ms
Reads Average : ~350ms
Cassandra > 50GB Data
Writes Average : 0.12ms
Reads Average : 15ms"
Which I'm guessing actually means: with a database table which is more than 50GB total size they have measured individual row accesses at these speeds. Maybe Avinash has accidentally transposed the read and write figures - I can't think why the read would be slower than the write.
15ms to write a row to disk seems possible given that a fast modern disk
has a latency of 2ms and seek time of 4ms and a sustained transfer rate of about 100MB/s. The track to track time is only 0.4ms so maybe if you just wrote all the data to disk serial-log-style you could reconstruct from the log after a failure and handle all reads from memory. I don't know Cassandra. Obviously, from these figures, the disk couldn't do a row read in 0.12ms.
Cassandra uses log-structured merge storage, so reads really _are_ slower than writes. (But still much faster than MySQL!)
You can (usually) make reads faster by throwing things like memcached at the problem. Writes are harder. So I think this is the right tradeoff for a modern system.
"They all have trade-offs. Key-value stores eliminate your ability to access data by anything other than its key. So, you have an article with an id of 5. What if you want to look up the articles by author 9"
You build and maintain indexes, just like you do with an SQL db.
Oh don't get me wrong, I use CouchDB and love it. I think it's going to be big.
It is very good at what it does. But it is not good at, nor ever will it be good at, ad hoc queries like the GP describes. The second time you query that view it will be blindingly fast. The first time, however, it has to build the view index from scratch, which in a large-ish DB might well take hours.
You just can't assume you will have pretty fast "random access" queries like you would on MySQL or similar. Of course, it goes the other way as well, and there are many examples of views you can trivially do in Couch which would be prohibitively expensive in MySQL.
As always, you pays your money and you makes your choice.
I know you don't mean it this way but this sounds like SQL databases can magically index their data without scanning it. Add a new index to a large table in any RDBMS and it'll do exactly what you're describing.
Query on an un-indexed column and it'll take forever every time. (There's an Oracle database I deal with occasionally that takes 50 seconds to count 78 rows. No we can't add an index.)
All NoSQL (eewww) non-RDBMSs do is move the pain around. The efficiency and lookup times are (usually, mostly) orthogonal to the orientation of the data, all you can do is align your use case to hit as few pain points as posisble.
Oh yeah I know. But RDBMSs tend to be much faster at ad hoc queries.
I don't know what's going on with your Oracle install but even taking the small example of 1k rows, an unindexed ad hoc query in MySQL will return pretty quickly, well under a second on a decent machine/disk. That might be fine for, say, occasional use of a "reports" web page - and you don't need to then store and update an index. The same query on CouchDB will be at least 10 times slower, possibly making the page unusably slow, and if you want it to be usable you need to store the index - no choice.
But yeah, just "moving the pain around" is absolutely right. Ain't no silver bullets.
This process is only slow if your DB is already huge. CouchDB forces you to think your query a bit more before building your DB. To be fair though, try to add an index to a huge SQL database, it will take long too, and sql indexes are needed on any frequently used query.
I agree, the article is really sloppy, a typical piece of journalism I would say.
I just want to add that column stores are not in any way anti SQL. They don't change the data model at all. They're just a different implementation of the relational model, more suitable for analytics than row stores. For OLTP apps they're bad. Reconstructing one row isn't necessarily expensive (as column stores don't just store individual columns), but writing one row is expensive.
To be fair to the noSQL guys, the technologies that drive the alternatives are very, very young. People are just starting to play with it. There's no reason to think that techniques for performing more traditional sql-style queries on distributed data won't quickly surpass it.
And as far as a lookup occuring in log(n) time, imagine a broadcast "hey everyone, row associated with 1030923". Each node checks its bounds (in parallel) and then, if it exists w/i those bounds, does a log(n) search. That's pretty reasonable.
In short, my prediction is that these things will grow way outside of key-value pairs and into a complete solution that scales very well (and is _incredibly_ redundant).
Are you kidding? I wouldn't say it's impossible, but "no reason to think that techniques for performing more traditional sql-style queries on distributed data won't quickly surpass it" is quite a stretch. There's plenty of reason to believe that. Mainly because of the number of databases and the sheer volume of thought put into RDBMS', both theoretical and practical. Sure, when it comes to massive scalability there are dozens of promising new technologies that beat SQL databases, but they all make tradeoffs, there's just no way not too when you're talking about something as powerful as SQL.
I imagine there will be some great improvements that come out, but I think we're past the point of any major data store revolutions unless something fundamentally changes like quantum computing becomes a reality and we have to all basically start over.
I suspect that solid state drives optimized for databases will appear before quantum computing. ioFusion is already targeting media distribution applications. Just as filesystems have been optimized to spinning platters and are adjusting for SSD, databases will have to adjust as well. (I'm not just talking about flash, either. There will be other SSD technologies.)
The software that runs on any node is pretty trivial as compared to a SQL database. Very easily written and understood. In three years, I assure you the systems will be significantly better and will start to rival many of the features of RDBMS'. At least the ones most relevant to anyone running something like hacker news or stack overflow.
So I'm working on a small project that will develop into a startup if the prototype works. It's heavily based on information handling and to supplement the functionality we plan on having an internal user profile database.
You are obviously familiar with the functionality of modern database tech. Was thinking SQL, might there be faster/lower latency lookups?
Haven't nailed down all the fields yet but baseline will include taxonomy(language topics), historical preferences (general and specific)
I can go into more detail by email if you're curious)
I also noticed this outrageous claim and left a comment in the article. It's shocking that the author can just blurb out random BS and post it without verifying.
Performance is always a trade-off; wou rarely get something for nothing. These object stores are great when all your doing it storing objects, but that's a pretty simplistic use-case. If you need to do more with your data than simply read it back in object-sized chunks, you're going to find that a relational database probably provides better performance. For web sites especially, retrieving and constructing objects is a fairly expensive operation if you're just generating list pages.
I wonder if we just have a generation of programmers who can't think in SQL. They're used to using ORMs (which are useful abstractions) but can't work at a lower level. Sending a query over the line to get exactly the results you want, and no more, and have it optimized and run entirely on server is pretty damn efficient.
"I wonder if we just have a generation of programmers who can't think in SQL."
Maybe we have. I'm not so sure.
What I am sure of is that we've had multiple generations of developers who've seen that real-world data almost never fit into a uniform, enforceable schema, and so are happy to have solutions which accept that reality instead of demanding that developers and DBAs try to shoehorn the data into uniformity anyway.
I wonder if we just have a generation of programmers who can't think in SQL.
I think that's probably true. SQL requires thinking declaratively, not imperatively, and in sets, not objects. Kids these days don't understand the difference between a table definition and a class and the difference between a row and an object. The declarative/functional style isn't really taught anymore; everyone just wants to learn Java and get a job.
If anything, ORMs are less "scalable" than SQL because under the hood the system is based on SQL, and in an effort to be completely generic, the ORM generates really bad SQL.
Indeed, "kids" (by which I assume you mean "assembly-line CS grads") these days are pretty much using Hibernate, ActiveRecord or (I guess?) LINQ. This is a good thing: The O/R mismatch exists. For good programmers ORMs are more productive in most cases (and you can write raw SQL for the exceptions); bad programmers have never gotten the relational model (and there's plenty of legacy code to prove it.)
However, the discussion about "kids" is a total red herring, because I'm pretty sure all of the people involved in talking up Tokyo Cabinet, memcachedb, Couch etc on one end and Cassandra, Hypertable or HTable on the other are fully aware of the difference between a class and a table. The truly clueless don't even know the discussion is happening.
The object/document database revival circa 2009 is about three things:
A. People are already using object-based access for almost everything, so if they're not using relational features it's trivial to drop in an object store and get better performance/less overhead.
B. On the low-end, sometimes a relational database is too much overhead -- if not in performance, in administration. (SQLite is cheap, sure, but sometimes you want just a disk-backed hash table.)
C. On the high-volume-data-analytics-end, RDBMSs don't perform well enough- unless you shell out serious cash for Oracle or Greenplum, Aster Data, Teradata, etc, and even then they still can't handle the volumes MapReduce and Bigtable-alikes can.
A. That is fine until the first time somebody has to write a report.
B. With the right product, administration is easy. MySQL is completely ubiquitous and something like SQL server is easy. This certainly conflicts with point C.
C. Most of these people aren't doing high-volume anything. If you are doing high-volume stuff (like Google) then of course it makes sense to use something very specific to your task. But that doesn't mean Bigtable, for example, makes for a good generic solution.
If you anticipate having to write a report, obviously you should use the right tool. If you don't, use what's simple and gets the job done. It's not incredibly hard to switch later.
On the low-end, administration for MySQL is certainly not easy; for an app you build in a day or two on Rails or Django or Sinatra (or as a CLI tool for that matter) you can easily spend more time doing sysadmin work to provision, configure and maintain MySQL than you do writing software. (MySQL shared hosting isn't everywhere.) There are plenty of throwaway webapps out there that just aren't worth setting up a database server for.
On the high end, it's not a matter of what's easy; it's a matter of what's possible.
These are different use cases, and there are different software packages being advocated for them. No one who needs a disk-backed hash table is using Hypertable, and no one who needs massively parallel analytical capability is using Tokyo Cabinet. And no one's advocating reconsidering RDBMS use as a whole; just RDBMS use as the default choice.
I think the key insight in all this is that the same database you store your objects in... shouldn't be the same one you write reports from. You can have two. One object store for your application, and another analytic DB for your reports.
The complaints about SQL is that is optimized for report writing, not application development. Fine, so split that out.
Do you happen to know any articles on people splitting their data across 2 types of storage this way ?
If there is any data that should be stored in 2 separate stores, I can see applications becoming messy rather fast in trying to replicate changes.
Unless of course this is usually done by having a clean separation between which data goes where, but I doubt if any domain can ever really be that easily split up.
Well, for starters - most companies don't run reporting queries on their production SQL database. They mirror, summarize, partition, index and cube a separate reporting DB, so that big/mean queries don't cause massive latency on their site/product/production system. Which isn't quite the same as using a different data-store altogether, but some kind of split is commonplace. Which means that some kind of difference between queries/applications in reports/production is already common. They key here though, is that setting up an RDBMS that can handle analytics on even a moderately large data-set is a major task, can be complex/pricey, tends to use big iron, and only scales so far before it gets very, very expensive.
But, yes there are examples of what I just described. In practice, in many problem domains, most data of interest for reports does not change once it is written, so syncing up is not a major issue.
CouchDB sort of takes this approach, albeit with key/value and pre-defined and materialized map/reduce views on the same store. I think this dichotomy will become increasingly common, and will be less cumbersome than it currently is as the tools mature.
Key/Value for the front end and Map/Reduce on the back end makes a lot of sense for a lot of problems, since key/value is how many applications actually work, and there is the added benefit that systems like these scale linearly on commodity hardware using FOSS, can make it cost effective - and much simpler, than scaling a traditional RDBMs as an analytic data-store. The upside to this is too good for these systems not to win a big chunk of the market. And you can have your SQL - albeit on top of MapReduce - in reports, where it belongs :)
It's a running joke where I work how bad Hibernate's SQL is compared to hand-written by experienced developers. Maybe it's "good enough" for some applications, but we don't really get out of bed for less than 5000 transactions/second...
You speak as if writing SQL was some unpleasant or arduous task, whereas in reality it's just a DSL for data. In most cases an ORM is just another layer of complexity.
> In most cases an ORM is just another layer of complexity.
That unlike SQL, drastically reduces the amount of code the programmer is forced to write for the vast majority of applications that programmers write.
SQL can't do the one thing most programs actually need, give them the ability to select a starting point, and then navigate the conceptual graph of data as the user moves around the application. ORM's provide that, object database provide that, SQL doesn't.
You're right, and that's what great about ORMs! But if you're asking a question of your data, like "how many widgets did I sell today?" then navigating a conceptional graph of data is the slowest, most convoluted, way of getting at that information.
"I wonder if we just have a generation of programmers who can't think in SQL."
And they don't even teach set theory in elementary school anymore like they did in the early seventies, which may be among the root causes of the malaise. Understanding things like relational databases and conditional probabilities is so much easier if you have been taught a solid foundation from early on.
> I wonder if we just have a generation of programmers who can't think in SQL. They're used to using ORMs (which are useful abstractions) but can't work at a lower level. Sending a query over the line to get exactly the results you want, and no more, and have it optimized and run entirely on server is pretty damn efficient.
Yeah, it's really obvious if you ever work with a really really senior DBA. I don't think it's the queries though, it's the modeling. I think a lot of software teams model their database sort of like they model data structures and then use various ORM tools to be the glue. DBAs tend to model the data in ways to make it most acceptable to the database and minimize the loss of any information, it's almost always more complex. The database as become an object persistence engine in a lot of cases that has some relational properties that may or may not be used. Instead of writing and managing files in the filesystem, you shove stuff in to a database.
Just for starters, your ORM will model an object for a row of data, how many times do you get the whole row when you're really interested in just a column or two? Does your ORM let you just specify the parts of the row you're interested in or does it hydrate an object and populate all of the columns? (Those extra columns being copied does add up...) ORMs are a religious war, as a software engineer it's a really beautiful idea, in reality I've never seen one that really works well with the database, they're too softwarey.
The other thing that seems to be rampant is the traditional 3 tier application model has kind of collapsed. It's not the case everywhere but I've seen it at more than a couple places where there is a persistence tier and then kind of a combined presentation/business tier. With a more traditional data model the business layer is absolutely critical and shows a lot of value, you might have to glue some more complicated queries together in to objects inside a transaction rather than just hydrating a row from a table. When you use a database as a store for your data structures a business layer just doesn't seem to be as valuable.
If you just want keyed data storage, a relational database does become the wrong tool. I would think, and this might just be parochial thinking, that at some point keyed data storage would essentially reinvent the relational database as the problems grow in complexity.
Just for starters, your ORM will model an object for a row of data, how many times do you get the whole row when you're really interested in just a column or two? Does your ORM let you just specify the parts of the row you're interested in or does it hydrate an object and populate all of the columns?
I don't know about (n)hibernate or other ORMs, but Linq for SQL let's you specify exactly what you want, and when you inspect the SQL generated, it is usually quite efficient, although not 100% optimal for complex queries.
There is also the issue with nested objects (relations) and if they should be prefetched or not (to avoid in effect nested-loop type SQL), but that is perfectly controllable.
My biggest issue with ORM layers is transaction-handling when you are doing some things in the DB, some things with data from this query and some data from this other query. This can quickly promote to what would in DB be a simple transaction to a distributed transaction.
When used correctly ORMs are very nice tools indeed, but to use them efficiently, you have to know how they work and how the underlying DB they interface with works as well.
> I wonder if we just have a generation of programmers who can't think in SQL. They're used to using ORMs (which are useful abstractions) but can't work at a lower level.
There is probably some truth to this, but another reason is that they want to avoid certain scaling hassles or costs.
Edit: I'm not saying this is necessarily a good thing.
I'm sorry, but this graph says nothing. Load average is meaningless; for example, one process using 100% CPU at all times would show a load of 1.0, and one process that repeatedly spawns threads but uses little cpu can push the load average very high.
What's the difference in maximum number of pages served per second using ab? Or average CPU usage? Or average disk throughput?
I'm not saying I disagree with your conclusion, but this doesn't prove it.
That's much more interesting data, but it clearly shows that the reason Redis was performing better is because it's caching a ton of data in memory, whereas MySQL was constantly hitting the disk for access. This seems to be a misconfiguration of either your kernel settings or MySQL settings. If Redis effectively cached your entire dataset in RAM, then MySQL could have done so as well (or the kernel could have). I'm guessing that you configured MySQL in a way that caused the kernel to fight with the database over memory usage.
That sheds some light on things. One thing that doesn't make sense to me is that your disk writes are completely flat now, no real variation through the day.
MySQL can hardly be called representative for good relational databases. That MySQL is underfeatured and performs like shit for any non-trivial query is hardly news.
This seems oversimplified to me. When I think "database" I think of ACID properties. You can have SQL without ACID, and you can have ACID properties without SQL.
Fully agree. look at graph databases like Neo4j, which has ACID
semantics but no SQL interface. Then, IMHO SQL is not only geared towards a special RDBMS type of underlying database but even against a special problem domain that is not always applicable, leading to query languages like SparQL.
The problem with this simplification os the problem-space is that ACID properties cannot be maintained (see CAP) when you start moving to a distributed model. The "no SQL" group is just using it as a catchy hook upon which to hang their hat, it is not SQL per se but the traditional RDBMS that is reaching its limits, but it was hard to make a memorable name that included "RDBMS" :)
The problem with this simplification os the problem-space is that ACID properties cannot be maintained (see CAP) when you start moving to a distributed model.
That assumes you need the system to remain available in the presence of partitions. How often is that really true?
I think SQL has its limits too, for example in hierarchical use cases. While it is possible to write extremely complicated CONNECT BY clauses in SQL, I would prefer XPath and XQuery for that job.
I wrote a fairly detailed post about the Death of the Swiss-Army RDBMS on my blog, http://www.roadtofailure.com :)
The RDBMS has been used for way too long as a generic "bucket" for any sort of data storage and analytics. As a result, speed and scalability have suffered. With Social Media and the advent of "web-scale" data, engineers need to realize that the storage and analysis solutions they use must be driven by the type of data they're storing, and what they want to do with it.
As I stat, "The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be relegated to the proper dustbin before it does any more damage to engineers trying to write scalable software."
This article is the biggest load of fucking bullshit I've ever read. That's all I'm going to say about it. The author couldn't tell his ass from a hole in the ground.
I am glad you are getting modded back up from the grave; I agree and appreciate someone calling a spade a spade.
We don't always need to get into thesis-length technical arguments about the merits of so and so viewpoint - sometimes an article is just bullshit and deserves to be dismissed as such.
I'm not convinced. Of course these projects are useful for some applications but most online services will never grow to the size where you HAVE to dump your SQL server for something more scalable. And in the meanwhile SQL is just sooo convenient and easy to develop on. Why would I want to switch unless I have to?
You'll still be able to use SQL. You will still use RDBMses for things they are good at. You will stop using SQL for things SQL is not good at.
You will start collecting much more data. You'll be able to throw as much processing power at your problem as you want to, for short periods of time. You'll be able to effortlessly pull in disparate data-sets and work with them without the effort required to import new data and pre-define relationships and indexes in a SQL schema. You will be able to query unstructured data in multiple formats including video, images, audio, html, plaintext, etc. with more structured data. You will live in a culture of data, where large numbers of PCs are brought to bear on all of your data whenever and however you want. You will discover new relationships you did not know were there. You will get increased value from data you didn't know had value because you didn't bother collecting it, because your DBMS wasn't good at working with that kind of data.
All these are new opportunities that exist in addition to traditional RDBMSes, not in place of them.
> And in the meanwhile SQL is just sooo convenient and easy to develop on.
No, it isn't, it's a pain in the ass because it doesn't match how applications are programmed. I know SQL pretty damn well, and I know how to tune a database very well, and SQL still sucks.
It has it's pain points to be sure, but the fact that "it doesn't match how applications are programmed" is because it's modeling data not code. Yes your object store will make the application much easier to write initially. But then what happens when you need reporting or worse, some feature that you never anticipated that requires a different view into the data.
That's not to say SQL is the end-all-be-all, but people are so quick to dismiss it, and I bet there is going to be a lot of pain felt as code based on simpler databases ages.
Then I replicate the applications data out into a relational database for reporting. 99% of what applications need isn't reporting. I want to use many different data structures, not be forced to fake everything with a table, or continually reassemble things from tables.
SQL has its strong points, being a convenient data store for applications isn't one of them.
Again, it's not convenient for greenfield, but applications don't spend most of their time in that state. A good ORM is a compromise between holding onto that flexibility and being able to reconstruct your data objects easily.
I realize that there are a lot of applications where you will never need that flexibility and a simpler database would be the right choice. My attitude is based a response to the unbelievable hype surrounding these technologies among a certain class of tech blogger. Given how quick many people seem to be willing to chuck the relational model out the window without acknowledging any of its benefits, I predict there will be a lot of rude awakenings and failed projects in the next 10-20 years.
> I predict there will be a lot of rude awakenings and failed projects in the next 10-20 years.
As opposed to the many many failed or stuck projects out there now because they made the mistake of using a relational database as an integration point between many applications and can't risk changing anything now because that one big global variable (the db) can't be touched without breaking 15 applications.
The relational folk don't like to admit their shortcomings either.
I should have said "some people". I don't think developers in general are dismissing SQL. I just think there's a certain vocal minority who has a loathing so great for SQL that they willfully ignore the power of the relational model in their analyses of why some of these new DBs are so great.
That said, I like sql. I find it extremely easy and natural to keep relational models in my head and write queries. While I do appreciate some of the tools that free me from writing very basic CRUD stuff or persisting objects, I don't like to be too insulated from my database, and I'll gladly drop to sql when I need to.
The problem is that sql has been used for all data persistence and retrieval for far too long, when all kinds of different models might have been appropriate. So I guess this is a kind of backlash. And eventually, there will be a backlash against the backlash - or, more likely, people will forget there ever was an argument, and we'll get a new generation of developers who think they've discovered relational algebra for the first time.