They are different beasts really. It'd be hard to implement something complex, like the PageRank algorithm, in SQL. On the other hand, MapReduce is not designed for tasks such as "SELECT * FROM table WHERE field > 5", which is one of the benchmarks the paper uses.
Also, parallel DBMSs don't handle failure well. Ask any of the MapReduce developers at Google and they'll tell you failure handling is the hardest part of the problem.
With all that said, I'd be very interested to see a comparison between Vertica, DBMS-X and MySQL, running in sharded or MySQL Cluster mode.
The paper was co-authored by David DeWitt, not Microsoft. DeWitt is a well-known database researcher from Wisconsin. And Microsoft, despite their awful history of business practices, pours tons of funding into really good quality research (via MSR).
I'm not saying people aren't biased by their associations -- certainly, the traditional DBMS community feels like they have to defend their research & products against new techniques like MapReduce.. but if you're going to criticize the paper, criticize it because you found flaws in its contents, not because of the organization listed beneath an author.
A single-statement SQL PageRank implementation sounds feasible to me. Most commercial SQL implementations have had hierarchal, ranking, and windowing functions for a while now. They all have XML (maybe HTML?) document parsers in them as well. I think moderately complex hierarchical rankings can be built from those primitives, even as a single SQL statement.
The performance of that code and if/how they are parallelized by the system is really a quality-of-implementation issue. If Oracle, Microsoft, or IBM thinks they can make money by implementing automatically parallelized hierarchical ranking and windowing functions, they will do so; maybe some of them have already done so.
I share Oracle's philosophy about how abstract SQL should be. Oracle says you should be able to write the same SQL against a small single-instance database that you would write against a giant multi-instance database; the database is responsible for optimizing the performance, not the developer. There are a lot of cases where the theory doesn't match practice with Oracle, but every new release gets the two closer. I wouldn't be surprised to see Oracle surpass general-purpose MapReduce-based approaches in terms of performance soon. And, I am expecting Oracle to release a tool to efficiently run MapReduce-style programs against Oracle (converting them to SQL) any time now.
Did no one else notice that this paper was authored by the C-Store guys, the ones that went on to create and found Vertica (specifically Stonebraker)? They have an intense bias against map/reduce, as could be expected.
As well, one important fact that people seem to have missed is that they admit that m/r was better at loading the tuning the data, a task which in its regular name (ETL) dominates most data warehousing workloads.
Finally, please note the price tags on the solutions advocated by this paper. They are far out of the reach of any startup founders reading this paper. I looked at them at both Commerce360 and Invite Media and they were just nowhere near reasonable enough for us to replace our existing Hadoop infrastructure with.
CAP Theorem:
Consistency, Availability and (network) Partition Tolerance: pick two.
RDBMS are excellent at providing consistency and availability with low latency. Yet imagine building a massive parallel web map/crawling infrastructure on top of an RDMBS -- particularly with multiple datacenters and high latency links between these datacenters: it may be feasible for an enterprise/intranet crawler, but not for something at the scale of Yahoo or Google search (both of which build up their datasets/indices using map/reduce and then distribute these indices to search nodes for low latency querying via more traditional approaches).
Keep in mind that the article itself admits that the loading the data into an RDBMS took much longer than loading into onto a Hadoop cluster.
That's what the main crux is: if your data isn't relational to start with (e.g. messages in a message queue, being written to files in HDFS), you're better of with Map/Reduce. If your data is relational to start with, you're better off doing traditional OLAP on an RDBMs cluster.
When I worked for a start-up ad network, we couldn't afford the latency/contention implications of writing to a database each time we served an ad. Initially, we moved the model to "log impressions to text files, rsync text files over, run a cron job to enter them into an RDBMS". Problem is, that processing the data (on a single node) was taking so long the data would no longer by meaningful by the time it would be in RDMBS (and OLAP queries would finish running).
I thought of "how can I ensure that a specific machine only processes a specific log file, or a specific section of a log file" -- but then I realized that's what the map part of map/reduce is. In the end I've setup a Hadoop cluster, which made both "soft real time" (hourly/bi-hourly) data processing and long term queries (running on multiple month, multi-terrabyte datasets) much faster and easier.
Theoretically, yes: RDBMS is the most efficient way to run complex analytics queries. Practically, we have to handle such issues as asynchronicity, contention, ease of scalability and getting the data into a relational format (which itself is something that could be done using Hadoop: one of the tasks we used it for was transforms all sorts of text data -- crawled webpages, log files -- into CSV data that could be loaded onto the OLAP/OLTP clusters with "LOAD DATA INFILE").
if your data isn't relational to start with (e.g. messages in a message queue, being written to files in HDFS), you're better of with Map/Reduce
If you query the data once and then throw it away, then certainly load performance is a critical issue. If you load the data once and then query it many times, load performance is far less important -- trading a longer load time for much better query performance would be a good idea. So it depends on the workload as much as the format the data happens to start in.
The conclusion is a pretty good summary, if you don't want to read the whole lot.
A big thing that's missing from Parallel DBMSs is fault tolerance, which the paper cops to (but also diminishes):
"MR is able to recover from faults in the middle of query execution in a way most parallel database systems cannot. ... If a MR system needs 1,000 nodes to match the performance of a 100 node parallel database system, it is ten times more likely that a node will fail while a query is executing. That said, better tolerance to failures is a capability that any database user would appreciate."
Also: "Although we were not surprised by the relative performance advantages provided by the two parallel database systems, we were impressed by how easy Hadoop was to set up and use in comparison to the databases."
Speed isn't necessarily the primary advantage that MapReduce provides; it's more about shifting the burden of scaling backend process from the programmer to the infrastructure.
That said (caveat: I don't have a lot of real world MapReduce experience), in my line of work, I do a lot of analysis at less than web scale (there's still a lot of business value there too, folks!), and just the thought of writing a bunch of MapReduce jobs just to achieve the analytical capabilities in vanilla SQL alone is not a pleasant one.
If people have experience in getting around this, I'd love to hear about it.
The empirical evidence Google provides gives a strong argument that map reduce can perform pretty well
' Results 1 - 10 of about 1,990,000,000 ... (0.13 seconds) '
I would say it really comes down to preference, the data set, and how you plan to handle scaling/development.
Also, parallel DBMSs don't handle failure well. Ask any of the MapReduce developers at Google and they'll tell you failure handling is the hardest part of the problem.
With all that said, I'd be very interested to see a comparison between Vertica, DBMS-X and MySQL, running in sharded or MySQL Cluster mode.
PS The paper is coauthored by Microsoft.