I reread DeWitt and Stonebraker’s (D&S) MapReduce criticism [1] and I still find...

mrits · on June 14, 2020

"Map() is not equivalent to a SQL GROUP BY clause, it is equivalent to a user-defined Table Function that is used in a FROM clause."

No, the projection doesn't remove redundancy under most cases. There also isn't any reason you couldn't have UDF's in the GROUP BY clause. I've written implementations of both and I think the GROUP BY is an excellent comparison for understanding Map in MapReduce Systems.

sradman · on June 14, 2020

> the projection doesn't remove redundancy under most cases

Maybe I'm missing something; I don't understand why projections are part of this discussion. Maybe I should have been more precise. I was thinking about the type of Table UDF that Aster Data made popular around the time DeWitt and Stonebraker wrote their article (Jan 2008). These Table UDFs were written in languages like Java or C/C++ and generally accessed data external to the database engine. Aster Data marketing defined the functionality in terms of MapReduce.

The point I was trying to make was that the Map() part of MapReduce is equivalent to a distributed ETL pipeline. This remains one of the key use cases for Spark. The Reduce() part is no longer relevant in the new world of cheap and scalable column stores. DeWitt and Stonebraker's Teradata-like enterprise data warehouses suffered the same fate.

ramraj07 · on June 14, 2020

When you say cloud-native data warehouses do you mean things like snowflake/redshift/big-query or something else? As part of an org making the transition from spark to these I can definitely agree that these tools are better suited for practical data engineering in the medium-big-data scale (anything not Google/Facebook)

sradman · on June 14, 2020

I was thinking AWS Athena (Presto) for the data warehouse and AWS Glue (Spark) for ETL. Redshift has always had the feel of a Column Store Appliance that runs side-by-side with your other IaaS resources. There is nothing particularly cloud-native about it other than the way it is provisioned and managed in the AWS web Console. Amazon QuickSight seems like an excellent alternative to Enterprise BI pivot tables like Tableau, Excel, PowerPivot, Business Objects, and Cognos. Amazon seems to be ahead of the competition (again) when it comes to ETL/DW/BI-as-a-Service, at least in terms of price-per-performance.

I don't know anything about Snowflake. SQL makes BigQuery and Hive easier to program than MapReduce/Pig but I don't think of these technologies as data warehouses.

Column Stores (compressed bitmap indexes batch updated with an ETL-like process) make exceptional data warehouses. Row oriented data warehouses all feel like anachronisms now.