Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Does (or why does) anyone use MapReduce anymore?
106 points by bk146 5 months ago | hide | past | favorite | 67 comments
Excluding the Hadoop ecosystem, I see some references to MapReduce in other database and analysis tools (e.g., MatLab). My perception was that Spark completely superseded MapReduce. Are there just different implementations of MapReduce and the one that Hadoop implemented was replaced by Spark?



(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.



There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.


Shuffle is interesting, I gotta read up on that. Maybe I've been hearing reduce for too long and have too much of a built-in visual sense of it but...shuffle does not seem like the right name at all, then I picture randomizing some set N, where the input and output counts are the same.


Shuffle is an operation that converts "{k1, v1}, {k1, v2}, {k2, v3}" into "{k1, [v1, v2]}, {k2, [v3]}".


Reduce is useful for aggregate metrics.


My point is that Reduce is Shuffle+Map, without materializing the intermediate result (the result after Shuffle).


For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

Are you confusing kafka with something else? Kafka is a persistent write append queue.


The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.


This book looks interesting, should I buy it or does anyone else have newer recommendations? I have Designing Data-Intensive Applications which is a fantastic overview and still holds up well.


That was one of "the" books in the space prior to DDIA. In my opinion Akidao mixes the logic for processing events with the stream infrastructure implementation because he was writing from the context of his particular use cases. The time that I spoke with him it seemed that his influence had driven to the design of Google's systems and GCP such that they didn't properly prioritize ordering/linearity/consistency requirements. At this point my copy is of historic interest to me.


It has the most interesting/conceptual/detailed discussion of the streaming system semantics (e.g. interplay of windows and stateful stream operations) I'm aware of to this day. At least as far as Manning/O'Reilly-level books go. So I'd put it on the same bookshelf as DDIA.

It's a little biased towards Beam and away from Spark/Flink though. Which makes it less practical and more conceptual. So as long as it's your cup of tea go for it.


Thank you, I'll check this out!


I feel like that’s kind of like saying we don’t use Assembly anymore now that we have C. We’ve just built higher level abstractions on top of it.


Yeah, that's exactly how I read the question, i.e. analogous to "does anyone still code in assembly, or has everyone switched to using abstractions?" and I think it's a very interesting one.


Its more like does anyone use goto.

Why use map:reduce when you can have an entire DAG for fanout/in?


At a high level, most distributed data systems look something like MapReduce, and that's really just fancy divide-and-conquer. It's hard to reason about, and most data at this size is tabular, so you're usually better off using something where you can write a SQL query and let the query engine do the low-level map-reduce work.


The concept is quite alive, and the fancy deep learning have it: jax.lax.map, jax.lax.reduce.

It's going to stay because it is useful:

Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.

If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)


That's my understanding. MR is very simplistic and awkward/impossible to express many problems in, whereas dataflow processors like Spark and Apache Beam support creating complex DAGs of rich set of operators for grouping, windowing, joining, etc. that you just don't have in MR. You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.


> You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

I think it's the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.


I don't mean generalization that way. Dataflow operators can be expressed as MR as the underlying primitive, as you say. But MR itself, as described in the original paper at least, only has the two stages, map and reduce; it's not a dataflow system. And it turns out people want dataflow systems, not hand-code MR and do the DAG manually.


I'm not sure what you describe is the opposite?

I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.

But that doesn't really mean it's useful to think of GOTOs being the generalisation.

Most of the time, it's just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-...


MapReduce was basically a very verbose/imperative way to perform scalable, larger than memory aggregate-by-key operation.

It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.


The abstraction came first. MapReduce was quickly used as a basis for larger-than-machine SQL (Google Dremel and Hadoop Pig). MapReduce was separately useful when the processing pieces require a lot of custom code that doesn't fit well into SQL (because you have hierarchical records, not purely relational, for example)


Can you point, please, to the better abstractions?


SQL comes to mind.

Every time you run an SQL query on BigQuery, for example, you are executing those same fundamental map shuffle primitives on underlying data, it's just that the interface is very different.


Now you rarely use basic MapReduce primitives, you have another layer of abstraction that can run on infrastructure that was running MR jobs before. This infrastructure allows to efficiently allocate some compute resources for "long" running tasks in a large cluster with respect to memory/cpu/network and other constraints. So basically schedulers of MapReduce jobs and cluster management tools became that good, because MR methodology had trivial abstractions, but required efficient implementation to make it work seamlessly.

Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.


The correct language for querying data is, as always, SQL. No one cares about the implementation details.

“I have data and I know SQL. What is it about your database that makes retrieving it better?”

Any other paradigm is going to be a niche at best, likely outright fail.


Spark is really failing, all right.

SQL lacks type safety, testability, and composability.


It’s crazy to think how old I am now. But give it 20 more years and you’ll come around.


Agree/disagree. I wish (and maybe this is "the answer") that you could take a basic SQL query and "invert it" into composable components.

A common thing I ended up doing for some "small data" hack projects was extremely liberal usage of SQLite: SELECT ... UNION ( SELECT ... ... GROUP BY ... ( UNION ... etc ) ) ... absolutely terrible SQL, but it got the job done to return the 100 or so records I was interested in.

It'd be great if I could write me some SQL then pop it out into: fn_group001, fn_join(g1, g2, cols(c1, c2)), ...etc...

...and then have composable sub-components of what the janky SQL-COBOL syntax supports, but in a group().chain().join(...) style.

I think I keep running across DataLog as something that's recommended, and of course ProLog has some similarities.

Nothing has been compelling enough to warrant jumping off of SQL, but I really do agree with the grandparent comment: SQL (aka: COBOL) is pretty clunky and non-composable in a way that complicates what you'd think would be straightforward for interactive, non-programming usage.


20 years ago SQL lacked type safety, testability, and composability. Today the same is true. I doubt it will be different 2 decades from now.

SQL is powerful. It is also very old and has very large warts.


I think one of the biggest missed opportunities in language design is the integration of powerful relational database and query models directly into a modern language. Not as a bolt-on or an ORM but as a first class part of the language in the same way as maps and arrays. Make a language that deals with data relationally and where relational queries are a core part of the language and relational query execution is baked into its runtime.

If persistence hooks were also baked in then you'd have something a little bit like stored procedures in databases but far more powerful and with a modern syntax. Couple this with a distributed database layer supporting either eventual consistency built on CRDTs or synchronization via raft/paxos and you'd have an amazing application platform.

It's always seemed dumb to me that data, which is in the very center of everything we do, feels like a bolted-on second class citizen from the perspective of pretty much all programming languages and runtime environments. "Oh, you want to work with your data? Well we didn't think about that..." Accessing the data requires weird incantations and hacks that feel like you're entering a 1970s time warp into a PDP-11 mainframe.

Instead the language and runtime environment should be built around the data. Put the data in the center like Copernicus did with the sun.

Why has nobody done this? Has anyone even tried?


Data access is of primary concern to all programming languages, using either memory or disk. All files on disk can be considered a form of database. Reading and writing files is standard in all languages.

Once you start getting fancier in your files, and the data grows large, you need special ways to read it. A Postgres database can be considered a single big file on disk. It is the Postgres server that is required to access the file in the most efficient way to store and randomly access enormous amounts of general data.

SQLite is interesting in that there is no server, it's just a special library that enables efficient random access of a single file, which can be thought of as a black box that only SQLite knows how to interpret.

Unless you mean, making something like SQL built directly into the language as a first class citizen. Mumps did something like this https://en.wikipedia.org/wiki/MUMPS


Like language integrated queries in c#?

https://learn.microsoft.com/en-us/dotnet/csharp/linq/


All data query languages eventually reduce themselves into SQL, or something equivalent to it.


Ok? Even if that were true — and I’m not entirely sure how you would even prove that — SQL still lacks type safety, testability, and composability.


SQL has type safety. It's primary purpose is to have typed schemas. I can't imagine why you would say this, other than some pedantic reason. SQL implementations like Postgres will happily throw errors when your types are off.

Testability - you use a general purpose language to execute SQL. Again, I don't know what you mean.

Composability - I suppose, but remember SQL is a language to retrieve data. I reuse fragments everywhere in a general purpose language.


So use a better language, and let the compiler optimize it??


The query planner optimizes it. Why would you want a compiler to optimize SQL? The nature of your data affects how it is optimized! The declarative statement to retrieve data must be interpreted based upon the nature of that data. You can't pre-optimize without knowing something about your data, in which case, you are basically storing some of the information outside of the database.


I mean don't use SQL at all. Use a real programming language like Scala, and let Spark (or Flink etc.) do the translation and optimization: https://www.databricks.com/glossary/catalyst-optimizer

I don't understand why anyone would prefer SQL to that for anything beyond a simple SQL query. And it's not just my opinion: industry at large uses Spark for production with complex queries. SQL is for analysts.


Now I’m totally confused. SQL is a syntax for querying data. Spark SQL is SQL. You are talking about a different implementation of a server.

SQL is for analysts? Everyone uses SQL.


How familiar are you with Spark and the like? This is what it looks like:

https://spark.apache.org/examples.html

SQL is just a DSL; it is not the only or primary API for Spark, and there's nothing magical about it. If you ditch it you can get your type safety, composability, and testability back, like so:

https://medium.com/@sergey.kotlov/unit-testing-of-spark-appl...

See those case classes that neatly encapsulate business objects? Add to that functional transforms that concisely express typical operations like filtering, mapping, and so on, you get something that is simply superior to SQL.


Any ORM provides the same features for any SQL database. There is nothing special going on here. If perhaps the database autogenerated a bunch of classes, maybe that’s interesting? I think some projects have introspected a database and created all the boilerplate language classes before.

There is nothing magical about wrapping database objects in language classes. This has been happening forever.

https://docs.sqlalchemy.org/en/20/orm/quickstart.html#select...

Nothing magical about using a function call rather than raw SQL.


No, it does not. Furthermore, if you're using an ORM you're not programming in SQL any more. You're using a poor man's Spark. Spark lets you drop down to SQL in all its APIs too.

I don't understand your argument if you're comfortable with ORMs.


I’m not even sure what we are debating. Spark is for big data work, it isn’t something you would typically use as a general purpose database backing your application. My original point way back is that expecting customers to learn an entirely new paradigm for storing and querying data is a poor decision, and limits your work to niche use cases. Spark has a large niche but it isn’t comparable to MySQL / Postgres or other newish data stores like Dynamo.

Furthermore any competent engineer knows SQL because ORMs are cumbersome and annoying for anything except basic use.


Doubtful, given my decades of experience and expected retirement age.

Working with 1000+-line SQL scripts written by other people is no fun. Why wouldn't you want to decompose that into legible, testable functions using an expressive language like Scala?


Being old doesn’t automatically make you more right. You don’t get wisdom as a birthday gift.


But if you did a lot of things you might know what does not work and why. Similar to science articles, nobody talks enough about "X technology does not work for Y domain", so a lot of people try to "reinvent the wheel" only to realize "X does not work for Y". Occasionally there is a surprise (because of some technology advancement) but being old definitely gives you more insight in what can go wrong.


On the other hand you often gain wisdom with experience, and often experience is proportional to age.


It does depend on experience, which everyone gets one way or the other.

I don't know about automatically, but definitely more likely.


At my current job I work with several people who have one to three years of experience repeated over the span of twenty. It might be less likely than some parts of HN would like it to be.


no, but you do get experience, as in, when last somebody “invented” this idea thirty years ago, it was a total crapshoot, so wonder what’s changed?

oh right, new language. that’ll definitely fix it. :eyeroll:


So you're still using COBOL, FORTRAN and BASIC, since new languages don't fix things?


I know I'm replying to a troll comment, but:

> “I have data and I know SQL. What is it about your database that makes retrieving it better?”

Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.


> Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

Seattle data guy had a great end of year top 10 memes post recently and one of them went like this

> oh cool you’ve hired a data scientist. so you have a collection of reliable and easy to query data sources, right?

> …

> you do have a collection of reliable and easy to query data sources, right?

—-

Like, most of the time in businesses… if the data can’t be queried with SQL then it’s not ready to be used by the rest of the business. Whether that’s for dashboards, monitoring, downstream analytics or reporting. Data engineers do the dirty data cleaning. Data scientists do the actual science.

That’s what I took from the parent at least.

YMMV obviously depending on your domain. ML being a good example where things like end to end speech-to-text operates on wav files directly.


That's true. With dbt (=SQL+Jinja-Templating in an opionated framework) a large SQL codebase actually becomes maintainable. If in any way possible I'll usually load my raw data in an OLAP table (Snowflake, BigQuery) and do all the transforms there. At least for JSON data that works really well. Combine it with dbt tests and you're safe.

See https://www.getdbt.com/


It's amazing that you think I'm trolling! The #1 way to get more customers of something as extreme as a new database is to use the tool that potential customers already know and have integrated into their systems. That's SQL. The same logic is for any new paradigm.

Ignore that statement, and fight the uphill battle.


You have no idea how long the tail of legacy MR-based daily stat aggregation workflows is in BigCorps.

The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.


> The batch daily log processor jobs will last longer than Fortran. Longer than Cobol.

Nonsense... They'll end at the same time. Which is approximately concurrently with the universe.


Or 2037.


It definitely played its role in high lighting what most functionish coders already knew: that map/filter/reduce is an awesome model for data processing pipelines.


Huge caveat to this: systems like Hadoop accomplish the parallelization by adding keys, and those FP constructs don't broadly have this. It's better to think of it as parallel divide-and-conquer.


the idea of map reduce remains a good one.

there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and failure strategies.

even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.

i played around with the thinnest possible distributed data stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of software bureaucracy. turns out modern network and cpu are really fast when you stop adding random layers like lasagna.

i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now, and the tradeoff for understandability is often worth the cost.

1. https://github.com/nathants/s4

2. https://github.com/nathants/bsv


I feel like a lot of the underlying concepts of mapreduce live large in multi-threaded applications even on a single machine.

Its definitely not a dead concept, I guess its not sexy to talk about though.


From what I recall Map:Reduce was basically a weird representation of DAG based pipeline.

I think because it looked sorta like an automatic dictionary to multi-thread converter it became popular. But its pretty useless unless you know how to split up and process your data.

basically, if you can cut your data up into a queue, you can MapReduce. But, most pipelines are more complex than that, so you probably need a proper DAG with dependencies.


The paradiagm is very much alive. Look at Query_then_fetch in elasticsearch




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: