Ask HN: To everybody who uses MapReduce: what problems do you solve?

alecco · on Nov 10, 2013

A large telco has a 600 node cluster of powerful hardware. They barely use it. Moving Big Data around is hard. Managing is harder.

A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem.

Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci

thrownaway2424 · on Nov 10, 2013

I strongly agree. Although there are clearly uses for map/reduce at large scale there is also a tendency to use it for small problems where the overhead is objectionable. At work I've taken multiple Mao/reduce systems and converted them to run on my desktop, in one case taking a job that used to take 5 minutes just to startup down to a few seconds total.

Right tool for the job and all that. If you need to process a 50PB input though, map/reduce is the way to go.

collyw · on Nov 10, 2013

I completely agree as well, but I don't consider myself much of an expert in NoSQL technologies (which is why I read up on threads like this to find out).

Does anyone have a use case where data is on a single machine and map reduce is still relevant?

(I am involved in a project at work where the other guys seem to have enthusiastically jumped on MongoDB without great reasons in my opinion).

alecco · on Nov 10, 2013

> Does anyone have a use case where data is on a single machine and map reduce is still relevant?

What matters is the running data structure. For example, you can have Petabytes of logs but you need a map/table of some kind to do aggregations/transformations. Or a sparse-matrix based model. There are types of problems that can partition the data structure and work in parallel in the RAM of many servers.

Related: it's extremely common to confuse for CPU bottleneck what's actually a problem of TLB-miss, cache-miss, or bandwidth limits (RAM, disk, network). I'd rather invest time in improving the basic DS/Algorithm than porting to Map/Reduce.

collyw · on Nov 10, 2013

OK, maybe I am not understanding you correctly, but what you describe seems to be, if the data is on one machine, connect to a cluster of machines, and run processing in parallel on that.

That doesn't imply a NoSQL solution to me. Just parallel processing on different parts of the data. If I am wrong can you point me to a clearer example?

Roboprog · on Nov 10, 2013

It sounds to me like the poster above restructured the input data to exploit locality of reference better.

http://en.wikipedia.org/wiki/Locality_of_reference

collyw · on Nov 10, 2013

So assuming the data is one one machine (as I asked), why would an index not solve this problem? And why does Map Reduce solve it?

virtuabhi · on Nov 10, 2013

Indexes do not solve the locality problem (see Non-clustered indexes). Even for in-memory databases, it is non-trivial to minimize cache misses in irregular data structures like B-trees.

Now why MapReduce "might" be a better fit for a problem where data fits into one disk. Consider a program which is embarrassingly parallel. It just reads a tuple and writes a new tuple back to disk. The parallel IO provided by map/reduce can offer a significant benefit in this simple case as well.

Also NoSQL != parallel processing.

alecco · on Nov 10, 2013

It's one of the issues, yes. But I wanted to be more general on purpose.

alecco · on Nov 10, 2013

Maybe I misunderstood what you were asking.

Note both MapReduce and NoSQL are overhyped solutions. They are useful in a handful of cases, but often applied to problems they are not as good.

aidos · on Nov 10, 2013

I'm not sure that the two concepts are resulted at all. Obviously mongo has map reduce baked in - but that's not that relevant. Map/reduce is a reasonable paradigm for crunching information. I have a heavily CPU bound system that I parallelise by running on different machines and aggregating that results. I probably wouldn't call it map reduce - but really it's the same thing.

How do you parallelise your long running tasks otherwise?

alecco · on Nov 10, 2013

I can't say without more information on the problem to solve. As I said above, there are cases where MapReduce is a good tool.

And even if you improve the DS/Algorithm first, usually that is usable by the MapReduce port and you save a lot of time/costs.

amenod · on Nov 10, 2013

I completely agree with the parents about Map Reduce. However I would justify using MongoDB for totally different reasons, not scalability. It is easy to setup, easy to manage and above all easy to use, which are all important factors if you are developing something new. However it does have a few "less than perfect" solutions to some problems (removing data does not always free disk space, no support for big decimals,...) and it definitely takes some getting used to. But it is a quite acceptable solution for "small data" too.

Edit: ...and I wouldn't dream of using MongoDB's implementation of MapReduce.

markuskobler · on Nov 10, 2013

On modern hardware with many cpu cores you can use a similar process of fork and join to maximise throughput of large datasets.

virtuabhi · on Nov 10, 2013

Single hardware with many cores does not give the same performance as multiple machines. For example, consider disk throughput. If the data is striped across multiple nodes then the read request can be executed in parallel, resulting in linear speed up! In a single machine you have issues of cache misses, inefficient scatter-gather operations in main memory, etc.

And it is much more easier to let the MapReduce framework handle parallelism than writing error prone code with locks/threads/mpi/architecture-dependent parallelism etc.

collyw · on Nov 10, 2013

That sounds more like parallelisation rather than a use case for NoSQL.

Why is that any better than all of your data on one database server, and each cluster node querying for part of the data to process it? Obviously there will be a bottleneck if all nodes try to access the database at the same time, but I see no benefit otherwise, and depending on the data organisation, I don't even see NoSQL solving that problem (you are going to have to separate the data to different servers for the NoSQL solution, why is that any better than cached query from a central server?).

Roboprog · on Nov 10, 2013

Forks/threads on (e.g.) 12 core CPUs works up to a point. But that point probably does solve many problems without further complication :-)

jfxberns · on Nov 14, 2013

"Does anyone have a use case where data is on a single machine and map reduce is still relevant?"

No! MapReduce is a programming pattern for massively parallelizing computational tasks. If you are doing it on one machine, you are not massively parallelizing your compute and you don't need MapReduce.

msellout · on Nov 10, 2013

You can imagine cases where map-reduce is useful without any starting data. If you are analyzing combinations or permutations, you can create a massive amount of data in an intermediate step, even if the initial and final data sets are small.

collyw · on Nov 10, 2013

Have you got any links on how to do that, as it sounds very like a problem I am trying to sole just now - combinations of DNA sequences that work together on a sequencing machine.

At the moment I am self joining a table of the sequences to itself in MySQL, but after a certain number of self joins the table gets massive. Time to compute is more the problem rather than storage space though, as I am only storing the ones that work (> 2 mismatches in the sequence). Would Map Reduce help in this scenario?

memracom · on Nov 10, 2013

If I had your problem, the first thing that I would do is try PostgreSQL to see if it does the joins fast enough. Second thing that I would try is to put the data in a SOLR db and translate the queries to a SOLR base query (q=) plus filter queries (fq=) on top.

Only if both of these fail to provide sufficient performance, would I look at a map reduce solution based on the Hadoop ecosystem. Actually, I wouldn't necessarily use the Hadoop ecosystem. It has a lot of parts/layers and the newer and generally better parts are not as well known so it is a bit more leading edge than lots of folks like. I'd also look at somethink like Riak http://docs.basho.com/riak/latest/dev/using/mapreduce/ because then you have your data storage and clustering issues solved in a bulletproof way (unlike Mongo) but you can do MapReduce as well.

twic · on Nov 10, 2013

> Mao/reduce systems

Well, that certainly sounds like ...

puts on sunglasses

... a Great Leap Forward.

jfxberns · on Nov 14, 2013

"A lot of people fail to understand the overheads and limitations of this kind of architecture. Or how hard it is to program, especially considering salaries for this skyrocketed. More often than not a couple of large 1TB SSD PCIe and a lot of RAM can handle your "big" data problem."

It's not that hard to program... it does take a shift in how you attack problems.

If your data set fits on a few SSDs. then you probably don't have a real big data problem.

"Moving Big Data around is hard. Managing is harder."

Moving big data around is hard--that's why you have hadoop--you send the compute to where he data is, thus requiring a new way of thinking about how you do computations.

"Before doing any Map/Reduce (or equivalent), please I beg you to check out Introduction to Data Science at Coursera https://www.coursera.org/course/datasci"

Data science does not solve the big data problem. Here's my favorite definition of a big data problem: "a big data problem is when the size of the data becomes part of the problem." You can't use traditional linear programming models to handle a true big data problem; you have to have some strategy to parallelize the compute. Hadoop is great for that.

"A large telco has a 600 node cluster of powerful hardware. They barely use it."

Sounds more like organizational issues, poo planning and execution than a criticism of Hadoop!

petar · on Nov 11, 2013

Consider using the gocircuit.org

It was expressly designed to avoid non-essential technical problems that come in the way of cloud application developers, when they try to orchestrate algorithms across multiple machines.

Identical HBASE SQL queries and their respective counterparts implemented in the Circuit Language are: Approximately equally long in code, and orders of magnitude faster than HBASE. In both cases, the data is pulled out of the same HBASE cluster for fair comparison.

continuations · on Nov 11, 2013

What does Go Circuit do differently that makes it orders of magnitude faster than HBase SQL?

AsymetricCom · on Nov 10, 2013

> Moving Big Data around is hard.

I never had any issues with Hadoop. Took about 2 days for me to familiarize myself with it and adhoc a script to do the staging and setup the local functions processing the data.

I really would like to understand what you consider "hard" about Hadoop or managing a cluster. It's pretty straight forward idea, architecture is dead simple, requiring no specialized hardware at any level. Anyone who is familiar with linux CLI and running a dyanamic website should be able to grok it easily, imho.

Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.

skrebbel · on Nov 10, 2013

> Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.

You sound like a snob.

AsymetricCom · on Nov 10, 2013

See what I mean?

alecco · on Nov 10, 2013

Is this serious? Have you ported a program to Hadoop? Unles you use Pig or one of those helping layers it is quite hard for non-trivial problems. And those helping layers usually come with some overhead cost for non-trivial cases, too.

Edit: no downvote from me.

AsymetricCom · on Nov 10, 2013

It was a pretty easy problem, parsing logs for performance statistics. But moving the data is the easy part and that's why I was incredulous of the OP's statement.

I'm starting to wonder if this is really "Hacker News" or if it's "we want free advice and comments from engineers on our startups so lets start a forum with technical articles"

alecco · on Nov 10, 2013

Big Data should be on the Peta+ level. Even with 10G Ethernet it takes a lot of bandwidth and time to move things around (and it's very hard to keep 10G ethernet full at a constant rate from storage). This is hard even for telcos. Note Terabyte+ level today fits on SSD.

oceanplexian · on Nov 10, 2013

Not really, "Big Data" has nothing to do with how many bytes you're pushing around.

Some types of data analytics are CPU heavy and require distributed resources. Your comment about 10G isn't true. You can move around a Tb every 10 minutes or so. SSDs or a medium sized SAN could easily keep up with the bandwidth.

If your data isn't latency sensitive and run in batches, building a Hadoop cluster is a great solution to a lot of problems.

schrodinger · on Nov 10, 2013

Of course big data is about number of bytes. That's what something like map reduce helps with. It depends on breaking down your input into smaller chunks, and the number of chunks is certainly related to the number of bytes.

nl · on Nov 10, 2013

Then again, I come from the /. crowd, so YC isn't really my kind of people, generally.

W T F does this even mean? I genuinely do not understand what point you are trying to make?

I have used Slashdot longer than you have (ok, possibly not.. but username registered in 1998 here...).

I find HN has generally much more experienced people on it, who understand more about the tradeoffs different solutions provide.

The old Slashdot hidden forums like wahiscool etc were good like this too, but I don't think they exist anymore do they?

willvarfar · on Nov 10, 2013

We had been using hadoop+hive+mr to run targetting expressions over billions of time series events from users.

But we have recently moved a lot back to mysql+tokudb+sql which can compress the data well and keep it to just a few terrabytes.

Seems we weren't big data enough and we were tired of the execution times, although impala and fb's newly released presto might also have fitted.

Add: down voters can explain their problem with this data point?

srl · on Nov 10, 2013

This doesn't answer the question - you described /how/ you solved some problem, not what problem you're actually solving.

(Mind you, at my writing this is the top comment, so I don't think you're getting many downvotes. But your comment irked me, so there you go.)

res0nat0r · on Nov 10, 2013

> not what problem you're actually solving.

> targetting expressions over billions of time series events from users.

collyw · on Nov 10, 2013

You moved back to MySQL.

What was the motivation to move to Map Reduce in the first place if a well understood technology, MySQL, works fine?

(sorry if I am posting a lot on this topic. I am really interested in finding answers rather than trying to prove any point that relational databases are better in case anyone thinks otherwise).

willvarfar · on Nov 10, 2013

(Poster) we moved from mysql+innodb/myisam to hadoop because of performance problems. We did test and evaluate hadoop, and then jumped. Then tokudb comes along (technically we moved back to mariadb) and puts performance advantage firmly back on mysql's side. I imagine impala and presto and any other column based, non-map-reduce engines would give tokudb a fairer fight though.

_random_ · on Nov 10, 2013

Hearing to what problems a technology failed to solve is usually even more interesting. It cuts the hype.

kamakazizuru · on Nov 10, 2013

sure - but that's not the original question. Maybe the OP is just curious to understand some in production use cases (and the approach to their implementation) - and isn't that interested to know the edge cases where it doesn't work.

_random_ · on Nov 10, 2013

Maybe his case is an edge case.

kamakazizuru · on Nov 10, 2013

right - but then he would've framed the question as "what are the cases where mapreduce really doesn't make sense".

pwang · on Nov 10, 2013

> run targetting expressions over billions of time series events from users.

lclarkmichalek · on Nov 10, 2013

This system isn't in production just yet, but should be shortly. We're parsing Dota2 replays and generating statistics and visualisation data from them, which can then be used by casters and analysts for tournaments, and players. The replay file format breaks the game down into 1 min chunks, which are the natural thing to iterate over.

Before someone comes along and says "this isn't big data!", I know. It's medium data at best. However, we are bound by CPU in a big way, so between throwing more cores at the problem and rewriting everything we can in C, we think we can reduce processing times to an acceptable point (currently about ~4 mins, hoping to hit <30s).

baudehlo · on Nov 10, 2013

This sounds like something you could just do in SQL and have it all done in milliseconds.

lclarkmichalek · on Nov 10, 2013

For our highly unstructured, untyped, non relational artefacts? It'd be fairly impossible to use it as a datastore in this particular case, and regardless, it would provide little to no speed increase over our current application, as the limiting factor is the CPU cost of the map function.

alecco · on Nov 10, 2013

Maybe you should consider transforming the data to structure it a bit. Have a unique identifier per object and arrays for each seen characteristic with (value, id), and it's reverse index. Then decompose the processing of each object to sub-problems matching n-way the characteristics. It doesn't have to be in SQL, though. I'd try a columnar RDBMS. YMMV

baudehlo · on Nov 10, 2013

I have extremely strong doubts about that. Everything can be modelled in modern databases, and SQL is probably much more powerful than you understand here.

collyw · on Nov 10, 2013

Most of the NoSQL cases I have heard seem to be that they could be done at least as well in SQL.

I have asked a lot of questions on this topic, and no one has yet convinced me (please do if you have a legitimate NoSQL case).

msellout · on Nov 10, 2013

Execution time, after a point, is less important than development time. NoSQL is often faster for development and refactoring, because the schema is easier to change in code than in the database.

Also there's the scale issue. Append-only is helpful. Why have SQL if you can't use all the features?

tumanian · on Nov 10, 2013

Wait, it takes more then a few minutes to spin up the hadoop mapreduce job. Are you using mapreduce for preprocessing or did you optimize hadoop for this scenario?

ameen · on Nov 10, 2013

Are you talking about match stastics or player stats? Dota 2 has a logging system that records various events? during a match.

lclarkmichalek · on Nov 10, 2013

Match stats atm, though we can get much everything. We can go far beyond the combat log, reproducing an entire minimap, along with things like hero position trails, map control, etc etc. Here's a screenshot of a minimap recreation we produced for the recent Fragbite Masters tournament: http://i.imgur.com/kuOvUZX.jpg. We have more complex tools such as XP and Gold source breakdowns over time (i.e. jungle creeps, lane creeps, kills, assists etc), along with other fun stuff still in the pipeline.

prawks · on Nov 10, 2013

Replays contain more detailed data (I believe) like player positioning and actions. I think logging is pretty limited.

stavros · on Nov 10, 2013

Which service is this for? It sounds very interesting.

alexdean · on Nov 10, 2013

We use Elastic MapReduce at Snowplow to validate and enrich raw user events (collected in S3 from web, Lua, Arduino etc clients) into "full fat" schema'ed Snowplow events containing geo-IP information, referer attribution etc. We then load those events from S3 into Redshift and Postgres.

So we are solving the problem of processing raw user behavioural data at scale using MapReduce.

All of our MapReduce code is written in Scalding, which is a Scala DSL on top of Cascading, which is an ETL/query framework for Hadoop. You can check out our MapReduce code here:

https://github.com/snowplow/snowplow/tree/master/3-enrich/ha...

sandGorgon · on Nov 10, 2013

thanks for this. snowplow has been an amazing source of learning. I'm quite interested in the etl process than in the actual mapreduce.

have you seen your etl used to pull data from Twitter or Facebook. I am wondering what is the state of art there considering throttling, etc.

alexdean · on Nov 10, 2013

Hi sandGorgon! Thanks for the encouraging words. We haven't yet seen people use the existing Scalding ETL to pull data from Twitter or Facebook. As you suggest, there are some considerations around using Hadoop to access web APIs without getting throttled/banned. Here's a couple of links which might be helpful:

- http://stackoverflow.com/questions/6206105/running-web-fetch...

- http://petewarden.com/2011/05/02/using-hadoop-with-external-...

I think a gatekeeper service could make sense; or alternatively you could write something which runs prior to your MapReduce job and e.g. just loads your results into HDFS/HBase, for the MapReduce to then lookup into. Akka or maybe Storm could be choices here.

We have done a small prototype project to pull data out of Twitter & Facebook - that was only a Python/RDS pilot, but it gave us some ideas for how a proper social feed into Snowplow could work.

sandGorgon · on Nov 11, 2013

is your python code part of the snowplow repository/gist?

it would be very interesting to take a look at it.

DenisM · on Nov 10, 2013

It's worth noting that CouchDB is using map-reduce to define materialized views. Whereas normally MR parallelization is used to scale out, in this case it's used instead to allow incremental updates for the materialized views, which is to say incremental updates for arbitrarily defined indexes! By contrast SQL databases allow incremental updates only for indexes whose definition is well understood by the database engine. I found this to be pretty clever.

smoyer · on Nov 10, 2013

I've been using CouchDB (and now BigCouch) for about four years and it's both clever and useful. We're storing engineering documents (as attachments) and using map/reduce (CouchDB views) to segment the documents by the metadata stored in the fields. The only downside is that adding a view with trillions of rows can take quite a while.

DenisM · on Nov 11, 2013

Strangely, there aren't many discussions of Couch* family on hacker news. Do you know why would that be?

I'm thinking about basing a new product around couchbase lite, but lack of popular acceptance is one of the things holding me back.

smoyer · on Nov 11, 2013

I'd be happy to take the discussion off-line if you want to get into more depth but I have to wonder if the merger of Memcached and CouchDB happened right as CouchDB would have made a name for itself. Right around that time, I think CouchDB and Mongo were equally well-known.

batbomb · on Nov 10, 2013

I'll tell you where it's not used: High Energy Physics. We use a workflow engine/scheduler to run jobs over a few thousand nodes at several different locations/batch systems in the world.

If processing latencies don't matter much, it's an easier more flexible system to use.

mileswu · on Nov 10, 2013

At least on the experimental LHC side, we process/analyse each event independently from every other event, so it's an embarrassingly parallel workload. All we do is split our input dataset up into N files, run N jobs, combine the N outputs.

Because we have so much data (of the order of 25+ PB of raw data per year; it actually balloons to much more than this due to copies in many slightly different formats) and so many users (several thousand physicists on LHC experiments) that's why we have hundreds of GRID sites across the world. The scheduler sends your jobs to sites where the data is located. The output can then be transferred back via various academic/research internet networks.

HEP also tends to invent many of its own 'large-scale computing' solutions. For example most sites tend to use Condor[1] as the batch system, dcache[2] as the distributed storage system, XRootD[3] as the file access protocol, GridFTP[4] as the file transfer protocol. I know there are some sites that use Lustre but it's pretty uncommon.

[1] http://research.cs.wisc.edu/htcondor/ [2] http://www.dcache.org/ [3] http://xrootd.slac.stanford.edu/ [4] http://www.globus.org/toolkit/docs/latest-stable/gridftp/

memracom · on Nov 10, 2013

I remember when HEP invented that WWW technology stuff which turned out to be rather popular outside of HEP as well.

evanb · on Nov 10, 2013

It is built into QDP++, which is the low-level code underlying lattice QCD code bases like Chroma and MILC, though!

Not for gauge configuration generation, which has to be done sequentially, but inversions and measurements can be mapped / reduced.

virtuabhi · on Nov 10, 2013

Yes, people in sciences write distributed code (e.g., using MPI) and employ job queueing systems. I think they work well for embarrassingly parallel tasks. But what if you need to do a group-by operation? MapReduce makes these operations easy. Also the Hadoop system takes care of failing nodes, data partitioning, communication, etc.

I do hope physics people consider MR systems in addition to the existing ones they use.

apw · on Nov 10, 2013

MPI has a reduce operator. On some supercomputers, the operator is implemented in an ASIC.

It is quite easy to implement map/reduce style programs in MPI. It is quite difficult or impossible to implement the parallel algorithms that win the Gordon Bell prize, almost all of which are implemented in MPI, using Hadoop.

seanmcdirmid · on Nov 10, 2013

MPI can't schedule to optimize IO/data shuffling like MapReduce, this is the price for being more general. The biggest competition to MPI in the HPC space is actually CUDA and GPUs. In either case, the data pipeline is exposed to the compiler or run time for optimization, where as MPI everything is just messaging.

btown · on Nov 10, 2013

This is a system in early development, but my research group is planning on using MapReduce for each iteration of a MCMC algorithm to infer latent characteristics for 70TB of astronomical images. Far too much to store on one node. Planning on using something like PySpark as the MapReduce framework.

alexrson · on Nov 10, 2013

How strongly does a RNA binding protein bind to each possible sequence of RNA?

collyw · on Nov 10, 2013

How big is your data?

Yours is the first example where I a decent knowledge of the field, so can understand the needs accurately. In most cases I see people using NoSQL in places where MySQL could handle it easily.

Maybe I am just too set in my relational database ways of thinking (having used them for 13 years), but there are few cases where I see NoSQL solutions being beneficial, other than for ease of use(most people are not running Facebook or Google).

I a bit sceptical in most cases (though I would certainly like to know where they are appropriate).

rch · on Nov 10, 2013

First, I should note that my needs are fairly specific, and not typical of the rest of the NGS world. The datasets are essentially the same though.

The rate at which we are acquiring new data has been accelerating, but each of our Illumina datasets is only 30GB or so. The total accumulated data is still just a few TB. The real imperative for using MR is more about the processing of that data. Integrating HMMER, for instance, into Postgres wouldn't be impossible, but I don't know of anything that's available now.

Edit: A FDW for PostgreSQL around HMMER just made my to do list.

collyw · on Nov 10, 2013

So is it fair to say it is an "ease of use" use case?

rch · on Nov 10, 2013

Is that the same as an 'impossible to do otherwise' case?

Edit: I should say 'currently impossible' since as I noted, I can imagine being able to build SQL queries around PSSM comparisons and the like. I just can't build a system to last 5+ years around something that might be available at some point.

Since I can't reply directly- agreed :)

collyw · on Nov 10, 2013

That comment was based on your "into Postgres wouldn't be impossible" phrase.

No fair enough of its the only way you can get things to work. I still see lots of people jumping on the "big data" bandwagon with very moderate sized data.

nextos · on Nov 10, 2013

Interesting, I'm using it for protein localization prediction, PTMs, and NGS data.

rch · on Nov 10, 2013

NGS phage display data here.

sidcool · on Nov 10, 2013

A host of batch jobs that source data from up to 28 different systems and then apply business rules to extract a substrate of useful data.

zengr · on Nov 10, 2013

We use MR using Pig (data in cassandra/CFS) with a 6 node hadoop cluster to process timeseries data. The events contain user metrics like which view was tapped, user behavior, search result, clicks etc.

We process these events to use it downstream for our search relevancy, internal metrics, see top products.

We did this on mysql for a long time but things went really slow. We could have optimized mysql for performance but cassandra was an easier way to go about it and it works for us for now.

oacgnol · on Nov 10, 2013

Can you estimate how much faster your processing is now vs. before on MySQL? I find it interesting that your cluster is only 6 nodes - relatively small compared to what I've seen and from what I've read about. It'd be interesting to know the benefits of small-scale usage of big data tech.

alecco · on Nov 10, 2013

Did you try a columnar database first? Vertica, InfinyDB, MonetDB... There are many.

clubhi · on Nov 10, 2013

"How do I make all my projects take 10x longer"

nl · on Nov 11, 2013

While the push-back against Map/Reduce & "Big Data" in general is totally valid, it's important to put it into context.

In 2007-2010, when Hadoop first started to gain momentum it was very useful because disk sizes were smaller, 64 bit machines weren't ubiquitous and (perhaps most importantly) SSDs were insanely expensive for anything more than tiny amounts of data.

That meant if you had more than a couple of terabytes of data you either invested in a SAN, or you started looking at ways to split your data across multiple machines.

HDFS grew out of those constraints, and once you have data distributed like that, with each machine having a decently powerful CPU as well, Map/Reduce is a sensible way of dealing with it.

jfxberns · on Nov 14, 2013

It still is a valid pattern. As disks get bigger, the harder it is to move the data and the more efficient it is to move the compute to the data.

jread · on Nov 10, 2013

Generating web traffic summaries from nginx logs for a CDN with 150 servers, 10-15 billion hits/day. Summaries then stored in MySQL/TokuDB.

PaulHoule · on Nov 10, 2013

MapReduce is great for ETL problems where there is a large mass of data and you want to filter and summarize it.

darkxanthos · on Nov 10, 2013

Yup. This. Once I filter down to the data I actually care about, I typically find I'm no longer anywhere near "Big Data" size.

ahulak · on Nov 10, 2013

I came here to say the same thing. We use to rip through anywhere from ~100M-3B rows of data every day.. updates to this dataset are time sensitive so we use it to finish the job as quickly as possible.

cstigler · on Nov 10, 2013

MongoDB: where you need Map/Reduce to do any aggregation at all.

tzury · on Nov 10, 2013

You may consider using Mongo's aggregation framework[1] instead which is zillion times faster[2].

[1] http://docs.mongodb.org/manual/aggregation/

[2] http://stackoverflow.com/questions/13908438/is-mongodb-aggre...

bcoughlan · on Nov 10, 2013

A lot of people in this thread are saying that most data is not big enough for MapReduce. I use Hadoop on a single node for ~20GB of data because it is an excellent utility for sorting and grouping data, not because of its size.

What should I be using instead?

grogenaut · on Nov 10, 2013

Obviously you should throw out the solution that worked for you and start over. 20GB just isn't cool enough to use M/R.

Aqueous · on Nov 10, 2013

To most people who use MapReduce in a cluster: You probably don't need to use MapReduce. You are either vastly overstating the amount of data you are dealing with and the complexity of what you need to do with that data, or you are vastly understating the amount of computational power a single node actually has. Either way, see how fast you can do it on a single machine before trying to run it on a cluster.

Tossrock · on Nov 10, 2013

We use our own highly customized fork of Hadoop to generate traffic graphs [1] and demographic information for hundreds of thousands of sites from petabytes of data, as well as building predictive models that power targeted display advertising.

[1]: https://www.quantcast.com/tumblr.com

rubyfan · on Nov 10, 2013

We are using Map/Reduce to analyze raw XML as well as event activity streams, for example analyzing a collection of events and meta data to understand how discreet events relate to each other as well as patterns leading to certain outcomes. I am primarily using Ruby+Wukong via the Hadoop-Streaming interface as well as Hive to analyze output and for more normalized data problems.

The company is a large Fortune 500 P&C insurer and has a small (30 node) Cloudera 4 based cluster in heavy use by different R&D, analytic and technology groups within the company. Those other groups use a variety of toolsets in the environment, I know of Python, R, Java, Pig, Hive, Ruby in use as well as more traditional tools on the periphery in the BI and R&D spaces such as Microstrategy, Ab Initio, SAS, etc.

joeblau · on Nov 10, 2013

I was using it on a D3.JS chart to aggregate data flow though our custom real-time analytic pipeline.

shamsulbuddy · on Nov 10, 2013

Will u mind sharing some more info on this real time analytics using d3 as i am also planning to do something like tht..

jackgolding · on Nov 16, 2013

look up crossfilter.js

gregoryw · on Nov 10, 2013

Validating performance testing simulations. Tie the inputs of the load generators to the outputs from the application server logs and verify the system works as designed at scale.

bijanv · on Nov 10, 2013

Taking dumps of analytics logs and pulling out relevant info for our customers on app usage

sitkack · on Nov 10, 2013

This is the `grep/awk` use case. The nice thing about streaming mr interface to hadoop (calling external programs) is that you can literally take your grep/awk workflow and move it to the cluster. Retaining line oriented records is a huge step in having a portable data processing workflow.

redwood · on Nov 10, 2013

Obviously, in today's theme, Facebook is not using mapreduce effectively to figure out who of our "friends" we actually care about :)

KyeRussell · on Nov 10, 2013

Could you be any more edgy?

espennilsen · on Nov 10, 2013

Saw this tweet today https://twitter.com/Andst7/status/399333514803294208/photo/1

mfeldman · on Nov 12, 2013

"Things that could be trivially done in SQL :(" We use HIVE over HDFS. Sure the type of things we are doing could have been done in SQL, if we had carefully curated our data set and chosen which parts to keep and which to throw away. Unfortunately, we are greedy in the storage phase. Hive allows us to save much more than we reasonably would need, which actually is great when we need to get it long after the insert was made.

conradev · on Nov 10, 2013

Airbnb uses distributed computing to analyze the huge amount of data that it generates every day[1]. The results are used for all sorts of things: assessing how Airbnb affects the local economy (for government relations), optimizing user growth, etc.

[1] http://nerds.airbnb.com/distributed-computing-at-airbnb/

hurrycane · on Nov 10, 2013

Billion of metric values.

sitkack · on Nov 10, 2013

Image restoration, OCR, face detection, full text indexing. Mostly just a parallel job scheduler.

pjbrunet · on Nov 10, 2013

It would be nice if someone collected/counted the actual answers. I read the whole thread and "analyzing server logs" was the only answer. If you don't have funding or have dreams/plans, you're not really _using_ the technology.

kirang1989 · on Nov 11, 2013

Just FYI: Even small problems can be solved using the concept of MapReduce. It is a concept and not an implementation that is tied to BigData and NoSQL. A simple example would be MergeSort. It uses the concept of MR to sort data.

mrgriscom · on Nov 10, 2013

Things that could be trivially done in SQL :(

estebanz01 · on Nov 11, 2013

Process a huge amount of spanish text in paper surveys

roschdal · on Nov 10, 2013

"How can I make Google shareholders richer?"

clockwork_189 · on Nov 10, 2013

Data aggregation to calculate analytics.

platz · on Nov 10, 2013

Billions

benihana · on Nov 10, 2013

How can I make myself feel better about what I do by trying to diminish the work other people do using the same technologies?

iurisilvio · on Nov 10, 2013

How much can I save in my mobile phone and plan?