Command-line tools can be faster than your Hadoop cluster

danso · on Jan 18, 2015

I'm becoming a stronger and stronger advocate of teaching command-line interfaces to even programmers at the novice level...it's easier in many ways to think of how data is being worked on by "filters" and "pipes"...and more importantly, every time you try a step, something happens...making it much easier to interactively iterate through a process.

That it also happens to very fast and powerful (when memory isn't a limiting factor) is nice icing on the cake. I moved over to doing much more on CLI after realizing that doing something as simple as "head -n 1 massive.csv" to inspect headers of corrupt multi-gb CSV files made my data-munging life substantially more enjoyable than opening them up in Sublime Text.

Spooky23 · on Jan 18, 2015

A few years ago between projects, my coworkers cooked up some satirical amazing Web 2.0 data science tools. They used git, did a screencast and distributed it internally.

It was basically a few compiled perl scripts and some obfuscated shell scripts with a layer of glitz. People actually used it and LOVED it... It was supposedly better than the real tools some groups were using.

It was one of the more epic work trolls I've ever seen!

jefftk · on Jan 19, 2015

Maybe I'm misreading you, but it sounds like you're saying "my coworkers made something with a really great UI and people loved it!"

hanoz · on Jan 18, 2015

Your CSV peeking epiphany was in essence a matter of code vs. tools though rather than necessarily CLI vs. GUI. On Windows you might just as well have discovered you could fire up Linqpad and enter File.ReadLines("massive.csv").First() for example.

Someone · on Jan 18, 2015

Running a shell in a GUI doesn't make it lose its "I am a CLI" property. That is a CLI.

eropple · on Jan 19, 2015

I disagree. It's a REPL, but a REPL is not always a CLI.

(Frankly, most REPLs are smarter than shells. I go to irb way more than I do bash, these days.)

TylerE · on Jan 18, 2015

Or just use vim or any other editor smart enough not to try to slurp the whole file in one go.

eru · on Jan 19, 2015

Actually, mmapping the file should Just Work (tm)?

hueving · on Jan 18, 2015

Do you not see the horrific syntax of what you just suggested as simple?

mc808 · on Jan 18, 2015

It's pretty clear what it does. It's also C#, so building up to a less trivial task will be much less horrific than

find . -type f -name '*.pgn' -print0 | xargs -0 -n4 -P4 mawk '/Result/ { split($0, a, "-"); res = substr(a[1], length(a[1]), 1); if (res == 1) white++; if (res == 0) black++; if (res == 2) draw++ } END { print white+black+draw, white, black, draw }' | mawk '{games += $1; white += $2; black += $3; draw += $4; } END { print games, white, black, draw }'

kevin_thibedeau · on Jan 19, 2015

In a real production environment that command line would be put into a script parametrized with named variables and the embedded awk scripts would be changed to here-docs.

mc808 · on Jan 19, 2015

Sounds good although at that point it's just programming, and there are tools that are cleaner and faster and more robust than piping semi-structured strings around from a command line.

The one real benefit that can be argued is ubiquity (on *ix). Not every system has Perl, Python, or Ruby installed - or Hadoop for that matter - but there's usually a programmable shell and some variant of the standard utilities that will get something done in a pinch. If it happens to be 200x faster than some enormous framework, so much the better.

peterhunt · on Jan 19, 2015

Are you arguing that shell scripts scale to larger applications better than C#?

alrs · on Jan 18, 2015

The example was a multi-gigabyte CSV file. You just sucked the whole thing off the disk into RAM so that you could shave off the first line.

If you're unlucky, you started swapping out to disk about halfway through.

recursive · on Jan 19, 2015

That code you're replying about was carefully and correctly written. You just replied as if you know how it works just so you could look like you know what you're talking about.

If you're unlucky, someone who actually knows how File.ReadLines() works will show up in an hour or two and explain that it's lazily evaluated.

alrs · on Jan 20, 2015

:) touche

chrishynes · on Jan 19, 2015

Wrong. ReadLines returns an IEnumerable<string> and lets you read line by line without loading the entire file into memory: http://msdn.microsoft.com/en-us/library/dd383503%28v=vs.110%....

crcsmnky · on Jan 18, 2015

Perhaps I'm missing something. It appears that the author is recommending against using Hadoop (and related tools) for processing 3.5GB of data. Who in the world thought that would be a good idea to begin with?

The underlying problem here isn't unique to Hadoop. People who are minimally familiar with how technology works and who are very much into BuzzWords™ will always throw around the wrong tool for the job so they can sound intelligent with a certain segment of the population.

That said, I like seeing how people put together their own CLI-based processing pipelines.

x0x0 · on Jan 18, 2015

I've used hadoop at petabyte scale (2+pb input; 10+pb sorted for the job) for machine learning tasks. If you have such a thing on your resume, you will be inundated with employers who have "big data", and at least half will be under 50g with a good chunk of those under 10g. You'll also see multiple (shitty) 16 machine clusters, any of which -- for any task -- could be destroyed by code running on a single decent server with ssds. Let alone hadoop jobs running in emr, which is glacially slow (slow disk, slow network, slow everything.)

Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers. I imagine it's similar to early ejb coding.

sedachv · on Jan 18, 2015

> Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers.

It's comical how bad Hadoop is compared even to the CM Lisp described in Daniel Hillis' PhD dissertation. How do you devolve all the way from that down to "It's like map/reduce. You get one map and one reduce!"

JabavuAdams · on Jan 19, 2015

Programming is very faddish. It's amazing how bad commonly used technologies are. I'm so happy I'm mostly a native developer and don't have to use the shitty web stack and its shitty replacements.

sedachv · on Jan 19, 2015

What really puzzles me is that Doug Cutting worked at Xerox PARC and Mike Cafarella has two (!) CS Masters degrees, a PhD degree, and is a professor at the University of Michigan. It's not like they were unaware of the previous work in the field (Connection Machine languages, Paralations, NESL).

jbergens · on Jan 20, 2015

It sounds a little bit like BizTalk :-)

IndianAstronaut · on Jan 18, 2015

Imho, hadoop is only for 100tb or more. Anything less can be easily handled by other tools.

EpicEng · on Jan 18, 2015

Exactly this just happened where I work. The CIO was recommending Hadoop on AWS for our image processing/analysis jobs. We process a single set of images at a time which come in around ~1.5GB. The output data size is about 1.2GB. Not a good candidate for Hadoop but, you know... "big data", right?

dalke · on Jan 18, 2015

Indeed. For more real-world examples of people who thought that they had "big data", but didn't, see https://news.ycombinator.com/item?id=6398650 ("Don't use Hadoop - your data isn't that big"). The linked essay has:

> They handed me a flash drive with all 600MB of their data on it (not a sample, everything). For reasons I can't understand, they were unhappy when my solution involved pandas.read_csv rather than Hadoop.

User w_t_payne commented:

> I have worked for at least 3 different employers that claimed to be using "Big Data". Only one of them was really telling the truth.

> All of them wanted to feel like they were doing something special.

I think that last line is critical to understanding why a CIO might feel this way.

nathancahill · on Jan 18, 2015

"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it." -Dan Ariely

underpants · on Jan 18, 2015

In a similar situation. In fact stupider. We have a 120Gb baseline of data inside a relational store. The vendor has a file stream option that allows blobs to be stored on disk instead of the transaction log and be pushed through the DB server rather than using our current CIFS DFS cluster. So lets stick our 950Gb static document load in there too (while i was on holiday typically) and off we go.

Do thing starts going like shit as expected now it has to stream all that through the DB bottleneck.

So what's the solution? Well we're in big data territory now apparently at 1.2TiB (comedically small data and almost entirely static data set) and have every vendor licking arse with the CEO and CTO to sell us Hadoop, more DB features and SAN kit.

We don't even need it for processing. Just a big CRUD system. Total Muppets.

rhizome · on Jan 18, 2015

Resume-Driven Development

jacquesm · on Jan 18, 2015

Being in charge of large budgets is better for CEOs and CTOs than being in charge of small budgets. It makes you look more impressive so they go for it like lemmings over the cliff, sometimes taking entire companies with them.

devonkim · on Jan 19, 2015

No, it's called enterprise software, and it's driven by "appeal to the enormous egos and matching budgets of leaders of companies that are held back more by their own bloat than by anything resembling competition." This is just standard practice for the enterprise software sector in general, and after everyone's spent their billions and the next recession is upon us, they'll probably axe most of these utter failures to produce business value. But all the consultants will just blame someone other than the client being dysfunctional because that's just the fastest way to get kicked off a contract.

Consulting for enterprise customers tends to be a lot like marriages - you can be right, or you can be happy (or paid). It takes a unique customer to have gotten past their cultural dysfunctions to accept responsibilities for their problems and to take legitimate, serious action. But like marriage, there can be great, great upsides when everyone gets on the same page and works towards mutual goals with the spirit of selflessness and growth. Yeah....

asuffield · on Jan 19, 2015

I have had this exact conversation at various past employers, usually when they started talking about "big data" and hadoop/friends:

"How much data do you expect to have?"

"We don't know, but we want it to scale up to be able to cover the whole market."

"Okay, so let's make some massive overestimates about the size of the market and scope of the problem... and that works out to about 100Mb/sec. That's about the speed at which you can write data to two hard drives. This is small data even in the most absurdly extreme scaling that I can think of. Use postgres."

Even experienced people do not have meaningful intuitions about what things are big or small on modern hardware. Always work out the actual numbers. If you don't know what they are then work out an upper bound. Write all these numbers down and compare them to your measured growth rates. Make your plans based on data. Anything that you've read about in the news is rare or it wouldn't be news, so it is unlikely to be relevant to your problem space.

texthompson · on Jan 18, 2015

That's not even medium data. Most people probably would be surprised to find out that their data could be stored and processed on an iPhone, and that using heavier duty tools isn't necessary or worthwhile.

azernik · on Jan 18, 2015

Perhaps a good way of explaining it to management is, "If it fits on a smartphone SD card, it's tiny data. If it fits on a laptop hard drive, it's maybe medium data." I think at that point many of these conversations would end.

volker48 · on Jan 18, 2015

If the data can fit on a thumb drive it's not big data.

kunstmord · on Jan 18, 2015

I think I read this somewhere here a few months ago (paraphrasing, obviously): "When the indices for your DB don't fit into a single machines RAM, then you're dealing with Big Data, not before."

ajdecon · on Jan 18, 2015

And following up: Your laptop does not count as a "single machine" for purposes of RAM size. If you can fit the index of your DB in memory on anything you can get through EC2, it's still not Big Data.

fulafel · on Jan 19, 2015

There's still 40x difference to biggest EC2 instance to a maxed out Dell server (244 GB EC2 vs 6 TB for a R920). Not to mention non-PC hardware like SPARC, POWER and SGI UV systems that fit even more.

ajdecon · on Jan 19, 2015

This is true, but at the upper end the "it isn't Big Data if it fits in a single system's memory" rule starts to get fuzzy. If you're using an SGI UV 2000 with 64 TB of memory to do your processing, I'm not going to argue with you about using the words "Big Data". ;-) I figured using an EC2 instance was a decent compromise.

mikeash · on Jan 19, 2015

Would it be fair to approximate it as, "if you can lift it, it's not big data"?

walshemj · on Jan 18, 2015

if a single file of the data can fit on the single biggest disk commonly available it not big data

threeseed · on Jan 18, 2015

Another explanation is that your CIO is not an idiot but rather they know about future projects that you don't. CIOs want to build capabilities (skills and technologies) not just one off implementations every time.

Not saying this is the case but CIO bashing is all too easy when you're an engineer.

acdha · on Jan 19, 2015

A good CIO would know that leaving out key parts of the project is unlikely to produce good results. Even if the details aren't final, a simple “… and we probably need to scale this up considerably by next year” would be useful when weighing tradeoffs

eru · on Jan 19, 2015

The CIO is probably better at playing corporate politics than a grunt. (That's why they are CIO and not a grunt.)

JustSomeNobody · on Jan 19, 2015

The last people you want to keep secrets from are your engineers.

EpicEng · on Jan 19, 2015

I think that hiding requirements from the people building the system would, in fact, be an idiotic move.

theVirginian · on Jan 18, 2015

"Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data (tm) tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques."

I think the point the author is making is that although they knew from the start that Hadoop wasn't necessary for the job, many people probably don't.

sputknick · on Jan 18, 2015

Lots of people think that is "big data". For most people if it's too big for an Excel spreadsheet, it's "big data" and the way you process big data is with Hadoop. Of course once you show them the billable hours difference between setting up a Hadoop cluster, and (in my case at least) using python libraries on a MBP, they change their minds real fast. Its just a matter of "big data" being a new thing, people will figure it out as time goes on and things settle down.

mrweasel · on Jan 19, 2015

People love the idea of being big and having "big problems". Wanting to use Hadoop isn't that different from wanting to use all sorts of fancy "web-scale" databases.

Most of us don't have scaling issues or big data, but that sort of excludes us from using all the fancy new tools that we want to play with. I'm still convinced that most of the stuff I work on at work could be run on SQLite, with designed a bit more careful.

The truth is that most of us will never do anything that couldn't be solved with 10 year old technology. And honestly we should happy, there's a certain comfort in being able to use simple and generally understood tools.

aadrake · on Jan 18, 2015

Hi all, original author here.

Some have questioned why I would spend the time advocating against the use of Hadoop for such small data processing tasks as that's clearly not when it should be used anyway. Sadly, Big Data (tm) frameworks are often recommended, required, or used more often than they should be. I know to many of us it seems crazy, but it's true. The worst I've seen was Hadoop used for a processing task of less than 1MB. Seriously.

Also, much agreement with those saying there should be more education effort when it comes to teaching command line tools. O'Reilly even has a book out on the topic: http://shop.oreilly.com/product/0636920032823.do

Thank you for all the comments and support.

jeroenjanssens · on Jan 19, 2015

Author of Data Science at the Command Line here. Thanks for the nice blog post and for mentioning my book here. While we're talking about the subject of education, allow me to shamelessly promote a two-day workshop that I'll be giving next month in London: http://datascienceatthecommandline.com/#workshop

andyjpb · on Jan 21, 2015

This is a great article and a fun read. A friend sent it over to me and I wrote him some notes about my thoughts. Having since realised it's on HN, I thought I'd post them here as well.

Some of my wording is a bit terse; sorry! :-) The article is great and I really enjoyed it. He's certainly got the right solution for the particular task at hand (which I think is his entire point) but he's generally right for the wrong reasons so I pick a few holes in that: I'm not trying to be mean. :-)

----- Classic system sizing problem!

1.75GiB will fit in the page cache on any machine with >2GiB RAM.

One of the big problems is that people really don't know what to expect so they don't realise that their performance is orders of magnitude lower than it "should" be.

Part of this is because the numbers involved are really large: 1,879,048,192 bytes (1.7GiB) is an incomprehensibly large number. 2,600,000,000 times per second (2.6GHz) is an incomprehensibly large number of things that can be done in the blink of an eye.

...But if you divide them using simple units analysis; things per second divided by things gives you seconds: 1.383. That's assuming that you can process 1 byte per clock cycle which might be reasonable if the data is small and fits in cache. If we're going to be doing things in streaming mode then we'll be limited by memory bandwidth, not clock speed.

http://www.techspot.com/review/266-value-cpu-roundup/page5.h... reckons that the Intel Core 2 Duo E7500 @ 2.93GHz) has 7,634MiB/s of memory bandwidth for reads.

That's 8,004,829,184 bytes per second.

Which means we should be able to squeeze our data through the processor in...

bytes per second divided by bytes = seconds =>

>>> 8004829184 / 1879048192.0 4.260044642857143

so less than 5 seconds.

We probably want to assume that there are other stalls and overheads, but a number between 20 and 60 seconds seems reasonable for that workload (he gets 12): the article says it's just a summary plus aggregate workload so we don't really need to allocate much in the way of arithmetic power.

As with most things in x86, memory bandwidth is usually the bottleneck. If you're not getting numbers with an order of magnitude or so of memory bandwidth then either you have a arithmetic workload (and you know it) or you have a crap tool.

Due to the memory fetch patterns and latencies on x86, it's often possible to reorder your data access to get a nominal arithmetic workload close to the memory bandwidth expected speed.

His analysis about the parallelisation of the workload due to shell commands is incorrect. The speedup comes from accessing the stuff straight from the page cache.

His analysis about loading the data into memory on Hadoop is incorrect. The slowdown in Hadoop probably comes from memory copying, allocation and GC involved in transforming the raw data from the page cache into object in the language that Hadoop is written in and then throwing them away again. That's just a guess because you want memory to fill up (to about 1.75GiB) so that you don't have to go to disk. That memory is held by the OS rather than the userland apps tho'.

His conclusion about how `sleep 3 | echo "Hello"` is done is incorrect. They're "done at the same time" because sleep closes stdout immediately rather than at the end of the three seconds. With a tool like uniq or sort it has to ingest all the data before it can begin because that's the nature of the algorithm. A tool like cat will give you line-by-line flow because it can but the pipeline is strictly serial in nature and (as with uniq or sort), might stall in certain places.

He claims that the processing is "non-IO-bound" but also encourages the user to clear the page cache. Clearing the page cache forces the workload to be IO bound by definition. The page cache is there to "hide" the IO bottlenecks where possible. If you're doing a few different calculations using a few different pipelines then you want the page cache to remain full as it will mean that the data doesn't have to be reread from disk for each pipeline invocation.

For example, when I ingest photos from my CF card, I use "cp" to get the data from the card to my local disk. The card is 8GiB. I have 16GiB of RAM. That cp usually reads ahead almost the whole card and then bottlenecks on the write part of getting it onto disk. That data then sits in RAM for as long as it can (until the memory is needed by something else) which is good because after the "cp" I invoke md5sum to calculate some checksums of all the files. This is CPU bound and runs way faster than it would if it was IO bound due to having to reread all that data from disk. (My arrangement is still suboptimal but this gives an example of how I can get advantages from the architecture without having to do early optimisations in my app: my ingest script is "fast enough" and I can just about afford to do the md5sum later because I can be almost certain it's going to use the exact same data that was read from the card rather than the copied data that is reread from disk and, theoretically, might read differently.)

He's firmly in the realm of "small data" by 4 or 5 base 10 orders of magnitude (at least) so he's nowhere close to getting a "scaling curve" that will tell him where best to optimise for the general case. When he starts getting to workloads 2 or 3 orders of magnitude bigger than what he's doing he might find that there are a certain class of optimisations that present themselves but that probably won't be the "big data" general case.

Having said that, this makes his approach entirely appropriate for the particular task at hand (which I think it his entire point).

Through his use of xargs he implies (but does not directly acknowledge) that he realises this is a so-called "embarrassingly parallel" problem. -----

a3_nm · on Jan 18, 2015

I think it is unsafe to parallelize grep with xargs as in done in the article, because, beyond delivery order shuffling, the output of the parallel greps could get mixed up (the beginning of a line is by one grep and the end of a line is from a different grep, so, reading line by line afterwards, you get garbled lines).

See https://www.gnu.org/software/parallel/man.html#DIFFERENCES-B...

pkrumins · on Jan 18, 2015

The example in the article with cat, grep and awk:

    cat *.pgn | \
    grep "Result" | \
    awk '
     {
        split($0, a, "-");
        res = substr(a[1], length(a[1]), 1);
        if (res == 1) white++;
        if (res == 0) black++;
        if (res == 2) draw++;
      }
      END { print white+black+draw, white, black, draw }
    '

Can be written much more succinctly with just awk, and you don't even need to split the string or use substr:

    awk '
      /Result/ {
        if (/1\/2/) draw++;
        else if (/1-0/) white++;
        else if (/0-1/) black++;
      }
      END { print white+black+draw, white, black, draw }
    ' *.pgn

dice · on Jan 18, 2015

Keep reading, he removes the cat and grep in the final solution.

omaranto · on Jan 19, 2015

Yes, but he still keeps the awkward Awk code with the substr and such. I haven't benchmarked, maybe that's faster than the pretty regex matches.

lloeki · on Jan 19, 2015

I believe this is to be a bit more educative about how to build a pipeline. Also, iteratively building such solutions quickly often leads to such "inefficiencies" but makes things easier to reason with. Besides, the awk step may have been factored out in the end so it wouldn't make sense to optimise early. Also, by the time the author reaches the end, he gets IO-bound so there's not much need to optimise further (in the context of the exercise).

zokier · on Jan 18, 2015

Author begins with fairly idiomatic shell pipeline, but in the search for performance the pipeline transforms to a awk script. Not that I have anything against awk, but I feel like that kinda runs against the premise of the article. The article ends up demonstrating the power of awk over pipelines of small utilities.

Another interesting note is that there is a possibility that the script as-is could mis-parse the data. The grep should use '^\[Result' instead of 'Result'. I think this demonstrates nicely the fragility of these sorts of ad-hoc parsers that are common in shell pipelines.

tracker1 · on Jan 19, 2015

It probably depends on what you are trying to accomplish... I think a lot of us would reach for a scripting language to run through this (relatively small amount of data)... node.js does piped streams of input/output really well. And perl is the grand daddy of this type of input processing.

I wouldn't typically reach for a big data solution short of hundreds of gigs of data (which is borderline, but will only grow from there). I might even reach for something like ElasticSearch as an interim step, which will usually be enough.

If you can dedicate a VM in a cloud service to a single one-off task, that's probably a better option than creating a Hadoop cluster for most work loads.

rkwasny · on Jan 18, 2015

Bottom line is - you do not need hadoop until you cross 2TB of data to be processed (uncompressed). Modern servers ( bare metal ones, not what AWS sells you ) are REALLY FAST and can crunch massive amounts of data.

Just use a proper tools, well optimized code written in C/C++/Go/etc - not all the crappy JAVA framework-in-a-framework^N architecture that abstracts thinking about the CPU speed.

Bottom line, the popular saying is true: "Hadoop is about writing crappy code and then running it on a massive scale."

earino · on Jan 19, 2015

Dell sells a server with 6TB of ram (I believe.) I think the limit is way over 2TB. If you want to be able to query it quickly for analytical workloads, MPPs like Vertica scale up to 150+TB (at Facebook.) I honestly don't know what the scale is where you need Hadoop, but it's gotten to be a large number very quickly.

juliangregorian · on Jan 19, 2015

They do, I checked. It comes in at a cool half million (Helloooo, investors!)

virmundi · on Jan 19, 2015

My question is what do you mean by 2TB? At my current client, we have 5 TBs of data sitting (that's relatively recent). Before we had 2-ish. However, we had over 30 applications doing complex fraud calculations on that. "Moving data" (data being read and then worked) is about 40 TB daily. Even with SSD and 256 GB of RAM, a single machine would get overwhelmed on this.

If you're only working one app on less than 1 TB, maybe you don't need something as complex as Hadoop. But given that a cluster is easy to setup (I made a really simple NameNode + Two Data nodes in 45 minutes, going cold), it might not be a bad idea.

I'll take this further and say that some tools for Hadoop that are not from Apache are really nice to work with even in a for non-Hadoop work. For example, I've got to join several 1 GB files together to go from a relational, CSV model into a Document store model. Can I do this with command line tools? Maybe. Cascading makes this really easy. Each file family is a tap. I get tuple joins naturally. I wrote an ArangoDB tap to auto load into ArangoDB. It was fun, testable and easy. All of this runs sans-hadoop on my little MBP.

Fun fact about the Cascading tool set is that I can take my little app from my desktop and plop it onto a Hadoop cluster with little change (taps from local to hadoop). Will I do that in my present example? No. Can I think of places where that's really useful? Yes, daily 35 fraud models' regression tests executed with each build. That's somewhere around 500 full model executions over limited, but meaningful data. All easily done courtesy of a framework that targets Hadoop.

treve · on Jan 18, 2015

What makes 2TB the cutoff?

wobbleblob · on Jan 19, 2015

I think the consensus is that as long as your data fits on a single (affordable) machine, 'big data' tools are probably not the best solution.

ricardobeat · on Jan 19, 2015

Don't shoot me, but out of curiosity I wrote the thing in javascript: https://gist.github.com/ricardobeat/ee2fb2a6d704205446b7

Results: 4.4GB[1] processed in 47 seconds. Around 96mb/s, can probably be made faster, and nodejs is not the best at munging data...

[1] 3201 files taken from http://github.com/rozim/ChessData

notpeter · on Jan 19, 2015

This article echoes a talk Bryan Cantrill gave two years ago: https://youtu.be/S0mviKhVmBI

It's about how Joyent took the concept of a UNIX pipeline as a true powertool and built a distributed version atop an object filesystem with some little map/reduce syntactic sugar to replace Hadoop jobs with pipelines.

The Bryan Cantrill talk is definitely worth your time, but you can get an understanding of Manta with their 3m screencast: https://youtu.be/d2KQ2SQLQgg

cheng1 · on Jan 19, 2015

I have developed a one-liner toolset for Hadoop (when I have to use it). It's fresh to see a ZFS alternate of the concept. Don't like the JavaScript choice though.

GUN parallel should be a widely adopted choice. Lightweight. Fast. Low cost. Extendable.

xer0x · on Jan 19, 2015

You can use command-line tools for Manta without touching any Javascript. That's probably the best way to go. Although I do like Javascript.

sam_lowry_ · on Jan 18, 2015

Next to using `xargs -P 8 -n 1` to parallellize jobs locally, take a look at paexec, GNU parallel replacement that just works.

See https://github.com/cheusov/paexec

pmoriarty · on Jan 19, 2015

What's the advantage of using paexec over GNU parallel?

ole_tange · on Jan 19, 2015

See comparison here: http://www.gnu.org/software/parallel/man.html#DIFFERENCES-BE...

jacquesm · on Jan 18, 2015

See this very good comment by Bane:

https://news.ycombinator.com/item?id=8902739

mabbo · on Jan 18, 2015

I had an intern over the summer, working on a basic A/B Testing framework for our application (a very simple industrial handscanner tool used inside warehouses by a few thousand employees).

When we came to the last stage, analysis, he was keen to use MapReduce so we let him. In the end though, his analysis didn't work well, took ages to process when it did, and didn't provide the answers we needed. The code wasn't maintainable or reusable. shrug It happens. I had worse internships.

I put together some command line scripts to parse the files instead- grep, awk, sed, really basic stuff piped into each other and written to other files. They took 10 minutes or so to process, and provided reliable answers. The scripts were added as an appendix to the report I provided on the A/B test, and after formatting and explanations, took up a couple pages.

arthurcolle · on Jan 18, 2015

I used Hadoop a few times this semester for different classes and it seemed like the code was so easy to write and then because everything is either a Mapper or a Reducer, you just read enough of the docs to figure out what is intended to be done and then build on top of it, can I ask how it wasn't maintainable?

Just curious

m_mueller · on Jan 18, 2015

On a tangent, I'd be interested in how you format heavily piped bash code for documentation. Can comments be intersparsed there?

mappu · on Jan 19, 2015

Functions, mostly - the big `awk` command in the example goes into something like

    # @param $1 whatever
    chess_extract_scores() {
         awk blah blah blah
    }

and then your whole pipeline simplifies to

    cat foo | grep bar | chess_extract_scores

which is pretty readable. You can even do most of this in a live bash session with ^X ^E.

plaes · on Jan 19, 2015

You can actually do without cat:

grep bar foo | chess_extract_scores

http://en.wikipedia.org/wiki/Cat_%28Unix%29#Useless_use_of_c...

theepauk · on Jan 19, 2015

Sure you can, but premature optimization is also a real thing http://en.wikipedia.org/wiki/Program_optimization#When_to_op...

dsr_ · on Jan 19, 2015

bash has functions; functions are just like commands. Write your comments in the functions, and your final line will be the pipeline of awesomeness:

   generate_data ()
       {
        # make it rain
       }

   process ()
       {
        # chunky
       }

   gather ()
       {
        # puree
       }
  
   generate_data | process | gather

knodi123 · on Jan 19, 2015

We have a proprietary algorithm for assigning foods a "suitability score" based on a user's personal health conditions and body data.

It used to be a fairly slow algorithm, so we ran it in a hadoop cluster and it cached the scores for every user vs. every food in a massive table on a distributed database.

Another developer, who is quite clever, rewrote our algorithm in C, and compiled it as a database function, which was about 100x faster. He also did some algebra work and found a way to change our calculations, yielding a measly 4-5x improvement.

It was so, so, so much faster that in one swoop we eliminated our entire Hadoop cluster, and the massive scores table, and were actually able sort your food search results by score, calculating scores on the fly.

saym · on Jan 20, 2015

May I ask: Who is we?

NyxWulf · on Jan 18, 2015

This also isn't a straight either or proposition. I build local command line pipelines and do testing and/or processing. When either the amount of data needed to be processed passes into the range where memory or network bandwidth makes the processing more efficient on a Hadoop cluster I make some fairly minimal conversions and run the stream processing on the Hadoop cluster in streaming mode. It hasn't been uncommon for my jobs to be much faster than the same jobs run on the cluster with Hive or some other framework. Much of the speed boils down to the optimizer and the planner.

Overall I find it very efficient to use the same toolset locally and then scale it up to a cluster when and if I need to.

azylman · on Jan 18, 2015

What toolset are you using that you can run both locally and on a Hadoop cluster?

mdaniel · on Jan 18, 2015

Almost all of them?

The vocabulary of the grandparent comment implies they are using hadoop's streaming mode, and thus one can use a map-reduce streaming abstraction such as MRJob or just plain stdin/stdout; both will work locally and in cluster mode.

Or, if static typing is more agreeable to your development process, running hadoop in "single machine cluster" mode is relatively painless. The same goes for other distributed processing frameworks like Spark.

arjie · on Jan 18, 2015

I believe he mentioned it. The Hadoop streaming mode.

decisiveness · on Jan 19, 2015

If bash is the shell (assuming recursive search is required), maybe it would be even faster to just do:

    shopt -s globstar
    mawk '/Result/ {
        game++
        split($0, a, "-")
        res = substr(a[1], length(a[1]), 1)
        if(res == 1)
            white++
        if(res == 0)
            black++
        if(res == 2)
            draw++
    } END {
        print game, white, black, draw
    }' **/*.pgn

?

taltman1 · on Jan 19, 2015

This is a great exercise of how to take a Unix command line and iteratively optimize it with advanced use of awk.

In that spirit, one can optimize the xargs mawk invocation by 1) Getting rid of string-manipulation function calls (which are slow in awk), 2) using regular expressions in the pattern expression (which allows awk to short-circuit the evaluation of lines), and 3) avoiding use of field variables like $1, and $2, which allows the mawk virtual machine to avoid implicit field splitting. A bonus is that you end up with an awk script which is more idiomatic:

  mawk '
  /^\[Result "1\/2-1\/2"\]/ { draw++ }
  /^\[Result "1-0"\]/ { white++ }
  /^\[Result "0-1"\]/ { black++ }

  END { print white, black, draw }'

Notice that I got rid of the printing out of the intermediate totals per file. Since we are only tabulating the final total, we can modify the 'reduce' mawk invocation to be as follows:

  mawk '
  {games += ($1+$2+$3); white += $1; black += $2; draw += $3}
  END { print games, white, black, draw }'

Making the bottle-neck data stream thinner always helps with overall throughput.

philgoetz · on Jan 19, 2015

First, you don't score points with me for saying not to use Hadoop when you don't need to use Hadoop.

Second, you don't get to pretend you invented shell scripting because you came up with a new name for it.

Third, there are very few cases if any where writing a shell script is better than writing a Perl script.

MrBuddyCasino · on Jan 18, 2015

To quote the memorable Ted Dziuba[0]:

"Here's a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.

The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script."

[0] since his blog is gone: http://readwrite.com/2011/01/22/data-mining-and-taco-bell-pr...

threeseed · on Jan 18, 2015

Oh right the "cool kids" approach.

Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?

I can't stand developers who come up with the xargs/wget approach, hack something together and then walk away from it. I've seen it far too often and it's great for the short term. Dreadful for the long term.

AnthonyMouse · on Jan 18, 2015

The Unix people have thought of these things. You can easily do them with command line tools.

> Operational Supportability: How do you monitor the operation ?

Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.

> Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ?

wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.

> Maintainability: Can we run the same application on our desktop as on our production servers ?

I'm not sure how this is supposed to be an argument against using the standard utilities that are on everybody's machine already.

> Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?

Again, what? Extensibility is the wheelhouse of the thing you're complaining about.

nine_k · on Jan 19, 2015

Unix tools are composable. Functional languages (e.g. Clojure) are all about composability. While bash might be a reasonable glue language, I wonder why Clojure wouldn't be — and it could probably be as compact, if not terser.

The problem of the Hadoop approach is that the overhead of parallelization over multiple hosts is serious, and the task fits one machine neatly. A few GBs of data can and should be processed on one node; Hadoop is for terabytes.

eru · on Jan 19, 2015

> [...] I wonder why Clojure wouldn't be — and it could probably be as compact, if not terser.

Because Clojure is a goo language, that question depends mostly on the libraries available for Clojure.

(Whereas some other languages are worse at gluing, so libraries will only help you so far.)

rakoo · on Jan 19, 2015

That's if you want to do everything in Clojure, but if Clojure were to be used as a glue language, it seems to me it has a clear syntax to do it :

http://clojuredocs.org/clojure.java.shell/sh

Someone even went further to make it more useful:

https://github.com/Raynes/conch

konradb · on Jan 19, 2015

I'd never heard of conch, thanks loads for the reference; really useful.

sauere · on Jan 19, 2015

I could not agree more. And even with the things you mentioned, such a script will still be tiny and very readable.

You just have to love the simplicity.

nextos · on Jan 19, 2015

I love Unix, but it's just a local minima in the design space.

For example, it's typical text processing pipelines are hard to branch. I have hacked up some solutions, but never found them very elegant. I would love to hear some solutions to this. Ended up switching to Clojure (Prismatic's) Graph.

LeoPanthera · on Jan 19, 2015

> For example, it's typical text processing pipelines are hard to branch.

I'm not entirely sure what you mean by this, but it sounds like you should use "tee" pointing at a fifo.

ajuc · on Jan 19, 2015

The problem - you have file, you want to do one thing for lines matching REGEX and other thing for lines not-matching REGEX.

How to do it without iterating the file 2 times? You can do while of course, but it defeats the reason to use shell.

I would love to have two-way grep that writes matching lines to stdout and nonmatching to stderr. I wonder if grep maintainers would accept new option for grep "--two-way".

barrkel · on Jan 19, 2015

Write to more than one fifo from awk. If you're composing a dag rather than a pipeline, fifos are one way to go.

Personally though, I'd output to temporary files. The extra cost in disk usage and lack of pipelining is made up for by the easier debugging, and most shell pipelines aren't so slow that they need that level of optimization.

jiffytick · on Jan 19, 2015

awk can write to stderr.

ticviking · on Jan 19, 2015

Which is for errors.

spiffytech · on Jan 19, 2015

wget isn't the only part of the puzzle you may need Restart Recovery for - the CPU-bound map/reduce portion may also need to recover from partial progress. Unix tools aren't well-designed for that.

coderdude · on Jan 19, 2015

> Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.

Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).

> wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.

Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).

This is one of those tasks that seems like you could glue together wget and some scripts and call it a day but you would ultimately discover the reasons why nobody does this in practice. At least not for anything but one-off crawl jobs.

Thought of another possible issue:

If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?

Edit: `previously fetched` -> `previously failed`

AnthonyMouse · on Jan 19, 2015

> Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).

This is kind of the premise of this discussion. You don't use Hadoop to process 2GB of data, but you don't build Googlebot using bash and wget. There is a scale past which it makes sense to use the Big Data toolbox. The point is that most people never get there. Your crawler is never going to be Googlebot.

> Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).

It really depends what you're trying to do here. If the reason you're restarting the crawler is because e.g. your internet connection flapped while it was running or some server was temporarily giving spurious HTTP errors then you want the failed URLs to be retried. If you're only restarting the crawler because you had to pause it momentarily and you want to carry on from where you left off then you can easily record what the last URL you tried was and strip all of the previous ones from the list before restarting.

But I think what you're really running into is that we ended up talking about wget and wget isn't really designed in the Unix tradition. The recursive mode in particular doesn't compose well. It should be at least two separate programs, one that fetches via HTTP and one that parses HTML. Then you can see the easy solution to that class of problems: When you fetch a URL you write the URL and the retrieval status to a file which you can parse later to do the things you're referring to.

> If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?

Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.

coderdude · on Jan 19, 2015

> Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.

I wrote this in a reply to myself a moment after you posted your comment so I'll just move it here:

Regarding the last two issues I mentioned, you could sort the list of URLs by domain and split the list when the new list's length is >= n URLs and domain on the current line is different from the domain on the previous line. As long as wget can at least honor robots.txt directives between consecutive requests to a domain, it should all work out fine.

It looks like an easily solvable problem however you go about it.

> It really depends what you're trying to do here.

I was thinking about HTTP requests that respond with 4xx and 5xx errors. It would need to be possible to either remove those from the frontier and store them in a separate list or mark them with the error code so that it can be checked at some point being passed onto wget.

sillysaurus3 · on Jan 19, 2015

Open file on disk. See that it's 404. Delete file. Re-run crawler.

You'd turn that into code by doing grep -R 404 . or whatever the actual unique error string is and deleting any file containing the error message. (You'd be careful not to run that recursive delete on any unexpected data.)

Really, these problems are pretty easy. It's easy to overthink it.

pyre · on Jan 19, 2015

> grep -R 404

This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's default 404 page. You really can't count on there being any consistency between 404 pages on different sites.

If wget somehow stored the header response info to disk (e.g. "FILENAME.header-info") you could whip something up to do what you are suggesting though.

sillysaurus3 · on Jan 19, 2015

Yeah, wget stores response info to disk. Besides, even if it didn't, you could still visit a 404 page of the website and figure out a unique string of text to search for.

saidajigumi · on Jan 18, 2015

> Dreadful for the long term.

Here comes a bubble-bursting: I've lead a team that built data processing tools exactly like this, and the performance and ease of manipulating vast amounts of text using classic shell tools is hard to beat. We had no problems with any of: operational supportability, restart recovery, or maintainability. Highly testable, even. No, it's not just cowboy-coded crappy shell scripts and pipelines. Sure, there's a discipline to building pipelined tooling well, just as with any other kind of software. Your problems seem to stem from a lack of disciplined software engineering rather than the tools, or maybe just an environment that encouraged high technical debt.

The kicker? We were using pipeline-based tooling ... running on a Hadoop cluster. Honestly, I'm a bit surprised to see such an apparent mindshare split (judging by some recent HN posts) between performant single-system approaches and techniques used in-cluster. The point that "be sure your data is really, truly big data" is obviously well made, and still bears repetition. Yet the logical follow-on is that these technique are even more applicable to cluster usage. Why would anyone throw away multiple orders of magnitude performance going to a cluster-based approach?

Roboprog · on Jan 19, 2015

Unix/POSIX backgrounds are pretty common among the Hacker News crowd. Not so in "Enterprise" development. (Beam me up Scottie, there's no intelligent life here, only risk avoidance)

Enterprise development is predominated by 2 or 3 trusted operating systems: Windows (/ .NET), and the JVM. POSIX systems are only useful in-so-far-as they are a cheaper (or sometimes more reliable) place to host Java virtual machines. Enterprise dev groups generally have very limited exposure to, and a lot of fear of, things like Borne shell, AWK, Perl, Python. These languages don't have Visual Studio or Eclipse to hold your hand while you make far reaching refactorings like renaming a variable.

Sure, you and I would crawl log/data files trivially with a few piped commands, but that's a rare skill in most shops, at least since the turn of the century.

Ugh, that sounds cliche, but it's hard not to feel that way after being drowned in "Java or nothing" for so long at work.

http://tvtropes.org/pmwiki/pmwiki.php/Main/ElegantWeaponForA...

snambi · on Jan 20, 2015

I agree with @roboprog. Most software shops employ engineers who don't have exposure into UNIX tools. Only few hardcore engineers have exposure or interest learning UNIX tools. For majority of engineers it is just a job. They simply use the same tool for everything. And they tend to use the tools that seem to get them into well paying jobs. If hadoop can get them good paying jobs, they would like us "hadoop" for something in their current job, even if that job can performed by a set of CLI utils. I have seen 100s of resume builder projects in the my past experience.

jdmichal · on Jan 20, 2015

Windows has PowerShell, which can be even more powerful than Unix shell depending on the data and processing required.

Roboprog · on Jan 21, 2015

Does anybody without a unix/POSIX background even bother tinkering with PowerShell? Yes, it's a cool idea, as well, especially if your source data is in MS-Office, but I've not seen it put to much use.

developer1 · on Jan 19, 2015

The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example, but it's very common for shell scripts to have horrible error handling and checks for important things. The shell is just not suitable for extremely robust programs. I would bet that 80%+ of people who think they're good at shell scripting... aren't.

yourad_io · on Jan 19, 2015

> The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example

The steam bug is an example of of utter incompetence; not of someone not being very, very good at it. Whoever is happy with shipping `rm -rf $VAR/` without extreme checking around it should get their computer driving license revoked.

> The shell is just not suitable for extremely robust programs.

Incorrect. "The shell" can go as robust as you can handle. In bash, `set -e` will kill your script if any of the sub-commands fail (although ideally you'll be testing $? (exit code of prev. op) at the critical junctions), `set -u` will error on usage of undefined variables, etc.

A huge part of the "glue" that holds your favourite linux distro together is bash files.

> I would bet that 80%+ of people who think they're good at shell scripting... aren't.

The same probably goes for driving[1], this doesn't make cars any less robust.

[1] http://en.wikipedia.org/wiki/Illusory_superiority

PhasmaFelis · on Jan 20, 2015

> The same probably goes for driving[1], this doesn't make cars any less robust.

I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure. They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.

It's actually a good example of the point developer1 was making: cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.

yourad_io · on Jan 21, 2015

> I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure.

Maybe US cars :P

> They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.

It's a box weighing 1-2 tons that travels at 100kmh+. Millions (billions?) of km are driven every year. There will be accidents for both good drivers and bad. This won't change.

> cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.

That's simply untrue - both points. Highly competent drivers will have accidents. I highly doubt you feel extreme danger when you get behind the wheel/in a car. The way you phrase it, one expects millions of fatalities daily.

yuubi · on Jan 19, 2015

> rm -rf $VAR/

> /

facepaw.jpg

Without the trailing slash, null or undefined $VAR would cause an error instead of a request to delete all the things.

DougBTX · on Jan 19, 2015

The Steam line was more like ` rm -rf $VAR/* `, eg, they didn't want to delete the $VAR directory. Still, ` rm -rf /* ` is no fun.

yourad_io · on Jan 21, 2015

Part of the journey into Linuxdom is learning a healthy dose of fear for that command. I always pause over the enter key for a few seconds, even when I'm sure I haven't typo'ed.

freehunter · on Jan 19, 2015

Yeah but 80% of the people writing Java and think they're good aren't as well. And plenty of companies support Java.

The answer isn't "don't use it", it's "train your programmers in the languages they use".

kansface · on Jan 19, 2015

The sorts of bugs people experience with Java mostly result in a crashed/stalled/hung process. Bash bugs erase your entire file system. The thing about Bash is that it is trivially easy to make these sorts of mistakes- the language just isn't suitable to general purpose scripting.

yourad_io · on Jan 19, 2015

> Bash bugs erase your entire file system.

    EMPTY=""
    rm -rf $EMPTY/

Is this the kind of bug you're referring to?

> The thing about Bash is that it is trivially easy to make these sorts of mistakes

I fail to see how any other scripting language would have a different effect when you told it to do:

    system("rm -rf "+""+"/")

> the language just isn't suitable to general purpose scripting.

Yes, it is. Bash is deeper than it looks, but not by much. Learn how to handle errors and you'll be fine.

collyw · on Jan 19, 2015

It shouldn't be able to erase your filesystem unless you are running as root or doing something equally stupid. That's pretty much common sense stuff for anyone that isn't a beginner.

gambiting · on Jan 19, 2015

Yeah the "common sense stuff for anyone that isn't a beginner" argument is repeated ad nauseam, and even the largest companies make this mistake in their largest products. Take Valve - they should know how to write good code, right? And yet, last week an article was on top of HN, outlining how they put:

"rm -rf '$STEAMROOT'/*" in their code, used to remove the library. But hey, no one checked if $STEAMROOT is not empty, so when it was for one user, Steam deleted all of his personal files, whole /home and /media directories next time it started.

I'm not saying that command line tools shouldn't be used,but sometimes they are just too powerful for some users,and stupid mistakes like this happen.

anthony_d · on Jan 19, 2015

You're right to an extent, but this isn't relevant to the Java vs Bash discussion. The largest companies make this kind of mistake in whatever language they happen to use.

People delete data and screw things up in MapReduce jobs for Hadoop. A lot.

vidarh · on Jan 19, 2015

If you're worried about that, don't give the script permissions to access your entire filesystem. Easily handled with separate users, cgroups, assorted containerisation, and more.

sergiosgc · on Jan 19, 2015

> The shell is just not suitable for extremely robust programs.

Absolute statements like this are usually wrong. This one does not escape the rule. When Linux distros init is mostly bash scripting, there is very little need to further prove that robust systems can be written in bash scripting without the language fighting the developer.

vdaniuk · on Jan 19, 2015

Wait, is it really a good argument for shell-based approach when all major distros are switching to the systemd due to the configuration/maintainability/boilerplate issues with bash init scripting?

sergiosgc · on Jan 19, 2015

I'm not going into the systemd VS sysvinit discussion. For my argumentation, it is enough to recognize bash based sysvinit has been with us for circa 20 years with no stability problems.

jbergens · on Jan 19, 2015

I think most of was was written also applies to any normal programming language. You could write this in Python, Ruby, Javascript, Java or C# without any problems. The code would probably be easier to read also. The only special thing is the web page scraping that could be done by a library but the same thinking about scalability and the use of a single computer instead of a hadoop cluster still holds even if you're reading from file systems or databases.

MichaelGG · on Jan 19, 2015

Yeah I learned this trying to massage some data with Elasticsearch. curl -XDELETE host/index/type/$id ... Except $id didn't exist.

virmundi · on Jan 19, 2015

Something to keep in mind is that while a single app might be best served on a single machine piping data, multiple apps working the same data set probably wouldn't scale. Hadoop for all its faults does provide a nice, relativily simple programing platform to support multiple data processes.

Edited for phone swipe mistakes.

threeseed · on Jan 18, 2015

This "orders of magnitude" less performance going to a Hadoop cluster approach is nonsense.

There are plenty of options for Hadoop that make it dramatically faster than the naive example in the article. Spark ? MR-Redis ? Storm ?

coldtea · on Jan 19, 2015

>Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?

Yeah, and then they produce some over-engineered monstrocity, late, over-budget and barely able to run...

gpapilion · on Jan 18, 2015

I look at this article as a criticism of the hadoop being the wrong tool for small data sets.

This starts to become a question of data locality, and size. 1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.

The fact that shell commands were used makes for an easy demo that might be hard to support, but if a solution were written using a traditional language with threading or IPC instead of relying on hadoop you should always be faster, since you don't incur the latency costs of the network.

lloeki · on Jan 19, 2015

> That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.

Not at all, because data is being streamed. It could just as easily be 35TB and only use a few MB of RAM.

gpapilion · on Jan 20, 2015

The IO bandwidth of the system will limit you more loading 35TB of data in ram on a single system, even if it is streamed. You'll need more than one disk, and network card to do this in a timely fashion.

danieldk · on Jan 19, 2015

1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system.

It depends on what you do with the data. If you are processing the data in 512KB chunks and each chunk takes a day to process (because expensive computation), you probably do want to spread the work over some cluster.

gpapilion · on Jan 20, 2015

I don't think of hadoop being built for high complexity computation, but high IO throughput.

When you describe this kind of setup, I imagine things that involve proof through exhaustion. For example prime number search is something with a small input and large calculation time. However, these solution don't really benefit from hadoop since you don't really need the data management facilities, and a simpler MPI solution could handle this better.

Search indexing could fit this description(url -> results), but generally you want the additional network cards for throughput, and the disks to store the results. Then again the aggregate space on disk starts looking closer to TB instead of GB. Plus in the end you need to do something with all those pages.

jbergens · on Jan 19, 2015

I think the article said that you don't need to use Hadoop for everything and that it might be much faster to just use command line tools on a single computer. Of course you might find a use case where the total computing time is massive and in that case a cluster is better. I still don't think many use cases have that problem.

We are doing some simple statistics at work for much smaller data sizes and the computing time is usually around 10-100 ms so it could probably compute small batches at almost network speed.

danieldk · on Jan 19, 2015

Definitely. I was reacting to my parent poster, because size does not say everything. 1TB can be small, 1GB can be big - it depends on the amount of computation time that is necessary for whatever processing of the data you do.

JustSomeNobody · on Jan 19, 2015

I hate developers who over engineer everything and then when it's time to perform some of that support and extensibility, they leave because maintenance is beneath them.

They put this behemoth together with a thousand moving parts and then walk away from it.

This, too, happens often.

Kiro · on Jan 19, 2015

And I can't stand developers who overengineer things. We have a couple of them at my company and something that should take a few hours always take several weeks just because of all the reasons you mention. Most things don't need that kind of features and maintainability and if they do in the future we can just rewrite them from scratch. The overall expected return on investment is still better since we seldom need to.

penguat · on Jan 19, 2015

Because in all too many companies, re-writing from scratch is a no-go, no matter how quickly and sloppily an initial solution was thrown together. I've worked on a prototype => production type project, where the throwaway was never thrown away. (the initial team made some mistakes, chief among them was building one prototype of a whole system, rather than one per major risk)

JustSomeNobody · on Jan 19, 2015

This is a systemic problem. Engineering is always subordinate to business. This simply should not be the case. We desperately need new business organization models.

known · on Jan 19, 2015

https://en.m.wikipedia.org/wiki/Triarchy_%28theory%29

nilbot · on Jan 19, 2015

Quite the opposite, and, quite simple: engineers over-engineer thing in order to make things generic. and generic make solutions robust. that's basic science. Unless the problem and solution are well understood, your investment won't guarantee a return at all.

JustSomeNobody · on Jan 19, 2015

Generic, by default, does not in any way make things more robust. We've gone from engineering solutions to meet specific problems to engineering solution frameworks that (supposedly) will solve the problem and allow for any unknowns. The problem is, no matter how hard the engineer tries, he can never anticipate the unknowns to the extent that the application framework can support all of them. We should go back to solving the specific problem at hand. In both scenarios you get the customer who wants a feature that absolutely doesn't fit with the current application, therefore a rewrite is necessary. And with the specific solution, you don't have nearly the man hours wasted.

gaius · on Jan 19, 2015

No, developers over-engineer because setting up a 20-node Hadoop cluster is fun, whereas doing the same task in an hour in Excel means you have to move onto some other boring task.

Generic doesn't mean robust either, I don't know where you got that from,the two concepts are entirely unrelated.

nilbot · on Jan 19, 2015

Generic -> robust. i... i dont know how to explain that. honestly i haven't thought about the necessity of explaining things like this. its... basic mathematics.

gaius · on Jan 19, 2015

I don't think these words mean what you think they mean. Like "science" and "mathematics".

JustSomeNobody · on Jan 19, 2015

I'm sorry, but if you cannot explain it, you simply do not understand it yourself. That's harsh, I get it, and I'm truly sorry, but that's a basic fact.

TickleSteve · on Jan 19, 2015

Generic != Robust.

It can quite easily be the opposite, they are certainly orthogonal concepts.

JabavuAdams · on Jan 19, 2015

This smacks of an unexamined bias. Or maybe we're not using the words to mean the same things?

JabavuAdams · on Jan 19, 2015

No. Look to safety-critical software for intuition on why.

Simpler is more reliable. Also, it's hard to know enough about a problem to make a generic solution until you've solved the problem 2-3 times already. But ... having solved a problem multiple times increases the risk that you will be biased towards seeing new problems as some instance of the old problem and therefore applying unsuitable "generic" solutions.

Symbiote · on Jan 19, 2015

'pv' shows a progress bar, something like '&& touch $x.success' can help restart recovery.

I'd probably pick the shell approach for something I expect to be a one-off, but reconsider each time the task is repeated.

I printed http://xkcd.com/1205/ and stuck it to the wall. It's a useful reference when someone seems to be under or overengineering something.

lloeki · on Jan 19, 2015

While it drives some point home, the chart eludes the question of robustness (a written script will run twice the same way, whereas human error, especially on routine tasks, may hit one hard) and documentation (writing even a lightly commented script to do yearly maintenance is guaranteed to help your future self remember things one year from now).

slagfart · on Jan 19, 2015

That chart assumes 24 hours days. The reality of (my?) productivity is that I have perhaps six productive hours in a day. If I can save eight productive hours per month, that's sixteen days a year, not four.

x0x0 · on Jan 18, 2015

Sure -- you can level multiple complaints.

What about the failed pages? How about shoving those on a queue and retrying n times with an exponential backoff between. What about the total number of failed pages? What about failed pages by site? etc etc etc

But so what -- the principle is still sound. All I described is still a 100 line python script, written in an afternoon, instead of 3 weeks of working bringing up emr, installing and configuring nutch, figuring out network issues around emr nodes talking to commodity internet, installing a persistent queue, performing remote debugging, building a task dag in either code or (god help you) oozie/xml, and on and on.

threeseed · on Jan 18, 2015

Anybody can throw some crap together and make it stick. And it's a perfectly valid solution.

My issue is when there is criticism laid against those solutions which are actually engineered in a way that allows for supportability and extensibility. They are arguably far more important than execution time.

angrybits · on Jan 19, 2015

I can't figure out why you're arguing against standard unix tools/idioms in the name of supportability and extensibility? It defies logic.

x0x0 · on Jan 19, 2015

I think, in many peoples minds, extensibility == pain; either lots of code configuration (hello java, ejb), or xml (hello hadoop, java, spring, ejb), or tons of code (hello java, c++), etc. When nice languages don't make things painful, it sometimes feels like it's wrong, or not really enough work, or in some other way, insufficient. But people can mistake the rituals of programming for getting actual work accomplished.

:shrug: just .02

shitgoose · on Jan 21, 2015

I imagine what would have happened to Linux if Linus designed it with supportability and extensibility in mind.

nilbot · on Jan 19, 2015

Simple: because std utils are programs that do what they supposed to do. if problems bound are well within the definition domain of a std util then its all good. Supportability and Extensibility is way too generic for you to draw a line saying std utils can handle them all. After all, they are programs, not programming languages.

charltones · on Jan 19, 2015

There are command line tools available to help the transition from 'hack' one liner to a more maintainable / supportable solution. For instance drake (https://github.com/Factual/drake) a 'Make for data' which does dependency checking would allow for sensible restarts of the pipeline.

The O'Reilly Data Science at the Command Line book (linked elsewhere in the comments) has a good deal to say on the subject: turning one liners into extensible shell scripts, using drake, using Gnu Parallel.

andyidsinga · on Jan 19, 2015

I've been using GNU Parallel for orchestrating running remote scripts/tools on a bunch of machines in my compute and storage cluster. Its now my goto tool for almost any remote ssh task that needs to hit a bunch of machines at once.

An excellent tool ...apparently an improvement on xargs even for local parallel tasks ( see http://unix.stackexchange.com/questions/104778/gnu-parallel-... )

2ion · on Jan 19, 2015

Just these pointed remarks:

- There are things like pv(1) which allow you to monitor pipes. Things like systemd open other interesting possibilities for implementing, grouping and monitoring your processes.

- Recovery could be implemented by keeping a logfile of completed steps like a list of completely processed files or moving processed files to elsewhere in the file system (could be done in memory only using ramfs or tmpfs). Of course, it depends on the case whether it's feasible or not.

- Extensibility: Scripts and configurations can be done in shell syntax. Hook systems and frameworks of varying complexity exist. I agree that doing extensibility in shell code is going to turn out to be hazardous when done without proper concept and understanding of the tool at hand.

pacala · on Jan 19, 2015

I fully agree with all the operational / restart / features comments. However, I've often been surprised on how a little thought / research can build all these requirements on top of off-the-shelf components. I also agree that it is likely that one will eventually outgrow wget, but, for example, one may run out of business / pivot before that.

module0000 · on Jan 19, 2015

We don't really "come up" with the xargs/wget approach. The approach is already there, waiting to be utilized by someone who understands the tools. The "cool kids" don't like(or are not able) to understand the tools.

The author (I think) is trying to point out that these problems are already solved, decades ago, with existing UNIX tools.

mwotton · on Jan 19, 2015

I've implemented this. It isn't too bad up to a certain point. You have to be a bit careful about your filesystem/layout of files, lots of filesystems don't particularly like it when you have a few hundred million files in one directory.

debacle · on Jan 19, 2015

What kind of operational supportability do you need for a script that took a few hours to write and takes <5 minutes to run?

_ofdw · on Jan 19, 2015

I agree, it's much better to re-invent the wheel.

lloeki · on Jan 19, 2015

Alternative, real life scenario: navigate through 6 months of daily MySQL dumps, assorted YAML files and Rails production.log, looking for some cross product between tables, requests and serialised entities, for analysis and/or data recovery (pinpoint or retrieval).

zcat/cut/sed/grep/awk/perl crawled through it in a couple of minutes and required less than half an hour to craft a reliable enough implementation (including looking up relations from foreign keys straight from the SQL dumps).

My colleagues, who still don't get the point of a command line, would still be restoring each dump individually and running SQL requests manually to this day (or more probably declare it "too complex" and mark it as a loss). Side note: I'm torn between leaving this place where nobody seems to understand the point of anything remotely like engineering or keeping this job where I'm obviously being extremely useful to our customers.

wmt · on Jan 19, 2015

> Side note: I'm torn between leaving this place where nobody seems to understand the point of anything remotely like engineering or keeping this job where I'm obviously being extremely useful to our customers.

You should always aim at working with people who are smarter or better than you. Unless they have stack ranking.

sebastianavina · on Jan 19, 2015

to be fair, you could have achieved all that with a simple python script. sometimes i feel python is the new bash.

tokenrove · on Jan 19, 2015

Note that you lose some of the parallelism you get effortlessly from the Unix pipeline if you do it as a single (simple) Python script.

dynode · on Jan 20, 2015

I use the multiprocessing module in python all the time for quick parallelism <https://medium.com/@thechriskiehl/parallelism-in-one-line-40... Let's you easily map to multiple cores. I use it a lot for image processing tasks. Quickly crawl through directories to find all the files, then spin up all the cores on my machine to crunch through them. Wish there was an easy way to enlist multiple machines.

coldpie · on Jan 19, 2015

Yeah. This is a great article. While I was reading it, I was thinking about what I would have done, and my answer was Python. Bash is just too easy to do wrong (see the recent Steam rm -rf bug), and I don't code in it often enough to know the pitfalls by heart.

I'd be interested to see another article about doing this job in Python and how its performance compares to this simple one-liner.

falcor84 · on Jan 19, 2015

Python is a great tool for this, and is even better when used with the pythonpy tool, which allows for convenient integration of python commands inside unix pipelines - https://github.com/Russell91/pythonpy

101914 · on Jan 19, 2015

Are these "web pages" all on the same website?

If so, using wget is a poor solution. I have not used wget in over a decade but as I recall it does not do HTTP pipelining; I could be wrong on that - please correct me.

I do recall with certainty that when wget was first written and disseminated in the 1990's, "webmasters" wanted to ban it. httpd's were not as resilient then as they are today, nor was bandwidth and hardware as inexpensive.

HTTP pipelining is a smarter alternative than burdening the remote host with thousands of consecutive or simultaneous connections.

Depending on the remote host's httpd settings, HTTP pipelining usually lets you make 100 or maybe more requests using a single connection. It can be acomplished with only a simple tcpclient like the original nc and the shell.

In any event, the line about a "distributed crawler" is spot on. Never understimate the power of marketing to suspend common sense.

Also, I find that I can often speed my scripts up a little by using exec in shell pipelines, e.g., util1 |exec util2 or exec util1 |exec util2.

There are other, better approaches besides using the builtin exec, but I will leave those for another day.

mercurial · on Jan 20, 2015

You could just use Scrapy [1]. Easy to setup, and plenty of options you can activate if needed. Likely more robust than shell scripts as well. No Hadoop involved.

1: http://doc.scrapy.org/en/0.24/intro/tutorial.html

profquail · on Jan 19, 2015

archive.org has Ted's blog post, "Taco Bell Programming": https://web.archive.org/web/20101025124303/http://teddziuba....

juliangregorian · on Jan 19, 2015

Having written distributed crawlers, saturating the network connection is quite easy to do and is the main reason for even distributing that type of work in the first place.

make3 · on Jan 18, 2015

you know, or the real world reasonnable mature engineering answer, a Java/C#/C++ scalable parallel tool using modern libraries and MPI if it ever needs to scale.

untog · on Jan 18, 2015

You usually don't need that though, that's the point. If you're building an entire service around page crawling, sure. If you're doing a one-off task, don't bother.

MrBuddyCasino · on Jan 18, 2015

That would be my first gut reaction too, but if it is as simple as downloading webpages, this is actually a really great solution. I suspect he used that when he built Milo, a now defunct startup sold to eBay where they had to update prices and inventory data regularly. A startup should make different choices than Google.

jessaustin · on Jan 18, 2015

Yeah if you're Google. Most people are not, and wget is plenty. After all it's written in C.

eru · on Jan 19, 2015

Or use curl, for a slightly better engineered wget.

copsarebastards · on Jan 19, 2015

I'm pretty proficient in both, and I think that's a mischaracterization of the two tools. wget is more suited to pulling down large files, groups of files, etc. curl is more suited to API calls where you might need to do something complicated at the protocol level. Each has their use.

_ytji · on Jan 18, 2015

I feel ag (silver surfer, a grep-ish alternative) should be mentioned (even though he dropped it in his final awk/mawk commands) as it tends to be much faster than grep, and considering he cites performance throughout.

ggreer · on Jan 18, 2015

GitHub link for those who don't know about it: https://github.com/ggreer/the_silver_searcher/

I built ag for searching code. It can be (ab)used for other stuff, but the defaults are optimized for a developer searching a codebase. Also, when writing ag, I don't go out of my way to make sure behavior is correct on all platforms in all corner cases. Grep, on the other hand, has been around for decades. It probably handles cases I've never even thought of.

wglb · on Jan 18, 2015

A similar story: http://blogs.law.harvard.edu/philg/2009/05/18/ruby-on-rails-...: Tools used not quite the right way.

edit: with HN commentary: https://news.ycombinator.com/item?id=615587

sgt101 · on Jan 18, 2015

on a couple of GB this is true, actually if you have ssd's I'd expect any non compute bound task to be faster on a single machine up to ~10gb after which the disk parallelism should kick in and Hadoop should start to win.

KaiserPro · on Jan 18, 2015

Depends on the dick, depends on the storage.

HDFS is a psudeo block interface. If you have a real filesystem like lustre, or GPFS, not only do you have the abilty to use other tools, you can use that storage for other things.

In the case of GPFS, you have configurable redundancy. Sadly with lustre, you need decent hardware, otherwise you're going to loose data.

In all these things, paying bottom dollar for hardware, forgoing support is a false economy. At scales of 1pb+ (which is about 1/2 a rack now) its much much cheaper to use off the shelf parts with 24/7 support than "softwareing" your way out.

radoslawc · on Jan 18, 2015

> Depends on the dick

not really, sorry I had to

back to the topic, HDFS is really somewhat waste of disk space, especially when used for something like munching logs

> At scales of 1pb+ (which is about 1/2 a rack now) its much much cheaper to use off the shelf parts with 24/7 support than "softwareing" your way out.

depends, if you need monthly reports from logs, as long as you don't loose storage completely, then using even second hand hardware or decommissioned from prod is cheapest choice

KaiserPro · on Jan 18, 2015

Ahem

Disk....

simonster · on Jan 18, 2015

If you want disk parallelism, RAID 0 is probably easier than Hadoop.