I'm becoming a stronger and stronger advocate of teaching command-line interfaces to even programmers at the novice level...it's easier in many ways to think of how data is being worked on by "filters" and "pipes"...and more importantly, every time you try a step, something happens...making it much easier to interactively iterate through a process.
That it also happens to very fast and powerful (when memory isn't a limiting factor) is nice icing on the cake. I moved over to doing much more on CLI after realizing that doing something as simple as "head -n 1 massive.csv" to inspect headers of corrupt multi-gb CSV files made my data-munging life substantially more enjoyable than opening them up in Sublime Text.
A few years ago between projects, my coworkers cooked up some satirical amazing Web 2.0 data science tools. They used git, did a screencast and distributed it internally.
It was basically a few compiled perl scripts and some obfuscated shell scripts with a layer of glitz. People actually used it and LOVED it... It was supposedly better than the real tools some groups were using.
It was one of the more epic work trolls I've ever seen!
Your CSV peeking epiphany was in essence a matter of code vs. tools though rather than necessarily CLI vs. GUI. On Windows you might just as well have discovered you could fire up Linqpad and enter File.ReadLines("massive.csv").First() for example.
In a real production environment that command line would be put into a script parametrized with named variables and the embedded awk scripts would be changed to here-docs.
Sounds good although at that point it's just programming, and there are tools that are cleaner and faster and more robust than piping semi-structured strings around from a command line.
The one real benefit that can be argued is ubiquity (on *ix). Not every system has Perl, Python, or Ruby installed - or Hadoop for that matter - but there's usually a programmable shell and some variant of the standard utilities that will get something done in a pinch. If it happens to be 200x faster than some enormous framework, so much the better.
That code you're replying about was carefully and correctly written. You just replied as if you know how it works just so you could look like you know what you're talking about.
If you're unlucky, someone who actually knows how File.ReadLines() works will show up in an hour or two and explain that it's lazily evaluated.
Perhaps I'm missing something. It appears that the author is recommending against using Hadoop (and related tools) for processing 3.5GB of data. Who in the world thought that would be a good idea to begin with?
The underlying problem here isn't unique to Hadoop. People who are minimally familiar with how technology works and who are very much into BuzzWords™ will always throw around the wrong tool for the job so they can sound intelligent with a certain segment of the population.
That said, I like seeing how people put together their own CLI-based processing pipelines.
I've used hadoop at petabyte scale (2+pb input; 10+pb sorted for the job) for machine learning tasks. If you have such a thing on your resume, you will be inundated with employers who have "big data", and at least half will be under 50g with a good chunk of those under 10g. You'll also see multiple (shitty) 16 machine clusters, any of which -- for any task -- could be destroyed by code running on a single decent server with ssds. Let alone hadoop jobs running in emr, which is glacially slow (slow disk, slow network, slow everything.)
Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers. I imagine it's similar to early ejb coding.
> Also, hadoop is so painfully slow to develop in it's practically a full employment act for software engineers.
It's comical how bad Hadoop is compared even to the CM Lisp described in Daniel Hillis' PhD dissertation. How do you devolve all the way from that down to "It's like map/reduce. You get one map and one reduce!"
Programming is very faddish. It's amazing how bad commonly used technologies are. I'm so happy I'm mostly a native developer and don't have to use the shitty web stack and its shitty replacements.
What really puzzles me is that Doug Cutting worked at Xerox PARC and Mike Cafarella has two (!) CS Masters degrees, a PhD degree, and is a professor at the University of Michigan. It's not like they were unaware of the previous work in the field (Connection Machine languages, Paralations, NESL).
Exactly this just happened where I work. The CIO was recommending Hadoop on AWS for our image processing/analysis jobs. We process a single set of images at a time which come in around ~1.5GB. The output data size is about 1.2GB. Not a good candidate for Hadoop but, you know... "big data", right?
Indeed. For more real-world examples of people who thought that they had "big data", but didn't, see https://news.ycombinator.com/item?id=6398650 ("Don't use Hadoop - your data isn't that big"). The linked essay has:
> They handed me a flash drive with all 600MB of their data on it (not a sample, everything). For reasons I can't understand, they were unhappy when my solution involved pandas.read_csv rather than Hadoop.
User w_t_payne commented:
> I have worked for at least 3 different employers that claimed to be using "Big Data". Only one of them was really telling the truth.
> All of them wanted to feel like they were doing something special.
I think that last line is critical to understanding why a CIO might feel this way.
"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it." -Dan Ariely
In a similar situation. In fact stupider. We have a 120Gb baseline of data inside a relational store. The vendor has a file stream option that allows blobs to be stored on disk instead of the transaction log and be pushed through the DB server rather than using our current CIFS DFS cluster. So lets stick our 950Gb static document load in there too (while i was on holiday typically) and off we go.
Do thing starts going like shit as expected now it has to stream all that through the DB bottleneck.
So what's the solution? Well we're in big data territory now apparently at 1.2TiB (comedically small data and almost entirely static data set) and have every vendor licking arse with the CEO and CTO to sell us Hadoop, more DB features and SAN kit.
We don't even need it for processing. Just a big CRUD system. Total Muppets.
Being in charge of large budgets is better for CEOs and CTOs than being in charge of small budgets. It makes you look more impressive so they go for it like lemmings over the cliff, sometimes taking entire companies with them.
No, it's called enterprise software, and it's driven by "appeal to the enormous egos and matching budgets of leaders of companies that are held back more by their own bloat than by anything resembling competition." This is just standard practice for the enterprise software sector in general, and after everyone's spent their billions and the next recession is upon us, they'll probably axe most of these utter failures to produce business value. But all the consultants will just blame someone other than the client being dysfunctional because that's just the fastest way to get kicked off a contract.
Consulting for enterprise customers tends to be a lot like marriages - you can be right, or you can be happy (or paid). It takes a unique customer to have gotten past their cultural dysfunctions to accept responsibilities for their problems and to take legitimate, serious action. But like marriage, there can be great, great upsides when everyone gets on the same page and works towards mutual goals with the spirit of selflessness and growth. Yeah....
I have had this exact conversation at various past employers, usually when they started talking about "big data" and hadoop/friends:
"How much data do you expect to have?"
"We don't know, but we want it to scale up to be able to cover the whole market."
"Okay, so let's make some massive overestimates about the size of the market and scope of the problem... and that works out to about 100Mb/sec. That's about the speed at which you can write data to two hard drives. This is small data even in the most absurdly extreme scaling that I can think of. Use postgres."
Even experienced people do not have meaningful intuitions about what things are big or small on modern hardware. Always work out the actual numbers. If you don't know what they are then work out an upper bound. Write all these numbers down and compare them to your measured growth rates. Make your plans based on data. Anything that you've read about in the news is rare or it wouldn't be news, so it is unlikely to be relevant to your problem space.
That's not even medium data. Most people probably would be surprised to find out that their data could be stored and processed on an iPhone, and that using heavier duty tools isn't necessary or worthwhile.
Perhaps a good way of explaining it to management is, "If it fits on a smartphone SD card, it's tiny data. If it fits on a laptop hard drive, it's maybe medium data." I think at that point many of these conversations would end.
I think I read this somewhere here a few months ago (paraphrasing, obviously): "When the indices for your DB don't fit into a single machines RAM, then you're dealing with Big Data, not before."
And following up: Your laptop does not count as a "single machine" for purposes of RAM size. If you can fit the index of your DB in memory on anything you can get through EC2, it's still not Big Data.
There's still 40x difference to biggest EC2 instance to a maxed out Dell server (244 GB EC2 vs 6 TB for a R920). Not to mention non-PC hardware like SPARC, POWER and SGI UV systems that fit even more.
This is true, but at the upper end the "it isn't Big Data if it fits in a single system's memory" rule starts to get fuzzy. If you're using an SGI UV 2000 with 64 TB of memory to do your processing, I'm not going to argue with you about using the words "Big Data". ;-) I figured using an EC2 instance was a decent compromise.
Another explanation is that your CIO is not an idiot but rather they know about future projects that you don't. CIOs want to build capabilities (skills and technologies) not just one off implementations every time.
Not saying this is the case but CIO bashing is all too easy when you're an engineer.
A good CIO would know that leaving out key parts of the project is unlikely to produce good results. Even if the details aren't final, a simple “… and we probably need to scale this up considerably by next year” would be useful when weighing tradeoffs
"Although Tom was doing the project for fun, often people use Hadoop and other so-called Big Data (tm) tools for real-world processing and analysis jobs that can be done faster with simpler tools and different techniques."
I think the point the author is making is that although they knew from the start that Hadoop wasn't necessary for the job, many people probably don't.
Lots of people think that is "big data". For most people if it's too big for an Excel spreadsheet, it's "big data" and the way you process big data is with Hadoop. Of course once you show them the billable hours difference between setting up a Hadoop cluster, and (in my case at least) using python libraries on a MBP, they change their minds real fast. Its just a matter of "big data" being a new thing, people will figure it out as time goes on and things settle down.
People love the idea of being big and having "big problems". Wanting to use Hadoop isn't that different from wanting to use all sorts of fancy "web-scale" databases.
Most of us don't have scaling issues or big data, but that sort of excludes us from using all the fancy new tools that we want to play with. I'm still convinced that most of the stuff I work on at work could be run on SQLite, with designed a bit more careful.
The truth is that most of us will never do anything that couldn't be solved with 10 year old technology. And honestly we should happy, there's a certain comfort in being able to use simple and generally understood tools.
Some have questioned why I would spend the time advocating against the use of Hadoop for such small data processing tasks as that's clearly not when it should be used anyway. Sadly, Big Data (tm) frameworks are often recommended, required, or used more often than they should be. I know to many of us it seems crazy, but it's true. The worst I've seen was Hadoop used for a processing task of less than 1MB. Seriously.
Also, much agreement with those saying there should be more education effort when it comes to teaching command line tools. O'Reilly even has a book out on the topic: http://shop.oreilly.com/product/0636920032823.do
Author of Data Science at the Command Line here. Thanks for the nice blog post and for mentioning my book here. While we're talking about the subject of education, allow me to shamelessly promote a two-day workshop that I'll be giving next month in London: http://datascienceatthecommandline.com/#workshop
This is a great article and a fun read. A friend sent it over to me and I wrote him some notes about my thoughts. Having since realised it's on HN, I thought I'd post them here as well.
Some of my wording is a bit terse; sorry! :-) The article is great and I really enjoyed it. He's certainly got the right solution for the particular task at hand (which I think is his entire point) but he's generally right for the wrong reasons so I pick a few holes in that: I'm not trying to be mean. :-)
-----
Classic system sizing problem!
1.75GiB will fit in the page cache on any machine with >2GiB RAM.
One of the big problems is that people really don't know what to expect
so they don't realise that their performance is orders of magnitude
lower than it "should" be.
Part of this is because the numbers involved are really large:
1,879,048,192 bytes (1.7GiB) is an incomprehensibly large number.
2,600,000,000 times per second (2.6GHz) is an incomprehensibly large
number of things that can be done in the blink of an eye.
...But if you divide them using simple units analysis; things per second
divided by things gives you seconds: 1.383. That's assuming that you can
process 1 byte per clock cycle which might be reasonable if the data is
small and fits in cache. If we're going to be doing things in streaming
mode then we'll be limited by memory bandwidth, not clock speed.
Which means we should be able to squeeze our data through the processor
in...
bytes per second divided by bytes = seconds =>
>>> 8004829184 / 1879048192.0
4.260044642857143
so less than 5 seconds.
We probably want to assume that there are other stalls and overheads,
but a number between 20 and 60 seconds seems reasonable for that
workload (he gets 12): the article says it's just a summary plus
aggregate workload so we don't really need to allocate much in the way
of arithmetic power.
As with most things in x86, memory bandwidth is usually the bottleneck.
If you're not getting numbers with an order of magnitude or so of memory
bandwidth then either you have a arithmetic workload (and you know it)
or you have a crap tool.
Due to the memory fetch patterns and latencies on x86, it's often
possible to reorder your data access to get a nominal arithmetic
workload close to the memory bandwidth expected speed.
His analysis about the parallelisation of the workload due to shell
commands is incorrect. The speedup comes from accessing the stuff
straight from the page cache.
His analysis about loading the data into memory on Hadoop is incorrect.
The slowdown in Hadoop probably comes from memory copying, allocation
and GC involved in transforming the raw data from the page cache into
object in the language that Hadoop is written in and then throwing them
away again. That's just a guess because you want memory to fill up (to
about 1.75GiB) so that you don't have to go to disk. That memory is held
by the OS rather than the userland apps tho'.
His conclusion about how `sleep 3 | echo "Hello"` is done is incorrect.
They're "done at the same time" because sleep closes stdout immediately
rather than at the end of the three seconds. With a tool like uniq or
sort it has to ingest all the data before it can begin because that's
the nature of the algorithm. A tool like cat will give you line-by-line
flow because it can but the pipeline is strictly serial in nature and
(as with uniq or sort), might stall in certain places.
He claims that the processing is "non-IO-bound" but also encourages the
user to clear the page cache. Clearing the page cache forces the
workload to be IO bound by definition. The page cache is there to "hide"
the IO bottlenecks where possible. If you're doing a few different
calculations using a few different pipelines then you want the page
cache to remain full as it will mean that the data doesn't have to be
reread from disk for each pipeline invocation.
For example, when I ingest photos from my CF card, I use "cp" to get the
data from the card to my local disk. The card is 8GiB. I have 16GiB of
RAM. That cp usually reads ahead almost the whole card and then
bottlenecks on the write part of getting it onto disk. That data then
sits in RAM for as long as it can (until the memory is needed by
something else) which is good because after the "cp" I invoke md5sum to
calculate some checksums of all the files. This is CPU bound and runs
way faster than it would if it was IO bound due to having to reread all
that data from disk. (My arrangement is still suboptimal but this gives
an example of how I can get advantages from the architecture without
having to do early optimisations in my app: my ingest script is "fast
enough" and I can just about afford to do the md5sum later because I can
be almost certain it's going to use the exact same data that was read
from the card rather than the copied data that is reread from disk and,
theoretically, might read differently.)
He's firmly in the realm of "small data" by 4 or 5 base 10 orders of
magnitude (at least) so he's nowhere close to getting a "scaling curve"
that will tell him where best to optimise for the general case. When he
starts getting to workloads 2 or 3 orders of magnitude bigger than what
he's doing he might find that there are a certain class of optimisations
that present themselves but that probably won't be the "big data"
general case.
Having said that, this makes his approach entirely appropriate for the particular task at hand (which I think it his entire point).
Through his use of xargs he implies (but does not directly acknowledge)
that he realises this is a so-called "embarrassingly parallel" problem.
-----
I think it is unsafe to parallelize grep with xargs as in done in the article, because, beyond delivery order shuffling, the output of the parallel greps could get mixed up (the beginning of a line is by one grep and the end of a line is from a different grep, so, reading line by line afterwards, you get garbled lines).
I believe this is to be a bit more educative about how to build a pipeline. Also, iteratively building such solutions quickly often leads to such "inefficiencies" but makes things easier to reason with. Besides, the awk step may have been factored out in the end so it wouldn't make sense to optimise early. Also, by the time the author reaches the end, he gets IO-bound so there's not much need to optimise further (in the context of the exercise).
Author begins with fairly idiomatic shell pipeline, but in the search for performance the pipeline transforms to a awk script. Not that I have anything against awk, but I feel like that kinda runs against the premise of the article. The article ends up demonstrating the power of awk over pipelines of small utilities.
Another interesting note is that there is a possibility that the script as-is could mis-parse the data. The grep should use '^\[Result' instead of 'Result'. I think this demonstrates nicely the fragility of these sorts of ad-hoc parsers that are common in shell pipelines.
It probably depends on what you are trying to accomplish... I think a lot of us would reach for a scripting language to run through this (relatively small amount of data)... node.js does piped streams of input/output really well. And perl is the grand daddy of this type of input processing.
I wouldn't typically reach for a big data solution short of hundreds of gigs of data (which is borderline, but will only grow from there). I might even reach for something like ElasticSearch as an interim step, which will usually be enough.
If you can dedicate a VM in a cloud service to a single one-off task, that's probably a better option than creating a Hadoop cluster for most work loads.
Bottom line is - you do not need hadoop until you cross 2TB of data to be processed (uncompressed).
Modern servers ( bare metal ones, not what AWS sells you ) are REALLY FAST and can crunch massive amounts of data.
Just use a proper tools, well optimized code written in C/C++/Go/etc - not all the crappy JAVA framework-in-a-framework^N architecture that abstracts thinking about the CPU speed.
Bottom line, the popular saying is true:
"Hadoop is about writing crappy code and then running it on a massive scale."
Dell sells a server with 6TB of ram (I believe.) I think the limit is way over 2TB. If you want to be able to query it quickly for analytical workloads, MPPs like Vertica scale up to 150+TB (at Facebook.) I honestly don't know what the scale is where you need Hadoop, but it's gotten to be a large number very quickly.
My question is what do you mean by 2TB? At my current client, we have 5 TBs of data sitting (that's relatively recent). Before we had 2-ish. However, we had over 30 applications doing complex fraud calculations on that. "Moving data" (data being read and then worked) is about 40 TB daily. Even with SSD and 256 GB of RAM, a single machine would get overwhelmed on this.
If you're only working one app on less than 1 TB, maybe you don't need something as complex as Hadoop. But given that a cluster is easy to setup (I made a really simple NameNode + Two Data nodes in 45 minutes, going cold), it might not be a bad idea.
I'll take this further and say that some tools for Hadoop that are not from Apache are really nice to work with even in a for non-Hadoop work. For example, I've got to join several 1 GB files together to go from a relational, CSV model into a Document store model. Can I do this with command line tools? Maybe. Cascading makes this really easy. Each file family is a tap. I get tuple joins naturally. I wrote an ArangoDB tap to auto load into ArangoDB. It was fun, testable and easy. All of this runs sans-hadoop on my little MBP.
Fun fact about the Cascading tool set is that I can take my little app from my desktop and plop it onto a Hadoop cluster with little change (taps from local to hadoop). Will I do that in my present example? No. Can I think of places where that's really useful? Yes, daily 35 fraud models' regression tests executed with each build. That's somewhere around 500 full model executions over limited, but meaningful data. All easily done courtesy of a framework that targets Hadoop.
It's about how Joyent took the concept of a UNIX pipeline as a true powertool and built a distributed version atop an object filesystem with some little map/reduce syntactic sugar to replace Hadoop jobs with pipelines.
The Bryan Cantrill talk is definitely worth your time, but
you can get an understanding of Manta with their 3m screencast: https://youtu.be/d2KQ2SQLQgg
I have developed a one-liner toolset for Hadoop (when I have to use it). It's fresh to see a ZFS alternate of the concept. Don't like the JavaScript choice though.
GUN parallel should be a widely adopted choice. Lightweight. Fast. Low cost. Extendable.
I had an intern over the summer, working on a basic A/B Testing framework for our application (a very simple industrial handscanner tool used inside warehouses by a few thousand employees).
When we came to the last stage, analysis, he was keen to use MapReduce so we let him. In the end though, his analysis didn't work well, took ages to process when it did, and didn't provide the answers we needed. The code wasn't maintainable or reusable. shrug It happens. I had worse internships.
I put together some command line scripts to parse the files instead- grep, awk, sed, really basic stuff piped into each other and written to other files. They took 10 minutes or so to process, and provided reliable answers. The scripts were added as an appendix to the report I provided on the A/B test, and after formatting and explanations, took up a couple pages.
I used Hadoop a few times this semester for different classes and it seemed like the code was so easy to write and then because everything is either a Mapper or a Reducer, you just read enough of the docs to figure out what is intended to be done and then build on top of it, can I ask how it wasn't maintainable?
We have a proprietary algorithm for assigning foods a "suitability score" based on a user's personal health conditions and body data.
It used to be a fairly slow algorithm, so we ran it in a hadoop cluster and it cached the scores for every user vs. every food in a massive table on a distributed database.
Another developer, who is quite clever, rewrote our algorithm in C, and compiled it as a database function, which was about 100x faster. He also did some algebra work and found a way to change our calculations, yielding a measly 4-5x improvement.
It was so, so, so much faster that in one swoop we eliminated our entire Hadoop cluster, and the massive scores table, and were actually able sort your food search results by score, calculating scores on the fly.
This also isn't a straight either or proposition. I build local command line pipelines and do testing and/or processing. When either the amount of data needed to be processed passes into the range where memory or network bandwidth makes the processing more efficient on a Hadoop cluster I make some fairly minimal conversions and run the stream processing on the Hadoop cluster in streaming mode. It hasn't been uncommon for my jobs to be much faster than the same jobs run on the cluster with Hive or some other framework. Much of the speed boils down to the optimizer and the planner.
Overall I find it very efficient to use the same toolset locally and then scale it up to a cluster when and if I need to.
The vocabulary of the grandparent comment implies they are using hadoop's streaming mode, and thus one can use a map-reduce streaming abstraction such as MRJob or just plain stdin/stdout; both will work locally and in cluster mode.
Or, if static typing is more agreeable to your development process, running hadoop in "single machine cluster" mode is relatively painless. The same goes for other distributed processing frameworks like Spark.
This is a great exercise of how to take a Unix command line and iteratively optimize it with advanced use of awk.
In that spirit, one can optimize the xargs mawk invocation by 1) Getting rid of string-manipulation function calls (which are slow in awk), 2) using regular expressions in the pattern expression (which allows awk to short-circuit the evaluation of lines), and 3) avoiding use of field variables like $1, and $2, which allows the mawk virtual machine to avoid implicit field splitting. A bonus is that you end up with an awk script which is more idiomatic:
Notice that I got rid of the printing out of the intermediate totals per file. Since we are only tabulating the final total, we can modify the 'reduce' mawk invocation to be as follows:
mawk '
{games += ($1+$2+$3); white += $1; black += $2; draw += $3}
END { print games, white, black, draw }'
Making the bottle-neck data stream thinner always helps with overall throughput.
"Here's a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it? The cool-kids answer is to write a distributed crawler in Clojure and run it on EC2, handing out jobs with a message queue like SQS or ZeroMQ.
The Taco Bell answer? xargs and wget. In the rare case that you saturate the network connection, add some split and rsync. A "distributed crawler" is really only like 10 lines of shell script."
Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
I can't stand developers who come up with the xargs/wget approach, hack something together and then walk away from it. I've seen it far too often and it's great for the short term. Dreadful for the long term.
The Unix people have thought of these things. You can easily do them with command line tools.
> Operational Supportability: How do you monitor the operation ?
Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.
> Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ?
wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.
> Maintainability: Can we run the same application on our desktop as on our production servers ?
I'm not sure how this is supposed to be an argument against using the standard utilities that are on everybody's machine already.
> Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
Again, what? Extensibility is the wheelhouse of the thing you're complaining about.
Unix tools are composable. Functional languages (e.g. Clojure) are all about composability. While bash might be a reasonable glue language, I wonder why Clojure wouldn't be — and it could probably be as compact, if not terser.
The problem of the Hadoop approach is that the overhead of parallelization over multiple hosts is serious, and the task fits one machine neatly. A few GBs of data can and should be processed on one node; Hadoop is for terabytes.
I love Unix, but it's just a local minima in the design space.
For example, it's typical text processing pipelines are hard to branch. I have hacked up some solutions, but never found them very elegant. I would love to hear some solutions to this. Ended up switching to Clojure (Prismatic's) Graph.
The problem - you have file, you want to do one thing for lines matching REGEX and other thing for lines not-matching REGEX.
How to do it without iterating the file 2 times? You can do while of course, but it defeats the reason to use shell.
I would love to have two-way grep that writes matching lines to stdout and nonmatching to stderr. I wonder if grep maintainers would accept new option for grep "--two-way".
Write to more than one fifo from awk. If you're composing a dag rather than a pipeline, fifos are one way to go.
Personally though, I'd output to temporary files. The extra cost in disk usage and lack of pipelining is made up for by the easier debugging, and most shell pipelines aren't so slow that they need that level of optimization.
wget isn't the only part of the puzzle you may need Restart Recovery for - the CPU-bound map/reduce portion may also need to recover from partial progress. Unix tools aren't well-designed for that.
> Downloading files with wget will create files and directories as it proceeds. You can observe and count them to determine progress, or pass a shell script to xargs that writes whatever progress data you like to a file before/after calling wget.
Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).
> wget has command line options to skip downloading files that already exist. Or you can use tail to skip the number of lines in the input file as there exist complete entries in the destination directory.
Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).
This is one of those tasks that seems like you could glue together wget and some scripts and call it a day but you would ultimately discover the reasons why nobody does this in practice. At least not for anything but one-off crawl jobs.
Thought of another possible issue:
If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?
> Which means using wget as your HTTP module and a scripting language as the glue for the logic you'll ultimately need to implement to create a robust crawler (robust to failures and edge cases).
This is kind of the premise of this discussion. You don't use Hadoop to process 2GB of data, but you don't build Googlebot using bash and wget. There is a scale past which it makes sense to use the Big Data toolbox. The point is that most people never get there. Your crawler is never going to be Googlebot.
> Is wget able to check whether a previously failed page exists on disk [in some kind of index] before making any new HTTP requests? It sounds like this would try fetching every failed URL until it reaches the point where it left off before the restart. If it's not possible to maintain an index of unfetchable URLs and reasons for the failures then this would be one reason why wget wouldn't work in place of software designed for the task of crawling (as opposed to just fetching).
It really depends what you're trying to do here. If the reason you're restarting the crawler is because e.g. your internet connection flapped while it was running or some server was temporarily giving spurious HTTP errors then you want the failed URLs to be retried. If you're only restarting the crawler because you had to pause it momentarily and you want to carry on from where you left off then you can easily record what the last URL you tried was and strip all of the previous ones from the list before restarting.
But I think what you're really running into is that we ended up talking about wget and wget isn't really designed in the Unix tradition. The recursive mode in particular doesn't compose well. It should be at least two separate programs, one that fetches via HTTP and one that parses HTML. Then you can see the easy solution to that class of problems: When you fetch a URL you write the URL and the retrieval status to a file which you can parse later to do the things you're referring to.
> If you're trying to saturate your connection with multiple wget instances, how do you make sure that you're not fetching more than one page from a single server at once (being a friendly crawler)? Or how would you honor robots.txt's Crawl-delay with multiple instances?
Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.
> Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.
I wrote this in a reply to myself a moment after you posted your comment so I'll just move it here:
Regarding the last two issues I mentioned, you could sort the list of URLs by domain and split the list when the new list's length is >= n URLs and domain on the current line is different from the domain on the previous line. As long as wget can at least honor robots.txt directives between consecutive requests to a domain, it should all work out fine.
It looks like an easily solvable problem however you go about it.
> It really depends what you're trying to do here.
I was thinking about HTTP requests that respond with 4xx and 5xx errors. It would need to be possible to either remove those from the frontier and store them in a separate list or mark them with the error code so that it can be checked at some point being passed onto wget.
Open file on disk. See that it's 404. Delete file. Re-run crawler.
You'd turn that into code by doing grep -R 404 . or whatever the actual unique error string is and deleting any file containing the error message. (You'd be careful not to run that recursive delete on any unexpected data.)
Really, these problems are pretty easy. It's easy to overthink it.
This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's default 404 page. You really can't count on there being any consistency between 404 pages on different sites.
If wget somehow stored the header response info to disk (e.g. "FILENAME.header-info") you could whip something up to do what you are suggesting though.
Yeah, wget stores response info to disk. Besides, even if it didn't, you could still visit a 404 page of the website and figure out a unique string of text to search for.
Here comes a bubble-bursting: I've lead a team that built data processing tools exactly like this, and the performance and ease of manipulating vast amounts of text using classic shell tools is hard to beat. We had no problems with any of: operational supportability, restart recovery, or maintainability. Highly testable, even. No, it's not just cowboy-coded crappy shell scripts and pipelines. Sure, there's a discipline to building pipelined tooling well, just as with any other kind of software. Your problems seem to stem from a lack of disciplined software engineering rather than the tools, or maybe just an environment that encouraged high technical debt.
The kicker? We were using pipeline-based tooling ... running on a Hadoop cluster. Honestly, I'm a bit surprised to see such an apparent mindshare split (judging by some recent HN posts) between performant single-system approaches and techniques used in-cluster. The point that "be sure your data is really, truly big data" is obviously well made, and still bears repetition. Yet the logical follow-on is that these technique are even more applicable to cluster usage. Why would anyone throw away multiple orders of magnitude performance going to a cluster-based approach?
Unix/POSIX backgrounds are pretty common among the Hacker News crowd. Not so in "Enterprise" development. (Beam me up Scottie, there's no intelligent life here, only risk avoidance)
Enterprise development is predominated by 2 or 3 trusted operating systems: Windows (/ .NET), and the JVM. POSIX systems are only useful in-so-far-as they are a cheaper (or sometimes more reliable) place to host Java virtual machines. Enterprise dev groups generally have very limited exposure to, and a lot of fear of, things like Borne shell, AWK, Perl, Python. These languages don't have Visual Studio or Eclipse to hold your hand while you make far reaching refactorings like renaming a variable.
Sure, you and I would crawl log/data files trivially with a few piped commands, but that's a rare skill in most shops, at least since the turn of the century.
Ugh, that sounds cliche, but it's hard not to feel that way after being drowned in "Java or nothing" for so long at work.
I agree with @roboprog. Most software shops employ engineers who don't have exposure into UNIX tools. Only few hardcore engineers have exposure or interest learning UNIX tools. For majority of engineers it is just a job. They simply use the same tool for everything. And they tend to use the tools that seem to get them into well paying jobs. If hadoop can get them good paying jobs, they would like us "hadoop" for something in their current job, even if that job can performed by a set of CLI utils. I have seen 100s of resume builder projects in the my past experience.
Does anybody without a unix/POSIX background even bother tinkering with PowerShell? Yes, it's a cool idea, as well, especially if your source data is in MS-Office, but I've not seen it put to much use.
The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example, but it's very common for shell scripts to have horrible error handling and checks for important things. The shell is just not suitable for extremely robust programs. I would bet that 80%+ of people who think they're good at shell scripting... aren't.
> The problem with shell scripting is that nearly nobody is very, very good at it. The Steam bug doing an rm -rf / is an example
The steam bug is an example of of utter incompetence; not of someone not being very, very good at it. Whoever is happy with shipping `rm -rf $VAR/` without extreme checking around it should get their computer driving license revoked.
> The shell is just not suitable for extremely robust programs.
Incorrect. "The shell" can go as robust as you can handle. In bash, `set -e` will kill your script if any of the sub-commands fail (although ideally you'll be testing $? (exit code of prev. op) at the critical junctions), `set -u` will error on usage of undefined variables, etc.
A huge part of the "glue" that holds your favourite linux distro together is bash files.
> I would bet that 80%+ of people who think they're good at shell scripting... aren't.
The same probably goes for driving[1], this doesn't make cars any less robust.
> The same probably goes for driving[1], this doesn't make cars any less robust.
I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure. They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.
It's actually a good example of the point developer1 was making: cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.
> I don't think I can imagine anything less robust than cars, in terms of the frequency and severity of operational failure.
Maybe US cars :P
> They're pretty much the deadliest thing we've ever invented that wasn't actually designed to kill people.
It's a box weighing 1-2 tons that travels at 100kmh+. Millions (billions?) of km are driven every year. There will be accidents for both good drivers and bad. This won't change.
> cars and shell scripts are perfectly safe if operated by highly competent people, and only become (extremely) dangerous when operated by incompetents, but in practice most operators are incompetent, in denial, and refuse to learn from others' mistakes.
That's simply untrue - both points. Highly competent drivers will have accidents. I highly doubt you feel extreme danger when you get behind the wheel/in a car. The way you phrase it, one expects millions of fatalities daily.
Part of the journey into Linuxdom is learning a healthy dose of fear for that command. I always pause over the enter key for a few seconds, even when I'm sure I haven't typo'ed.
The sorts of bugs people experience with Java mostly result in a crashed/stalled/hung process. Bash bugs erase your entire file system. The thing about Bash is that it is trivially easy to make these sorts of mistakes- the language just isn't suitable to general purpose scripting.
It shouldn't be able to erase your filesystem unless you are running as root or doing something equally stupid. That's pretty much common sense stuff for anyone that isn't a beginner.
Yeah the "common sense stuff for anyone that isn't a beginner" argument is repeated ad nauseam, and even the largest companies make this mistake in their largest products. Take Valve - they should know how to write good code, right? And yet, last week an article was on top of HN, outlining how they put:
"rm -rf '$STEAMROOT'/*" in their code, used to remove the library. But hey, no one checked if $STEAMROOT is not empty, so when it was for one user, Steam deleted all of his personal files, whole /home and /media directories next time it started.
I'm not saying that command line tools shouldn't be used,but sometimes they are just too powerful for some users,and stupid mistakes like this happen.
You're right to an extent, but this isn't relevant to the Java vs Bash discussion. The largest companies make this kind of mistake in whatever language they happen to use.
People delete data and screw things up in MapReduce jobs for Hadoop. A lot.
If you're worried about that, don't give the script permissions to access your entire filesystem. Easily handled with separate users, cgroups, assorted containerisation, and more.
> The shell is just not suitable for extremely robust programs.
Absolute statements like this are usually wrong. This one does not escape the rule. When Linux distros init is mostly bash scripting, there is very little need to further prove that robust systems can be written in bash scripting without the language fighting the developer.
Wait, is it really a good argument for shell-based approach when all major distros are switching to the systemd due to the configuration/maintainability/boilerplate issues with bash init scripting?
I'm not going into the systemd VS sysvinit discussion. For my argumentation, it is enough to recognize bash based sysvinit has been with us for circa 20 years with no stability problems.
I think most of was was written also applies to any normal programming language. You could write this in Python, Ruby, Javascript, Java or C# without any problems. The code would probably be easier to read also. The only special thing is the web page scraping that could be done by a library but the same thinking about scalability and the use of a single computer instead of a hadoop cluster still holds even if you're reading from file systems or databases.
Something to keep in mind is that while a single app might be best served on a single machine piping data, multiple apps working the same data set probably wouldn't scale. Hadoop for all its faults does provide a nice, relativily simple programing platform to support multiple data processes.
>Here's what the "sensible adults" think about when they see problems like this. Operational Supportability: How do you monitor the operation ? Restart Recovery: Do you have the ability to restart the operation mid way through if something fails ? Maintainability: Can we run the same application on our desktop as on our production servers ? Extensibility: Can we extend the platform easily to do X, Y, Z after the crawling ?
Yeah, and then they produce some over-engineered monstrocity, late, over-budget and barely able to run...
I look at this article as a criticism of the hadoop being the wrong tool for small data sets.
This starts to become a question of data locality, and size. 1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.
The fact that shell commands were used makes for an easy demo that might be hard to support, but if a solution were written using a traditional language with threading or IPC instead of relying on hadoop you should always be faster, since you don't incur the latency costs of the network.
> That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.
Not at all, because data is being streamed. It could just as easily be 35TB and only use a few MB of RAM.
The IO bandwidth of the system will limit you more loading 35TB of data in ram on a single system, even if it is streamed. You'll need more than one disk, and network card to do this in a timely fashion.
1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system.
It depends on what you do with the data. If you are processing the data in 512KB chunks and each chunk takes a day to process (because expensive computation), you probably do want to spread the work over some cluster.
I don't think of hadoop being built for high complexity computation, but high IO throughput.
When you describe this kind of setup, I imagine things that involve proof through exhaustion. For example prime number search is something with a small input and large calculation time. However, these solution don't really benefit from hadoop since you don't really need the data management facilities, and a simpler MPI solution could handle this better.
Search indexing could fit this description(url -> results), but generally you want the additional network cards for throughput, and the disks to store the results. Then again the aggregate space on disk starts looking closer to TB instead of GB. Plus in the end you need to do something with all those pages.
I think the article said that you don't need to use Hadoop for everything and that it might be much faster to just use command line tools on a single computer. Of course you might find a use case where the total computing time is massive and in that case a cluster is better. I still don't think many use cases have that problem.
We are doing some simple statistics at work for much smaller data sizes and the computing time is usually around 10-100 ms so it could probably compute small batches at almost network speed.
Definitely. I was reacting to my parent poster, because size does not say everything. 1TB can be small, 1GB can be big - it depends on the amount of computation time that is necessary for whatever processing of the data you do.
I hate developers who over engineer everything and then when it's time to perform some of that support and extensibility, they leave because maintenance is beneath them.
They put this behemoth together with a thousand moving parts and then walk away from it.
And I can't stand developers who overengineer things. We have a couple of them at my company and something that should take a few hours always take several weeks just because of all the reasons you mention. Most things don't need that kind of features and maintainability and if they do in the future we can just rewrite them from scratch. The overall expected return on investment is still better since we seldom need to.
Because in all too many companies, re-writing from scratch is a no-go, no matter how quickly and sloppily an initial solution was thrown together. I've worked on a prototype => production type project, where the throwaway was never thrown away. (the initial team made some mistakes, chief among them was building one prototype of a whole system, rather than one per major risk)
This is a systemic problem. Engineering is always subordinate to business. This simply should not be the case. We desperately need new business organization models.
Quite the opposite, and, quite simple: engineers over-engineer thing in order to make things generic. and generic make solutions robust. that's basic science. Unless the problem and solution are well understood, your investment won't guarantee a return at all.
Generic, by default, does not in any way make things more robust.
We've gone from engineering solutions to meet specific problems to engineering solution frameworks that (supposedly) will solve the problem and allow for any unknowns. The problem is, no matter how hard the engineer tries, he can never anticipate the unknowns to the extent that the application framework can support all of them.
We should go back to solving the specific problem at hand. In both scenarios you get the customer who wants a feature that absolutely doesn't fit with the current application, therefore a rewrite is necessary. And with the specific solution, you don't have nearly the man hours wasted.
No, developers over-engineer because setting up a 20-node Hadoop cluster is fun, whereas doing the same task in an hour in Excel means you have to move onto some other boring task.
Generic doesn't mean robust either, I don't know where you got that from,the two concepts are entirely unrelated.
Generic -> robust. i... i dont know how to explain that. honestly i haven't thought about the necessity of explaining things like this. its... basic mathematics.
I'm sorry, but if you cannot explain it, you simply do not understand it yourself.
That's harsh, I get it, and I'm truly sorry, but that's a basic fact.
No. Look to safety-critical software for intuition on why.
Simpler is more reliable. Also, it's hard to know enough about a problem to make a generic solution until you've solved the problem 2-3 times already. But ... having solved a problem multiple times increases the risk that you will be biased towards seeing new problems as some instance of the old problem and therefore applying unsuitable "generic" solutions.
While it drives some point home, the chart eludes the question of robustness (a written script will run twice the same way, whereas human error, especially on routine tasks, may hit one hard) and documentation (writing even a lightly commented script to do yearly maintenance is guaranteed to help your future self remember things one year from now).
That chart assumes 24 hours days. The reality of (my?) productivity is that I have perhaps six productive hours in a day. If I can save eight productive hours per month, that's sixteen days a year, not four.
What about the failed pages? How about shoving those on a queue and retrying n times with an exponential backoff between. What about the total number of failed pages? What about failed pages by site? etc etc etc
But so what -- the principle is still sound. All I described is still a 100 line python script, written in an afternoon, instead of 3 weeks of working bringing up emr, installing and configuring nutch, figuring out network issues around emr nodes talking to commodity internet, installing a persistent queue, performing remote debugging, building a task dag in either code or (god help you) oozie/xml, and on and on.
Anybody can throw some crap together and make it stick. And it's a perfectly valid solution.
My issue is when there is criticism laid against those solutions which are actually engineered in a way that allows for supportability and extensibility. They are arguably far more important than execution time.
I think, in many peoples minds, extensibility == pain; either lots of code configuration (hello java, ejb), or xml (hello hadoop, java, spring, ejb), or tons of code (hello java, c++), etc. When nice languages don't make things painful, it sometimes feels like it's wrong, or not really enough work, or in some other way, insufficient. But people can mistake the rituals of programming for getting actual work accomplished.
Simple: because std utils are programs that do what they supposed to do. if problems bound are well within the definition domain of a std util then its all good. Supportability and Extensibility is way too generic for you to draw a line saying std utils can handle them all. After all, they are programs, not programming languages.
There are command line tools available to help the transition from 'hack' one liner to a more maintainable / supportable solution. For instance drake (https://github.com/Factual/drake) a 'Make for data' which does dependency checking would allow for sensible restarts of the pipeline.
The O'Reilly Data Science at the Command Line book (linked elsewhere in the comments) has a good deal to say on the subject: turning one liners into extensible shell scripts, using drake, using Gnu Parallel.
I've been using GNU Parallel for orchestrating running remote scripts/tools on a bunch of machines in my compute and storage cluster. Its now my goto tool for almost any remote ssh task that needs to hit a bunch of machines at once.
- There are things like pv(1) which allow you to monitor pipes. Things like systemd open other interesting possibilities for implementing, grouping and monitoring your processes.
- Recovery could be implemented by keeping a logfile of completed steps like a list of completely processed files or moving processed files to elsewhere in the file system (could be done in memory only using ramfs or tmpfs). Of course, it depends on the case whether it's feasible or not.
- Extensibility: Scripts and configurations can be done in shell syntax. Hook systems and frameworks of varying complexity exist. I agree that doing extensibility in shell code is going to turn out to be hazardous when done without proper concept and understanding of the tool at hand.
I fully agree with all the operational / restart / features comments. However, I've often been surprised on how a little thought / research can build all these requirements on top of off-the-shelf components. I also agree that it is likely that one will eventually outgrow wget, but, for example, one may run out of business / pivot before that.
We don't really "come up" with the xargs/wget approach. The approach is already there, waiting to be utilized by someone who understands the tools. The "cool kids" don't like(or are not able) to understand the tools.
The author (I think) is trying to point out that these problems are already solved, decades ago, with existing UNIX tools.
I've implemented this. It isn't too bad up to a certain point. You have to be a bit careful about your filesystem/layout of files, lots of filesystems don't particularly like it when you have a few hundred million files in one directory.
Alternative, real life scenario: navigate through 6 months of daily MySQL dumps, assorted YAML files and Rails production.log, looking for some cross product between tables, requests and serialised entities, for analysis and/or data recovery (pinpoint or retrieval).
zcat/cut/sed/grep/awk/perl crawled through it in a couple of minutes and required less than half an hour to craft a reliable enough implementation (including looking up relations from foreign keys straight from the SQL dumps).
My colleagues, who still don't get the point of a command line, would still be restoring each dump individually and running SQL requests manually to this day (or more probably declare it "too complex" and mark it as a loss). Side note: I'm torn between leaving this place where nobody seems to understand the point of anything remotely like engineering or keeping this job where I'm obviously being extremely useful to our customers.
> Side note: I'm torn between leaving this place where nobody seems to understand the point of anything remotely like engineering or keeping this job where I'm obviously being extremely useful to our customers.
You should always aim at working with people who are smarter or better than you. Unless they have stack ranking.
I use the multiprocessing module in python all the time for quick parallelism <https://medium.com/@thechriskiehl/parallelism-in-one-line-40... Let's you easily map to multiple cores. I use it a lot for image processing tasks. Quickly crawl through directories to find all the files, then spin up all the cores on my machine to crunch through them. Wish there was an easy way to enlist multiple machines.
Yeah. This is a great article. While I was reading it, I was thinking about what I would have done, and my answer was Python. Bash is just too easy to do wrong (see the recent Steam rm -rf bug), and I don't code in it often enough to know the pitfalls by heart.
I'd be interested to see another article about doing this job in Python and how its performance compares to this simple one-liner.
Python is a great tool for this, and is even better when used with the pythonpy tool, which allows for convenient integration of python commands inside unix pipelines - https://github.com/Russell91/pythonpy
If so, using wget is a poor solution. I have not used wget in over a decade but as I recall it does not do HTTP pipelining; I could be wrong on that - please correct me.
I do recall with certainty that when wget was first written and disseminated in the 1990's, "webmasters" wanted to ban it. httpd's were not as resilient then as they are today, nor was bandwidth and hardware as inexpensive.
HTTP pipelining is a smarter alternative than burdening the remote host with thousands of consecutive or simultaneous connections.
Depending on the remote host's httpd settings, HTTP pipelining usually lets you make 100 or maybe more requests using a single connection. It can be acomplished with only a simple tcpclient like the original nc and the shell.
In any event, the line about a "distributed crawler" is spot on. Never understimate the power of marketing to suspend common sense.
Also, I find that I can often speed my scripts up a little by using exec in shell pipelines, e.g., util1 |exec util2 or exec util1 |exec util2.
There are other, better approaches besides using the builtin exec, but I will leave those for another day.
You could just use Scrapy [1]. Easy to setup, and plenty of options you can activate if needed. Likely more robust than shell scripts as well. No Hadoop involved.
Having written distributed crawlers, saturating the network connection is quite easy to do and is the main reason for even distributing that type of work in the first place.
you know, or the real world reasonnable mature engineering answer, a Java/C#/C++ scalable parallel tool using modern libraries and MPI if it ever needs to scale.
You usually don't need that though, that's the point. If you're building an entire service around page crawling, sure. If you're doing a one-off task, don't bother.
That would be my first gut reaction too, but if it is as simple as downloading webpages, this is actually a really great solution. I suspect he used that when he built Milo, a now defunct startup sold to eBay where they had to update prices and inventory data regularly. A startup should make different choices than Google.
I'm pretty proficient in both, and I think that's a mischaracterization of the two tools. wget is more suited to pulling down large files, groups of files, etc. curl is more suited to API calls where you might need to do something complicated at the protocol level. Each has their use.
I feel ag (silver surfer, a grep-ish alternative) should be mentioned (even though he dropped it in his final awk/mawk commands) as it tends to be much faster than grep, and considering he cites performance throughout.
I built ag for searching code. It can be (ab)used for other stuff, but the defaults are optimized for a developer searching a codebase. Also, when writing ag, I don't go out of my way to make sure behavior is correct on all platforms in all corner cases. Grep, on the other hand, has been around for decades. It probably handles cases I've never even thought of.
on a couple of GB this is true, actually if you have ssd's I'd expect any non compute bound task to be faster on a single machine up to ~10gb after which the disk parallelism should kick in and Hadoop should start to win.
HDFS is a psudeo block interface. If you have a real filesystem like lustre, or GPFS, not only do you have the abilty to use other tools, you can use that storage for other things.
In the case of GPFS, you have configurable redundancy. Sadly with lustre, you need decent hardware, otherwise you're going to loose data.
In all these things, paying bottom dollar for hardware, forgoing support is a false economy. At scales of 1pb+ (which is about 1/2 a rack now) its much much cheaper to use off the shelf parts with 24/7 support than "softwareing" your way out.
back to the topic, HDFS is really somewhat waste of disk space, especially when used for something like munching logs
> At scales of 1pb+ (which is about 1/2 a rack now) its much much cheaper to use off the shelf parts with 24/7 support than "softwareing" your way out.
depends, if you need monthly reports from logs, as long as you don't loose storage completely, then using even second hand hardware or decommissioned from prod is cheapest choice
That would depend on the data set and the strip size. Striping is good for streaming. Linear/concat of 2+ drives with XFS would be faster with a lot of files than end up in separate AG's on separate drives which can be accessed in parallel.
So don't use Hadoop to crunch data that fits on a memory stick, or that a single disk spindle can read in few seconds.
Why is this first on the HN front-page?
Reminds me of the C++ is better than Java, Go is better than C++, etc, pieces.
Yes, the right tool for the right job. That's what makes a good engineer.
Somebody who thinks there is _no_ valid use case for Hadoop is a fool. (The author did not say that, but many of the comments here seem to imply that view)
(ELSA is a logger that claims to be able to handle 100000 entries/sec (!!))
When to Use Hadoop
This is a description of why Hadoop isn't always the right solution to Big Data problems, but that certainly doesn't mean that it's not a valuable project or that it isn't the best solution for a lot challenges. It's important to use the right tool for the job, and thinking critically about what features each tool provides is paramount to a project's success. In general, you should use Hadoop when:
Data access patterns will be very basic but analytics will be very complicated.
Your data needs absolutely guaranteed availability for both reading and writing.
There are inadequate traditional database-oriented tools which currently exist for your problem.
Do not use Hadoop if:
You're don't know exactly why you're using it.
You want to maximize hardware efficiency.
Your data fits on a single "beefy" server.
You don't have full-time staff to dedicate to it.
The easiest alternative to using Hadoop for Big Data is to use multiple traditional databases and architect your read and write patterns such that the data in one database does not rely on the data in another. Once that is established, it is much easier than you'd think to write basic aggregation routines in languages you're already invested in and familiar with. This means you need to think very critically about your app architecture before you throw more hardware at it.
Shell commands are great for data processing pipelines because you get parallelism for free. For proof, try a simple example in your terminal.
sleep 3 | echo "Hello world."
That doesn't really prove anything about data processing pipelines, since echo "Hello world." doesn't need to wait for any input from the other process; it can run as soon as the process is forked.
cat *.pgn | grep "Result" | sort | uniq -c
Does this have any advantage over the more straightforward verson below?
grep -h "Result" *.pgn | sort | uniq -c
Either the cat process or the grep process is going to be waiting for disk I/Os to complete before any of the later processes have data to work on, so splitting it into two processes doesn't seem to buy you any additional concurrency. You would, however, be spending extra time in the kernel to execute the read() and write() system calls to do the interprocess communication on the pipe between cat and grep.
Also, the parallelism of a data processing pipeline is going to be constrained by the speed of the slowest process in it: all the processes after it are going to be idle while waiting for the slow process to produce output, and all the processes before it are going to be idle once the slow process has filled its pipe's input buffers. So if one of the processes in the pipeline takes 100 times as long as the other three, Amdahl's Law[1] suggests that you won't get a big win from breaking it up into multiple processes.
"grep <pattern> <files>" is not the same as "cat <files> | grep <pattern>", in that the former will prefix lines with filenames if there is more than one input file. What you want instead is "grep -h <pattern> <files>".
The advantage of using cat, therefore, is the few seconds of laziness saved in not reading the manual.
> The -F for grep indicates that we are only matching on fixed strings and not doing any fancy regex, and can offer a small speedup, which I did not notice in my testing.
I guess grep is probably clever enough to choose a faster matching algorithm once it's parsed the pattern and discovered it doesn't contain any regex fun.
In general grep seems smart enough that it would do that, but it hasn't been my experience. Just last week I was searching through a couple hundred gigs small xml files. I found that:
$ LC_ALL=C fgrep -r STRING .
was much faster than plain grep. This was on a CentOS 5 box, so maybe newer versions of grep are smarter.
But then again, if I was on a newer box I'd just install and use ack or ag.
LC_ALL=C makes grep faster because text matching is normally locale-sensitive, for example 'S' ('\x53' in Big5) is not a substring of '兄' ('\xA5\x53' in Big5).
About 5 years ago I worked at a company that took the "pile of shell scripts" approach to processing data. Our data was big enough and our algorithms computationally heavy enough that a single machine wasn't a good solution. So we had a bunch of little binaries that were glued together with sed, awk, perl, and pbsnodes.
It was horrible. It was tough to maintain-- we all know how hard to read even the best awk and perl are. It was difficult to optimize, and you always found yourself worrying about things like the maximum length of command lines, how to figure out what the "real" error was in a bash pipeline, and so on. When parts of the job failed, we had to manually figure out what parts of the job had failed, and re-run them. Then we had to copy the files over to the right place to create the full final output.
The company was a startup and the next VC milestone or pivot was always just around the corner. There was never any time to clean things up. A lot of the code had come out of early tech demos that management just asked us to "just scale up." But oops, you can't do that with a pile of shell scripts and custom C binaries. So the technical debt just kept piling up. I would advise anyone in this situation not to do this. Yeah, shell scripts are great for making rough guesses about things in a pile of data. They are great for ad hoc exploration on small data or on individual log files. But that's it. Do not check them into a source code repo and don't use them in production. The moment someone tries to check in a shell script longer than a page, you need to drop the hammer. Ask them to rewrite it in a language (and ideally, framework), that is maintainable in the long term.
Now I work on Hadoop, mostly on the storage side of things. Hadoop is many things-- a storage system, a set of computation frameworks that are robust against node failures, a Java API. But above all it's a framework for doing things in a standardized way so that you can understand what you've done 6 months from now. And you will be able to scale up by adding more nodes, when your data is 2x or 4x as big down the line. On average, the customers we work with are seeing their data grow by 2x every year.
I feel like people on Hacker News often don't have a clear picture of how people interact with Hadoop. Writing MapReduce jobs is very 2008. Nowadays, more than half of our users write SQL that gets processed by an execution engine such as Hive or Impala. Most users are not developers, they're analysts. If you have needs that go beyond SQL, you would use something like Spark, which has a great and very concise API based on functional programming. Reading about how clunky MR jobs is just feels to me like reading an article about how hard it is to make boot and root floppy disks for Linux. Nobody's done that in years.
I've had the pleasure and displeasure of working with small datasets (~7.5GB of images) in shell. One often needs to send SIGINT to the shell when it starts to glob expand or tab complete a folder with millions of files. But besides minor issues like that, command line tools get the job done.
Until semi-recently, millions of files in a directory would not only choke up the shell, but the filesystem too. ext4 is a huge improvement over ext3 in that regard; with 10m files in an ext3 directory you ended up with long hangs on various operations. And even with ext4, make sure not to NFS-export the volume that directory is on!
I've encountered this (or similar) issue on production.
We had C++ system that wrote temporary files to /tmp when printing, /tmp was cleared on system startup, it worked ok for years, but the files accumulated. At some point it started to randomly throw file access errors when trying to create these temporary files. Not for each file - only for some of them.
Disk wasn't full, some files could be created in /tmp, others couldn't, it turned out after a few days of tracking it, that filesystem can be overwhelmed by too many similary named files in one directory - and it can't create file XXXX99999 even if there's is no such file in this directory, but it can create files like YYYYY99999 :)
I just love such bugs where your basic assumptions turn out to be wrong.
This kind of approach can probably scale out pretty far before actually needing to resort to true distributed processing. Compression, simple transforms, R, etc... You can probably get away with even more by just using a networked filesystem and inotify.
One common misconception about using Hadoop is that use Hadoop if your data is large. Usage of Hadoop should be more driven based on the growth of data rather than size.
I agree that for the given use case, the solution is appropriate and works fine. Problem mentioned in the given post is not a Big Data problem.
Hadoop will be helpful in case if there are millions of games are played everyday and we need to update the statistics daily e.t.c. For this case, the given solution will hit bottleneck and there will be some optimisation/code change needed to keep running the code.
Hadoop and its ecosystem are not a silver bullet and hence should not be used for everything. The problem has to be a Big Data problem
It is that buzz surrounding Hadoop that makes people misunderstood its use and capability. I have met non-technical analysts who want RDBMS performance on Hadoop. They expect seconds to minutes scale queries on hundreds of GB of data.
I always throw this analogy to people who misunderstood Hadoop: A stone to crack an egg or a spoon?
Hadoop and RDBMS only have a thin overlapping region in the Venn diagram that describes their capabilities and use cases.
Ultimately, it is cost vs efficiency. Hadoop can solve all data problems. Likewise for RDBMS. This is an engineering tradeoff that people have to make.
I totally agree with you. Capability <strong>"LIKE"</strong> will drive Hadoop adoption, Hadoop should not be seen as replacement of R.D.B.M.S. These are two different tools for made for different purpose.
Cloud solution are totally out due to the nature of the data. Not everything can be done in cloud.
If you have such huge amount of data, the total amount of time it takes to transfer there and compute is not as competitive as an on-premise solution, unless all your data live in the cloud.
I would look into https://spark.apache.org/ then. You can get quite good performance out of it, but you need to spend more effort in babysitting your data.
Heres a probably unpopular opinion....
Pipes make things a bit slow. A native pipeless program would be a good bit faster - incl. an acid db. Note that doing this in python and expecting it to beat grep wont work...
The other thing is that hadoop - and some others are slow on big data (peta, or more) vs own tools. Theyre necessary/used because of massive clustering (10x the hardware deployed easily beats making ur own financially).
I suspect its a general lack of understanding the way computers work (hardware, os ie system architecture) vs "why care it works and python/go/java/etc are easy for me i dont need to know what happens under the hood".
Why would you want to use a database for this problem? The input data would take time to load into an ACID db and we're only interested in a single ternary value within that data. The output data is just a few lists of boolean values so it has no reason to be in a database either.
This is a textbook stream processing problem. Adding a database creates more complexity for literally no benefit assuming the requirements in the linked article were complete. I would be baffled to see a solution to this problem that was anything more than a stream processor, to say nothing of a database being involved.
If it really is just a one-shot with one simple-ish filter, I agree. But I often find myself incrementally building shell-pipeline tangles that are sped up massively by being replaced with SQLite. Once your processing pipeline is making liberal use of the sort/grep/cut/tee/uniq/tac/awk/join/paste suite of tools, things get slow. The tangle of Unix tools effectively does repeated full-table scans without the benefit of indexes, and is especially bad if you have to re-sort the data at different stages of the pipeline, e.g. on different columns, or need to split and then re-join columns in different stages of the pipeline. In that kind of scenario a database (at least SQLite, haven't tried a more "heavyweight" database) ends up being a win even for stream-processing tasks. You pay for a load/index step up front, but you more than get it back if the pipeline is nontrivial.
The interesting part is that its still faster, not that its the best-case solution. The main reason is that the data set fits in memory and is no slower to load (you need to read the data in all cases, duh. Both piped and db will read the data from disk exactly once in a sequential fashion).
There is no locking issue, and you can be smart in the filtering steps (most dbs do some of that automagically anyway). You don't have that level of control with the pipes, you are limited by the program's ability to process stdin, and additional locking.
This is exactly where knowing how things really work under the hood give you an advantage vs "but in theory..". You can reimplement a complete program, or even set of programs that will outperform the db abd the piped example. But will you? No, you want the best balance between fastest solution with the least amount of work.
In the final solution at the end of the article there are only two pipes:
1. A pipe to feed the file names into xargs for starting up parallel `mawk` processes.
2. A pipe to a final `mawk` process which aggregates the data from the parallel processes.
There's still some performance that could be gained by using a single processes with threads and shared memory, but this is pretty good for something that can be whipped together quickly.
Yeah its not bad. In the final command, it is basically leveraging mawk for everything which works out well since there's fewer pipes.
But in this case its about replacing hadoop with mawk basically. Which is indeed a good point as well - and incidentally also confirms my own comment =)
This is blown out of proportion... actually increase is probably a factor 10-20X, not 100s. The fact that EMR is used is a problem, provisioning, bootstrapping the cluster alone accounts for probably half the time.
The fact that shell commands were run repeatedly means that the data ends up in the OS buffer cache and basically in memory.
I'm not discounting that CLI is faster than Hadoop by an order of magnitude on small datasets. Nor will I dive into Hadoop vs CLI. The answer to all that IMO is that it depends. And in this case, it's not well warranted.
What I do take exception to is the Fox News style headlines that are disproportional to the truth. EMR != Hadoop.
Maybe I come from a weird world, or even a weird generation. But when I was in high school, Linux fanboyism was at its peak and just like people get all wound up on bands and such, us geeks got wound up on open-source and linux and fck Micro$oft etc. etc. This was early-ish 2000's.
As a result. Every serious programmer I know, especially those who are about my age, lives their life in the CLI.
It always comes a surprise when somebody suggests that there are professional developers out there who do not use predominantly CLI.
This shouldn't be a surprise. Tons of development is done on Windows. Most game development, obviously Windows app development, .NET websites, etc.
There are command line tools there, but in my 10 years of being a Windows developer, GUI tools were more the norm.
There's a time and a place for both. Now developing predominantly under Linux, it amazes me how time consuming and clunky some tasks are on the command line compared to using a GUI (e.g. debugging, Visual Studio is just a fantastic IDE), but also how much faster and easier other tasks are with a CLI.
Windows is just inherently GUI-centric, if you force developers to use Windows they are naturally going to gravitate toward using GUI based tools because Windows command line is so crippled. Tools like Cygwin and Chocolatey are nice in a pinch but they just don't compare to being in a real UNIX environment.
Why use a loaded word like 'force'? Developing on Windows was excellent. Microsoft provides great tools and support; it was incredibly productive. The only real negative is the licensing requirements.
It's interesting to see the culture of development differ so much from place to place. When I worked in Australia, Windows was an incredibly common development environment, while here in Silicon Valley it's all Mac/Linux.
Yes, but my point was that there is no evidence that is the majority, nor are they really 'forced' to use it - if it's that much of an issue, just don't take the job (as you say).
Many, many developers have no issue developing on Windows or even enjoy it. There seems to be a mindset amongst certain people that Windows developers are not 'real' developers, which is what I'm arguing against.
I say "force" because a lot of employers simply do not allow developers to choose their workstation OS. Windows is very popular with the non-programmers responsible for making IT purchasing deals within most major corporations. Among programmers, it's not quite so popular.
It still depends on what programmers you ask. Windows works fine for me, Linux did not work well when I tried it (but it was many years ago). Unix through XWindows from a pc was actually quite nice.
I think developers like users use feelings more than thoughts when choosing an environment. Yes, you can do some things on Linux that are harder on Windows but saying that missing "grep" is severly limiting you is strange since it is easy to install. And you have find in most editors. You can run Vim or Emacs if you want to on Windows. Powershell is very good but with a strange syntax. If that is someones problem they sound like they don't want to learn something new. Listing thousands of files in a folder is probably still faster on Linux but I rarely do that (I can't see the point in collecting many years of logs in one folder, for example). The main problem I have with Windows is that I have to restart the computer every now and then.
Yes, the way Windows locks shared DLLs and requires a restart of the whole OS instead of just the programs currently using the DLL when you install an update is, in a word, obnoxious.
Also, again I dislike the fact that Windows is inherently GUI-centric. Interacting with programs is designed to be done using primarily the mouse or a touch screen. Even Windows Server is designed to be used via remote desktop. Whereas for UNIX-compliant OSes the desktop and mouse are not required for anything, a skilled user can easily be more productive using only a keyboard, and anything a user can do can just as easily be placed in a script and automated.
Package management is also a major deficiency of Windows. Chocolatey is a nice workaround, but is a relatively young project. Windows did not have any package management for a long time.
Also for a lot of programming language packages with native code extensions, it can be difficult or even impossible to get them to compile on Windows. I can think of a handful of Ruby packages that flat-out do not compile, making Rails dev on Windows a non-starter.
> Developing on Windows was excellent. Microsoft provides great tools and support; it was incredibly productive.
I don't know about that. I've made several good faith efforts to really see what people like about the Windows development ecosystem, and I consistently come away dismayed. "Great tools and support" could never be used to describe msbuild, for example. Or any of the MSDN documentation with incredible antipattern code that people like to blithely copy and paste into their programs. Visual Studio is slow, brittle, and makes it difficult to do version control right.
I could list examples for days but that's not the point. Nowhere is perfect, but there's no way anyone could consider Microsoft a clear leader here.
Windows does have virtual desktops...just not the way X allows. The main thing that bugs me about Windows's desktops is that it is given a logical display surface compromised of all the display units. Whereas with X (going by my use of xmonad) it's easy to separate the display and desktop abstractions and have desktop 2 on display 1 and desktop 5 on display 2.
Also note, if I'm remembering correctly, there are native virtual desktops slated to come with Windows 10. Kinda late to the game, but it's nice that they're finally getting them.
Windows Powershell is actually quite nice and powerful, while also avoiding some of the shell legacy traps around escaping. It suffers from not being very discoverable and not having a community.
I wish they had just been less stubborn and made something that would run bash and standard unix commands. I've used it for production jobs and it's worked as advertised, but I would have rather have just had bash.
Because there are a lot of shitty things about bash, too, that anybody with half a brain would think should be blindingly simple from the command line. For example, add a virtual host to apache with an Allow for localhost on /var/www/localtest. And do it in a portable way, i.e. no 'put every vhost in a separate file and Include those. Or fetch a configuration parameter from another machine and that other machine might run any of 5 different distros. The list goes on and on - look I'm no PowerShell fan, or 'real' user even, but we have to admit that the old Unix approach is reaching EOL (well I should say 'it should reach EOL', unfortunately it seems like it's not going to go away soon, with there not even being an alternative).
They weren't stubborn, they set out to "give windows a Unix command line" but then they discovered how unsuitable a Unix command line is, and had to rethink.
A lot of bash is using awk, sed, cut, tr and others to munge the output of one command into the input of another. The killer feature of Powershell is that it does away with all of that.
.. having advertised Powershell upthread, I found it fell on its face at the last hurdle. I wanted to grep a file, and discovered that the default output formatters will either truncate or wrap lines, even when directed into a file. This is wrong and destroys data. I ended up with
It also has a surprising number of WTFs, which was a great disappointment to me because it really seemed like MS had done scripting better. Of course, later versions fix some of the pitfalls if you use them right, but then you need to make sure you're always running machines that will have the latest Powershell on them.
(And, of course, it would have been nice if they could improve its interactive use to even be on par with the Unix shells of the 80s)
People are always surprised when I mention that the Microsoft devs I worked with had free access to the highest tiers of Visual Studio, yet what they actually worked in was vim and the internal fork of make. I don't know whether that's still true; it's been a decade now.
When I was an MS dev (8 to 18 years ago), Visual Studio had nothing to do with "real work" on Windows systems programming.
Quite a few people used a rather obscure editor called Source Insight (http://www.sourceinsight.com/) because of its code-navigation abilities, which were similar to an IDE's but worked on huge codebases that would take hours to actually parse and analyze "properly". Sort of a supercharged ctags.
It's still true (source: Microsoft engineer for 8 years, used vim and internal version of Unix tools).
A lot of this has to do with the sheer size of many of Microsoft products' codebases. Visual Studio just can't handle projects with millions of lines of code, whereas vim + ctags elegantly handles the fragments of projects you build locally.
15 years ago when i first started programming i learned on the *nix CLI. I worked that way for years, thats just how it was done. Well, I started using an IDE in windows and now.... I really like it. I don't want to go back to the command line. That being said, I still use the Windows command line from time to time.
Yeah me too. I do quite a bit of 'bit bigger than average' data processing, and I write the tools in cmd.exe or as command line C++ applications first - and then when they work and I need to run them more often than a few times, I made a small GUI in front of it so that I don't have to spend half an hour every time to set up the parameters right.
Can confirm, was there in high school at the same time and had the same experience :-)
I still use programming languages like AWK to this day (I even list it as a known language on my resume; someone commented on it once, it's always a +1 for the company if they do). Recently though I've been exposed to some stuff in the Windows world that makes me think that, sure, they evolved slower but they evolved in a really interesting direction and possibly pulled themselves out of the local maxima that is the Unix world. (I'm referring to PowerShell.)
It's a little weird to me when devs don't immediately go to the command line, given that's just how I learned, but you've gotta recognize that it's not a flaw that they don't. Everyone just learned a little differently.
Every serious programmer I know, especially those who are about my age, lives their life in the CLI.
This is not even close to true. Of the developers that I know, many of them quite serious, live in the Windows world, and are very happy with for example VS or eclipse. This is very likely to be the case in most BFEs.
So, you know all the same programmers the parent poster knows? The parent explicitly stated that this was anecdotal, and from (not your) personal experience.
Indeed, there is a part of me that hurts when people mention teaching something that is a signifier of legitimacy. As if you could fake being a real programmer by listening to a lecture in CS 102!
If the entire dataset fits into memory on your laptop then it's not big data, and the only reason for using map reduce etc is to build experience with it or proof of concept for a larger dataset.
Related: sometimes I wish most of my data was in JSON streams so I could simply map and reduce the data using jq (http://stedolan.github.io/jq/), pipes and possibly filemap.
There is also an interesting and fun talk to watch by John Graham Cumming from CloudFlare. http://www.youtube.com/watch?v=woCg2zaIVzQ using Go instead of xargs. Kind of fits into: "Using the right tool for the job". There is no Big Data involved but it shows a sweetspot where it might make sense(make it easier) to not use a shell script (i.e retries, network failure)
Oh ya and it turns out when all is said and done the average data set for most hadoop jobs is no more than 20GB which can again fit comfortably on a modern desktop machine.
Hadoop is replacing many datawarehousing dbs like netezza, teradata, exadata. In the process, many datwarehousing developers have become hadoop developers, who write sql code; after all, hadoop got a sql interface via hive.
Informatica (another ETL tool) also provides another tool called powerexchange, which automatically generates MR code for hadoop.
Whenever you hear hadoop, first ask yourself whether it is another disguised datawarehousing stuff.
Yes, this is very much happening -- mostly based on the insane pricing difference of supporting Hadoop clusters vs ntz or td infrastructure. Just following a simple 3year lifecycle of HW depreciation essentially boosts your performance for next to nothing. The same cannot be said of the big DWH vendors
What is missed in the article and many of these comments is that Hadoop isn't always going the best tool for one job. It shines in its multitenancy- when many users are running many jobs-each developed in their favorite framework or language(bash/awk pipeline? No problem) running over datasets bigger than single machines can handle.
It also comes in handy when your dataset grows dramatically in size.
Any decent tutorials out there to get me up to speed on CL tools? I use grep and a few others regularly, but have avoided sed and awk as they seem difficult to jump into.
on a tangential note, sometimes I use a slower methods for UI reasons. For example avoiding blocking the UI, or allowing for canceling the computation, or displaying partial results during the computation (that last one might completely trash the cache).
great article!
PS. probably some hardcore unix guy would tell you that you are abusing cat. The first cat can be avoided, and you might gain even better performance. Also using gnu grep seems to be faster.
Thank you for the compliment. Point taken on cat, but that's the way I like to introduce the process to people. I took cat and grep out at the end of the article anyway.
Well sorry but you don't have a clue what you're talking about.
I very much work in "big data" with about 2 terabytes of new data coming in every day that has to be ingested and processed with hundreds of jobs running against them. The data needs to be queryable via an SQL like language and analyzed by a dozen data scientists using R or Map Reduce.
There isn't anything on the market today that has been proven to work in environments like this and has the tooling to back it up. Unless you want to prove everyone e.g. Netflix, Linkedin, Spotify, Apple, Microsoft wrong ?
The point here is that the same functionality could be implemented way more efficiently even with shell scripts.
The idea of using standard UNIX tools for the showcase is good one. Basically, it tells you that a modern FS is very good at storing chunks of read-only data (one don't need Java for that) with efficient caching and in-kernel procedures. That using pthreads for jobs is a waste, because context-switching has its costs, etc.
To put it simple - by mere rewriting basic functionality in, say, Erlang, one could get orders of magnitude more efficient implementation.
The only selling point of Hadoop is that it exist (mature, stable, blah-blah). It also has one problem - Java. But as long as hardware is cheap and credit is easy - who cares?
What about if you are processing 100 Petabytes? And you are comparing to a 1000-node Hadoop cluster with each node running 64 cores and 1TB of main memory?
Right tool for the right job. 100 petabytes is 50,000,000 times larger than the data in the post. It's the difference between touching something within reach and flying around the world.[1]
1. Earth is 40 megameters in circumference. 40Mm / 50M = 0.8m
Then you're hardly using commodity hardware anymore. While jobs like that probably actually work on Hadoop, I'd imagine a problem like that might be better suited for specialized systems.
IME, most installations where Hadoop is "successfully" used it's running on pretty high-end machines. "Commodity hardware" really means standard hardware, not cheap hardware (as opposed to buying proprietary appliances and mainframes).
I'm not sure what your point is with a hypothetical situation. Why wouldn't the be able to query their data? All I'm saying is from my actual experience with real users, it's best to build a Hadoop cluster with high quality hardware if you can.
Hadoop is highly inefficient when using default MapReduce configuration. And a single Macbook Pro machine is much stronger than 7 c1.medium instances.
Bottom line - run the same thing over Apache Tez with a cluster that has the same computational resources as your laptop, and I'm pretty sure you'll see the same results.
Awk and Sed aren't very accessible to most people who did not grow up learning those tools.
The whole point of tools built on top of Hadoop (Hive/Pig/HBase) is to make large scale data processing more accessible (by hiding the map-reduce as much as possible). Not everyone will want to write a Java map-reduce in Hadoop. However, many can write a HiveQL statement or Pig textual script. Amazon Redshift brings it even farther - they are a Postgres compatible database, meaning you can connect your Crystal Reports/Tableau data analysis tool to it, treating it like a traditional SQL database.
I think the author's point was that the example in question was orders of magnitude smaller than "big data" and that it was more efficient to process it on a single machine, not that Hadoop and friends aren't easy to use.
That it also happens to very fast and powerful (when memory isn't a limiting factor) is nice icing on the cake. I moved over to doing much more on CLI after realizing that doing something as simple as "head -n 1 massive.csv" to inspect headers of corrupt multi-gb CSV files made my data-munging life substantially more enjoyable than opening them up in Sublime Text.