Fancy algorithms lose; dumb C code wins

lukesandberg · on Nov 12, 2011

This doesn't strike me as particularly interesting. CL is just a programming language, a functional assembly so iit seems intuitive that it would be infeasible to cache all computations. Even just caching all programs up to a given size is a big issue because the number grows exponentially with the size of the program, so even if you could cache all programs of a given size it wouldn't neccesarily help to solve (reduce) larger programs very much.

It would be like trying to write a compiler with a peephole optimization for every form of the language, you just end up with a large, slow compiler.

Still it makes me want to write an SKI interpreter.

JadeNB · on Nov 12, 2011

> Still it makes me want to write an SKI interpreter.

How about in Perl-style regexes? http://www.perlmonks.org/?node_id=809842

p4bl0 · on Nov 12, 2011

> Still it makes me want to write an SKI interpreter.

See unlambda at http://en.wikipedia.org/wiki/Unlambda.

Sniffnoy · on Nov 12, 2011

Bah, Unlambda is clearly artificial. You want something that is ridiculous in a natural way, you use Iota: http://en.wikipedia.org/wiki/Iota_and_Jot

stuhood · on Nov 12, 2011

The first solution that didn't involve disk or network IO was the fast one?

sambeau · on Nov 12, 2011

It also doesn't free anything.

Mesmoria · on Nov 12, 2011

My thoughts exactly. It has nothing to do with language at all.

archangel_one · on Nov 12, 2011

Potentially it still might have done; since C is more efficient with memory, it may not have been practical to take that approach with PHP.

stavros · on Nov 12, 2011

Ah, nothing like a false dichotomy to start the weekend.

secure · on Nov 12, 2011

So not putting stuff into a transactional database and using a compiled language instead of an interpreted one is better for computationally intensive tasks? Big surprise.

mjbellantoni · on Nov 12, 2011

Anyone wanting to do experimentation of pre-computing logical expressions should take a look at Binary Decision Diagrams.

http://en.wikipedia.org/wiki/Binary_decision_diagram

codex · on Nov 12, 2011

This story reminds me of the (apocryphal?) claim that C++ running on a single machine can often beat a cluster of Hadoop machines. Scalability is another matter.

bpodgursky · on Nov 12, 2011

That's a totally problem/domain specific assertion. A single machine with 8 or so cores simply isn't going to make it through a hundred terabytes of data, and many (most?) hadoop jobs are IO-bound anyway.

If you find that micro-optimization greatly increases your performance, you probably shouldn't be using hadoop anyway...

beagle3 · on Nov 12, 2011

> If you find that micro-optimization greatly increases your performance, you probably shouldn't be using hadoop anyway...

It's all a matter of what you optimize for. In a recent project, a C++ version, using memory mapped files with a fixed-length record, was about 20 times faster than the hadoop version, and that's just the CPU.

The C++ version had: no deserializing needed, everything random access, multiple runs on same data had NO I/O requirements (mmap was cached between runs). It was about as hard to write.

So, instead of running a 100-core strong Hadoop install (which is small, but far from trivial), I was able to do with one hefty 8-core machine.

Scaling up is "easier" with Hadoop, in the sense that you can just throw money at it and get more EC@/rackspace nodes when needed. But it is a lot of money.

Hadoop makes sense if you've got lots of money to waste, and actually need thousands of cores (cause you'll only need a few tens if you effectively use your hardware).

jeffffff · on Nov 12, 2011

If your working set fits entirely in ram on a single machine, hadoop is the wrong tool for the job. Hadoop is more about getting obscene IO bandwidth from hundreds of hard disk spindles going in parallel, not raw number of cores/parallel computation.

beagle3 · on Nov 13, 2011

While that is what Hadoop is about, and it is successful at it, it is very far from being efficient; In my experience, you pay 10x compared to good use of the same hardware. "obscene IO bandwidth" is right, but at 10% of the practical maximum.

This is a tradeoff that people in computing have been happily taking for years -- e.g. using a higher level language is a similar kind of tradeoff (use Python instead of C - pay x10 in performance, get x10 shorter development time).

But this tradeoff only makes sense if the x10 thing you get is cheaper than the x10 thing you pay for. With hadoop, you pay x10 for hardware/computing time and more for administration (unless you have some kind of EC2 elastic mapreduce, at which point you pay x20 for hardware/computing but no administration cost), and you save some development time.

Two hadoop projects I was consulting on (or rather, consulting _off_ hadoop) were paying ~$10K/month each to Amazon for a while, when <$20K in programmer time for each reduced it to working on one beefy machine that was already located in the office, with ~$100/month cost (power, cooling).

jeffffff · on Nov 13, 2011

I've found the slowdown from hadoop to be very problem dependent. I've seen anywhere from no slowdown to 1000x slower. It sounds to me like those projects should never have been using hadoop in the first place. If your problem is not IO constrained naturally, moving it to hadoop will make it IO constrained. In my experience doing that will get you a 10x-100x slowdown versus a more appropriate solution on the same hardware.

If your problem is IO constrained, for example processing 10 TB of log data, hadoop will get all your disks to 100% utilization just as well as anything else will. It will leave your processors horribly underutilized, but that's just the nature of IO constrained problems. If you could get 10 TB of ram dedicated to your dataset then you could easily get a massive speedup over the hadoop solution but that's just not realistic.

rbranson · on Nov 12, 2011

Have you any useful examples of non-trivial Hadoop MapReduce jobs that can process input data at hundreds of megabytes per second?

jeffffff · on Nov 12, 2011

building sharded inverted indexes

varelse · on Nov 12, 2011

Doesn't sound apocryphal at all. A well-written piece of custom code for a single multi-core CPU can easily run circles around an algorithm shoehorned into a useful but somewhat generic multiprocessor framework where the interconnect between the processors is the real culprit making the computation communication-limited.

A related specific example would be molecular dynamics on GPUs versus CPUs. Not only are GPUs faster, they're faster than CPUs can be at this time because of communication issues. As in you can throw as many CPUs and servers as you want at the problem and they'll never beat a single $600 consumer GPU with its 120+ GB/s internal bus (note the absence of hyperbolic claims of 100x to 1000x faster, just faster).

Or in other words, strong-scaling tasks don't work out so well on a weak-scaling architecture.

cageface · on Nov 12, 2011

Must be a slow news day when this hits the front page.

viraptor · on Nov 12, 2011

I think this might be much faster with caching, but the mistake was to put caching in some external database, rather than local process memory.

There are only 4 characters possible: `,i,k,s - that means on a 64b machine you can have 29 elements fitting into a native integer (5 bits for length + 29 2-bit elements). Use that as a key for a hash in a cache and compute the longer terms in a standard way to avoid smaller mallocs, preallocate large chunks of memory which can be filled linear way... I'd hope for 1.5-2x speedup over the original "dumb C".

Sigh... now I'm tempted to actually try that. Does anyone know of a good library of SKI fragments for testing?

begriffs · on Nov 12, 2011

I'd love to see what you come up with! Here are lots of CL terms you can use: http://share.begriffsschrift.com/joe/bank-17.txt.bz2

It is a tab delimited file. The first column is all CL terms of length 17 which reduce to a normal form. The second column is what they reduce to, and the last column is the number of reduction steps it took in a leftmost-outermost evaluation order.

viraptor · on Nov 12, 2011

I'm not sure how comparable it is to your program since I cannot run it in batch mode (memory leaking...), but I'm able to process the first 340k rows of that file in under 7 seconds with cache limited to 15 characters. That makes 20 microseconds per row (with checking the result).

Or another way - 7 microseconds per reduction step on average.

Without cache, it takes minutes, so I didn't want to wait for the actual time. See http://imgur.com/FL2Eg for seconds per row -vs- cache limit. Ah... written in python without using the bit-packing tricks. I guess the 100+ times speedup is enough anyways... https://bitbucket.org/viraptor/ski/src

begriffs · on Nov 13, 2011

Nicely done, and good analysis. It's a fascinating problem, isn't it? Simple rules, but an opportunity to approach it several ways and refine the solution.

hmottestad · on Nov 12, 2011

From the post: "This post is about sucking it up and just writing low-level code"

So I'm quite sure that C is a high level language.

Assembly is low level. It allows you to control the cpu directly by telling it what instructions to run and which registers should be used for what. In C the closest thing you have is specifying if a register should be used at all. Or inline assembly, which again...is just assembly.

And yes....php is even higher level.

Caching to drive is only useful if your operation takes longer than reading from drive. In this case reading from drive was slower than doing the calculation. What he should have considered is an in-memory-cache.

slowpoke · on Nov 12, 2011

>So I'm quite sure that C is a high level language.

Except it isn't. At least not anymore. This statement was true a few decades back, when C was the highest level available. Today, C is barely a level above Assembly on the abstraction scale of programming languages.

Mind you, I'm not saying that's bad (and most other people don't mean it as an insult, either). C is a powerful language that will probably be needed by generations to come. But calling it "high level" today is being stuck in the 1970s.

3pt14159 · on Nov 12, 2011

Once you get good enough at Python or Ruby knowing C is so very, very important. It releases whole swaths of problems up to you and binding it to Ruby or Python is easy.

listrophy · on Nov 12, 2011

Hey begriffs, next time we're hanging at CE or a tech social, let's talk about this kinda stuff. I may be able to shed some light on making "fancy" stuff work faster/better... though a straight-up C implementation without I/O is gonna win (though it perhaps won't be as persistent as, say, anything with actual persistence)

chuinard · on Nov 12, 2011

You complain about App Engine expiring requests, but I think the new 'backend instances' concept allows you to have really long requests running.