Hacker News new | past | comments | ask | show | jobs | submit login

I look at this article as a criticism of the hadoop being the wrong tool for small data sets.

This starts to become a question of data locality, and size. 1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.

The fact that shell commands were used makes for an easy demo that might be hard to support, but if a solution were written using a traditional language with threading or IPC instead of relying on hadoop you should always be faster, since you don't incur the latency costs of the network.




> That data size fits easily in memory, and without doubt on a single system. From that point you only need some degree of parallelism to maximize the performance. That being said when its 35TB of data, the answer starts to change.

Not at all, because data is being streamed. It could just as easily be 35TB and only use a few MB of RAM.


The IO bandwidth of the system will limit you more loading 35TB of data in ram on a single system, even if it is streamed. You'll need more than one disk, and network card to do this in a timely fashion.


1.75 GB isn't enough data to justify a hadoop solution. That data size fits easily in memory, and without doubt on a single system.

It depends on what you do with the data. If you are processing the data in 512KB chunks and each chunk takes a day to process (because expensive computation), you probably do want to spread the work over some cluster.


I don't think of hadoop being built for high complexity computation, but high IO throughput.

When you describe this kind of setup, I imagine things that involve proof through exhaustion. For example prime number search is something with a small input and large calculation time. However, these solution don't really benefit from hadoop since you don't really need the data management facilities, and a simpler MPI solution could handle this better.

Search indexing could fit this description(url -> results), but generally you want the additional network cards for throughput, and the disks to store the results. Then again the aggregate space on disk starts looking closer to TB instead of GB. Plus in the end you need to do something with all those pages.


I think the article said that you don't need to use Hadoop for everything and that it might be much faster to just use command line tools on a single computer. Of course you might find a use case where the total computing time is massive and in that case a cluster is better. I still don't think many use cases have that problem.

We are doing some simple statistics at work for much smaller data sizes and the computing time is usually around 10-100 ms so it could probably compute small batches at almost network speed.


Definitely. I was reacting to my parent poster, because size does not say everything. 1TB can be small, 1GB can be big - it depends on the amount of computation time that is necessary for whatever processing of the data you do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: