on a couple of GB this is true, actually if you have ssd's I'd expect any non co...

KaiserPro · on Jan 18, 2015

Depends on the dick, depends on the storage.

HDFS is a psudeo block interface. If you have a real filesystem like lustre, or GPFS, not only do you have the abilty to use other tools, you can use that storage for other things.

In the case of GPFS, you have configurable redundancy. Sadly with lustre, you need decent hardware, otherwise you're going to loose data.

In all these things, paying bottom dollar for hardware, forgoing support is a false economy. At scales of 1pb+ (which is about 1/2 a rack now) its much much cheaper to use off the shelf parts with 24/7 support than "softwareing" your way out.

radoslawc · on Jan 18, 2015

> Depends on the dick

not really, sorry I had to

back to the topic, HDFS is really somewhat waste of disk space, especially when used for something like munching logs

> At scales of 1pb+ (which is about 1/2 a rack now) its much much cheaper to use off the shelf parts with 24/7 support than "softwareing" your way out.

depends, if you need monthly reports from logs, as long as you don't loose storage completely, then using even second hand hardware or decommissioned from prod is cheapest choice

KaiserPro · on Jan 18, 2015

Ahem

Disk....

simonster · on Jan 18, 2015

If you want disk parallelism, RAID 0 is probably easier than Hadoop.

cmurf · on Jan 18, 2015

That would depend on the data set and the strip size. Striping is good for streaming. Linear/concat of 2+ drives with XFS would be faster with a lot of files than end up in separate AG's on separate drives which can be accessed in parallel.