seiji pretty much nails it. Hadoop seems to have come out of a weird culture. It is a distributed system with a single point of failure (name node) because its designers insisted on avoiding Paxos (distributed systems are too hard so we'll just make a broken-by-design protocol instead). Another example is that a lot of the database code built on top of Hadoop is designed around one Java hashmap per row which really limits performance.
There are all sorts of oddities and you can mostly work around them but it is...exhausting, and I spend a lot of time thinking "surely there must be a better way".
Wait, so Zookeeper (= distributed consensus thingie that I think implements the Paxos algorithm) is a Hadoop project but not actually used in Hadoop mapreduce?
There are all sorts of oddities and you can mostly work around them but it is...exhausting, and I spend a lot of time thinking "surely there must be a better way".