Hacker News new | past | comments | ask | show | jobs | submit login
How to Learn Hadoop for Free (johnwittenauer.net)
163 points by jdwittenauer on April 3, 2017 | hide | past | favorite | 13 comments



A better idea: work as a consultant, get a Hadoop-related assignment, accept it, learn on the go and swear A LOT in the process, still deliver on time.

You've got paid AND you got to know some Hadoop! (Worked for me; YMMV)


Is Hadoop/MapReduce still as relevant now as it was a few years ago? What stack would you set up today for a standard Big Data processing system?


Hadoop certified engineer here. I think Hadoop is losing its popularity or from a different point of view, it got 90% of its potential market saturated and having trouble entering other markets. The biggest challenges are operational stability and performance, and the lack of understanding from the Hadoop companies about the performance characteristics of their system. On the top of that there is always 2 version of everything (Tez vs Impala, ORC vs Parquet, etc.) because HWX and Cloudera cannot really work together in an opensource fashion. On the top of everything there are better products on the market for different use cases for Hadoop. The following list is incomplete: Alluxio, Apache Beam, Apache Kudu. These systems trying to address some of the aforementioned shortcomings of Hadoop. There are other products like PrestoDB that take a slightly different approach to a particular problem (accessing data via SQL like interface) and mix it with a extra goodness (in memory caching) and delivering an entirely different customer experience. If you leave Hadoop land you can also play with Spark or Storm (depending on your use case). Now that Facebook uses Spark there is a good chance that an average use won't be running into scaling issues with it. I left out products from vendors that target the same customers as Hadoop vendors on purpose. There are plenty of closed source solutions that will leave Hadoop in the dust in almost every aspect of big data processing (performance, security, UI, stability, availability, etc.).


Disclaimer: i'm a developer of Hops Hadoop

I agree Hadoop is no longer MapReduce. It's HDFS+YARN. That's it. Distributions package up Spark/Flink/Kafka/PrestoDB with the HDFS/YARN core.

At Hops, we've scaled the core HDFS by >16-37X ( https://blog.acolyer.org/2017/03/06/hopfs-scaling-hierarchic...) and we have a distribution called Hopsworks with support for Spark/Flink/Tensorflow. Nobody uses MapReduce on our platform.

The thing that has killed Hadoop, imo, is Kerberos. In Hops, we have switched to using TLS/SSL certificates instead of Kerberos, and that enables us to implement dynamic roles. Dynamic roles allows us to build a software-as-a-service platform, where projects are securely isolated from one another.


I think the term "Hadoop" is becoming almost meaningless. It seems to now be more of a pointer referencing a basket of distributed processing technologies that run on YARN/HDFS. Agree completely with having multiple technologies to solve every problem, that's one of the most confusing parts to learn.

My own perspective is that there are lots of businesses that haven't yet needed the capabilities provided by a platform like Hadoop, but they likely will in the future. So the market may be saturated based on current needs but that market will continue to expand. Whether it's Hadoop (YARN/HDFS/etc.) that wins that market share or some other stack like Spark/Mesos remains to be seen.


> It seems to now be more of a pointer referencing a basket of distributed processing technologies that run on YARN/HDFS

You reference the MapR distribution for their training material, and its interesting that their version of HDFS is a reimplentation in C++ (MapR-FS). Its part of the reason I settled on MapR to use tools like Apache Drill, because the filesystem becomes usable to non-Hadoop tools via NFS (i.e. Awk).

Given a shift in some categories away from map-reduce to other approaches, could Hadoop eventually just become a collection of distributed filesystems and job schedulers?


Hadoop is not the same as Mapreduce (anymore). For instance, folks in my organization run both Spark and Mapreduce on top of Hadoop/Yarn.


I think Spark is consistently eating into Hadoop's market.


Spark is eating the MapReduce market. HDFS is still going strong.


I've recently started working in the big data space and it definitely seems like the primary use case for Hadoop these days is HDFS. Spark seems to have entirely subsumed Hadoop MapReduce for most batch processing workloads.


Probably the one tool missing from this list is Impala, which is essentially Hive's successor. Uses the same metastore and runs an order of magnitude faster. Almost the same flavor of SQL too.


Agree that Impala would fit well on this list. They didn't have any training on it, presumably because it's a Cloudera-led technology, but my understanding is it's very popular. Not sure that it truly replaces Hive/Tez though. I think they each excel at certain types of workloads.


Spark SQL does the same thing, slower than Impala (does not even have join reordering) but probably more popular.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: