Hacker News new | past | comments | ask | show | jobs | submit login

Hadoop is a framework, not a library, so user applications need to link against the jars they need. That implies using CLASSPATH to locate them. Whether or not there is 1 jar or 100, the fact that there's no standard place to install jars in Java is a problem.

Hadoop jars are hundreds of megabytes, and we have multiple daemons. Duplicating all those jars in each daemon would multiply the size of the installation many times over. That's also a nontrivial amount of memory to be giving up because jars can no longer be shared in the page cache.

Some of these problems could be mitigated by making Hadoop a library rather than a framework (as Google's MR is), or by pruning unnecessary dependencies.




Most of these issues could be addressed by actually modularizing the core of Hadoop, some of which has been done in the latest code. Also, many things could be provided at runtime by the system with only the interfaces required to be in the jars that customers depend on, thus making their jars backwards compatible and more robust. BTW, let's say you didn't want to put the jars in one jar but didn't want a classpath. You can use META-INF/manifest to include those jars automatically as long as they are in a well defined place relative to the host jar. Redesign with the requirement that end users don't have to worry about CLASSPATH and you will find that there are solutions.

I do sympathize that something akin to the maven repository and dependency mechanism hasn't been integrated into the JDK. I was on the module JSR and continually pushed them to do something like that but it turns out IBM would rather have OSGI standardized and so it deadlocked. Maybe something will come in JDK 9.

Sad email thread from years ago: http://markmail.org/message/y2a6nzhcsp62p5yv


Well, I work on Hadoop. I don't know what you mean by "modularizing the core." There was an abortive attempt a few years ago to split HDFS, MapReduce, and common off into separate source code repositories. At some point it became clear that this was not going to work (I wasn't contributing to the project at the time, so I don't have more perspective than that).

Right now, we have several Maven subprojects. Maven does seem to enforce dependency ordering-- you cannot depend on HDFS code in common, for example. So it's "modular" in that sense. But you certainly never could run HDFS without the code in hadoop-common.

None of this really has much to do with CLASSPATH. Well, I guess it means that the common jars are shared between potentially many daemons. Dependencies get a lot more complicated than that, but that's just one example.

Really, the bottom line here is that there should be reasonable, sane conventions for where things are installed on the system. This is a lesson that old UNIX people knew well. There are even conventions for how to install multiple different versions of C/C++ shared libraries at the same time, and a tool for finding out what depends on what (ldd). Java's CLASSPATH mechanism itself is just a version of LD_LIBRARY_PATH, which also has a very well-justified bad reputation.

I don't know of anyone who actually uses OSGI. I think it might be one of those technologies that just kind of passed some kind of complexity singularity and imploded on itself, like CORBA. But I have no direct experience with it, so maybe that is unfair.

I like what Golang is doing with build systems and dependency management. They still lack the equivalent of shared libraries, though. Hopefully, when they do implement that feature, they'll learn from the lessons of the past.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: