* They are inserting special x86 instructions around object access JIT output instructions to "trap" uses of references to objects that had cleanup/relocation in progress. The trap works around the work in progress to "heal" the references in rare instances, usually allowing branch prediction on the x86 to simply fall through.
* It almost sounds like part of the relocation work by the GC used a mechanism not unlike "transactional memory" to either "commit" a block of moves of active objects, or roll them back in case of a conflict caused by the running application accessing/updating/creating something at an inopportune moment.
* One of the diagrams suggests that there are N GC threads corresponding to N application threads. If there is in fact a one-to-one correspondence, rather than just "there are many of both kinds of thread", I wonder if they have thread specific sub-heaps, and employ some kind of processor affinity binding together the application thread and its corresponding GC "shadow" on the same processor? Maybe that's automatic anyway, based on memory region in use? Anyway, localizing these tasks together might avoid processor cache misses. I may have read much more into one of the diagrams than was really meant, though. Even if they don't have thread specific heaps, I think I like the idea of having heaps tied to individual threads, only migrating objects/references to a global heap when they have in fact been shared between threads, or are anchored to some sort of static context.
Anybody care to provide an alternate interpretation of some of this?
Azul's systems are pretty much the definition of high-volume. Nobody (sane) buys an embarrassingly parallel machine with custom hardware and software just to host their blog on it.
Understood, I'm just curious if anyone has hands-on experience. I have a high-volume, low-latency app and I'm curious about the ROI on this vs non-commercial solultions.
Also, it looks like they're now selling software as well as HW solutions. That's potentially more interesting than when it was just an appliance solution.
Some of that would depend upon how much the work can be localized per processor, since there are multiple caches with multiple CPUs. But, yes, I very much wonder as well.
The linked article is naive as to GC algorithms used in current Java VM's. These go to a lot of trouble to avoid tracing or scanning the heap. Card marking (http://www.ibm.com/developerworks/java/library/j-jtp11253/#2...) is one way to avoid this cost. As to building something like memcached, you can use native memory directly using direct byte buffers in Java; this is essentially what Terracota BigMemory (http://www.terracotta.org/bigmemory) does.
At some point, I am going to have to experiment with the java command line options and redo my string whacking benchmark program. Still, I can't help but think that while these refinements limit the trips made wandering about the heap, there still are a good number of times when all of those gigabytes of pages still have to be marched through CPU caches displacing active work to check on things that haven't changed status. Perhaps with enough cores, some of them are simply left alone that vast majority of the time to do productive work with data in cache. I'd like to see some measurements of this, and how the effectiveness is affected by worker threads vs CPU cores available, as well as how many background GC threads there are. Data, anybody???
At any rate, the defaults for Java are slower than those for Perl when doing many string operations on a single thread. Measurements: http://roboprogs.com/devel/2009.12.html. I have since rerun these tests on a 6 core AMD, with largely similar results. Of course, when doing threads or fork, the comparison breaks down as these constructs are implemented so differently between these languages, to say the least.
There are plenty of "distributed hash table" solutions for Java, some are at least as old as memcached.
It's true that large heaps (multiple GBs) can cause long GC pause times (which is one of the problems Azul tried to solve). This can be mitigated by simply running more cache servers with smaller heaps.
* They are inserting special x86 instructions around object access JIT output instructions to "trap" uses of references to objects that had cleanup/relocation in progress. The trap works around the work in progress to "heal" the references in rare instances, usually allowing branch prediction on the x86 to simply fall through.
* It almost sounds like part of the relocation work by the GC used a mechanism not unlike "transactional memory" to either "commit" a block of moves of active objects, or roll them back in case of a conflict caused by the running application accessing/updating/creating something at an inopportune moment.
* One of the diagrams suggests that there are N GC threads corresponding to N application threads. If there is in fact a one-to-one correspondence, rather than just "there are many of both kinds of thread", I wonder if they have thread specific sub-heaps, and employ some kind of processor affinity binding together the application thread and its corresponding GC "shadow" on the same processor? Maybe that's automatic anyway, based on memory region in use? Anyway, localizing these tasks together might avoid processor cache misses. I may have read much more into one of the diagrams than was really meant, though. Even if they don't have thread specific heaps, I think I like the idea of having heaps tied to individual threads, only migrating objects/references to a global heap when they have in fact been shared between threads, or are anchored to some sort of static context.
Anybody care to provide an alternate interpretation of some of this?