JVM implementation challenges: Why the future is hard but worth it [pdf]

vardump · on Feb 4, 2015

Cache, cache and cache. Three of the most important things Java gets wrong today. Mostly because Java doesn't have object arrays at all. Only primitive types can be array elements. So you get array of references to objects instead of array of objects.

There are only 512 L1D cache lines in a CPU core. After just 512 references to memory areas covered by different cache lines, L1D entries will start to drop. Or earlier. In Java this happens very easily. Like when just iterating a single, not very large, data structure.

Everything else (except maybe SIMD/GPU related items for compute purposes) are not anywhere near as important as fixing Java's current problems with CPU caches.

spullara · on Feb 5, 2015

This is a focus for Java 10 — value types. John Rose is actually one of the guys working on it. He mentions this several times in the presentation as a key issue to solve in the JVM.

https://www.google.com/search?client=safari&rls=en&q=value+t...

pjmlp · on Feb 4, 2015

Back when Java still wasn't known, we had Oberon(-2), Component Pascal and Modula-3.

All system programming languages with GC, value types and AOT compilers (some implementations of Oberon did have JITs).

Apparently 20 years detour was needed to go back to those features.

mike_hearn · on Feb 5, 2015

Apparently 20 years detour was needed to go back to those features.

I think this is explainable by Java's history. When it was created, the design docs said Java was a simple, interpreted language with distributed objects and the target hardware was set top boxes. When Java unexpectedly took off, I suspect Sun weren't quite sure which ingredients were key to that success but fixated on "simple". They unlocked a huge market that was unserved at the time: companies that needed a much simpler C++. So that wasn't unreasonable. They then spent many years trying to handle customer requests for ever more runtime speed, whilst simultaneously not making Java as complex as C++. Hence the focus on super whizzy optimisations inside HotSpot ... many of which are just undoing performance hits introduced by the simplicity of the Java language.

Value types increase the complexity of the language, no doubt about it. Heck, even having a difference between int and Integer increases complexity. When value types come out, I expect much rejoicing from people who need more runtime speed, and a whole lot of confused questions on Stack Overflow from people who aren't totally sure about the difference between a value type and a regular class.

I also expect to come across codebases where classes were turned into value types more or less at random, or where the style code forbids value types entirely, etc. Java sort of has this problem already where weaker programmers who are using threads just slap synchronized keywords on every method until the crashes stop. Then HotSpot goes in and tries to figure out if the locks are actually needed or not and removes them at runtime.

Don't get me wrong. I'm A++++ in favour of having value types on the JVM and in Java. But increasing complexity of the language is absolutely not cost free, especially not given Java's value proposition to big corporates.

seanmcdirmid · on Feb 5, 2015

The CLR has had value types from the onset.

pjmlp · on Feb 5, 2015

True, but only now is getting static compilation to native core.

There were Bartok, mono -aot, Cosmo OS, but it seems if it isn't part of the reference platform, developers tend to ignore them.

I have meet a few .NET devs unaware of NGEN.

seanmcdirmid · on Feb 5, 2015

NGEN is a detail. If you don't care about it, it can remain an obscure detail :)

kristianp · on Feb 5, 2015

This reminds me of something I'm curious about: why have L1 caches have been the same size for quite a few generations of Intel Core processors?

wmf · on Feb 5, 2015

Because that size is optimal, regardless of transistor size. Making the cache larger makes it slower (because of the speed of light AFAIK) and more power-hungry which reduces overall performance.

vardump · on Feb 5, 2015

Currently Intel's L1D (level 1 data) cache is 512 lines with 64 bytes each, 32 kB. Been that way for a pretty long time.

L1D latency with a pointer is mostly 4 cycles. Not sure, but I think having 1024 entries would increase that to 5 cycles.

Increasing cache line size from 64 bytes to 128 bytes would require more memory (DDRx SDRAM) bandwidth, because CPU needs to always fill a whole 64 byte - 128 byte with this change - cache line on a memory load. Even if you want just one byte. Wasted bandwidth for non-streaming workloads would be increased significantly. This change would force also L2 to have 128 byte cache lines, etc. LFBs and other buffers would of course also need to accommodate this.

Skylake will be able to process 64 bytes (512 bits) with a single SIMD instruction, with AVX-512. Perhaps Intel will need to soon increase cache line size, if 1024 bits wide SIMD instructions are desired.

SIMD gather instructions might be a hint that in the future memory loads will function in a different fashion. Maybe cache lines won't anymore be the smallest amount of data a CPU memory subsystem can deal with?

jjoonathan · on Feb 5, 2015

This reminds me of something I'm curious about: why aren't large SRAM arrays displaying DRAM? I know 6Ts are larger than 1T+1C, but even if SRAM were 6x as expensive (which ought to be a severe upper bound) that would put it at $50/GB which sounds like it could easily be worthwhile for certain workloads -- not unlike SSDs a few years ago. The latency advantage would be gigantic, especially if you could crowd the chips near the CPU.

Is SRAM comparatively power hungry or something?

wmf · on Feb 5, 2015

RLDRAM exists, so SRAM would have to beat that. In your 6x upper bound maybe you're not taking into account (dis)economies of scale that could cause SRAM to be much more expensive.

jjoonathan · on Feb 5, 2015

What diseconomies of scale do you mean? If you mean that demand is low right now and so there is no economy of scale yet, that's hardly evidence against a gradual market takeover. That's what investors and early adopters are for, see: SSDs. If you really do mean diseconomy of scale, where unit price will rise with the number of units produced, what would produce such an effect in the SRAM market? Chip fab is almost always in the opposite regime, what's so unique about SRAM?

Also, I looked up RLDRAM [1] and it looks like their "low" latencies are still 10ns, which is good for DRAM but abysmal in comparison to SRAM.

[1] http://www.micron.com/products/dram/rldram-memory/1_15Gb#/

vardump · on Feb 5, 2015

I think SRAM cost is more like $1500/GB.

jjoonathan · on Feb 5, 2015

Are you familiar with an immensely super-linear cost scaling in the # of transistors vs DRAM or are you quoting a cost figure for putting SRAM on the CPU die?

If you put DRAM on the CPU die I would also expect it to cost hundreds per GB, but that's not a very interesting comparison because we know from current DRAM best practices that you can put it on its own die at a much lower cost.

harkyns_castle · on Feb 5, 2015

Some of the talks [1] on the LMAX Disruptor [2] were interesting with regard to how it interacted with the various caches and how important it could be.

1. https://www.youtube.com/watch?v=DCdGlxBbKU4 2. https://lmax-exchange.github.io/disruptor/

im3w1l · on Feb 5, 2015

It'd go against OOP, but you could take all the member variables and put in separate arrays.

E.g instead of an array of 3-vectors, you'd have an x-array, a y-array and a z-array.

pkolaczk · on Feb 5, 2015

Still if you need all three together, that's potential 3-cache misses instead of 1.

vardump · on Feb 5, 2015

And if there are too many fields, this technique can confuse CPU hardware prefetcher not to see the sequential access pattern making prefetching less useful (or even completely useless). This will increase memory latency and make CPU stall longer while waiting for memory.

If you're unlucky, physical memory read pointer distance in bytes is modulo of 64 * <number of memory channels>, all 3 reads might hit the same physical memory module, but in a different DRAM page. All accesses would buffer on just one memory controller and one memory module, significantly reducing available bandwidth.

At physical memory level read pointer distance between different arrays could also cause cache aliasing and consequently extra cache invalidations.

See http://www.intel.com/content/dam/www/public/us/en/documents/... for details. Find "aliasing" from it in your favorite PDF viewer.

If you are just sequentially and continuously streaming from one array, none of this should normally happen. You can of course still get conflicts leading to CPU stalls when writing data, but that's a different matter.

Considering the above, you might want to store your vector x, y, z in a same array, interleaving the values:

  x = array[objIndex * 3 + 0];
  y = array[objIndex * 3 + 1];
  z = array[objIndex * 3 + 2];

Not pretty! In real code, you might want to use 4 as multiplier. Then a JVM with vectorization support might have (or might not) better opportunities to load whole vector together with a single SIMD (SSE/AVX) instruction.

TheLoneWolfling · on Feb 5, 2015

And now we're back to value structs.

vardump · on Feb 5, 2015

High performance cache friendly value struct emulating Java code can be unbelievably ugly. Just one object with elementary type array members.

TheLoneWolfling · on Feb 5, 2015

And here is where I wish Java actually had a struct type.

Alas, like so many other things about the Java language (No operator/assignment overloading, no unsigned integer types, etc, etc), it seems to have been a deliberate, although potentially under-thought, design decision.

pron · on Feb 5, 2015

But Java and the JVM are getting value types, which are immutable structs.

TheLoneWolfling · on Feb 5, 2015

What about for mutable structs?

pron · on Feb 5, 2015

Mutable struct are a very bad idea.

TheLoneWolfling · on Feb 6, 2015

Elaborate?

What makes it a bad idea?

Having to rewrite an entire struct to change one value within it seems... counterproductive, considering that the purpose of having structs is largely speed-based.

I can see it being useful to be able to flag a struct as immutable for certain cases, but for there not being a mutable version?

pron · on Feb 6, 2015

The main point of value types is that they're passed by value. As such, they have no concept of identity. The value 2 stored in variable x is the same as the value 2 stored in variable y. So, too, the value 2+3i, represented as the value type Complex { double r, i; }, is the same no matter where it is stored. Introducing mutation changes that. Values can now be destroyed. This is the conceptual reason.

The practical reason is that structs help you with speed when their size is no more than the size of the machine's cache line, i.e. up to 64 bytes. Beyond that, you incur an extra cache miss when accessing them (even below, if they're not aligned, but the smaller they are the smaller the chance), so you might as well use a pointer, and your copying costs are no longer negligible, so a pointer might be a better idea anyway.

So conceptually mutable "values" are a bad idea, and there's no reason to have them because beyond the cache line size they might cost you in performance more than they buy you, and below it mutation doesn't help as copying costs are very low.

The only benefit mutable structs might have is when you want to lay out a large number of objects, which you'll be accessing sequentially, consecutively in RAM. This way, the CPU's prefetcher will help you stream them into the cache. But the way HotSpot's allocator works, you're almost certain to get the same behavior with objects anyway, and if whatever tiny performance difference this might buy is that important to you, you might be better off using C, anyway.

The goal of the JVM is to get you the best performance for the least amount of work; not to let you work hard to squeeze out a few more performance improvement percentages. It sometimes certainly allows for that, but not at the cost of such a fundamental change, which is bound to do more harm than good.

jermo · on Feb 5, 2015

ObjectLayout [1] should help with that but its not in standard Java.

[1] http://objectlayout.org/

x0x0 · on Feb 4, 2015

There's some exciting stuff in the presentation. In particular: next gen threading (fibers/warps, fj), plus attention to java's profligate use of memory b/w because of it's reliance on pointers-for-all-the-things. Plus the idea of the jvm helping out languages besides java is also exciting.

That said, it's a little scary to contemplate the power that microsoft, steward of the clr, and oracle, steward of the jvm have over our industry. It's staggering to see the amount of engineering that's gone into bullet-proofing the jvm. What happens when microsoft or oracle decline to continue paying for hundreds of very expensive senior compiler/vm/language engineers?

pjmlp · on Feb 4, 2015

> What happens when microsoft or oracle decline to continue paying for hundreds of very expensive senior compiler/vm/language engineers?

People will move away for something else, while a few will keep on maintaining them.

I remember when the answer for any database related problem was Clipper. Now it is gone.

I remember when no one would question the use of Turbo Pascal for both system programming as application programming.

I remember when using a spreadsheet meant Lotus 1-2-3.

I remember ....

aragot · on Feb 5, 2015

Sorry I "arrived late at the party", was Turbo Pascal ever used beyond business applications? I've always thought it was a cool programming language but that C was the massive leader.

pjmlp · on Feb 5, 2015

In the 80's and 90's, C had hardly any meaning outside UNIX.

The 8 bit home computers were all about BASIC, Assembly and Forth.

Those powerful enough to run CP/M, had some C dialects available, but no one cared.

When 16 bit arrived in the home computers, BASIC and Assembly were still the way.

On *-DOS variants, the OS was developed in straight Assembly.

Turbo Pascal was widely used in my home country from systems programing all the way to business applications. Very few cared about C.

For CRUD applications there was DBase and Clipper.

We only started caring about C when the need to take code to Windows 3.x started to be a reality and Turbo Pascal for Windows started to get behind the times.

Then Borland went crazy in their business decisions and many moved away from Delphi into Visual Basic and Visual C++ (MFC).

In the Amiga world, the OS was coded in a mix of Assembly, BCPL and C.

For the coders it was all about Assembly and AMOS. Although I think there were some using C with the likes of MUI and similar.

MacOS was originally developed in Object Pascal, the dialect later added to Turbo Pascal, and Assembly.

Apple eventually added C and C++ support, and while trying to cater for developers went C and C++, re-writing the Object Pascal parts.

So it always saddens me to see young generations think C was the one and only systems programming language.

It only became that because UNIX based workstations succeed in the market and everyone wanted a piece of the pie. Which partially meant using C.

mike_hearn · on Feb 5, 2015

Well, the Windows API was a fundamentally C based API and that probably did more for it than most things. The Microsoft example code was all C and C++. Delphi was great for many years, Borland didn't really start to sink until around the Delphi 5 era, if I recall correctly. Until then it was widely regarded as a much superior solution to MSVC++ and was used for many things, though mostly business apps.

I used to be a Delphi programmer and worked on, amongst other things, an open source video game project :) Ah, good times.

pjmlp · on Feb 5, 2015

You are right regarding Delphi, I just omitted the part where Borland tools used to get out of sync with Windows SDK.

I stopped at Turbo Pascal for Windows 1.5, because by Delphi 1.0 time-frame I was at the university and wanted something UNIX friendly so went C++.

I knew C and C++ still from MS-DOS, never liked C in regard to what Turbo Pascal could offer, hence C++.

Also I got eventually fed up of writing Pascal wrappers for all WIndows APIs not provided by Borland.

wtetzner · on Feb 4, 2015

Well, there's OpenJDK and now the CLR is open source, so companies that are heavily invested in the platforms can pay for people to work on them.

As for the languages and compilers, I'm less worried. There are already plenty of open source languages targeting these platforms, many of them better than Java/C#.

tormeh · on Feb 4, 2015

I've long wondered what's in it for Oracle and now Microsoft. Microsoft stands to profit from applications written for Windows that are hard to port, but they seem to be abandoning that approach. Oracle seems to have no real upside to maintaining Java. No one is paying for C# and Java directly, so I don't get it.

pjmlp · on Feb 4, 2015

> Oracle seems to have no real upside to maintaining Java

Since the early 2000 the majority of Oracle GUI tools across their products are written in Java.

Long before acquiring Sun, Oracle did acquire Bea.

Oracle still sells Bea Weblogic JEE container and the J/Rockit JVM, which also has a real time version.

They have their own JSF framework and IDE.

So becoming Java steward is not so strange.

tsotha · on Feb 4, 2015

I always believed Sun shot itself in the head with Java. They had this idea getting people invested in a cross platform ecosystem meant new software could be run on Sun hardware.

But what really happened is people who had been locked into Sun with legacy software wrote applications in Java and then switched from Sun to cheaper hardware running Linux or Windows.

SixSigma · on Feb 5, 2015

They did have Microsoft removing their helmet and doing the reloading.

Sun called it: Mankind vs Microsoft [1]

It drained their energy and stole their focus.

It was the apex of Embrace, Extend, Extinguish

[1] http://www.theregister.co.uk/2004/04/03/why_sun_threw/

Interesting from that 2004 piece

> Microsoft's biggest global competitors are exactly as they were on Thursday: Nokia and Sony.

How different things looked then!

cpeterso · on Feb 5, 2015

Similarly, IBM tried to make OS/2 a "better Windows than Windows" by supporting both OS/2 and Windows applications. This just encouraged application developers to write Windows applications because they would run on OS/2 and Windows.

xxxyy · on Feb 4, 2015

Java is the corporate powerhouse, especially for banking. Together with Oracle DB. Owning two out of three crucial components (the third being the hardware) creates a lot of good business opportunities for them, makes it easier to enforce monopoly.

Gurkenmaster · on Feb 6, 2015

Oracle already released new SPARC processors that are optimized for the Oracle DB and Java: https://blogs.oracle.com/rajadurai/entry/sparc_m7_chip_32_co...

gaius · on Feb 5, 2015

Oracle seems to have no real upside to maintaining Java.

No, but they have an awful lot of downside from not maintaining it.

Consider the history of IBM, their grand strategy in the early 90s was that Smalltalk would be the enterprise language of the next few decades. The Java juggernaut rode roughshod over IBM. Or Microsoft, they would have loved to have kept Visual Basic and Visual C++ cash cows going forever, Java completely blindsided them. Altho' it may be technically less powerful than Smalltalk or C++, in terms of mindshare (or hype if you prefer) Java is too powerful for Oracle to risk falling into anyone else's hands.

parkovski · on Feb 4, 2015

Both companies have a lot of enterprise software that exposes either Java or .NET APIs, and lots of companies pay them for support and troubleshooting, and licensing those other products.

Padding · on Feb 5, 2015

> No one is paying for C# and Java directly

No one you know.

Suppose some large and wealthy entity has trouble running some money-making legacy application that happens to be running atop the JVM. Who do you think will they call to fix their issues asap?

.. Which is why there's a lot of interest in having the people working on the JVM on your payroll ..

pron · on Feb 4, 2015

There are now many large companies participating in OpenJDK. Of them, IBM, AMD and Intel are contributing code (Google, sadly, not so much).

LaSombra · on Feb 4, 2015

For Java there's always OpenJDK, which I believe is 100% open source.

pjmlp · on Feb 4, 2015

Great to see AOT compilation coming to the reference JVM instead of relying on third parties.

Also the desire to go meta-circular and reduce even more the amount of C++ code.

Most likely based on the Graal/SubstrateVM work.

Finally JNI being replaced by something developer friendly.

haddr · on Feb 4, 2015

Lightweight threads (fibers) looks especially interesting to me. I wonder if there is some library that already tries to ger around threads with something more lightweight...

rdtsc · on Feb 4, 2015

An interesting one is

http://docs.paralleluniverse.co/quasar/

They claim to have "true lightweight" threads. (I am sure pron if he is a around can jump in and expand on it).

It takes a lot from Erlang even pattern matching.

pron · on Feb 4, 2015

Well, the pattern matching is only in the Clojure API, but yeah, these are true fibers based on continuations implemented with bytecode instrumentation, scheduled by a work stealing scheduler (or any other scheduler of your choice). It lets you write simple blocking code with all its familiarity and advantages (exceptions, thread-local data) but enjoy the performance of async code.

The downside is that libraries have to be integrated in order to block gracefully when called on fibers, but we already have a long and growing list of integration modules for popular libraries.

norswap · on Feb 4, 2015

Assuming this is not rhetorical; yes there is: http://www.paralleluniverse.co/quasar/

justinsb · on Feb 4, 2015

I would love to see threads based on this work (OS support for lightweight threads): http://www.linuxplumbersconf.org/2013/ocw//system/presentati...

It seems this would move most of the heavy lifting into the kernel (and out of the already-very-complicated JVM). Anyone know what happened to that work?

pron · on Feb 4, 2015

AFAIK, Google uses it (to some extent, at least), but the solution suffers some issues that make it less than ideal. While it relinquishes scheduling to the application, stack management remains the kernel responsibility, so while the task-switching overhead is greatly reduced, the threads are still not quite lightweight. The advantages to the approach is that it automatically crosses "language barriers", say, between the JVM and native code. This is very important for Google, who use C++ libraries as common code in their polyglot projects, but maybe not so important in other shops.

tormeh · on Feb 4, 2015

http://akka.io/

eranation · on Feb 4, 2015

I love Akka, but I thought their implementation relies on regular JVM threads at the moment, isn't it? (I had a wishful thinking at first that they are using user-threads at first, but they are not like quasar or goroutines in any way from what I read, Actors might reuse threads and seem like a "lighter thread" but they are still using plain old JVM threads, for now at least)

tormeh · on Feb 5, 2015

Correct. Like goroutines, but even lighter, as they have no stack. Java threads map directly to OS threads, do they not?

eeperson · on Feb 5, 2015

I'm not sure I understand what distinction you are making. All of the examples you mention are built by reusing OS/JVM threads. That is how you have to implement user-threads. How do you feel Akka differs from Quasar/Go (aside from the coroutine/actor distinction) .

eranation · on Feb 5, 2015

Quasar uses byte code instrumentation to do things you can't do with "regular" java. How exactly they can do things whne the JVM is the same JVM? good question, I don't know, I know implementing user threads simply means keeping stack information, keeping a scheduler, handling signals, without asking the kernel / OS for a native thread (had to implement user threads in C in the past, not fun)

Maybe the JVM allows for more fine grained control on PC location, and allows to "go to" different places in the code without invoking a native thread. But I know that when you create a Quasar co-routine / green thread / user thread / continuation or whatever we want to call it, then it's not the same as Akka.

In Akka, if you have a "conversation" between a million actors in a circle, you can still get 1 thread to be reused, if there is no need to use more than one. But if you try to do 3 things in 100% parallel, you might find yourself with 3 threads (from the little that I know).

In quasar, you can do all that, without invoking an OS thread, I think this is the main difference.

http://docs.paralleluniverse.co/quasar/

x0x0 · on Feb 4, 2015

fj / jsr166 is in 7

sample code: http://docs.oracle.com/javase/tutorial/essential/concurrency...

oldmanjay · on Feb 4, 2015

Oh please give me continuations in the VM and I will take back every angry thought I've ever directed at Oracle.

tomp · on Feb 4, 2015

On slide 16, which talks about cache, there is this:

> Rule #1: Cache lines should contain 50% of each bit (1/0) > – E.g., if cache lines are 75% zeroes, your D$ size is effectively halved

Can anyone explain this?

x0x0 · on Feb 4, 2015

a cache line is the unit of read from ram, and is at a premium

At least for data, a common win is to compress the data in ram and decompress once it is in-core; this is often essentially free as many machine learning algorithms are bandwidth starved but have plenty of compute available. Suppose you are eg storing small integer counts in ints; if it's a java int you are using 4B to store 1B, while if it's a java.lang.Integer it costs 16 bytes plus most likely an 8B pointer.

Another way to consider this is if you are using 8B pointers, you waste a lot of that as constant zeros -- 1TB is 2^40, so even on a 1TB machine 3B/24b are wasted. Particularly with (all? at least that I know of) jvms that have 8B alignment, another 4 bits are wasted. This is how the CompressedOops hack words to access 32G ram w/ 32b pointers.

antoaravinth · on Feb 5, 2015

I didn't get them fully. Sorry for novice questions, but wish to understand your statement completely.

>Suppose you are eg storing small integer counts in ints; if it's a java int you are using 4B to store 1B, while if it's a java.lang.Integer it costs 16 bytes plus most likely an 8B pointer.

This wastage does happens only in reference types? Tomorrow say if value types are been created, then these sort of issue will get resolved?

>Another way to consider this is if you are using 8B pointers, you waste a lot of that as constant zeros

Why of constant zeros? And in the line your saying "-- 1TB is 2^40", your referring 1TB of cache line?

vardump · on Feb 5, 2015

> This wastage does happens only in reference types? Tomorrow say if value types are been created, then these sort of issue will get resolved?

That would solve most of the issue, yes.

> Why of constant zeros? And in the line your saying "-- 1TB is 2^40", your referring 1TB of cache line?

He was not referring to cache line or anything like that. He was referring to the fact if you can address just 40 bits, upper 24 bits in a 64-bit pointer are wasted in a pointer. I think on x86 real virtual address size is currently 48 bits, so upper 16 bits are "wasted". This wastage will occur with heaps larger than 32 GB when using compressed pointers JVM option. And of course 4 GB (or 2 GB?) with that JVM option disabled.

Similarly, if you use "bool" type, it still takes a byte in current JVM implementations. So you could say 7 bits are wasted there.

antoaravinth · on Feb 5, 2015

Thanks for the explanation. Looks like Java has more memory problems:

1. If OOP pointer is going to eat up our memory, there we loose / waste some bits of memory. 2. And as you have said even using bool type, is going to say waste by 7 bits!

Since we are dealing more with objects in java world, then I guess, 32 bit VMs will be much faster than 64 bit VMs. Or even 16bit will be more faster, from my understanding.

Still I have a final question over here, all the OOP pointers does goes the CPU caches and registers?

vardump · on Feb 5, 2015

32-bit OOP pointer is just multiplied by 8, because all objects have address divisible by 8. 4 GB * 8 = 32 GB.

Example of generated code from some random web page that happened to be there:

http://cr.openjdk.java.net/~shade/8050147/rsp-minus-8.perfas...

  [0x7f189918ae07:0x7f189918ae54] in org.openjdk.VolatileBarrierBench::testWith

  #           [sp+0x30]  (sp of caller)
  0x00007f189918ade0: mov    0x8(%rsi),%r10d
  0x00007f189918ade4: shl    $0x3,%r10

That "shl $0x3,%r10" multiplies the pointer in register r10 by 8 (2^3). So looks like JVM computes the base pointer to an object to a register (r10 in this case) and uses relative addressing from it.

Here's example of using x86 indirect addressing mode to multiply the pointer by 8:

http://shipilev.net/blog/2014/safe-public-construction/stead...

  0x00007fc7b4971644: mov    0xc(%r12,%r11,8),%r10d  ;*getfield instance

In this case effective address for the load from heap is r12 + r11 * 8. Register r12 probably represents heap base pointer or similar in this case - I'm not sure.

16-bit would be significantly slower, but for entirely other reasons. Also, you could have at most 512 kB heap with 8 byte address alignment...

The issue with 64-bit pointers is that Java (and JVM) is a very pointer happy language, so the cost is significantly more than for pretty much any other language I can think of. Memory usage and more cache misses is the issue.

pron · on Feb 5, 2015

HotSpot compresses pointers on 64-bit systems to 32 bits by default.

https://wikis.oracle.com/display/HotSpotInternals/Compressed...

bluecalm · on Feb 4, 2015

>>At least for data, a common win is to compress the data in ram and decompress once it is in-core

I have problems imagining how it could be done. Would you mind elaborating and/or sharing some examples ?

x0x0 · on Feb 5, 2015

The insight is that on a modern processor memory bandwidth is scarce but it can issue several floating point ops per clock. So multiplying and adding stuff in L1 or already in registers is cheap. Streaming algorithms that walk large chunks of ram will be repeatedly memory starved, so anything you can do to effectively increase memory b/w is valuable.

Say you have 8bit integers; store them packed in ram, then upon reading, unpack. So instead of striping an array of int[], you have an internal array of long[] and you read them with a function. Your memory read will suck in 8 at once.

The same technique works for floats or ints with a small-ish range and limited precision; you can store a scale and offset, then pack on write / unpack on read. It's common to be able to quadruple your effective memory bandwidth, then the read operation -- ie

   // instead of:
   double[] _data;
   // accessed as
   _data[idx];

   //instead you do
   getd(idx);

   // using the below
   bytes[] _mem;
   float _bias, _scale;
   double getd(int idx){
      long res = _mem[idx];
      return (res + _bias) * _scale;
   }

executes entirely from registers, and is essentially free. The price of all this is you have to process your data on ingestion, but if you run iterative algorithms -- like convex optimizers -- that repeatedly walk your entire dataset, this is often a big win. You can often lose some of the low precision bits on the float or double, but those probably don't matter much anyway.

Like anything else, you'll have to measure.

vardump · on Feb 5, 2015

For example like what instructions below do. This is pretty standard nowadays when high performance is desired.

Here are some specific instruction examples about how it's done.

Bit field operations; Bit field extract, etc.

http://docs.oracle.com/cd/E36784_01/html/E36859/gnydm.html#s...

Parallel bits extract / deposit.

http://docs.oracle.com/cd/E36784_01/html/E36859/gnyak.html#s...

SIMD (AVX2 in this example link); Pack, unpack, shuffle, permute, broadcast, etc.:

http://docs.oracle.com/cd/E36784_01/html/E36859/gntae.html#s...

Someone · on Feb 5, 2015

If you build your own hardware, you don't even have to decompress it. Let's limit each process to 2^40 bytes of RAM. Build your virtual address space so that it masks out the top 24 bits of each address before doing address translation. Then, you can store 24 bits of extra information in each pointer. For example, you could store 3 bytes of a C++ object in a vtable pointer (with 'interesting' repercussions for the sizeof and offsetof macros)

That is more or less what the original Macintosh did with its handles. The hardware wrapped addresses around at 24 bits. That allowed Apple to store 3 bits of data in each handle ('data can be purged from memory if needed', 'data cannot be moved', 'data is read from a resource'). Third parties sometimes used the other 5 bits. With the introduction of Macs with more memory, that led to the "program has special memory requirements" system error, which actually meant "this is Excel version so and so. It cannot run on a machine that has 32 address bus lines because it mangles pointers"

jbn · on Feb 4, 2015

if your cache line is 75% zeroes, then half of it respects the objective of "should contain 50% of each bit", and the other half is all zeroes, which is a waste (according to this objective)... in other words, only half the cache is 'correctly' used. This is just another way of saying that your D$ size is effectively halved.

nusbit · on Feb 4, 2015

Is there a video?