Typical problem of over tuning. Not every tunable needs to be tweaked (or worse cut'n'pasted somewhere from the internet). Your street cred as a admin does not depend on how many settings you can change. If they had just kept using the defaults everything would have been fine.
That said 250MB is really a lot of code. Their binaries must be gigantic. This by itself likely already causes performance problems because all caches will be thrashing.
Yes but he said it was previously only 50mb. That 250mb contains a lot of duplicated compiles: tiered compilation can generate not just two but up to I think 5 different compiles of the same methods! I think code cache management in Hotspot is not that great: eventually the server would probably converge on a bit higher than 50mb due to the more aggressive inlining, but 5x suggests the low value compiles aren't being properly discarded.
To the author: It would be useful to put a little blurb paragraph at the top to say that if you haven't overridden the default codecache size via JVM options, no action is necessary. This information is buried way down at the bottom of the page, which may lead hurried readers to assume that something may already be wrong with their Java 8 JVMs.
Would a good approach to major JVM upgrades in production be to remove all but critical JVM flags in the new version, and run it on a subset of nodes, then?
After getting a baseline measurement of performance with JVM defaults, experiment with tuning a couple of settings, measure some more, make sure nothing breaks under load. Repeat until new JVM version seems stable/performs better/etc even under worst load and then upgrade all nodes to the latest and greatest?
Yes. The good approach when you're given a legacy java application with tons of flags is to removed ALL the optimizations flags for the heap and the GC.
Most of the settings found in legacy projects or on the internet are either obsolete or defaults in the latest version of the JVM.
The only mandatory heap flag is: "-Xms???M -Xmx???M" which set the heap size of the application to ??? megabytes.
I wonder how many projects have production issues because they are not able to hit expected production volume in a QA environment? This is certainly the cause of a lot of grief I've experienced.
For us, it's not production volumes that are the main issue but production dynamics i.e. how the volumes change over time. It's such a big issue that standard practice is to run QA against live production data by default and manage the consequences.
The downside is, when we do need to use test data e.g. when there's a large change to the structure of the upstream or we want to do stress scenarios, it's a pain in the backside.
We struggle with this too. There are so many ways the dynamics can change--daily/weekly/seasonal cycles, robots, client caching, etc. One best practice we use for high-volume services is to minimize the variance in call mixtures--if there are two calls with vastly different call patterns, it's probably worth it to split them into separate services, so you can tune throttling, GC, load balancing, etc. specifically for those calls, instead of having to tune them to support both calls (which is often difficult or impossible to do). Of course, it's hard to predict how your service will evolve over time, so making the split is often painful for you and your clients. Some of our services can't be handled by a single load balancer, so we use DNS round robins, which have a whole other class of problems when you have mixed call patterns. Gotta earn your pay...
Some other techniques we use are one-box deployments that receive a proportion of production traffic and "bake" new changes before deploying to the whole fleet, and shadow fleets which let you tune and test against live traffic. We've found that simply replaying production traffic at higher volumes sometimes isn't sufficient, because our calls don't necessarily scale that way (some of them scale with upstream traffic, some of them scale by downstream fleet sizes due to client caching).
How much Java source code does it take to need 256MB of codecache? The author says they're using a service architecture where each transaction uses about 20 services. There's no indication of why their program is so huge.
Author here. We know 128MB codecache was not enough and 256MB was sufficient. I think we could have gotten by with less, e.g. 200MB, but we stopped experimenting once we found a value that worked.
We don't have a complete understanding of why this app uses so much more codecache than other apps we've switched to Java 8. Now that we expose the codecache size in Datadog, I may try plotting code size vs codecache size for a variety of our apps.
Datadog employee here, you don't need to run your own collector to collect codecache usage. As it's exposed through JMX, the JMX collector can collect it http://docs.datadoghq.com/integrations/java/ . It doesn't by default though, so you'll need to configure it to collect the metrics from the "java.lang:name=Code Cache,type=MemoryPool" bean. We should add that in our default configuration.
Codecache is for compiled code, so it doesn't necessarily correlate to the original program's source code size. You can have the same methods inlined in many places, load the same classes in multiple class loaders (which means they are separate classes to the JVM), generate code at runtime, etc.
For example, Presto is a SQL query engine that generates code for each query (a SQL query is effectively a program), so it can need a lot of codecache depending on the query rate and concurrency.
Now that's something to look for. If something is generating code for each transaction, and that code shares a cache with other, more permanent code, cache thrashing is possible. If the cache management favors new code over old code, old code is likely to be pushed out as the cache fills up with query code. Then the old code gets recompiled the next time it's needed. Could that be why so much time is being spent in the JIT compiler?
> You can have the same methods inlined in many places
Isn't the JVM more conservative about inlining already inlined methods? I can't find a cite right now, but I swear I've seen this when inspecting the compiler output.
yes,
if the method was already compiled into a big or medium size method (in term of assembly code). JITs will not try to inline it again otherwise you will have too much code duplication and nobody want too many cache misses from the instruction cache.
so the unspoken thing seems to be that they kept the same JVM arguments as java 7 and that caused problems in java 8? and they had a JVM argument that was setting a value to the same as the default?
We knew to adjust other JVM settings (e.g. PermGen replaced by Metaspace) but we overlooked that (1) we used a non-default max codecache size and (2) the default for that setting was 3x higher for Java 8.
An unfortunate thing about using non-default JVM settings is that you need to scrutinize them every time you switch Java versions. If I ever go through this again, I will know to pay more attention to every JVM setting we override.
-XX:-TieredCompilation is the magic option to disable tiered compilation.
Beginning with Java 8, instead of having the VM to magically choose between using the client JIT (c1: think V8) or the server JIT (c2: think gcc -O2), the default configuration is to run in so called "tiered mode", first starts with the interpreter (as usual) then c1 (here keep the profile info) then c2. Because the code is compiled twice, you need a twice bigger code cache.
From my own experience, tiered compilation is nice when you run something interactive like an IDE (IntelliJ IDEA) and useless when you run a server app.
That's said i've never had to have a 250MB code cache.
If the JVM has one bytecode compiler that it uses only during startup and another that it used the rest of the time, shouldn't it dump the bytecode generated by the first compiler from the cache at the point when it switches to the second compiler?
The article's wording is a bit misleading here. The first-tier and second-tier compilers are always available throughout the entire lifetime of the program. The first-tier compiler performs very few optimizations, but compiles very quickly. It is a simple and literal compiler -- most direct translation of byte code to machine code. It's meant to be a quick win to get out of interpreted mode, which is vastly slower. The second-tier compiler is slower, but also takes advantage of the data collected during runtime profiling to make optimizations. Basically, there's a trade-off between the time it takes to run the second-tier compiler and the time it will save. By comparison, the first-tier compiler is almost always a win over interpreted code, even if the code is only run a small number of times.
Now, sometimes optimizations can be "wrong", in the sense that something new has happened that invalidates an assumption used during second-tier compilation. Here's a real-world example: I have an interface and only one loaded class that implements that interface. The second-tier compiler will use this information to basically do this:
function doStuff(MyInterface a) {
if (a.getClass() != MyClass.class)
deoptimize();
// Do everything from here on assuming
// that a is an instance of MyClass.
// This includes inlining simple getter
// methods to pure field accesses, etc.
}
Now, when a second class that implements `MyInterface` is loaded and makes it to that function, it will hit the `deoptimize` branch and go back to first-tier or even interpreted mode. Eventually the function will be recompiled by both tiers with the new assumption -- two implementing classes.
So, in the case of deoptimization, it can be a win to keep the first-tier code to fall-back to.
I think you meant the machine code. The Java compiler produces bytecode from Java source. The C1 and C2 bytecode compilers in the JVM convert bytecode to machine code.
Yes, I agree if C2 recompiles something C1 already compiled, the C1-compiled machine code should be freed from the codecache.
It will be freed but when c2 has emitted the new machine code, you can still have code on the stack of the user threads that use the code compiled by c1.
When no more threads will use the c1 generated code, it will be swept from the code cache.
Yikes ! Can somebody with deep AWS knowledge comment on how this impacts aws lambda (java) and AWS Elastic beanstalk applications? what configuration is recommended?
That said 250MB is really a lot of code. Their binaries must be gigantic. This by itself likely already causes performance problems because all caches will be thrashing.