Hacker News new | past | comments | ask | show | jobs | submit login
Inside the JVM: Arrays and how they differ from other objects (oracle.com)
167 points by ternaryoperator on Aug 24, 2023 | hide | past | favorite | 124 comments



> (Note that the comma after the last value is accepted in Java and won’t cause an error.)

Why have I been 15 years programming in Java and I discovered this today?

On a more serious note, I prefer this kinds of posts to the traditional "Look at this shiny new thing" because without fundamental stuff the next big thing can't be built and normally this information makes you a slightly better programmer.


Likely because it's inconsistent with all other parts of the language where trailing commas are not allowed


Indeed. Whereas JavaScript and Python even allow a trailing comma in function parameters:

    function foo(x,) {}
    
    def foo(x,): pass
Before anyone questions why anyone would do this, it's for putting arguments (or array elements) on separate lines:

    function foo(
        a,
        b,
        c,
    ) {}


Although be careful with that in Python because there are other situations where (x) and (x,) are both legal but mean very different things.


In Python the comma acts as a tuple creation operator inside parentheses with no other context like function invocation.

Also, last I remember in JS a trailing comma meant the creation of an extra array element that’s null so I became conditioned to only use trailing commas for separators in languages where it’s consistent and specified very explicitly like in HCL


Tuple creation and destructuring does not require parentheses in Python.

     w   =  0,   # w == (0,)
    (x)  = (0,)  # x == (0,)
     y,  =  0,   # y == 0
    (z,) = (0,)  # z == 0
Though in Rust, parentheses are required for tuple creation and destructuring.


Trailing commas in JS don't do that, but having more commas than values does. (eg. [1,,].length === 2)


hey now, Enum allows trailing comma before semicolon, and ever since discovering that, I've been putting extra comma at the end of last Enum value and semicolon on new line.

Why? Because whenever there's a new value added, git diff shows single line change :)


Read a book on preparing for Java SE Certification, the exam is full of questions about things like this


I believe that comma thing was added recently.


Yup, as recent as 1996:

"A trailing comma may appear after the last expression in an array initializer and is ignored"

https://titanium.cs.berkeley.edu/doc/java-langspec-1.0/10.do...


The comma thing is inherited from C and has always been there. Newer bits of array-like syntax, like varargs and array literals in annotations, do not allow trailing commas.


I have this feeling the C comma operator was a accident from making function calls work in the beginning. After doing some hobby languages of my own I have come to realize many features are kinda automatic and accidental when writing languages, in a way other programming almost never is.


You may be thinking of javascript, which has added various kinds of trailing comma over the years, though arrays had them from the start. See https://stackoverflow.com/a/67793166


[flagged]


Did you ask yourself this question before you wrote this comment?


To answer your question: no, I didn't. It doesn't make sense to—I'm not here saying anything like, "I believe it's because X" without knowing whether X is true or not. If you think otherwise, explain how.

(I have now answered your question. If you respond, how about doing so in the form of an answer to mine—and not evading it with another question?)


I bothered to comment because I write Java for 15 years and I thought that my knowledge was relevant to this topic. I’m not going to check every my comment with reference documentation. In this particular case my memory failed me and I was quickly corrected and downvoted, so no harm was done, I guess, other than few people spending few minutes, for which I feel sorry, but not very much.


In my experience, just a little bit of insider knowledge goes a long ways to making better code. Arrays are fun things, especially when you do a deep dive into the System.arraycopy() function. But the same goes for all Collections in Java. For instance, most of them have a default size (mostly 10), and growing them is a costly operation. So knowing beforehand how large your collection can or may be, can benefit code. I could use this effectively when working with large document XML parsing.

I recommend everyone that uses a managed language (Java, C# or others) to at least get a basic understanding of these fundamentals. And also know which collection type to use when.


> For instance, most of them have a default size (mostly 10), and growing them is a costly operation. So knowing beforehand how large your collection can or may be, can benefit code.

It's really a tricky balance. Over-allocating collections "just in case" can quite often be very expensive as well, since large array allocations tend to be fairly slow (since e.g. they typically won't fit in the TLAB).


It's one of those things where you usually have to let profiling and other observations guide your approach. 99.9% of the time it doesn't really matter and the default behavior is fine. But I can think of a few times where this has been a big deal.

One in particular - I was profiling an application with low-latency needs and GC was taking up a ton of time. Mission control showed tons of allocations of arrays - at one point it was creating a bunch of lists in a loop and adding stuff to them, which triggered creating a new underlying array. We found that a) Many of the arrays were just over the first resizing size, and b) There was a good heuristic that we could use to give them an initial size that would never have to be expanded and wouldn't result in huge amounts of waste.

This had a pretty dramatic effect on our GC times and the overall latency. I think this is where the JVM really shines - tons of tooling to help you profile and observe these kinds of details to help you figure out when you actually need to care about stuff like the initial array capacity.


Depends a lot on what you're doing too. I do a fair bit of heavy data processing work with my search engine (tokenizing something like a billion documents into arrays of words etc), and allocator contention has a pretty huge performance impact for that type of work.

My intuition is that the best thing is to aim for the expected median size, rather than the maximum as one might assume would be the most performant. The maximum strategy minimizes re-allocations, but at the expense of always making costlier allocations.


I think it depends a lot on the other details, especially how expensive the extra GC will be vs the wasted space. Hard to give a rule that will work in all contexts.

In our case, it wasn't a single hard-coded number - the input data gave us the upper bound, and the difference between the upper bound and the median case was so small that going with the upper bound worked out best.


> It's really a tricky balance. Over-allocating collections "just in case" can quite often be very expensive as well

It is sometimes really tricky. When I worked with streaming XML documents that were gigabytes in size, there is a really fine margin you have to work with.

However some general knowledge can be pretty useful. I saw colleagues just do "= new ArrayList<?>(1000);" without considering the collection type or possible size. And besides being a bit ignorant, it can also be really confusing for other developers that take first look at such code.


> TLAB

TLB?


The TLAB is the Thread Local Allocation Buffer.

In short and a bit simplified, normally when you allocate memory, the allocator needs to synchronize between threads because RAM is a shared resource. This means that a thread that allocates a lot can disrupt the performance of other threads, among other weird effects. But there's a small buffer called the TLAB owned by each thread where this isn't true: Allocation in the TLAB doesn't require synchronization. The TLAB makes allocating small ephemeral objects much faster.


This is a good explanation. See also Shipilev's JVM Anatomy {Park|Quark} episode: https://shipilev.net/jvm/anatomy-quarks/4-tlab-allocation/


Thread Local Allocation Buffer


I cringe every single time I see a for loop for what System.arraycopy () has been providing since early days.

For better or worse, it shows me that the author isn't that into Java.


I cannot for my life remember the argument order, so I write the manual code and let IntelliJ convert it.


Doesn't autocomplete show the arguments? I usually use Netbeans when I write Java, so no idea if InelliJ is just that bad.


IntelliJ definitely shows the arguments lol


It shows the argument names and even highlights the current one where the cursor is. But sometimes my thought process is just different.


In Java, it's src array, src offset, dest array, dest offset, length. It's a natural order of from, to.

It's C memcpy() that's the odd one out by putting the destination before the source.


memcpy argument order matches the left-to-right arrangement of assignment. lhs=rhs is rhs is copied to lhs. memcpy(lhs,rhs) is the contents at rhs is copied to lhs.


British people generally don't, but Americans very often use "to...from".


They do, and it's jarring!

Even though I find memcpy and friends to be perfectly logical using the assignment analogy suggested by thwarted, I often need to re-read english sentences written that way.


> I cringe every single time I see a for loop for what System.arraycopy () has been providing since early days.

The worst thing is, that System.arraycopy() is an optimized JNI call which is much faster than copying it by hand [1].

> For better or worse, it shows me that the author isn't that into Java.

The thing is though, most of the time arrays in Java are used because of performance. Or maybe ignorance. Because why would anyone voluntarily give up all the comforts of a List<T>? It's not that Collections are very hard to find in the documentation. And most of the IntelliJ suggest switching to a Collection anyway.

1. https://www.javaspecialists.eu/archive/Issue124-Copying-Arra...


Or it might be that the person has used multiple programming languages, across which the order/meaning of copy arguments varies a lot, and thus prefer to not remember the decision of each language (if not for writing (at which point the IDE could help), then for reading). Whereas a loop is always easy to read and write equally in all languages, and it's really not unreasonable to expect it to perform well enough (if not as good as System.arraycopy, then at least good enough to be insignificant compared to the actual important logic in the code).


Given that I have programed dozen of languages since 1986, and have to jump between C#, Java, C++, Typescript, Transact SQL and PL/SQL for work, plus whatever is needed to keep the customer happy, isn't an argument I would sympathise with in code reviews.


Yet, somehow I doubt you write perfect code in all those languages. Do you cringe at yourself and conclude you just don’t care also?


No I don't, and when someone cringes looking at my code, I shut up, apply the fix and get to improve my skills on the language, instead of excusing myself.


So you agree you should be judged as someone who doesn’t care in those cases also, right? You didn’t mention anything about people excusing themselves initially, just that you judged them. I just hope you hold yourself to the same standard.


So is life, made up of judgments, not always fair.


There are more options than "write perfect code" and "not even attempt to learn how to write idiomatic code".


Yeah, of course. But maybe you should just be happy to share knowledge with those lacking it, rather than cringing and making some kind of personal judgement of them.


I agree. Understanding the inner working of languages and their runtimes is IMHO what gets you one step closer to a senior. Luckily, I had in my young career few seniors in the team who knew a lot about Java and shared their knowledge about the behavior.


Does anyone have any good book recommendations or links for insider knowledge of the JVM/Java? If Clojure focused all the better :)


There is JVM Anatomy Quarks.

I can also recommend reading the JVM specification itself, it is surprisingly not as dry as one might think, and not a novel, it’s a good read. Oh and of course anything written by Brian Goetz, usually about some new feature.


Maybe it's not really up your alley. But I learned Java with the Java in Action with BlueJ [1]. Although it's pretty basic, the text book really explains all the Java (and OOM) basics in a pretty clear way. The book is called Objects First [2].

In addition I really enjoyed exploring the JDK documentation. Especially Java <1.7 is extremely manageable. Java 8 introduced NIO and lambda's which make Java way more fun, but also a tad harder to learn.

It's not exactly JVM, but just wanted to share anyway :).

1. https://www.java.com/en/java_in_action/bluej.jsp

2. https://www.bluej.org/objects-first/


The default size of an ArrayList has been 0 for a while. On the first insertion, it is initialized to 10.


That's a bit semantic, isn't it? Because in practice it's still 10, but lazily initialized [1]. And an empty ArrayList is useless anyway.

1. https://stackoverflow.com/a/34250231


Things I learned by reading this post

* Java arrays can have 0 dimensions

* When declaring arrays, trailing comma is allowed after the last element

* In multi-dimensional arrays, only the last dimension contains actual values. Other dimensions are just pointers to arrays.

Good read.


> Java arrays can have 0 dimensions

The way you've phrased it is a bit ambiguous.

Having 0 dimension sounds like

    x = new int[7][5]; // 2 dimensions
    y = new int[9];    // 1 dimension
    z = new int;       // no dimensions, not an array, not allowed
The phrasing you mean is "dimensions can have zero size".


Unfortunately the article author uses the exact same terminology:

> (It’s somewhat counterintuitive that the zero dimension is not the first one in the array.)


No, this is a quirk of the English language that I'm not sure how to describe generally. Best I can do:

"Dimensions" plural agrees with "zero" as a count, so that reads as "the count of dimensions is zero".

But they used singular "dimension" with "the" which treats "zero" as an adjective. It means "the dimension that is zero".


> "Dimensions" plural agrees with "zero" as a count, so that [always?] reads as "the count of dimensions is zero".

Out of curiosity, how do you rate each of:

"Arrays can have a zero dimension."

"Arrays can have two zero dimensions."

"Arrays can have some zero dimensions."

"Arrays can have zero dimensions."

in this regard?


The first three I read like "__ dimensions that are zero", the last one useless/borderline gibberish - it sounds like "isn't that just an int, not an array of ints?"

So yeah, my description above is a bit off, I guess it's the articles that do it? Is "some" considered an article?


> "isn't that just an int, not an array of ints?"

A two-dimensional (int[x][y] not int[]...[2]...[]) "array of" int isn't really a array of ints either - it's a array of arrays of int. So a zero-dimensional "array of" int is a int.

I'd call the last one ambiguous: it's either a group of (dimensions that are zero), or a (group of dimensions) that is [of group size] zero, and the text doesn't provide enough information to know which without context, just like "I read books." when you don't whether someone's talking about their hobbies or what they did over the summer.


Thanks for the clarification.


Arrays cannot have zero dimensions. They can have zero-sized dimensions.


> (It’s somewhat counterintuitive that the zero dimension is not the first one in the array.)

I must've read this sentence at least 7 times, but don't understand what this means. Can anyone illuminate?


The article is very confused in general about multidimensional arrays (which are really just arrays of array references in Java). It’s badly written and IMO doesn’t deserve to be on the HN front page.

The author seems to have expectations that Java multidimensional arrays violate, and seems to assume the reader would also have those expectations, but they just seem confused to me.

Except for TFA’s mention of the bytecode instructions and trailing comma, the classic tutorial article is much better: https://docs.oracle.com/javase/tutorial/java/nutsandbolts/ar...


To elaborate further, consider this quote from TFA:

For example,

strangePoints = new int[3][4][0][2]

In this declaration, all dimensions after the zero-size dimension are ignored. So, the result of this declaration is equivalent to a two-dimensional array of ints.

This is just plain wrong. It’s a four-dimensional array of ints, just one that cannot contain more than zero ints, because one of the dimensions is zero.

To illustrate,

    int[][] a = strangePoints[0][0];
will typecheck while

    int b = strangePoints[0][0];
will not (which however it should if the author’s claim that strangePoints is a two-dimensional array of ints was true).

The talk about Object having no size() method and arrays having therefore a length field is also confused. Arrays have distinct types (and classes, in the sense of getClass()), and therefore could very well have a size() method. It’s merely a stylistic choice of Java that they opted for the simpler .length syntax.

The article is so misguided as to be harmful.


Author here. Your "correction" is wrong. All dimensions after the zero dimension are ignored by the JVM. This is explicitly stated in the JVM spec[0].The typecheck you're leaning on is a purely syntactical construct. Inside the JVM, the arrays function and are sized just as I described.

> Object having no size() method and arrays having therefore a length field is also confused. Arrays have distinct types (and classes, in the sense of getClass()), and therefore could very well have a size() method. It’s merely a stylistic choice of Java that they opted for the simpler .length syntax.

What's the "confused" part? Again, my description is accurate and you're simply saying that Java could have chosen a different way to do the same thing, but decided not to.

[0] https://docs.oracle.com/javase/specs/jvms/se16/html/jvms-6.h...


> All dimensions after the zero dimension are ignored by the JVM. This is explicitly stated in the JVM spec[0].

What is stated is: “If any count value is zero, no subsequent dimensions are allocated.” This is mostly just for clarification, because what could the alternative possibly be? It is equivalent to saying that count subarrays are allocated, which when count is zero, of course means that none are allocated. Again, what would the alternative possibly be?

Furthermore, strangePoints.getClass() results in `[[[[I` (so it’s not just syntactical, it’s the actual runtime type), i.e. a four-dimensional array of ints, and not a two-dimensional array of ints as the article claims.

What is true is that what is allocated is a two-dimensional array, but it is an array of empty arrays, not of ints. For example, the expression strangePoints[0][0].length is valid (and yields 0), whereas it wouldn’t be valid for a two-dimensional array of ints. And again that’s on the JVM level, because otherwise the ARRAYLENGTH operation wouldn’t work here. Furthermore, strangePoints[0][0].getClass() of course is valid as well and yields `[[I`, showing that the element type of the two-dimensional array is another two-dimensional array (of ints), and not int.

> What's the "confused" part?

The confused part is this: “Many Java collections have a method called size(), which returns an integer stating the number of elements in the collection. Arrays have no such method. There are several reasons for this, but the principal one is that arrays are simple Object instances—they are not collections. The Object class has no size() method, so arrays don’t either.”

There is no reason why array objects couldn’t have a size() method even though Object doesn’t. Invoking getClass() on an array doesn’t return Object.class, but a subclass of that, and those subclasses could have the additional size() method.

This is also wrong: “Eagle-eyed readers of my earlier statement about length being a field rather than a method call might wonder how a direct subclass of Object would have a field called length to begin with, as Object has no such field.” Arrays don’t have a length field, as calling .getClass().getFields() (or getDeclaredFields()) on an array shows. The .length property is mere syntax, much like .class on an object type. But, again, Java/the JVM could have chosen to imbue array classes with a fully-fledged size() method.


> What is stated is: “If any count value is zero, no subsequent dimensions are allocated.” This is mostly just for clarification, because what could the alternative possibly be?

The key part comes a bit later:

> The components of the last !!allocated!! dimension of the array are initialized to the default initial value (§2.3, §2.4) for the element type of the array type.

Since the last allocated dimension is the one that has zero count... well, that means each component (of which there are none) is initialized to a value of type `int`, so technically speaking, the type of each component (of which there are none) is `int`. Due to the rest of the spec, that means the type of the dimension before that is `int[]`; and that's how you arrive at the conclusion of TFA.

It's quirky, and a semantic argument at best, and certainly not worth fighting over here on HN.


Yeah this was the comment I was coming to look for. I saw a couple sentences that made me go "hmm" but the one that I copied into my clipboard to post about was:

"Arrays, however, have a property called length, which can be queried to get the number of elements in the specified array."

This is not correct and causes bugs. Length is the length of the array, not the number of elements within it. Maybe the author understands this, but precision of language is important here and as-written this is either ambiguous at best or incorrect at worst.

Also the entire section for "Array size and the concept of arrays of arrays" is terribly confused. Again I'm not sure if this is imprecise use of language or if the author is truly this confused about the topic, but:

"As you can see, what you have is truly three arrays, working together to create the equivalent of a three-dimensional array."

No, in the naive case of "new int[M][N][10]" you have 1 + M + MN arrays. And if you are working with only the type int[][][], just checking arr.length, arr[0].length, and arr[0][0].length will not give the the dimensions of the 3d array - any of the arrays can be of any length (the length of the array is not part of the type system).

And I feel like I'm not nitpicking here - note some algorithms, like DP ones for instance, only use a "diagonal" half of a 2d array and allocating the other half would be wasted space. So it's very possible you'll encounter such non-trivially-shaped 2 and 3D arrays in practice.

"Here’s an interesting question: What would happen in a multidimensional array if one of the dimensions were declared with a size of 0? For example,

strangePoints = new int[3][4][0][2]"

This was a TIL to me though, I'm kind of surprised this doesn't throw at runtime? But also, I suppose, not surprised. It's weird because, is int[3][4][0][2] really* an int[][][][]? I guess it's not not an int[][][][] but it's also kind of weird that it is. I guarantee this has caused a bug at some point. This would be a cursed interview question.


Author here. This is simply a comment on the way that Java refers to the subarrays. I believe most developers would think of the first subarray as being referenced with [0], rather than with no index at all. That's all the comment was intended to point out.


I think that it means the definition `int[2][3][4]` is effectively `((int[4])[3])[2]`, *not* `((int[2])[3])[4]`. So it declares an array of 2 (array of 3 (array of 4 ints))`.


It the difference between length and indexing. The new int[0] is providing the 0 as length. Mostly when seeing the construct arr[i] you might think of ‘i’ as the index.


What are the advantages of representing multidimensional arrays with pointers to arrays instead of a "flat" version where everything is stored contiguously and access is simply pointer arithmetic?

EDIT: For the JVM, not manually. I'm asking about the internal representation, not a manual flattening by the user.


They're not the same thing. A flat array is fixed in dimension (modulo sum each size = total size). An array of arrays can have its sub-arrays replaced.

        long[][] foo = new long[2][];
        foo[0] = new long[5];
        foo[1] = new long[3];
        // ...
        foo[0] = new long[7];
Although I think as a developer you probably almost always want to index a single flat flat array if the dimensions are fixed. This is much faster.


Ragged arrays. You can do this:

    int[][] x = new int[2][];
    x[0] = new int[] { 1, 2, 3 };
    x[1] = new int[] { 1, 2, 3, 4 };


The JVM doesn’t really have multidirectional arrays, it just has arrays with a type, and that type may itself be an array.

Sometimes that can be useful, and sometimes not.


> The JVM doesn’t really have multidirectional arrays, it just has arrays with a type, and that type may itself be an array.

That contradicts the article:

> When the compiler encounters this code, it emits a unique bytecode, MULTIANEWARRAY, which creates an array with dimensions that are each set to the specified size.


Does that bytecode do anything more efficient than allocating arrays inside of arrays inside of arrays etc? Does it ensure they’re contiguous? Does a multidimensional array allocated in this way still require pointer chasing to get to the values?

It’s not really a helpful statement. Initializing a single flat array with stride metadata alongside gives you memory locality and arithmetic access. Intuitively, I wouldn’t expect the special multidimensional array bytecode to provide any of that.


Well, /u/aardvark179 claimed that the JVM did not treat multidimensional arrays in any special way at all, i.e. that it is completely just a composition of features already available in the JVM, but that was wrong, since multidimensional arrays have their own special bytecode. Anything beyond that is irrelevant, I'm just correcting his mistake there.


OK, so they're treated in a special way that's completely irrelevant to the concerns one might have when using multidimensional arrays. It's just an implementation detail with no practical considerations.


Yes


Sorry, missed this yesterday. From the spec:

“A new multidimensional array of the array type is allocated from the garbage-collected heap. If any count value is zero, no subsequent dimensions are allocated. The components of the array in the first dimension are initialized to subarrays of the type of the second dimension, and so on. The components of the last allocated dimension of the array are initialized to the default initial value (§2.3, §2.4) for the element type of the array type. A reference arrayref to the new array is pushed onto the operand stack.”

It’s really a little utility method that happens to be pushed all the way to the byte code for historical reasons. The arrays it produces aren’t multidimensional in the ways you might hope, they are just arrays of arrays, which remain mutable and so can become ragged, or be reassigned in sneaky or horrible ways (e.g. with a subtyped array).

There has been lots of talk over the years about how we could do better, often motivated by projects like Panama which are concerned with interfacing with other languages which do have firmer concepts of multidimensional arrays, but these are generally at a slightly higher level and try to avoid messing with lower level things the VM has to know about.



It is a bit more flexible would be my guess.

On the flip side, separate arrays and pointers should be slower on modern CPUs because the multiplication will be faster than the cache misses from jumping around in memory.


Maintenance, probably.

Puting data in a structure that mirrors the real thing normally helps when trying to understand it


No, I mean for the JVM.... the interface could remain the same but the pointer indirections could be avoided. I don't see the downsides, that's why I'm asking.


It just means you don't have to special case any particular scenario such as array-of-array, it's no different than array-of-whatever. Also jagged arrays form a complication.

But even for the simple case of large non-jagged arrays: In the situation where you needed this perf gain as a developer you would still need to be able to specify the order e.g. row-major or column-major, since the VM won't know your access pattern.

So if you make your own matrix class you would be better off storing a flat array yourself and doing the access arithmetic yourself like index = y*cols + x. It would be a pretty cumbersome API in Java if they had this in the declaration api and you declared double[100][100] how would you declare whether that 10k flat array is row major or column major? If the VM gets it wrong for your access pattern then a lot of the perf gain from fewer pointer indirections will be eaten by cache misses from looking in the wrong order.


This information could be made part of the type of the array. Bases on the type, Java could then compute the proper index.


Yes, but you'd have a whole set of additional data (number of dimensions e.g. 2 and then the order of access for those dimensions). And that would affect the size of every array instance, which is something you really don't want. Or the compiler would need to keep track of it on the type level so that a row-major array can't be passed where a column major is expected etc. In any case it would really make the APIs overly complicated.


It would be 4 byte more per array for metadata, plus the sizes. It would actually require less memory than ragged arrays, which have to record the size in each subarray.

API complexity could be handled by a polymorphism mechanism to make it possible to write code that is generic regarding row/column access order.


If you keep the interface the exact same, you run into some pretty big complications, as with the current interface you can change one row out with another object with a single statement, or similarly get a row object as such (which you do implicitly in "arr[a][b]"!). So you'd have to still store an object per row in addition to the data to keep their identity the same each time they're gotten, and have some way to transform out of this representation if the last reference to the 2D array is via a single row of it, GCing away the rest of the rows.


I guess it was just to simplify the original specification of the JVM (if two dimensional arrays are just arrays of arrays, you don't need special instructions for them). I can also imagine that the original JVM designers did not expect JIT compilers to one day become so good that the performance difference becomes relevant.


But there is actually a special bytecode instruction to create multidimensional arrays: MULTIANEWARRAY. My suspicion is that it allows a sufficiently sophisticated JVM to allocate all the memory required at once such that all of it is contiguous.

Also, the performance impact has nothing to do with the JIT. It follows from how indexes in contiguous vs. ragged arrays are computed. The JIT can't do much in this case apart from providing speedup by a constant factor. (Actually, it could do more if Java had true multidimensional arrays)


> MULTIANEWARRAY

Oh, I didn't know that. Weird. IIRC there are no corresponding special instructions for accesses to multidimensional arrays.

> Also, the performance impact has nothing to do with the JIT

What I meant: The authors didn't expect that the JIT compiler would become so good in compiling the bytecode, that the performance difference between contiguous vs. ragged arrays actually becomes relevant.


tl;dr: Arrays have some special opcodes dedicated to them, and otherwise are completely unsurprising, unless you are an ancient Roman, or Andrew Brinstock, who can't wrap his head around the concept of zero and thinks it should be special somehow.

And nobody can give a good reason why String#length(), Array#length, and Collection#size() are all spelled differently, but if pressed, they'll use 'special bytecodes!' as an excuse.


Array is a fixed size memory, therefore it’s unnecessary to invoke a method

A string has a length, because it counts the number of characters, which can have different sizes.

Both arrays and strings represent a 1-dimensional string or away of things.

Collections have sizes, because collections are more generic. Would a binary tree have a length? Maybe.. but it’s ambiguous and in most cases not correct


Uh-huh. And yet they can (according to the article) turn .length into a special opcode, even though it looks like any other property access. So the Java compiler could just as easily see size() being called on what it knows is an array and turn that into a special opcode. We could argue about the semantic differences between length-as-in-number-of-characters and size-as-in-bytes, but for most purposes, i.e. when any of String, char[], or List<Character> would be logically equivalent, there's no technical reason for having different APIs for what are essentially different implementations of the same thing.

Like when Rasmus couldn't stick to a naming convention for PHP functions, and it turned out to be because he couldn't be bothered to write a decent hash function. So everyone who uses the language from now to forever has to memorize this inconsistent naming scheme because the language designers couldn't get their shit together. Yeah, I get it; sometimes it's hard to foresee these things from the beginning. But it annoys me when people make up excuses for these inconsistencies that don't hold water instead of just admitting that someone messed up.


> Another curiosity of Java arrays is that they can have a size of zero.

> This code will not result in an error message. This surprising feature is used primarily by code generators, which might create an array and then discover there are no values to place in it.

What? How can someone at Oracle have written this?

Zero-length arrays are used all the time when you call a function asking for an array of "the latest stuff" and it needs to be an array, not an ArrayList say (maybe it's an array of bytes). If there's no stuff, you get a zero-length array of course. The myriad foo.toArray(...) functions in Java's library do this for example.


Right, and

> create an array and then discover there are no values to place in it.

I mean you can't change the size of an array after you've created it, so if you create an array with a certain size intending to put values in it, then discover there aren't any values, you've still got a non-zero size array...


Author here. Zero-length arrays are used in the narrow domain you mention. but they are rare in bread-and-butter Java programming. Given some of the other comments on this page, you can see that this aspect is new to multiple readers. And in my experience speaking to Java devs, the reaction of surprise is far more common than "of course, I use them."


Thing is, practically every modern programming language has zero-length arrays or lists. This is derived from Lisp, which had zero-length arrays (and of course zero-length lists). It's not at all surprising. What ought to be surprising is that among languages developed in the last 40 years, C++ almost uniquely does not have them. I use zero-length arrays all the time in my coding, as does everyone I know. I think they're probably much more common than you imagine.


In Common Lisp I might want to push elements to an array, and start with zero elements:

  CL-USER 9 > (make-array 0 :adjustable t :fill-pointer 0)
  #()

  CL-USER 10 > (vector-push-extend 'foobar *)
  0

  CL-USER 11 > **
  #(FOOBAR)


An adjustable vector probably doesn't qualify as an "array" in the Java sense of the term: it's closer to an ArrayList. However Lisp is perfectly comfortable making zero-length simple-vectors, which are arrays in the Java sense.


An ArrayList is not an Array?

> comfortable making zero-length simple-vectors

Btw., CL allows also zero-dimensional arrays:

  CL-USER 3 > (make-array '())
  #0ANIL


> An ArrayList is not an Array?

Not in Java, no.

An ArrayList is an object which represents a variable length random-access list and does not support basic types (int, double etc.) An array is fixed-length and supports basic types. It is not an object per se.

The closest analog to an ArrayList in Lisp is an extensible vector, and the closest analog to an array is a simple-vector.

> Btw., CL allows also zero-dimensional arrays:

As any decent language would!


This choice probably pre-dates Oracle?


I think he's saying "how could he write that 0 length array is a surprising feature".


And the use of "primarily", when there is a much more common use case, so common that it's baked right into the standard libraries.


I was pretty disappointed that, for a blog called "Inside the JVM", very little in the blog entry discussed goings on inside the JVM. For example, when does the JVM typically optimize away bounds or null checks? How are arrays of booleans packed and what is their efficiency compared to arrays of bytes or words?


You want something like https://shipilev.net/jvm/anatomy-quarks/ (the author is a jvm maintainer, formerly at redhat now at aws).


Thanks for the link, that's an amazing resource!


For that you need an Inside Hotspot, Inside OpenJ9, Inside GraalVM, Inside Azul, Inside ART, Inside microEJ, Inside PTC, Inside JamaicaVM, Inside....

Otherwise is like trying to discuss what does a C compiler do, when only looking through the lens of the C abstract machine in ISO C.


This is an Oracle blog, and it's called "Inside the JVM". What VMs do Oracle build besides Hotspot and its ilk?


as far as i know, graalvm is also an oracle project


GraalVM, the JVM inside various database products, historically the embedded VMs, and maybe a few more I’ve forgotten. :-)


As far as I know, all those distributions use OpenJDK for that kind of stuff and don't really do much more than apply a few patches here and there, not change stuff like how the JVM packs bytes in memory.

Would be happy to be proven wrong.


IBM OpenJ9 uses a mix of OpenJDK and their J9 toolchain.

Azul uses parts of OpenJDK, alongside their JIT Falcon infrastructure.

Microsoft OpenJDK based distribution has better escape analysis than regular one, although OpenJDK 22 should have those improvements merged.

And no, not all of them use OpenJDk, it is an urban myth, as usual.


That would be much more convincing if you linked to a proper JDK distribution that's not based on the OpenJDK.


Like on street markets, we are having bonus today, and you get two for the price of one.

https://www.ptc.com/en/products/developer-tools/perc

https://www.aicas.com/wp/products-services/jamaicavm/


Both of these are meant for embedded development only.

You implied before that there are major JDK distributions which are not based on OpenJDK, but these two examples do not show that: these are niche JDKs.


Not only I am not here to save people's work, those were two examples when you asked for one, and now proven wrong, got to move goalposts with additional stuff, because "those don't count oh oh".

A JVM is a JVM, regardless of deployment scenarios and who gets to use them.


Of course there are some niche JDKs around that do not use OpenJDK... Hell, I've written a subset of the JDK from scratch just for fun.

I am not moving goal posts, I am just saying that a JVM specializing on embedded devices is not a JDK most people would use, only a small number of niche applications... so you haven't shown anything other than there are alternative JDKs available for niche applications, which I agree with! But I was saying that all JDKs people working on most applications rely on are based on OpenJDK. If you want to include a niche JDK for embedded devices, fine, but that's not what I or most people would care about, I would think.

It would be more pleasant to discuss with you if you expressed yourself a bit more like an adult (even if you're still a teenager which I am guessing is the case), by the way...


Author here. Thanks for your comment. It's sometimes a little difficult to know how deeply to go into the innards of the JVM before readers' eyes glaze over and they can't follow. I'll bear in mind for future articles in this series that I can/should go deeper than the present level.


I was honestly hoping for a little more considering the title is "Inside the JVM" and not "Basic data structures in Java". Oh well...


Me too. At first, I thought that the article would show me how to make my programs faster by utilizing arrays more.


well he does talk about how the array object emits special bytecode in several cases




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: