Hacker News new | past | comments | ask | show | jobs | submit login

What are the advantages of representing multidimensional arrays with pointers to arrays instead of a "flat" version where everything is stored contiguously and access is simply pointer arithmetic?

EDIT: For the JVM, not manually. I'm asking about the internal representation, not a manual flattening by the user.




They're not the same thing. A flat array is fixed in dimension (modulo sum each size = total size). An array of arrays can have its sub-arrays replaced.

        long[][] foo = new long[2][];
        foo[0] = new long[5];
        foo[1] = new long[3];
        // ...
        foo[0] = new long[7];
Although I think as a developer you probably almost always want to index a single flat flat array if the dimensions are fixed. This is much faster.


Ragged arrays. You can do this:

    int[][] x = new int[2][];
    x[0] = new int[] { 1, 2, 3 };
    x[1] = new int[] { 1, 2, 3, 4 };


The JVM doesn’t really have multidirectional arrays, it just has arrays with a type, and that type may itself be an array.

Sometimes that can be useful, and sometimes not.


> The JVM doesn’t really have multidirectional arrays, it just has arrays with a type, and that type may itself be an array.

That contradicts the article:

> When the compiler encounters this code, it emits a unique bytecode, MULTIANEWARRAY, which creates an array with dimensions that are each set to the specified size.


Does that bytecode do anything more efficient than allocating arrays inside of arrays inside of arrays etc? Does it ensure they’re contiguous? Does a multidimensional array allocated in this way still require pointer chasing to get to the values?

It’s not really a helpful statement. Initializing a single flat array with stride metadata alongside gives you memory locality and arithmetic access. Intuitively, I wouldn’t expect the special multidimensional array bytecode to provide any of that.


Well, /u/aardvark179 claimed that the JVM did not treat multidimensional arrays in any special way at all, i.e. that it is completely just a composition of features already available in the JVM, but that was wrong, since multidimensional arrays have their own special bytecode. Anything beyond that is irrelevant, I'm just correcting his mistake there.


OK, so they're treated in a special way that's completely irrelevant to the concerns one might have when using multidimensional arrays. It's just an implementation detail with no practical considerations.


Yes


Sorry, missed this yesterday. From the spec:

“A new multidimensional array of the array type is allocated from the garbage-collected heap. If any count value is zero, no subsequent dimensions are allocated. The components of the array in the first dimension are initialized to subarrays of the type of the second dimension, and so on. The components of the last allocated dimension of the array are initialized to the default initial value (§2.3, §2.4) for the element type of the array type. A reference arrayref to the new array is pushed onto the operand stack.”

It’s really a little utility method that happens to be pushed all the way to the byte code for historical reasons. The arrays it produces aren’t multidimensional in the ways you might hope, they are just arrays of arrays, which remain mutable and so can become ragged, or be reassigned in sneaky or horrible ways (e.g. with a subtyped array).

There has been lots of talk over the years about how we could do better, often motivated by projects like Panama which are concerned with interfacing with other languages which do have firmer concepts of multidimensional arrays, but these are generally at a slightly higher level and try to avoid messing with lower level things the VM has to know about.



It is a bit more flexible would be my guess.

On the flip side, separate arrays and pointers should be slower on modern CPUs because the multiplication will be faster than the cache misses from jumping around in memory.


Maintenance, probably.

Puting data in a structure that mirrors the real thing normally helps when trying to understand it


No, I mean for the JVM.... the interface could remain the same but the pointer indirections could be avoided. I don't see the downsides, that's why I'm asking.


It just means you don't have to special case any particular scenario such as array-of-array, it's no different than array-of-whatever. Also jagged arrays form a complication.

But even for the simple case of large non-jagged arrays: In the situation where you needed this perf gain as a developer you would still need to be able to specify the order e.g. row-major or column-major, since the VM won't know your access pattern.

So if you make your own matrix class you would be better off storing a flat array yourself and doing the access arithmetic yourself like index = y*cols + x. It would be a pretty cumbersome API in Java if they had this in the declaration api and you declared double[100][100] how would you declare whether that 10k flat array is row major or column major? If the VM gets it wrong for your access pattern then a lot of the perf gain from fewer pointer indirections will be eaten by cache misses from looking in the wrong order.


This information could be made part of the type of the array. Bases on the type, Java could then compute the proper index.


Yes, but you'd have a whole set of additional data (number of dimensions e.g. 2 and then the order of access for those dimensions). And that would affect the size of every array instance, which is something you really don't want. Or the compiler would need to keep track of it on the type level so that a row-major array can't be passed where a column major is expected etc. In any case it would really make the APIs overly complicated.


It would be 4 byte more per array for metadata, plus the sizes. It would actually require less memory than ragged arrays, which have to record the size in each subarray.

API complexity could be handled by a polymorphism mechanism to make it possible to write code that is generic regarding row/column access order.


If you keep the interface the exact same, you run into some pretty big complications, as with the current interface you can change one row out with another object with a single statement, or similarly get a row object as such (which you do implicitly in "arr[a][b]"!). So you'd have to still store an object per row in addition to the data to keep their identity the same each time they're gotten, and have some way to transform out of this representation if the last reference to the 2D array is via a single row of it, GCing away the rest of the rows.


I guess it was just to simplify the original specification of the JVM (if two dimensional arrays are just arrays of arrays, you don't need special instructions for them). I can also imagine that the original JVM designers did not expect JIT compilers to one day become so good that the performance difference becomes relevant.


But there is actually a special bytecode instruction to create multidimensional arrays: MULTIANEWARRAY. My suspicion is that it allows a sufficiently sophisticated JVM to allocate all the memory required at once such that all of it is contiguous.

Also, the performance impact has nothing to do with the JIT. It follows from how indexes in contiguous vs. ragged arrays are computed. The JIT can't do much in this case apart from providing speedup by a constant factor. (Actually, it could do more if Java had true multidimensional arrays)


> MULTIANEWARRAY

Oh, I didn't know that. Weird. IIRC there are no corresponding special instructions for accesses to multidimensional arrays.

> Also, the performance impact has nothing to do with the JIT

What I meant: The authors didn't expect that the JIT compiler would become so good in compiling the bytecode, that the performance difference between contiguous vs. ragged arrays actually becomes relevant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: