Hacker News new | past | comments | ask | show | jobs | submit login
String.intern() (shipilev.net)
145 points by dmit on May 15, 2017 | hide | past | favorite | 63 comments



Every time I see String.intern() my mind leaps to the problem of new Java programmers who are misled into this:

    String a = "hello";
    String b = "world";
    assert a != b;
    b = "hello";
    assert a == b;
    // OH NICE I'LL USE == FOR STRING COMPARISONS NOW
It works cause source-code literals are intern'ed down into identical objects by the compiler, but that's a special case that won't apply to strings created at runtime.


new Java programmers

Or perhaps interns? ;-P

In fact, I've wondered about the origins of the term "intern" since I first heard of it; the technique was already known to me and likely others as "tokenising" before that, however, since compilers would often have to compare identifiers and doing this to them made it much faster.


'intern'ing in the Lisp world usually means to create a new object and put it under its name into some datastructure for later retrieval, unless the object already exists -> then the object is just retrieved.

The typical example is to intern a symbol into a symbol table (list, array, or in Common Lisp into a package). Here the symbol table is searched for a symbol with the required name. If it exists, the symbol will be returned. If it does not exist, the symbol will be created and registered in the symbol table.

But one might also use the word 'intern' for an operation to retrieve/put a user into a group, retrieve/put a resource into a pool, or similar operations.

The function INTERN dates back at least to Lisp 1.5 from 1962. See the Lisp 1.5 Programmer's Manual.


It's from Lisp, and where did Lisp take it from in the sixties, I don't know. But I'll bet that this particular Java method comes straight from Lisp, as there were some Lisp folks involved in making Java (in particular, Guy Steele worked on both Common Lisp ANSI standard and Java language specification).


And was one of the Scheme creators.


> In fact, I've wondered about the origins of the term "intern" since I first heard of it

I always assumed it came from internalize, or something like that.


I was curious, but a good amount of Googling didn't turn up any etymology.

I'll take a guess, though, that it's from Common Lisp, because CL's intern function (http://clhs.lisp.se/Body/f_intern.htm) is clearly part of a suite of names (intern, unintern, :internal, :external) where, in context, "intern" makes complete sense as the name for the function.


Usage of "intern" in this sense is much older than Common Lisp. The earliest reference I could find in a few minutes of searching is from 1974, from the MACLISP Reference Manual:

http://www.softwarepreservation.org/projects/LISP/MIT/Moon-M...

See sections 6.3 and 6.4 and the glossary for definitions and uses of the word.

Unfortunately they don't shed any light on where the term "intern" comes from. Typical usage is that a string is «"interned" or registered in an obarray» but it doesn't explain much more than that. Interning occurs most often at read time, when a string read from the input is turned into an atom, so I'd guess that "intern" is short for "internalize" or similar.


Well, I was about to say, the Common Lisp version makes enough sense (and the MacLisp version might be equivalent, for all I know):

In CL, you don't really have a global "symbol table" or "atom table"; instead, interning is about the runtime lexical scopes under which code compilation happens, which consist of stacks of imported namespaces ("packages"), where each namespace exists as a mutable object holding identifier "slots".

(intern foo) is a mutation on the current namespace that creates a local variable "slot" for the identifier foo, shadowing any other foo that might have been in lexical scope from an import. That is, (intern foo) makes foo resolve internally to the current namespace.

(unintern foo) is the complementary mutation; it drops the foo identifier from the current namespace, allowing other symbols (slots in packages) that were previously shadowed by it to re-appear.


Internment, perhaps?


OED:

†1. intr. To enter or pass in; to be come incorporated or united with another being. Obs.

2. trans. To confine within the limits of a country, district, or place; to oblige to reside within prescribed limits without permission to leave them.


I believe the original term was for recent medical grads working at a hospital before they had their license. It was "okay" because they were internal, and supervised. Once they had the license, they could work unsupervised or externally.


> In fact, I've wondered about the origins of the term "intern" since I first heard of it; the technique was already known to me and likely others as "tokenising" before that, however, since compilers would often have to compare identifiers and doing this to them made it much faster.

My first hearing of it was in conjunction with LISP macros, or more specifically creating hygenic LISP macros. I'm pretty sure it goes way back.

You can see an example of this here: http://stackoverflow.com/a/30016344


I think the problem is that == compares pointers, when it should've been sugar for .equals() instead.


it's tough. If a language has pointers, then you would expect == to compare whether the two sides have the same value (an address). I don't see a way around that.

How would you expect a programmer to ask "do these two pointers hold the same address?" .holdsSameAddressAs()?


D and Python use the "is" operator to compare reference identity. It is the lesser used operator and the more advanced concept, so using == for reference comparison is just a newbie trap, doubly so since you have to use == to compare equality for primitives.


But it's not true. Mary is a sign-holder. She's holding a sign that says "32 Crestview Road". John is also a sign-holder. Today his sign also says "32 Crestview Road".

Question: is Mary John?

You want the answer to be "yes" (Mary is John) since the is operator asks whether they hold addresses which have the same value.

But of course it's a broken terminology, because Mary and John are two different people, with their own physical locations etc. So you're overloading a word "is". I get that it works, but so does anything else you make programmers learn.


I think Haskell gets it right. All values are immutable. Calling == for two values of some type invokes the Eq implementation for that type, e.g. for strings it's string equality. For the type "IORef a" (box that can contain different a's at different points in time) it compares the identity of the box, not the contents. In fact it can't compare contents, because the type of == stops it from using impure operations like accessing current contents.

The same approach could work in imperative or OO languages. For immutable types == should be value equality, while mutable types are analogous to IORef or IOArray, so == should be reference equality. It comes down to observational equality, different instances of the number 2 shouldn't be distinguishable, but different mutable containers with the same contents should be distinguishable because you can make their contents different. I think observational equality is a good fit for pretty much all uses of equality in programming.


But in the ideal situation it would be

    assertTrue(mary.getSign == john.getSign)
    assertFalse(mary.getSign is john.getSign)
    assertFalse(mary == john)
    assertFalse(mary is john)


right, you and I are agreed and your third sentence is exactly what I was arguing for - it goes against the OP's request that that third sentence evaluate to true.

this was the OP in this thread:

https://news.ycombinator.com/item?id=14348066


> How would you expect a programmer to ask "do these two pointers hold the same address?"

In Java and C#, you rarely need to ask this question.

In fact, the .NET Framework overloads String, DateTime[0], etc so you can use == for almost all comparisons.

[0]: http://stackoverflow.com/a/1205462/266535


Except java doesn't have pointers, it has references. You can't do pointer arithmetic on references, nor can you take addresses and dereference.

I'd expect a programmer to ask: "is this object the same as this other object" and have the language support that by an operator like `is`. Instead of overloading `==` to mean value and reference equality depending on the context.


java references are pointers; the fact that you can't do math on them is immaterial. In java, pointers and primitive types are first class (can be assigned, compared, passed and returned from functions, be members of other objects) while the object themselves are not (they only live in the heap and can't be aggregated).


I think the distinction between "this value is a reference that can be derefenced to get another value that is stored somewhere else" and "this value is an address into some contiguous storage where another value may be stored" is a useful one. Personally I like "reference" and "pointer" to make that distinction, where a pointer is a specific implementation of a reference.

However, that terminology is by no means universal. While, from what I can tell, "reference" as defined above is favoured in the Java and C# community, "pointer" is still used plenty. For example, the Java spec says "The reference values (often just references) are pointers to these objects"

All that being said, I do agree with what I think is the spirit of your comment, which is that when your language doesn't allow pointer manipulation it is less useful to ask whether two values point to/reference the same value.


> How would you expect a programmer to ask "do these two pointers hold the same address?" .holdsSameAddressAs()?

In C++, yes, you do

    (&a == &b)
and the plain == is used for the common case of actually doing what you mean (what in Java would be .equals()).


Actually, that's not always correct in C++ (it is in C). It is possible that either type has the operator& overloaded, so instead the proper way to do it is:

    (std::addressof(a) == std::addressof(b))
http://en.cppreference.com/w/cpp/memory/addressof


Thanks (I didn't even know this operator was overloadable)!


==& ?

I use .equals far more than ==.


I ran into this once when I accidentally used == for a comparison to an empty string... and the condition went through part of the time because sometimes the empty string it was being compared through was also a constant....


new Java programmers, especially those coming from C# where == compares the strings.


This is really an unscientific claim but I ran a hand crafted/hacked benchmark just to get a feeling for the numbers. For 5 to 35 character Strings, == is 20 to 40 times faster than String.equals().

Given that s1.equals(s2) if and only if s1.intern() == s2.intern() (assuming you haven't filled the string table), then this looks like an opportunity for a significant optimization.

Before doing this, I had hoped that String.equals might check if both were "interned" and shortcut the character by character comparison if this was the case by just comparing references. But interpreting the results of my rough benchmark would suggest this isn't what is happening which would agree with the source provided for the String.equals method.

Java String comparison is absolutely ubiquitous so I would have expected that an optimization like this might have been considered?

Having said that, the supplied rt.jar source also suggests that the String.hashCode() computation isn't cached/memoized. This strikes me as odd given that Strings are immutatable and Strings are one of the most common key type for Maps.


This is from Oracle JDK sources:

    public boolean equals(Object anObject) {
        if (this == anObject) {
            return true;
        }
        ...
So if both strings are interned, this shortcut will work. I guess, your benchmarks were wrong. Microbenchmarks in Java are hard to implement correctly.

hashCode is cached as well in field `hash` and lazily computed:

    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }
Everything there is optimized till the last cycle, I'm sure about that.


I hadn't spotted that in the hasCode() implementation. Thanks, that makes sense.

Regarding .equals(), indeed that shortcut exists but it didn't help my benchmark because it spent nearly all it's time comparing Strings which were not equal.

I switched the benchmark to compare all equal (interned) Strings and even then there .equals() is about half the speed of the reference equality check. I guess because of the method call overhead?

Yes I would have been sure that everything to do with Strings would have been optimized to the last cycle hence my surprise. I'm not sure about the state of Java intrinsics with the Oracle JDK but I would have thought that most of the String methods would be handled by intrinsics and be very heavily optimized.

Edit: correction - indeed microbenchmarking is tricky and I had a bug. If the (interned) Strings are equal then .equals() and reference comparison run at roughly the same speed.


I don't think that JVM will do magic and execute something different from actual sources, even for String. On the other hand, optimizing those bytecodes to compile to perfect machine code or playing with code to produce bytecode which will be correctly optimized by JIT is more reasonable approach, not only java.lang.String would benefit from those optimizations, but any other developer with similar code.

Method call overhead is real and calling equals method is slower compared to reference comparison, but eventually JIT might decide to inline this method call, so there'll be no overhead after that. Anyway it's unlikely, that those difference will be noticeable in real code, IMO.


You should compare the cost of one (or two) .intern() plus '==' against a single .equals()

For example, if your code needs 10 .intern() and then performs 100 '==', that is possibly going to be faster than performing the equivalent 100 .equals() (or maybe not)

But if you are comparing a single couple of strings, .equals() is always going to be faster.

Not to mention that if you .intern() random strings, you risk running out of PermGen space. Even in the case of 100 comparisons among 10 strings, you may be better off using a custom HashMap.

You can find all this information in the article.


I'm not suggesting that .equals() should be implemented by 2 inter() operations followed by a reference comparison. But that the implementation check whether the pair of Strings happened to be already .intern()ed, and then use the short circuit of a reference comparison.

A programmer could then decide to intern _some_ non-constant Strings in their application if it made sense (i.e. they were frequently used in comparisons or as hash keys).

And I don't think you can't run out of permgen as the String pool is a fixed size - unless I am misreading that section of the article.

However now that I've thought about a bit more, I think the flaw in the idea is unrelated to your points but that currently it's not cheap to determine if a particular String is interned. The StringTable implementation would have to be changed otherwise it would require adding storage overhead to every String which would not be acceptable. This is also probably why .hashCode() doesn't memoize it's results.


Total aside from main topic, I love shipilev's posts.

If there are other core JVM developers that have similar blogs, I'd love to hear about them here.


We built an inmemory map and we were using String.intern for both keys and values. We could see that we were saving lots of memory but we had the problems described in the article. We then built our own 'String.intern' by using yet another static HashMap. It worked. It was the simplest alternative and it just did the job. Thanks alekskey for the nice article.


I'd never seen the @Benchmark annotation before so I looked it up.

The blog author is also one of JMH's authors.

http://openjdk.java.net/projects/code-tools/jmh/


Have been doing Java for 14 years so far, never ever needed the .intern(). I can imagine it's use-case, but anyway does seem pretty rare case.


Deserializing tons of records all having one of a few values for a particular field is an easy and probably fairly common use case.


Is "Anatomy Park" a Rick and Morty reference? http://rickandmorty.wikia.com/wiki/Anatomy_Park_(episode)


Please consider adding "JVM Anatomy Park" to the title.


The code creates unique strings to "interns" which most likely isn't what would happen in a real world application (unless you know... code without thought), you'd inter strings with low variance usually. Not saying that it won't be slower but the memory usage might be lower.


Yeah, blindly interning strings isn't a good idea, but it can be useful for certain use cases. For example, keys of maps where you know the keys won't vary much. It can reduce memory usage, and improve lookup performance (since doing a lookup requires a string comparison).


This, and the few other articles up, are a great series. Having done Java development now for 30% of my life these are some amazing pointers.

I'd love to buy a hard copy of these if they ever get up to a few dozen articles. Would be good to give to middle-experience devs (like myself) in the future.


Some good content along the same vein with 16 years worth of articles: http://www.javaspecialists.eu/


"The performance is at the mercy of the native HashTable implementation, which may lag behind what is available in high-performance Java world, especially under concurrent access."

What native HashTable is used? Shouldn't the JVM be using an optimized one?


I think by "native" they mean "implemented-in-c++-by-the-JVM", which potentially vary; not java.util.HashTable, which should be pretty standard across JVM implementations.


Yes, although Hashtable is legacy from JDK 1.0. New code should use HashMap or ConcurrentHashMap.


Please be careful with blanket statements like this. HashMap, Hashtable and ConcurrentHashMap behave differently in certain (subtle?) ways. ConcurrentHashMap doesn't like NULL but is thread-safe. Hashtable is synchronized, slow and thread-safe, but doesn't mind NULL. HashMap is not thread safe and doesn't mind NULL.

Edit: pardon the dupes, on unreliable mobile link :-(


> Hashtable is synchronized, slow and thread-safe, but doesn't mind NULL.

The issue with Hashtable or rather all legacy collection classes is that the API is rarely useful without additional synchronization. So you might as well use a HashMap wrapped by Collections.synchronizedMap or use a ConcurrentHashMap with a placeholder object instead of null.


It's using a custom c++ hashtable:

https://github.com/JetBrains/jdk8u_hotspot/blob/master/src/s...

Interestingly, the hash table implementation it's subclassing has changed from `HashTable` to `RehashableHashtable` between JDK7 and JDK8, so I guess there's at least been work to try and improve the performance there.

The hashmap header is here, code lives in the matching cpp file in the same directory:

https://github.com/JetBrains/jdk8u_hotspot/blob/master/src/s...


That is an implementation detail, there are many JVMs to choose from.


Absolutely - that's what the blog post is about, how OpenJDK's particular implementation behaves. When Aleksey is talking about the hashtables not being resizeable, he's talking about the project he works on, which is the code above.


I missed that.


That is an implementation detail, there are many JVMs implementations to choose from.


String.intern() would suck much less if strings had an "IS_INTERNED" flag which would prevent hashtable lookups for already interned strings. Really sad given the insane overhead Java strings have.


No, it would suck more, because currently you have an option to avoid string.intern() altogether (and that is what you should do), and pay nothing for that in runtime. Another boolean flag may cost extra 4-8 bytes on the heap for each String object, whether you use String.intern() or not.


> in OpenJDK, String.intern() is native, and it actually calls into JVM, to intern the String in the native JVM String pool.

How much of this also applies when using the standard Oracle JDK?


It would be interesting to know where it's used, was it used in the JDK for example?


Also see:

XInternAtom (XWindow function)

RegisterClass (Windows)


The instrumentation here is impressive. The amount of data inspection done with just a few simple commands is a bit overwhelming. Frankly, I rarely hope to find myself looking at this level of metrics.

There's a lot down there I like to take for granted. But more likely I try to use methods like string.Intern() exactly never.

Use code you know and understand. Frankly, use code you can trust. And wtf would trust a method string.Inter() to do... exactly, what?

If you are writing a function to do something the name of the function must be the thing being done. What the heck is a 'static internalize'? The explicit HashMap was a few lines of code, and it's the most basic and obvious, and surprisingly performant approach. So definitely I agree you must use your own HashMap and not a static internalizer.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: