Hacker News new | past | comments | ask | show | jobs | submit login
Allocation is cheap in .NET until it is not (tooslowexception.com)
124 points by matthewwarren on Jan 24, 2018 | hide | past | favorite | 65 comments



"Managed memory is free. Not as in free beer, as in free puppy."

Dev manager of Exchange used that line in a talk. Never were more insightful words spoken. Devs will move from C++ where they obsess about every allocation to .NET and they'll totally forget that allocation is expensive no matter what the platform or runtime.


>they'll totally forget that allocation is expensive no matter what the platform or runtime.

Well, it's easier to do in a managed language. When you literally don't have to agonize or obsess over every allocation because you aren't responsible for cleaning it up (unmanaged held resources withstanding), you tend not to do so.

P.S.: You're always free to drop down into C or C++ if you want to get some speed, but of course you need to clean up after yourself there. A friend of mine wrote a good guide on doing so, if anyone cares https://github.com/TheBlackCentipede/PlatformInvocationIndep...


>You're always free to drop down into C or C++ if you want to get some speed

Wouldn't C# with structs and pointers do the job in many cases? I've been able to get 50-fold increases in speed through heavy optimizations, without switching to another language. Using C or C++ solely for a "speed boost" over C# is not only unnecessary, but it creates more problems than it solves. If you don't know how to optimize within C# (as a C# developer), how are you going to succeed in writing efficient C++ code?

Once you learn the nuances and limitations of making optimizations in C#, then you should start looking into how and when other languages such as C can wisely be used. To name an example, C makes it easier to micromanage assembly instructions (can be done in C# too, but not in a very practical way, and yes I mean assembly and not IL). C also contains more syntax and features which are suitable for bitwise micromanagement, whereas with C# it can be more awkward.


Yes they would, and the C# 7 improvements taken from Midori experience make it much better.

I think in general it is a culture problem.

Those of us that embraced managed languages, including for systems programming (Oberon, D, ...), know that we can be productive 99% of the time and just have to care how to do speed boost tricks on that 1% using profiler and low level language tricks.

In C and C++ communities there is a sub-culture of thinking too much ahead of time how much each line of code costs, thus speeding too much time with design decisions that actually have zero value in the context of the application being delivered.

The problem is not taking those decisions, rather taking them without validating if they are right with a profiler, or regard to the goals that have to be met for the application.

Beyond which any low level fine tuning, while fun, is needless engineering.


Midori was so beautiful. I think it would have succeeded as a .Net runtime replacement with picoprocesses. it frustrates me that we didn't open-source it.


As believer in GC enabled system programming languages, I do feel it was indeed a missed opportunity, specially to change the mind of those that think C and C++ are the only way to write OSes.


Can you please point to any resources that talk about heavy optimization options in c#. That 50-fold increase you talk about is very interesting. I would like to learn more.


Here is a list, not easy to track all of them down, but maybe as keywords to easy googling.

- structs

- unsafe code

- stack allocation in unsafe code (think alloca())

- attribute annotations for packing and inline calls across assemblies

- ref parameters

- ref returns

- readonly ref parameters

- Span<> and Memory<>

- Native memory allocation via MarshalInterop, SafeHandles

- Buffer and ArraySegment

- SIMD (with RyuJIT)

- Profiled code cache for JIT code background generation between executions (System.Runtime.ProfileOptimization)


Great list. It's important to understand when to use each one of these. Identify your bottleneck, through the use of profilers. Execution time is largely based on memory bus blocking I/O and not the CPU calculations, so if you start with writing SIMD, you're not going to get anywhere.

Accessing data on the stack instead of the heap is the #1 saver of execution time, in my experience. But your bottlenecks might be different. Locally scoped value-type variables are generally on the stack. Object-scoped and static fields and properties are on the heap.

Writes to local variables seem to be faster than reads, IIRC. The fastest operators seem to be the bitwise instructions, IIRC. If running in 32-bit mode, try to work with 32-bit integers. If running in 64-bit mode, try to work with 64-bit integers.

Here's an example of a major, major improvement in performance

for(int x = 0; x < this.Width; x++)

{

   for(int y = 0; y < this.Height; y++) { foo = bar; } 
}

Much faster version (due to storing a copy of Width and Height on the stack instead of the heap):

int width = this.Width;

int height = this.Height;

for(int x = 0; x < width; x++)

{

   for(int y = 0; y < height; y++) { foo = bar; }
}

My comment here describes roughly the approach I used to take advantage of stack-allocated memory (before Span<T> was available). https://news.ycombinator.com/item?id=15136627


Thanks! Your example is pretty interesting. Any reason why this is the case? In both cases, it is just accessing a memory location to read the value. Are there compiler optimization heuristics at play here? E.g., for the local variable compiler knows that its value is not changing during the loop execution, so it can be pushed to register for faster access.


Register access isn't the issue. In the first example, this.Width and this.Height are accessing the Width and Height property of the current object. This requires a heap fetch on each iteration of the loop. There may be OS-specific nuances with automatic caching that I can't remember clearly enough to reliably mention.

If you can get rid of all heap lookups in your iterative loop, then you'll see a large speed boost if that was the bottleneck. Local variables exist on the stack, which tends to exist in the CPU cache when the current thread is active. https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

Unfortunately, method calls in C# have a much higher overhead than in C and C++. If you must do a method call in your loop, be sure to read this to see if your method can be inlined. Only very small methods of 32 IL bytes or less can be inlined: https://stackoverflow.com/questions/473782/inline-functions-...


Thanks!


Also be sure to check this out https://gist.github.com/jboner/2841832 Notably the L1/L2 cache vs the main memory reference


This article demonstrates why generational GC with a bump-allocating nursery is so important. Without a semispace copying collector (which is usually impractical without a generational GC) you can't have bump allocation at all. Not having that fast path is a huge performance loss, as this article demonstrates.


Immix [1] does bump allocation without the space overhead of the copying young generation. It's been working really well for us in Scala Native.

[1] https://dl.acm.org/citation.cfm?id=1375586


Mark-sweep-compact is another GC algorithm that supports bump allocation, and doesn't have the 2x overhead of semispace.


The point is that you need moving GC for bump-allocation to be possible. Traditional mark-and-sweep is non-moving while semispace collector is the simplest to describe moving GC. The practical takeaway from all that is that usual generational GC constructions with semispace minor collections and mark-and-sweep/compact major collections are in the complexity/efficiency sweet spot.

By the way it is possible and trivial, although somewhat non obvious, to construct non-moving generational GC. The general idea is that for simple M&S you have some list of live objects which you replace that with multiple per-generation lists (Claus Gittinger mentions this in his VM design talks, which probably means that at some point such GC was used by ST/X).


Moving collectors get you best allocation throughput but impose other costs, which are hard to measure because they are design constraints. Obviously you cannot have a moving conservative collector so you must have stack maps, safe points, etc.

Or interactions with native code. How can native code hold a reference to a potentially movable object? .NET allows pinned pointers (obviously hurting compaction efficiency) while JNI uses double-dereferenced handles, of which there's a limit (65k in Android!) Compare to, say, JavaScriptCore, which uses a non-moving collector, and simply conservatively scans native stacks.

Whether these costs are important depends on your use case, but it's important to remember that we're rarely building isolated systems.

> non-moving generational GC

Yes, that's what Apple built for its ill-fated experiment with GC! Amusingly Apple also built its inverse: a moving manual memory manager. Google MoreMasters for some retro fun!


The design space has one non-obvious but fundamental strict dividing line: if you can constrain the mutator code enough to be able to insert write barriers for generation GC you also can constrain it enough to have all the metadata to support moving GC at least to the extent of opportunistic compaction (eg. what CLR and SBCL on i386 does, both of which have conservationaly scanned stack because building stack map for register-constrained platforms like i386 is essentially impossible).

On the other hand BEAM has AFAIK non-moving generational GC implemented in the aforementioned way which is in this case trivial as you don't need any write barriers and remembered set when heap objects are inherently immutable. (In this it is somewhat relevant that Boehm GC for some time had API that allowed you to signal the time extent of mutatibility of given heap object to the GC, AFAIK it is no-op since some 6.x version and the concurrent/incremental/generational bits of it work on basis of mprotect() and handling SIGSEGV)


Aside from the stack, another key challenge for moving GCs is hash tables keyed on object identity. If the object can move the raw address is no longer a suitable hash.

.NET at one point stored an extra per-object word, which was either (via its LSB) a random hash code or a pointer to a metadata object that held the lock, etc.

Python did this cute thing where moved objects would get an extra word allocated to store the moved-from address, which was where the hash was stored.

Apple's plan for its (not-released) ObjC moving GC was to directly teach the GC about hash tables, so it could rehash on collection (!).

Do you know how other implementations handle this?

BEAM's GC is indeed a marvel both because of the immutability and also the enforced process isolation, so that each process can do its own GC independently. As you say, constrain the mutator...


Cpython has non-moving GC so this is not an issue. On the other hand CPython's hashmap implementation is probably most educational open source hashmap implementation that you can find, because it is full of wonderful and portable performance hacks (for one thing the hash values of CPython's objects is intentionally computed such that the distribution is non-uniform which allows tuning of the hashmap implementation for common cases).

As for having tight coupling between moving GC and hashmap implementation there is another reason why you want to do that: weak pointers and hashmaps that are either weak keyed or weak valued which both are things that are useful for the VM implementation itself (symbol tables, method dispatch caches, identity maps for FFI...).


> Cpython has non-moving GC so this is not an issue.

I believe the GP was referring to PyPy.

From https://morepypy.blogspot.co.uk/2009/10/gc-improvements.html

> The hash field is not necessarily there; it is only present in classes whose hash is ever taken in the RPython program (which includes being keys in a dictionary). It is an "identity hash": it works like object.__hash__() in Python, but it cannot just be the address of the object in case of a GC that moves objects around.


The BEAM GC algorithm is explained in details here:

https://www.erlang-solutions.com/blog/erlang-garbage-collect...

I think it’s moving, unless I misunderstood something?


That is why the "AFAIK" was there :)

Few years ago I read some paper from Ericsson that described non-moving generational GC explicitly designed for Erlang's data model which could even be implemented in terms of malloc()/free(), I'm not sure that it was relevant to how it is implemented in current BEAM.


That’s quite possible. The GC has changed several times. There was even a unified heap fork of Erlang at some point: https://www.researchgate.net/profile/Marc_Feeley/publication...


> of which there's a limit (65k in Android!)

Oh man JNI and Android.

I've never heard a developer curse up a storm like I did when one of my co-workers inadvertently stumbled across the 512 LocalRef limit(also a fun one) during an intermittent crash repro.

By the time he got done with his rant we had to talk him out of purchasing a one-way plane ticket to Mountain View.


As Java dev, there are so many things that just feel wrong on Android.

Leaving aside the fact how Google treated Sun, the whole framework has a feeling that was written by former C and C++ devs, learning Java on the job while implementing Android.


Sorry, but if you have constrained plattforms- all the frameworks/software tends to look this way. This is how many gamecodes look internal- and do horrible things behind the scene to keep it save.

So if a native java dev had written the whole plattform, how would he handle the limitations of the plattform any diffrently, except for alot of abstract wrapping and exceptional execptions?

There is a reason why there is no rte every time something comes to the limits of machine feasible. Projecting ones unwillingness to cope with reality on a developer with a difficult job - is sort(off, sad);


That wasn't my point about feeling wrong about Android APIs, surely there are some restrictions due to programming resource constrained devices.

Any Java developer that used Java Card, J2ME or Embedded Java is quite aware of them.

A native Java dev would not have used Hungarian notation, snake case identifiers, allocation of classes just to fake out parameters which could be returned as result,have event handlers with unused parameters repeating AWT design errors and a couple of other things that I could still rant on.


I've mentioned it earlier, yet - if you need native code use direct byte buffers as shared memory between java/c and hold/use no refs inside the native code.


Those lucky enough to focus on Oreo can make use of the new shared memory classes, that they introduced to the new microkernel-like architecture for Treble drivers.


Objective-C GC main failure was that is impossible to have any good GC implementation in a language with C semantics.

On top of that, having developers mixing GC compiled libraries with others compiled the traditional was a good recipe for random runtime crashes.


A point of JNI - use direct byte buffers. The address remains the same within the virtual address of the process (hence non-movable).

Of course that leaves some memory management in Java, itself, but the price is usually fine for having void* access in C.


I didn't understand a bit of that


The GC equivalent of stack allocation of temporary objects, but without having to be 100% accurate about object lifetimes.

It works even when the object is usually thrown away but occasionally isn’t.


This brought memories of that pattern (flyweight?) where the data was stored outside the objects, possibly in an array. An object was instantiated only to hold an index to the array position and allow access. That's dirty cheap!


It's still commonly used. The NFX library has Pile which does this well for holding large data:

http://nfxlib.com/book/caching/pile.html

https://www.infoq.com/articles/Big-Memory-Part-1


This pattern was also used by Java and .NET for implementing cheap String.substring calls where all substrings would use the same underlying array with just offsets changed. Unfortunately it turns out that people read entire files into a one big String and then have a reference to just a small piece of it (via substring) marking the big underlying array as reachable for the GC holding a lot of memory for no reason. That's why new implementations of substring always copy:)


I know this was changed recently-ish in Java, but I hadn't heard of anybody doing the old substring trick in .NET, do you know when they cut over?


AFAIK this trick was never possible in .NET because Substring always 'deep copied' the relevant data into a completely new string


There is a lot of good things going on in .NET Core about this. Here is a good description of Span: http://adamsitnik.com/Span/#slicing-without-managed-heap-all...


Regex's implementation in the standard library used to do this, and I think maybe still does. Bonus points because they leaked a reference to the last string you ran a regex against (????) so if it was big you'd just eat up 50 mb of heap for a while.


That pattern can be done in .NET via native memory allocation (MarshalInterop and SafeHandles).

With the latest C# 7.x features will become easier to use it.


QPX also stored their database in non-GC space in order to prevent the GC from having to walk it.


It's a common pattern — e.g. a lot of C++ libraries have some sort of string piece type.


good old fashion separating data from logic.


How does .NET support pinned pointers with a bump pointer allocator? Does it just eagerly move pinned objects out of the contiguous heap?


The .NET GC hands out "allocation contexts" to every thread. An allocation context is little more than two pointers: the bump pointer and the bump pointer limit. If the runtime allocates too much and exceeds the bump pointer limit, it asks the .NET GC for a "quantum" of memory (usually a few KB). Each quantum that the GC gives out is guaranteed to be free of pinned objects - it'll find a contiguous block of memory to hand out.

Pins on the ephemeral segment are generally bad in that the quantum allocator has to be aware of them and squeeze objects between them.

The GC is not permitted to eagerly move pinned objects out of the heap. This is because there are two ways an object can be pinned: a pinning GC handle or a stack scan reports a local as pinned (e.g. the "fixed" keyword in C#). The GC does not know until a GC is already in progress that an object has been pinned and, at that point, it's not legal to move the object so it must stay where it is at the current point in time.


Pinning typically just means it is left in place and exempted from compaction. This does mean that you can end up with a performance penalty and nasty holes in your heap layout. Sometimes marshaling code will opt to make a copy of the data instead (and then perhaps pin that), it depends on the type. There's not a lot of explicit documentation on this (probably because some of it is an optimization). Pinned objects can't be moved without breaking semantics - once you get a pinned-type GCHandle to an object, you can just directly get the address and it won't ever change. (I believe once the GCHandle is freed/finalized by the GC, it will automatically unpin the object.)

Typically this isn't a big problem - pinned data structures in .NET code are either pinned for short periods of time (to pass to native code), or are reusable large big buffers that stay pinned forever. Large buffers are always allocated in the large object heap right away. You can always allocate native memory directly in which case the GC doesn't care about it.

This may be changing since recent updates to C# and the runtime have introduced the concept of interior pointers to objects, where you can have a raw pointer to a field within a GCable object. Right now those are constrained to living on the stack only, so the period of time in which the object can't be moved/compacted as a result is relatively short.


> (I believe once the GCHandle is freed/finalized by the GC, it will automatically unpin the object.)

GCHandle is a struct so you have to explicitly call GCHandle.Free().


Makes sense. I make a point of calling Free but it wasn't clear to me whether the pin was attached to the object reference (since the handle contains a reference).


Is there work ongoing or planned to try to add/improve escape analysis, as the article suggests?


.net already has in-place allocated types (e.g. 'structs') which are dropped on the stack or inside the allocation of their owner (in the case of members fields), which explicitly covers most cases that escape analysis tries to handle implicitly. However, they still need to be heap-alloc'd when boxed (e.g. when used as a IEnumerator or something), which would definitely benefit from hotspot-style allocation-inlining (though the C# compiler already tries to do this prejit for common cases like iterators, too, for practicality's sake).


In general - yes: https://github.com/dotnet/coreclr/issues/1784. However, nothing more specific I can say...


For a VM as mature as .NET, I am surprised they don't do escape analysis. Java and Go both do escape analysis, though the coreclr issue you linked to makes the point that Java really needs escape analysis because it doesn't have value types like C# structs.


I'd say .NET is surprisingly immature considering it's age and it's popularity, at least compared to JVM and JS options.

Don't get me wrong I think .NET is great, and they've made some smart decisions that allow it to be competitive with much less engineering work on the VM.

As well as not having escape analysis it also lacks tiered compilation which is a much more pressing problem from my perspective, and is being actively tackled now.


Looking from the outside, but also following everything that comes out of MS and MSR, I think it was a political consequence.

There was always the DevTools vs WinDev differences regarding the role of .NET in Windows, and for a while it seemed Microsoft has happy having .NET just be good enough.

Thankfully they have changed their mindset and are improving .NET to be as close to C++ as possible, at least for 99% of the use cases where using a GC language is an acceptable trade off.


I realize it's a pretty hard problem - and hadn't java already demonstrated the feasibility of it, I would have doubted it to be possible at all without major surgery to both language and runtime (special scoped types etc). So I guess my question is: is there something about C# or .NET that makes it much harder to do escape analysis than it is in Java world?

An evil example is

    class Something
    {  
       private static readonly Something _inst;
       public Something()
       {
          _inst = this;
       }
    }
Where the reference leaked just by instantiating it. Does java detect that this escaped the stack? How?


This post has a nice investigation into 'Escape Analysis' in Java, https://shipilev.net/jvm-anatomy-park/18-scalar-replacement/

Shows that the Hotspot doesn't handle it in all scenarios:

> But, EA is not ideal: if we cannot statically determine the object is not escaping, we have to assume it does. Complicated control flow may bail earlier. Calling non-inlined — and thus opaque for current analysis — instance method bails. Doing some things that rely on object identity bail, although trivial things like reference comparison with non-escaping objects gets folded efficiently.

> This is not an ideal optimization, but when it works, it works magnificently well. Further improvements in compiler technology might widen the number of cases where EA works well.


Thanks that clears some of it up. It seems that java runtimes that do EA actually do this sort of crazy difficult analysis that quickly breaks down with branches and non-inlined code.


FWIW, graal has partial escape analysis, which can avoid a lot of those pitfalls by allowing allocations to escape through just some of the branches. For instance, if you have

    X thing = new ...;
    if (slowpath) {
        unlikely_function(thing);
    }
    ...
Even if unlikely_function isn't inlined, it can still perform scalar replacement, and push the allocation site into the branch (reconstructing the state of the allocated object as it would've been at that point), which is a big improvement.

This in turn lets the inliner be smarter about what it does and doesn't inline, vs c2 which tries to greedily inline everything, partly to assist escape analysis


I believe it is not difficulty level problem but the fact that due to value types existence in .NET it was ok to live without escape analysis for so many years (like @cpeterso also pointed out).


> Does java detect that this escaped the stack? How?

There isn't one Java, rather lots of it.

So, in what concerns Oracle, there are reports from Oracle Labs that Graal is better than Hotspot on escape analysis.

No idea where IBM, Azul, Aicas, microEJ, PTC stand.

Even less regarding Android, because not only Google has done their own thing, each OEM also has the freedom to change how ART works.

Regarding your actual question, I have this feeling none of them is able to detect it.


Construction is a two step process. The space for something is allocated and then the constructor is called so there isn’t anything magical in that example.

It does however also remove half the guarantees about uninitiated object visibility in the java memory model because your leaking the reference inside the constructor.


> Does java detect that this escaped the stack? How?

Well you assigned the instance to a global variable. That's how Java knows. If you can see that then so can the Java compiler.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: