Comparison of C++17, Go, and Java for a next-generation sequencing tool

ezoe · on April 23, 2020

After quickly glancing the code, I concluded that they wrote C++ like there is no static type. It seems they faithfully ported the very dynamic nature of their existing code to C++ without thinking.

Like what is this? https://github.com/ExaScience/elprep-bench/blob/master/cpp/f...

  auto alns = any_cast<shared_ptr<deque<shared_ptr<sam_alignment>>>>(data);

So the data is sam_alignment type inside shared_ptr inside deque inside another share_ptr inside god forbid any? Why did they do that? What kind of abomination is this? Also from the context, it is the only possible type. they use:

  try { any_cast<abomination_type>(data) ; }
  catch ( bad_any_cast ) { throw runtime_error(...) ; }

If you're so sure an object of any only hold exactly one type and everything else is unexpected error, you shouldn't use any at all!

They really like std::deque<T> and use it everywhere even though the sizeof(T) is like a few dozen bytes at best so they should rather use std::vector. The data structure of deque is a list of array. while it can amortize the continious adding of the elements to front or back, since the element size is very small, they should rather use vector.

Speaking of data structure, they also use std::unordered_map<int, any>. the unordered_map is very slow(it's a node based hash map, not suitable for the modern hardware) and the sizeof(int) + sizeof(any) is like 20 bytes(sizeof(int) + two pointers) so they got no benefit of using node based data structure here. They should rather use sorted vector and binary search it.

My conclusion, it's slow because they wrote C++ like a dynamic typed language and they choose the wrong data structures.

gpderetta · on April 23, 2020

The use of 'any' and the ridiculous levels of boxing is bizarre.

There is also absolutely no attempt at using std::move; the amount of gratuitous copies and refcount updates is staggering.

Having said that, apart from the use of std::any, the code is fairly clean even though obviously is not even remotely written for performance..

ezoe · on April 23, 2020

It's actually more surprising that the person who can write this otherwise clean and modern C++17 code produce this awful abomination.

The only possible explanation is, he literally port the existing code written in a very dynamic language by hand.

He is not stupid, far from it. It's just he also ported the dynamic part of the code so well.

beached_whale · on April 24, 2020

A copy in function arguments often is not expressed by the compiler and often allows more optimizations.

gpderetta · on April 24, 2020

For relatively small objects with no internall allocation, definitely yes.

Otherwise only a qualified yes: if you are copying it anyway in the body, pass by value allow moving object into and actually avoid copies, but the code linked doesn't use move at all anywhere.

As they seem fairly versed in C++, the only reason for doing that is that they want their code to be modificable by non-programmers, so they want to use a style that is safe and avoids possible use after move.

For this scenario, then yes, C++ is going to be slower than a language with a proper garbage collector (although I would like to see a version with a proper shared_string implementation instead of the abomination that is shared_ptr<std::string>).

beached_whale · on April 28, 2020

Def a qualified yes. The pass by value will often allow the compiler to deal with less as a pass by ref/ptr makes it more pessimistic. You see this play out in trying to get the auto-vectorizer going by copying N values to a local buffer in the loop prior to operating in them.

The optimizer in c++ compilers are really good at inlining and if it's a single TU more so. So the size isn't the issue in those cases too. In addition, on the other size most optimizers have a form of return value optimization that elides the copy/move on the return value.

Where they often get hit up or lacking is heap allocations. They seem to hide a lot and only some compilers can elide a heap allocation.

Use after move is interesting. Many cases of move are not necessary anyways (returning an automatic storage object) but the complexity of the code already would be on part with using std::move. I don't think an optimizer will do that one even if it is that last usage.

WesternStar · on April 23, 2020

<unpopular opinion>We should break ABI on unordered map just to stop embarrassing ourselves in public and in front of new users.</unpopular opinion>

alexhutcheson · on April 24, 2020

It’s unfortunately not just ABI, but also API. The standard specifies that you can get iterations to specific buckets in O(1)[1], and also specifies bucket_count(), max_bucket_count(), bucket_size() (which is specified to be O(n)), and bucket(). Those functions and their specified performance make it effectively impossible to implement a standards-compliant std::unordered_map without using separate chaining.

[1] https://en.cppreference.com/w/cpp/container/unordered_map/be...

SamReidHughes · on April 24, 2020

You can just remove those functions. The real problem is that the API break would invalidate iterators and create undefined behavior.

UncleMeat · on April 24, 2020

I'm not even sure that is an unpopular opinion. Its a minority opinion, but people on the committee have been grumbling about how ABI consistency is hamstringing C++ for a while now.

RantyDave · on April 23, 2020

What's wrong with unordered map?

ezoe · on April 23, 2020

It's implementation is buckets divided node based hash map. Theoretically, its order is good, but not that efficient in modern hardware where the memory access is heavily cached so the data locality is more important than saving the memory copy or memory size. Using the linked list to save some memory copy doesn't benefit at all and the overhead is far greater than simply copying the contagious large chunk of memory. Especially so if the data type is trivial so the simple byte by byte memcpy is suffice.

gpderetta · on April 23, 2020

> contagious

normally I wouldn't point out an obvious autocorrect mistake, but this is really season-appropriate (or, better, inappropriate) :).

ezoe · on April 23, 2020

Oh that's... not intended really.

TwoBit · on April 24, 2020

So use unordered_map with an allocator that returns nodes from a small local memory?

josefx · on April 24, 2020

To be useful that would require that the nodes are allocated in the order they appear in the map (and still has pointer overhead). Is that a given with an unordered map or does it break down the moment the map is rehashed?

cozzyd · on April 23, 2020

why not just define std::dict or something like that that is better?

WesternStar · on April 23, 2020

Because you should be able to find a good hash data structure for your language. You shouldn't be able to accidentally get a bad one. It should arguably be the first data structure you reach for.

imtringued · on April 24, 2020

Deprecate the old data structure. Even Java did that. Originally it had the Vector class as primary List type. Nowadays everyone uses the List interface and ArrayLists as default implementation.

You can even find this in the javadoc of the Vector class:

>As of the Java 2 platform v1.2, this class was retrofitted to implement the List interface, making it a member of the Java Collections Framework. Unlike the new collection implementations, Vector is synchronized. If a thread-safe implementation is not needed, it is recommended to use ArrayList in place of Vector.

gpderetta · on April 24, 2020

I think the parent is proposing adding a better hash map to the standard library but with a different name to preserve backward compatibility.

ezoe · on April 23, 2020

The problem is, hash map isn't that effective in modern hardware, the overhead cost is greater than the benefit. If it's consists of clustered network connected computers, it may be, but not for the single local computer for the most of problem.

movedx · on April 24, 2020

> My conclusion, it's slow because they wrote C++ like a dynamic typed language

Here's my conclusion without needing to look at the code: is C++17 so complicated a language that having to understand deeper concepts such as the ones you've mentioned means a reduced "time to market" for marginal gains over the same code written in Go?

Or put another way: did you just further prove that Go is the better option because it's providing extremely similar performance yet doesn't require this very esoteric knowledge you've mentioned?

If I can get the same work done in Go and I can learn Go in a week versus months for C++17 (which is far more complex), why would I pick C++17?

kllrnohj · on April 24, 2020

If you know Go and don't know C++ then yes absolutely use Go. If you're trying to do a performance comparison though it doesn't seem unreasonable to have "basic" familiarity with the languages being compared?

None of the knowledge mentioned is "esoteric". We're not talking hyper optimized insane C++ optimizations here with custom allocators and specialized code generation with templates. We're talking "basic" memory management & standard library data structures.

movedx · on April 24, 2020

> If you're trying to do a performance comparison though it doesn't seem unreasonable to have "basic" familiarity with the languages being compared?

No. Because it's not just a performance comparison, really. Without realising it, it's also a learning test. The very fact they wrote sub-par C++ code, likely without realising it, and then proceeded to produce (sub-par?) Go code that was faster demonstrates the difficulty of one language over the other. It demonstrates the cognitive efforts required between the two options and one of them was easier to produce and the results were more than suitable.

These technology comparisons are important, for sure, but at the end of the day people want an ROI from everything they do, even the things that don't involve money. So if something can be solved in six hours with one solution or 36 hours with another, and the latter only yields a 10% performance, then I'd rather my end goal be ready and "in the market" for 30 hours ahead of the other guy using the slower solution for a 10% speed improvement.

adev_ · on April 24, 2020

> No. Because it's not just a performance comparison, really. Without realising it, it's also a learning test. The very fact they wrote sub-par C++ code, likely without realising it, and then proceeded to produce (sub-par?

Who or what tell you it's a learning test ? It is a pretty weak asumption.

The publication says nothing on the actual past experience of the developers in every language. They might very well be experienced in Go or Java.

From their C++, It is pretty obvious it is not their main language.

In my experience, this kind of code of nested shared_ptr everywhere is pretty typical of developers with a Java background just starting C++.

llm · on April 24, 2020

they clearly want the performance - the algorithms (as you can easily see in the repo) are trivial, its just a huge amount of data to process

C++ can shine here - and still looking nearly like the go or Java-Code (due to the simple algorithm) but some one tried to write highly sophisticated code (there is no tutorial or book about C++ that teaches you to do it that way) with the result that C++ is far more slower then possible - we definitly not talking about 10% gain here

movedx · on April 26, 2020

> they clearly want the performance - the algorithms (as you can easily see in the repo) are trivial, its just a huge amount of data to process

> C++ can shine here

And my point is: it didn't. And it didn't because of the learning curve involved. It's only the better solution if the knowledge is invested in ahead of time, and if the gains don't outweigh those you'd get from Go by an amount that greatly exceeds the time invested to net them, then it's not the better choice.

cppbert · on April 27, 2020

but your point is just not correct - i know go, Java and C++ very well and i don't understand why the C++ code isn't just like the Java code - there is absolutely no need for this deep nested, ugly, container parties that are not in the Java code at all

59nadir · on April 24, 2020

The flip side to this is that for someone who knows roughly what a CPU does (you don't need in-depth knowledge to think about pre-fetching and branch prediction) C++ is going to be easier to use to leverage that. That's really the end of it.

I agree with you that writing C++ is harder than writing Go and that writing naïve C++ leading to disproportional pessimization of your program shows that it's not the right tool for people who lack the knowledge to use it.

At the same time it's really not hard to translate simple concepts into simple code for lower-level languages, that runs well on modern processors. Basic usage of (possibly) growable, homogenous arrays will get you far. Contrary to what most C++ app developers seem to think custom allocators aren't complicated either, and will guarantee that you get much better cache locality and memory management.

Most of these issues are made hard by people, including the premise of learning the basics of what you need to do for the processor to execute your code fast.

movedx · on April 26, 2020

That being said if the time invested in learning C++ and the extra time needed to write it and optimise it doesn't exceed the performance of the same solution written in (optimised) Go by a factor larger than the time invested then it's not the right choice overall.

cppbert · on April 27, 2020

you are just wrong - someone without any good C++ knowlegde except porting the Java port plain to C++, ported it by using every crude feature and wild combination of shared-ptr, container nesting in a way you will nearly not find in any open source or tutorial project - its not about the ease of go, the the benchmark results are just wrong in so many ways

comex · on April 24, 2020

‘Esoteric’ is in the eye of the beholder. Even in Go you gain performance by avoiding boxing.

> If I can get the same work done in Go and I can learn Go in a week versus months for C++17 (which is far more complex), why would I pick C++17?

Because C++17 code if written well is likely to be much faster. If you don’t care about that, by all means stick with Go. But keep in mind that learning is a one-time cost, and you can keep applying the same knowledge to project after project.

movedx · on April 24, 2020

> Even in Go you gain performance by avoiding boxing.

Go has boxing?

> Because C++17 code if written well is likely to be much faster.

I was hoping someone could prove this, actually. I don't have the means or the knowledge.

magicalhippo · on April 24, 2020

> I was hoping someone could prove this, actually.

Maybe this presentation by Jason Turner from CppCon 2016 can illustrate just what can be done in C++17. His target in the presentation is a C64, but I think it illustrates well just what you can do with the language, and hence what tools are available for writing fast code.

https://www.youtube.com/watch?v=zBkNBP00wJE

edit: If you just want a quick fix, check out what a simple keyword does here: https://youtu.be/zBkNBP00wJE?t=1616

icholy · on April 24, 2020

interfaces

llm · on April 24, 2020

that is really not the case here - the Java and C++ implementatios are very plain boring algorithm code (some maps,lists,searching) non of the fancy language features are use to reduce or optimize the code - its absolutely not a C++-is-hard-to-get-right-thing

the C++ code is just using rarely used features in over-wild combinations - as stated: it seems that someone tried to port highly dynamic script code 1:1 to C++ without any knowledge about C++ and that just make no sense, even porting the current Java-Code 1:1 to plain old C++ Code would result in a much better performing version of the algorithm

everything (even the simplest stuff) is FORCED to be heap-allocated (with huge amount of code), in case of deque and shared_ptrs even double or tripple times - without any benefit of doing it

i would bet that a more clean, less every-feature-using version of the C++ algorithm would beat the Java, go code by a huge factor

dataangel · on April 26, 2020

The problem with that idea is that he actually used more advanced features to make it worse (any_cast). This is not how you would do it if you picked up a beginner C++ book and read the first few chapters.

brnt · on April 24, 2020

Welcome to scientific programming :)

benibela · on April 23, 2020

>the unordered_map is very slow(it's a node based hash map, not suitable for the modern hardware)

It is very slow?

I wrote my app in FreePascal. I need a hashmap, but FreePascal has no real standard hashmap, so I have been benchmarking Pascal hashmaps for weeks/months.

Today I added std::unordered_map for comparison. It will still take a day to run the benchmark, but so far it looks that std::unordered_map is 25% faster than the fastest map in the FreePascal standard library. And the best map of 45 Pascal maps is only 30% faster than std::unordered_map. Only 10 maps are faster, and 35 maps are slower.

alexhutcheson · on April 24, 2020

Try absl::flat_hash_map and/or F14FastMap[2].

[1] https://abseil.io/docs/cpp/guides/container#hash-tables

[2] https://github.com/facebook/folly/blob/master/folly/containe...

saagarjha · on April 23, 2020

std::unordered_map is supposedly slow because it has fairly stringent iterator invalidation requirements. In my experience, unless you wring the most performance out of your code as you can, it's not a huge issue; it's generally faster than most casual implementations.

TwoBit · on April 24, 2020

Yeah people here are dumping on it for being non-performance-optimal while ignoring that higher performing versions have restrictions and design limitations. unordered_map is fast enough for probably 98% of uses, and the other 2% may be better off finding a specialized version best for their particular case.

IshKebab · on April 24, 2020

It's not very slow. It's faster than `std::map` and we lived with that for years!

It's just not as fast as some other implementations: https://tessil.github.io/2016/08/29/benchmark-hopscotch-map....

gpderetta · on April 23, 2020

Depends what you do. If you populate it only once and then you do a bunch of lookups, assuming a low load factor, you'll be paying for a modulo operation and an extra indirection, not great but not terrible either. If you do a lot of inserts and removal or your load factor is high, then the performance is going to be not great.

benibela · on April 24, 2020

My benchmark is one insert, multiple lookups, repeat.

I have no use for deletions. Although when I want to make a general benchmark, I probably should measure it once, too

FpUser · on April 24, 2020

Free Pascal actually does have hashmap. Check this for reference: https://wiki.freepascal.org/Data_Structures,_Containers,_Col...

benibela · on April 24, 2020

But that lists like ten classes which map/hash in their name.

None of them stands out, as being THE standard Pascal map to use for everything.

FpUser · on April 24, 2020

There are no silver bullets in the world. If I were you I would take a brief look at the code and it will be clear what's the difference.

IshKebab · on April 24, 2020

Maybe they wrote it really badly so someone will take offence and rewrite it for them!

Also in my experience `std::deque` is almost always slower than `std::vector`. In theory it shouldn't be, but copying contiguous data is so fast on modern CPUs that the reallocation of `std::vector` (or occasional front-insertion) is very cheap.

grovellogic · on April 24, 2020

In theory copying contiguous data is faster than dealing with pointer indirection.

techempower · on April 24, 2020

Hashmaps Benchmarks: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-05-c...

tom_mellior · on April 23, 2020

> They really like std::deque<T> and use it everywhere even though the sizeof(T) is like a few dozen bytes at best so they should rather use std::vector. The data structure of deque is a list of array. while it can amortize the continious adding of the elements to front or back, since the element size is very small, they should rather use vector.

Are you saying that access patterns (push/pop front/back) never matter and std::vector is always better than std::deque if the element size is very small? I don't think that this is true in this generality.

04091948 · on April 23, 2020

in CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures"[1] he says:

"but the standard deque data type is really quite bad. I wouldn't recommend anyone use it for anything. if you look at the implementation constraints, the constraints placed upon its iterators, and its invalidation constraints, it's painted into a very unpleasant corner and it has very few opportunities to be an efficient data structure."

[1] https://www.youtube.com/watch?v=vElZc6zSIXM

pansa2 · on April 24, 2020

For a language whose purpose is high-performance, the std:: containers are an embarrassment. `deque` is slow, `map` is slow - so the language added `unordered_map` which is also slow...

Are any of the containers (other than `vector`) worth using in high-performance code?

I’m sympathetic to the game developers who I’ve heard say “we use C++, but nothing from std::”.

59nadir · on April 24, 2020

STL is mostly garbage, which is a big reason it's not used in places that actually need performance, both for build times and runtime.

Yes, it's embarrassing. I think it should be noted that the performance mantra is mostly for show with C++, though. The same people who spout it will also blatantly use pessimizing language features just because, when normal, procedural code would've done the job faster and ultimately simpler.

I think the performance overhead of a few things in C++ make sense, but in general you get performance in C++ by turning things off and abstaining from most of the features. Exceptions complicate mostly everything, so the best thing to do is to turn them off and not rely on anything that uses them, for example.

Modern C++ isn't fast and most of C++ wasn't even before the modern variant was invented. The C "subset" is.

adev_ · on April 24, 2020

> STL is mostly garbage, which is a big reason it's not used in places that actually need performance, both for build times and runtime.

It is NOT garbage. It is more than sufficient for the 99% of devs that need key in hand data structure and good enough performance (Meaning faster than 99% of over programming languages already).

If what you need is sub micro-second perf, then yes, redefined your data-structure.

BTW, you will very likely have to do that in any language anyway. Because it is impossible to design fast forever-living data structure. They (almost) all become obsolete when architectures evolve. Red-Black Trees where state of art DS, teached-at-school 10 years ago, they are useless garbage nowadays if you seek for performance.

kouteiheika · on April 25, 2020

> It is more than sufficient for the 99% of devs that need key in hand data structure and good enough performance (Meaning faster than 99% of over programming languages already).

I really don't get this argument. If you don't need pedal-to-the-metal performance then why are you using C++ in the first place? (Unless, of course, your answer is "legacy code".)

C++ is being touted as being high performance, but basically every standard data structure besides `std::vector` is garbage for high performance, pedal-to-the-metal code. And not only data structures - `std::regex`'s performance is bad, `std::unique_ptr` doesn't optimize as well as a plain pointers, no vendor has a best-in-class `std::hash` implementation (they're neither DoS-safe nor the fastest), etc.

> BTW, you will very likely have to do that in any language anyway. Because it is impossible to design fast forever-living data structure. They (almost) all become obsolete when architectures evolve.

Do you, though? Rust already replaced their standard hash map implementation with a completely different one which was faster, so it shows that it can be done.

adev_ · on April 27, 2020

> I really don't get this argument. If you don't need pedal-to-the-metal performance then why are you using C++ in the first place? (Unless, of course, your answer is "legacy code".)

There is many other things that pure compute-bound CPU performance that can drive you to a GC-less language like C++. Fine grain memory usage is one of them, latency control is an other.

> `std::regex`'s performance is bad. , `std::unique_ptr` doesn't optimize as well as a plain pointers

std::regex is not defendable, specially when there is much better implementation already available (https://github.com/hanickadot/compile-time-regular-expressio...)

However, do you have any bench / source for std::unique_ptr ? You are the first one I hear giving critics on it.

> Do you, though? Rust already replaced their standard hash map implementation with a completely different one which was faster, so it shows that it can be done.

Rust does not have 25 years of code base behind him. We can talk to that again when he gets 10 years more.

C++ can not afford to randomly break API on one of its core STL component just to hypothetically gain few bit of performance.

It can be done, but it should be done with a depreciation process and an alternative implementation, like it has been done for auto_ptr -> unique_ptr.

steveklabnik · on April 27, 2020

Rust didn't break the API of HashMap. The trick is that the STL decided that certain algorithmic requirements were a part of the interface. For std::map, the case here is that it must be sorted. Its API also means you can't be polymorphic over a hashing strategy. This constrains possible valid algorithms. Rust made the opposite choices.

And so they did exactly what you say, for that reason. std::unordered_map is a better comparison.

aw1621107 · on April 27, 2020

> However, do you have any bench / source for std::unique_ptr ? You are the first one I hear giving critics on it.

Chandler Carruth talks about this at CppCon 2019 [0]. From a quick review he says part of the reason std::unique_ptr isn't zero-cost right now is:

- Objects passed by value (like std::unique_ptr) are passed on the stack instead of in registers, and changing that will be an ABI break

- No destructive move means extra temporaries will be created/retained

[0]: https://youtu.be/rHIkrotSwcc?t=1046

dataangel · on April 26, 2020

> Exceptions complicate mostly everything, so the best thing to do is to turn them off and not rely on anything that uses them, for example.

Look forward to all your users ignoring your return codes and never calling valid() on your objects that can only signal construction failed that way.

Also the impact of exceptions on performance is overblown, even in high perf situations:

https://news.ycombinator.com/item?id=20342183

Throwing is slow, but throwing is supposed to be rare. Don't use it as flow control.

pjmlp · on April 24, 2020

About 1% of developers actually need to deliver µs fast code.

The remaining 99% are happy to be able to deliver fast enough portable code without having to reinvent data structures all the time.

Leherenn · on April 24, 2020

Array is great as far as I know.

kllrnohj · on April 24, 2020

std::array is great except it's annoying to statically initialize. Really wish make_array or to_array were non-experimental a lot sooner (to_array finally did land in C++20 at least).

But yeah it does mean you can finally stop doing that silly `sizeof(a)/sizeof(a[0]);` trick.

tom_mellior · on April 24, 2020

Thanks for the reference. But this is still specific to access patterns: You can use a deque as a queue without iterating, so iterators and their invalidation constraints should not matter.

Rochus · on April 23, 2020

I had a quick look at the C++ source code provided at https://github.com/ExaScience/elprep-bench/tree/master/cpp.

As suspected, everything is dynamically allocated and no memory mapping (see e.g. http://man7.org/linux/man-pages/man2/mmap.2.html) is used. No wonder this is slow and eats a lot of memory. At the moment I have no information about why this design was chosen, if there is a justification for it, or if the developers only knew this option. Maybe I can find some hints in the paper. From what I've seen up to now it can be savely assumed that with optimal use of data structures and system functions the C++ results are at least one order of magnitude better.

JoeSmithson · on April 23, 2020

The problem with C++, for a lot of people, is not that it's impossible to do it right, but that it's easy to do it wrong.

Rochus · on April 23, 2020

I think this applies to any kind of advanced software development or programming languages, not only C++.

There may be many reasons why scientists who are not computer scientists feel more comfortable with Go than with C++, but performance is certainly not one of them.

thelazydogsback · on April 23, 2020

It depends on your definition of "wrong". You can mess up your domain logic with any language, but the chances of messing up perf (and even some ref vs. copy semantics that effect logic) in C++ are vastly greater than in GC'd langs. I've ported C++ code to C# with much perf gain, due to c'tors of every kind getting called all over the place. Doing it right in C++ is too much cognitive overload -- at least for me. If you're writing inner game loops or device drivers, I feel I need to pay for this complexity - for any LOB apps, I can't see choosing C++, or even Rust for that matter.

oconnor663 · on April 24, 2020

Rust certainly comes with its share of complexity, but it might be worth clarifying that Rust doesn't have this specific problem of implicit constructors called all over the place. Instead, everything uses move semantics by default.

_ph_ · on April 23, 2020

The authors of that paper are computer scientists.

Rochus · on April 23, 2020

But most life scientists using the library for their work are not; Go is a language they can master, as well as Python or Lua; if they're biophysicists they're likely to know Fortran and C as well, but I rarely meet scientists fluent in C++. Also not sure how much experience the authors have with such software projects.

Koshkin · on April 23, 2020

What is easy is to write C++ as if it was Java. (You just stick to the standard containers and use smart pointers when allocating objects on the heap.)

zucker42 · on April 23, 2020

I think it's clear that anything that Java/Go can do, C++ can do faster. The question is how much effort does it take to get to that point.

benhoyt · on April 23, 2020

They address that somewhat in the discussion section:

> C++ provides many features for more explicit memory management than is possible with reference counting. For example, it provides allocators [35] to decouple memory management from handling of objects in containers. In principle, this may make it possible to use such an allocator to allocate temporary objects that are known to become obsolete during the deallocation pause described above. Such an allocator could then be freed instantly, removing the described pause from the runtime. However, this approach would require a very detailed, error-prone analysis which objects must and must not be managed by such an allocator, and would not translate well to other kinds of pipelines beyond this particular use case. Since elPrep’s focus is on being an open-ended software framework, this approach is therefore not practical.

nemetroid · on April 23, 2020

Custom allocators are a relatively niche topic, and I agree that library users cannot be expected to customize memory behaviour to that level of detail (especially if the library users typically are not professional developers).

However, a certain level of proficiency must be expected, and in the case of C++ this includes "know when to use const&, unique_ptr<T> or shared_ptr<T>". If this cannot be expected of the user, the comparison becomes less a question about performance and more about which language is the best at being the lowest common denominator.

Rochus · on April 23, 2020

Of course you can use better allocators; but it's faster to avoid dynamic allocation (e.g. by pointing to memory mapped from the input file by the OS) altogether. If they allocate memory for each flyspeck of a 200 GB file and also create and change a reference counter for it, nobody should be surprised about the low performance. Have a look what e.g. shared_ptr does behind the scenes.

PaulDavisThe1st · on April 23, 2020

Unless you're streaming, in which case mmap'ed access on Linux is generally slower than read/write. At least it was the last time we checked for the Ardour project (probably about 3 years ago).

Rochus · on April 23, 2020

See https://en.wikipedia.org/wiki/SAM_(file_format)

vvanders · on April 23, 2020

That's not even the real benefit you get. What you really get is the ability to specify memory locality which easily has a 10-30x performance impact.

tomp · on April 23, 2020

> elPrep is an open-ended software framework that allows for arbitrary combinations of different functional steps in a pipeline, like duplicate marking, sorting reads, replacing read groups, and so on; additionally, elPrep also accommodates functional steps provided by third-party tool writers. This openness makes it difficult to precisely determine the lifetime of allocated objects during a program run

> Phase 1 allocates various data structures while parsing the read representations from BAM files into heap objects. A subset of these objects become obsolete after phase 1.

> Therefore, manual memory management is not a practical candidate for elPrep,

TL;DR: it's non-deterministic which objects will be garbage when so some for of dynamic memory management is needed.

If you disagree, what would you use instead?

kllrnohj · on April 23, 2020

Well, one thing that jumps out immediately to me is that everything appears to be using shared_ptr. And by that I mean everything. Why is everything shared? What does it even mean to have a shared_ptr<stringstream>? Arbitrary mixed writes to a string seems like a bad idea, so shouldn't that be unique_ptr?

Or like this:

    auto alns = make_shared<deque<shared_ptr<sam_alignment>>>();

A shared_ptr to a deque of shared_ptrs? deque isn't thread-safe, why would it be shared? And why does the deque instance need to be heap allocated at all? It just contains a pointer to the actual allocation anyway, moving it around by value is super cheap?

It's like make_shared is the only way they know to allocate an object. They even put string inside of shared_ptr:

    class istream_wrapper {
        ...
 shared_ptr<string> buffer;

It could be that it does need to be shared for some reason, but this looks like a pointer to a pointer for no obvious reason. Even ignoring the atomics that shared_ptr results in, the dependent data loads are going to be brutal.

EDIT: And they don't even seem to use std::move anywhere :/ There's a huge gap between this code and custom allocator territory.

gpderetta · on April 23, 2020

they heap allocate strings because their string_range implementation is a shared_ptr to the original string plus two indices. It might be somewhat worth it assuming the strings are large enough. But if most strings are small, passing them around by move-value would probably be an overall win. One would need to benchmark it.

kllrnohj · on April 24, 2020

In that case you'd still want a shared_string not a shared_ptr<string>. The double pointer chases add up quick.

Edit: It might help to keep in mind that string is essentially an alias for unique_ptr<char[]>. As in, string is already heap allocated. They heap allocated a pointer to a heap allocation.

UncleMeat · on April 24, 2020

In my experience, this kind of error is the cause of 90% of the "wait, I thought C++ was supposed to be fast so why is it chugging compared to my Java implementation" problems. People turn everything into two dereferences.

Rochus · on April 23, 2020

The problem starts with the design decision to stream the file into memory; if you map the file instead you can directly use the mapped data and only have to allocate supporting structures (if required). And anyway, the "Therefore" part does not follow from the "Phase 1" statement.

ezoe · on April 23, 2020

For objects allocated at phase 1 and not necessary after,

I would use std::monotonic_buffer_resource added in C++17. It's implementation is consists of just a pointer point to large contagious memory. You want a n bytes of memory? return the pointer while adding ptr += n. Deallocation do nothing. Destructing std::monotonic_buffer_resource object deallocate the memory. If the objects has trivial destructor, as that is the case based on glancing the code, this is very efficient.

twic · on April 23, 2020

> To achieve good performance, it was therefore necessary to explicitly control how often and when the garbage collector would run to avoid needless interruptions of the main program, especially during parallel phases.

Why? This is a batch program, interruptions don't matter, only the end-to-end time does.

> The goal of elPrep is to simultaneously keep both the runtime and the memory use low.

Why? Keeping runtime low lets you get more work done. Keeping memory use low means what? They are using a machine with 384 GB RAM, make use of it.

Worth noting also that they used GCC 7.2.1, Go 1.9.5, and Java 10. That's a pretty old GCC.

They don't seem to explicitly select a GC with Java, so they'll be using G1. G1 is still not entirely mature. It got much faster between 9 and 10, and somewhat faster from 10 to 11. For a batch process like this, though, the parallel collector is still probably a better choice. Using a newer JDK and a different collector should give better performance - but admittedly, probably won't reduce heap usage.

WatchDog · on April 24, 2020

Yeah there are lots more knobs they could have played with in regards to Java garbage collection.

I wonder if they capped the java max heap size to what the go implementation used, how much it would have affected the runtime.

zmmmmm · on April 23, 2020

It has to be noted: it's quite a strange approach all round - this framework reads all the data into memory. So if you have a 100GB genome it will read 100GB into memory. Presumably it stays in memory uncompressed so we are talking hundreds of GB to process even a single whole genome sample.

This may indeed have some performance benefits, but it's a very impractical approach from a hardware point of view. Few places doing processing of genomic data will have many compute nodes with > 256GB memory, yet that would barely process 1 sample with this framework. God forbid you have a family of samples or tumor/normal comparison samples to analyse and need several genomes in memory together.

Genomes are for the most part massively parallelisable and nearly every other toolkit I have seen has put that first and foremost in its design approach. Ensuring tools process data in a streaming manner and pipe between each other is a basic expectation of most genomic data tools.

Which is all to say ... this is a very strange beast and I'm not sure a lot of conclusions can be drawn from it that generalise to other activities or approaches.

yvdriess · on April 24, 2020

Untrue. Large memory nodes are par the course for genomics workloads. Only a few stages of the analysis pipeline can stream the in/out effectively. Even then, going to disk for the result only to be read back is going to bottleneck your pipeline.

The authors go in more detail around this topic in: https://journals.plos.org/plosone/article?id=10.1371/journal...

imtringued · on April 24, 2020

We once asked our cluster department to give us bare metal access to their smallest machine. They gave us a dedicated server with 256GB RAM. The real cluster nodes are bigger than that because they handle multiple jobs at the same time and they have hundreds of them.

This isn't some electron framework where it is unexcusable to use that much RAM. The hardware that is available to scientists is more than capable of handling these workloads.

jolmg · on April 23, 2020

> Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.

They don't seem to take into account that their results depend on their own proficiency in each programming language.

vardump · on April 23, 2020

> They don't seem to take into account that their results depend on their own proficiency in each programming language.

If they're going to implement it as well, then that's perfect. They now know which language they should be using with their proficiency.

In some other team, Java might have been #1. The overall best results with a great team would almost certainly be achievable by using C++.

But this is what worked for them.

Traster · on April 23, 2020

But....why?

They could've just saved themselves the trouble and said "Anyone fancy doing this in C++? No. Great, that's that tricky question off the table!" Rather thna going through this ridiculous exercise of writing bad C++ in order to justify not using C++. If we really believe that this is just a way of idnetifying a revealed preference fine, but engineers should be smart enough to not need to trick thesmlves into coming to the obvious conclusion.

jolmg · on April 23, 2020

Yes, for them that's great, but their statement sounds like they're suggesting their results would apply for everyone.

vardump · on April 23, 2020

Funny how people think what works for them works for everyone... so easy to fall into that trap even with the best intentions.

voldacar · on April 23, 2020

This is not surprising. Parallel GC will almost always be faster than refcounting. If you wrote a C or C++ program to accomplish this task after carefully planning out exactly when stuff needs to get manually malloced/freed, you could outperform any of their approaches

And like another commenter mentioned, if you're writing a program which streams a lot of data sequentially from disk, and where throughput is important (such as in sequencing), you should always be using mmap

ori_b · on April 23, 2020

That's a case where mmap isn't actually all that much faster than read, and due to the inter processor interrupts needed to synchronize the memory mappings across cores, it may end up much slower. You're grabbing large chunks and flushing the TLB a whole lot.

If you are seeking randomly and doing small reads, then mmap will help quite a bit: the data will be faulted in, and accessing it a second, third, or hundredth time will not cost much.

gpderetta · on April 23, 2020

read mapping are actually not too bad. remote TLBs can be sychronised lazily on a page fault.

yvdriess · on April 24, 2020

That makes the assumption that any refcounted object lifetime management can be replaced with manual management, which is not the case.

As the paper states: it is not possible to manually determine the object lifetimes due to the nature of the problem, data and openness of the configurable analysis pipeline. Manual alloc/free would in their case mean holding on to much more memory than a GC/refcounting solution would.

imtringued · on April 24, 2020

It doesn't require care. By default you just put stuff on the stack in C++. You have to go out of your way to use things like ref counting.

voldacar · on April 24, 2020

for complicated stuff involving multiple threads and shared memory, it's good to plan it out in advance in some way. but you are right that most data can just get stack allocated

if you look at their code it's like they have no clue that the stack even exists. seriously. they heap allocate EVERYTHING

gpderetta · on April 23, 2020

well, there is careful planning and there is going out of the way to heap allocate everything.

voldacar · on April 24, 2020

Yeah i hadn't really looked into their c++ code when I wrote my comment. it looks like code written by someone who doesn't even know the stack exists. insane

pizlonator · on April 23, 2020

Very surprising result. I wouldn’t have bet that this is what would have happened.

But anyone working on language perf should take note even though it’s just one result from one team and one application. Of course they probably used C++ in a not great way and probably use Go in a better way. But maybe that is caused by something in Go that encourages good behavior or at least encourages the kind of behavior that Go optimizes for.

So, even if this result doesn’t mean that C++ devs should switch to Go to get more speed, it’s a result that is worth pondering at least a bit, particularly if you like thinking about what it is that makes languages fast or slow.

shigeo · on April 23, 2020

> I wouldn’t have bet that this is what would have happened.

Which part of the results are you referring to? It's well-known that reference counting has significantly lower throughput that tracing garbage collectors, so the fact that C++ is outperformed here isn't surprising at all.

ken · on April 23, 2020

As a Lisp programmer, I confess that when I see a performance shootout between Go, Java, and C++, I don't expect to hear the result that Go won -- and that its performance is on par with their old Common Lisp program.

pizlonator · on April 24, 2020

Great point. Here’s the issue: there are tons of ways of doing reference counting in C++. Some go all-in with implied borrowing. Some make great use of C++’s various references. Some use the reference counting only for a subset of objects and rely on unique_ptr for hot things.

So, there is no universal answer to how C++ reference counting compares to GC.

There is a well known answer, that you’re probably referring to, if you reference count every object and you do it soundly by conservatively having every pointer be a smart pointer. But it just isn’t how everyone who does reference counting in C++ does it, and I was surprised because I’m used to folks going in the other direction: being unsound as fuck but hella fast.

kllrnohj · on April 24, 2020

> there are tons of ways of doing reference counting in C++

There are but critically there are also a lot of ways to not do ref counting at all. C++ isn't a refcounted language, it's a language where you can use refcounting (shared_ptr), but you don't have to (unique_ptr, value types). It's not even recommended to be primarily refcounted.

They chose a really odd subset of C++ to use here (shared_ptr exclusively), very unorthodox and not something I've ever seen elsewhere or recommended.

BubRoss · on April 23, 2020

If either of those affects performance there are WAY too many memory allocations happening.

shigeo · on April 23, 2020

Sure, but I guess that's the point --- the GCs in Java and Go can handle any allocation pattern you throw at them reasonably well, but there's not, as far as I know, such a "one-size-fits-all" solution in C++ (not that it needs one.)

MauranKilom · on April 23, 2020

If you don't want to plan out proper (as in performant) data structures and memory management, C++ is indeed the wrong language. And judging by the stuff discussed in comments above, this is exactly what happened here.

BubRoss · on April 23, 2020

I'm not sure I would call 'using your entire CPU to do trivially small heap allocations' a pattern unless you mean that it's a pattern you see when people wonder why their software is so slow.

BubRoss · on April 23, 2020

Their software seems to be mostly just allocating lots of memory. It seems like they are comparing language speed if you don't know the first thing about optimizing ( which is to not allocate memory nonstop)

typon · on April 24, 2020

On one hand this comparison sucks because their C++ is quite non idiomatic, in the other hand it reflects something about C++. Why is it so easy to write non idiomatic C++? And so much so that it's much slower than a garbage collected language.

avarsheny · on April 23, 2020

Guys hire better C++ programmers rather making stupid claims.

lostmsu · on April 24, 2020

Or programmers in general. I bet they did not think, that Java consuming more memory can be inconsequential as in: it would free it if pressed to.

zwieback · on April 23, 2020

The fact that gc is well suited to this application isn't suprising but what I thought was interesting was that subtracting the time to do the deallocation from the C++ benchmark brought it into line with Go. In other words, ignoring mem management, for this application, Go and C++ performed on par.

gpderetta · on April 23, 2020

There is a huge amount of gratuitous reference count updates and double ptr indirection (see for example their string_slice). Those add up quickly.

The rest is a lot of string manipulation. If you are not taking advantage of being able to layout your objects carefully and avoid memory avoiding allocations, I wouldn't expect C++ to have any particular advantage over Go or Java in in this particular scenario.

DeathArrow · on April 23, 2020

That only demonstrated that poor usage of the language can make C++ slow.

aidenn0 · on April 23, 2020

I thought it was well known that reference counting is slower than all but the worst tracing GCs.

m0th87 · on April 23, 2020

It's not well known, because it's not true, at least not absolutely. Speed could either mean latency or throughput. Reference counting is almost always worse than GC at throughput, but is usually better for latency.

aidenn0 · on April 23, 2020

I intended it to mean throughput.

DeathArrow · on April 23, 2020

True. C# and Java constantly beat Swift on speed.

DeathArrow · on April 24, 2020

Before downvoting, please look at benchmarks.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

https://www.techempower.com/benchmarks/#section=data-r18&hw=...

https://github.com/frol/completely-unscientific-benchmarks

etse · on April 23, 2020

Sorry about the naive question, but if the memory management overhead is worse in Swift, is the hardware it runs on typically better? I'm assuming some of this because I've noticed Android devices tend to require more CPU/memory compared to iOS devices in the same generation.

ken · on April 23, 2020

https://lists.swift.org/pipermail/swift-evolution/Week-of-Mo...

Lattner on Swift (2016): "...while it is true that modern GC's can provide high performance, they can only do that when they are granted much more memory than the process is actually using. Generally, unless you give the GC 3-4x more memory than is needed, you’ll get thrashing and incredibly poor performance..."

aidenn0 · on April 24, 2020

Agreed. I would say 2x the heap size is table-stakes for high-performance tracing GC, and the more you give it the better performance you can get.

flumpcakes · on April 23, 2020

Apparently the custom processors used in iOS devices are actually some of the fastest out there in that form factor.

Designed by apple (I think using ARM instruction set) and made using some of the latest, most advanaced, process nodes by TMSC.

If you look at raw benchmarks, they handidly beat the best Qualcom/"android SOC" chips out there.

I think this changes very rapidly - given the market segment of high end phones has a yearly turn around. However, I did read an article today that claimed the "cheapest iPhone" (the new SE model) is faster than the most expensive android phone currently out there.

saagarjha · on April 23, 2020

> I think this changes very rapidly

Apple has been leading the mobile processor market by quite a bit for the last five years.

saagarjha · on April 23, 2020

Swift is currently missing some optimizations that would allow for better knowledge of object lifetimes and potential elision of refcount updates.

jayd16 · on April 23, 2020

There are a lot of factors at play. Some technical like OS differences, and some non-technical like race to bottom spec pushing by Android OEMs.

madhadron · on April 24, 2020

Here is the key point:

> elPrep allows users to specify arbitrary combinations of SAM/BAM operations as a single pipeline in one command line.

The assumption is that your native environment for data analysis is bash and you have to expose anything that you might want to do in bash. Further,

> elPrep also accommodates functional steps provided by third-party tool writers

That is, they are attempting to provide general semantics in a bash command line that they can handle arbitrary functions being dropped into their program.

Remember, the average salary of a job requiring both programming and biology knowledge is much lower than the average salary of a job requiring programming knowledge alone, so as bioinformaticists build skill to the point where they can be employable as programmers, they mostly leave bioinformatics.

Those who have escaped bash have usually done so into Python these days, which creates a class system of those gluing things together in Python versus those implementing algorithms in C or C++.

cppbert3 · on May 5, 2020

reply of the author

https://github.com/ExaScience/elprep-bench/issues/3

zmower · on April 23, 2020

"We have not performed a detailed comparison against the original version of elPrep implemented in Common Lisp, but based on previous performance benchmarks, the Go implementation seems to perform close to the Common Lisp implementation." LMFAO.

tomp · on April 23, 2020

If I understand correctly, new Go code was as fast as old Lisp code, with much less effort and much clearer code.

> Most existing Common Lisp implementations use stop-the-world, sequential garbage collectors. To achieve good performance, it was therefore necessary to explicitly control how often and when the garbage collector would run to avoid needless interruptions of the main program, especially during parallel phases. As a consequence, we also had to avoid unnecessary memory allocations, and reuse already allocated memory as far as possible, to reduce the number of garbage collector runs. However, our more recent attempts to add more functionality to elPrep (like optical duplicate marking, base quality score recalibration, and so on) required allocating additional memory for these new steps, and it became an even more complex task and a serious productivity bottleneck to keep memory allocation and garbage collection in check.

gumby · on April 23, 2020

Why would that be funny?

teleforce · on April 24, 2020

It seems that all the languages implementation in the benchmarking exercises are utilizing garbage collection (C++ using ref counting) it will be very interesting if someone extend this benchmark using D programming language, arguably the fastest language with GC. The fact that elPrep tool is using functional software architecture [1] and D natively supports functional programming makes me think this type of computing is very well suited for it.

[1] https://github.com/exascience/elprep

fnord123 · on April 23, 2020

It would be interesting to see an output of the compiler flags used in the C++ program that was benchmarked. The build script doesn't set an optimization level so it would default to -O0:

https://github.com/ExaScience/elprep-bench/blob/master/cpp/m...

I am sure they didn't benchmark it like this but it would be interesting to see the flags that /were/ used.

gpderetta · on April 23, 2020

The jmalloc script does set the opt level and that's the one discussed in the paper.

techempower · on April 23, 2020

the source code: https://github.com/ExaScience/elprep-bench

jolmg · on April 23, 2020

That has the Java and C++ versions. Here's the Go version:

https://github.com/exascience/elprep

alexeiz · on April 24, 2020

They use shared_ptr pervasively in their C++ implementation. It's not the best way to manage objects in C++ by far. The only thing that this performance comparison really tells me is that shared_ptr is worse than a modern GC. This is not really a surprising find.

greendave · on April 24, 2020

Probably worth noting in the title that this is a June 2019 publication, so the versions used (gcc 7.2.1, go 1.9.5) etc. are not exceptionally ancient - basically they used what was readily available in CentOS 7 around that time.

That said, some of the code used is quite... odd.

exabrial · on April 23, 2020

They're using "Java 10", (no mention which vendor or build) any idea if they had the compressed strings option enabled?

pulse7 · on April 23, 2020

They should benchmark Java's implementation with different GCs...

nemetroid · on April 23, 2020

I think the original title is better, although dropping "full-fledged next-generation" would be an even better choice.

> A comparison of three programming languages for a full-fledged next-generation sequencing tool

Magnap · on April 23, 2020

"next-generation sequencing" is a term of art in this case

nemetroid · on April 23, 2020

I see, it sounded like a buzzword term (which I guess it still might be). The point is that the current title makes it sound like a general comparison, while the original title makes it clear that it's comparing three implementations of a single tool.

google234123 · on April 24, 2020

Ignoring the terrible C++, why would they bother to write the same program in three languages. I feel like spending 3x time on just one of them would have produced the best outcome.

teleforce · on April 24, 2020

In academic publication it is a convention to compare your proposed implementation with at least the other two comparable competitors (languages, framework, algorithm, scheme, etc). I think it is naturally intriguing and refreshing to see the performance comparison of programming languages by the the non-author of the languages themselves even though the implementations are probably not optimized to death.

qzw · on April 23, 2020

TLDR: 1. Go, 2. Java, 3. C++17.

Java was slightly faster than go but used significantly more memory. C++17 was slower than both and used more memory than go.

I’m still reading for more details on the implementations, but the results are certainly not what I would’ve predicted.

adrian_b · on April 23, 2020

It seems that the performance was dominated by memory management, so the comparison is not between the languages per se, but between their current garbage collectors, and respectively the reference counting implementation for C++.

beering · on April 23, 2020

Yeah, further down in the article they track down the gap between cpp and Go/Java to the ref-counting deallocation work. I'm not a cpp expert, but it seems surprising to me that GC would beat ref-counting in any scenario.

def-lkb · on April 23, 2020

That might seem counter-intuitive, but tracing GC is the fastest way to manage memory in general (arbitrary life time, graph shapes and allocation patterns).

It permits really fast allocation, defragmentation and is sometime the only way to manage memory (cyclic datastructures for which reachability cannot be known directly from program text).

Reference-counted GC on the other hand is notoriously slow, incomplete, and exhibit lots of bad performance patterns (that can be somehow managed by taking some elements of tracing GCs).

The drawback is that tracing GCs are very hard to implement, to tune, and sometimes not all of the benefits are available at the same time. And because they are so counter intuitive, programmers in general have a hard time understanding their runtime behavior... Even GC experts struggle (the skills required range from low-level hardware to control theory, otherwise performance at scale is unpredictable).

Rochus · on April 23, 2020

In C/C++ we have the possibility to avoid dynamic allocation altogether and to use system features like memory mapping. If we use C++ the same way as Java (everything dynamically) it's not too surprising the result is not (much) faster than Java.

aidenn0 · on April 23, 2020

Their original implementation was in common lisp. Common Lisp also lets you use mmap (in fact it's not that uncommon to do so if you have a large amount of mostly static data) to manage your memory manually.

They clearly wanted automatic memory management, so the C++ implementation is reasonable. A fairer comparison might have used MPS or boehm instead of refcounting, but I suspect the results would have been similar.

nemetroid · on April 23, 2020

> the C++ implementation is reasonable

Fig. 4 shows that deallocation, presumably of objects allocated during the first phase, takes half as long as the first phase itself. Unless the nature of the problem solved by the program requires you to have a ton of objects with shared mutability (the use case for shared_ptr), and requires you to make a lot of allocations up-front without knowing if you will need them, the memory thrashing occurring does not seem reasonable.

Rochus · on April 23, 2020

> Common Lisp also lets you use mmap

But not without allocating dynamic memory and copying data.

> They clearly wanted automatic memory management

Most likely because of some misconceptions.

> so the C++ implementation is reasonable.

How so?

> but I suspect the results would have been similar

Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.

aidenn0 · on April 23, 2020

> > Common Lisp also lets you use mmap

> But not without allocating dynamic memory and copying data.

Sure it does. In SBCL you can force a stack allocation (though rarely does it improve performance), and very short-lived values do not leave registers in any case.

> > They clearly wanted automatic memory management

> Most likely because of some misconceptions.

There are both good and bad reasons to want automatic memory management. At least one good reason it it would decrease the porting effort by keeping the code similar.

> > so the C++ implementation is reasonable.

> How so?

Using reference counting is a reasonable way to get automatic memory management in C++

> > but I suspect the results would have been similar

> Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.

Which is going to be rough on any automatic memory management system, which makes using a language with a better ecosystem of automatic memory management more performant.

candiodari · on April 23, 2020

> > Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.

> Which is going to be rough on any automatic memory management system

You could equally say that it makes the case for actually designing memory allocation strategy (which only C++ really supports) that much more important.

You always see this in Java programs for large data analysis. They pick java because of memory management and the tooling. But it's just SO slow and after optimisation the only thing that stubbornly remains up there in the profiler data is memory and GC. And what do they do?

A global object of the following form:

  class DataStore {
    float theFloatsWeNeed[constHowMany];
    int theIntsWeNeed[anotherConst];
  }

You get the idea. Because this avoids memory allocation in java. And you use the flyweight pattern to pass data around. Or you fake pointer arithmetic in java. You create your own pointers by specifying indexes and you oh the horror use math on those indexes. Even then just checking those indexes actually becomes a significant time sink (and then you disable that, which of course kills memory safety in java, but you won't care).

The truth is you don't want memory management for large amounts of data. You don't want to allocate it, track it or deallocate it at all. You leave it in it's on-disk data format and never serialize/deserialize it at all. You want to mmap it into your program, operate on it and then just close the mmap when you're done. C++ definitely has the best tools for this way of working.

Rochus · on April 24, 2020

Yeah, right. The people who know this can apparently be counted on one hand when I look at the advertised publication and the discussion in this forum.

Rochus · on April 23, 2020

Common Lisp: It is permissible for an implementation to simply ignore such declarations. And you still have to copy.

Ref counting: only makes sense in a few special cases.

Avoiding dynamic memory management: have a look at mmap.

aidenn0 · on April 24, 2020

C++: It is permissible for an implementation to allocate every local variable on the heap.

Going back to my original point, you are suggesting a complete rearchitecture of their allocation system. That does not require switching languages to C++. If we are talking about working with 100s of GB of data, that's probably even the correct approach!

TFA does not, however, claim that they have a working set of 100s of GB of data. The data is 100s of GB at rest, but can be processed in chunks with a single pass. That, by itself, does not scream "mmap" to me. On top of that, the data is compressed at rest, so copying is inevitable.

Rochus · on April 24, 2020

So we are glad that we can use the data of the memory mapped file directly and do not have to allocate or copy anything. But of course you may solve the problem badly if you prefer so; after all, there are even "scientific" publications that do it that way, as the example shows.

kazinator · on April 24, 2020

It's permissible for a C compiler to emit shell scripts.

aidenn0 · on April 23, 2020

I guess I was wrong with this comment:

https://news.ycombinator.com/item?id=22959600

1. Reference counting is a form of GC; you could implement a JVM that used reference counting (though in order to be general a small amount of additional work is needed)

2. Reference counting causes extra work every time a reference appears or disappears. Tracing GCs amortize that cost across many allocations.

2.b. This is particularly hurtful to performance for short-lived objects, since most tracing GCs have zero GC overhead for short-lived objects (the cost of a nursery collection under most implementations scales with the amount of live data in the nursery, so objects that appear and disappear in the time-span of a single nursery GC are freed at zero extra cost). Furthermore a tracing GC

3. Malloc cannot move allocated data, so many implementations have a lot of complexity to avoid heap fragmentation, which comes at a cost to both allocating and freeing data. Many GC'd languages allocate small objects with a single instruction in the typical (just incrementing a pointer, the non-typical case would be when the nursery is full and a GC happens).

4. the JVM and Go both have a lot of effort put into their GC; the ref-counting implementation used by this test is probably a bit more naive. In particular they talk about large delays when a chain of links cause many allocations to die at the same time. A less naive refcounting implementation would queue deleted objects and spread that work out across a larger time period.

nostrademons · on April 23, 2020

They're duals. Seminal paper:

https://www.researchgate.net/publication/221321424_A_unified...

Performance characteristics depend where on that continuum your workload falls. For example, Erlang/BEAM uses a generational GC for most common heap objects, but refcounts large binary blobs. This is pretty much a perfect case for refcounting: new references are created infrequently, copying or moving is expensive, destruction is deterministic and happens immediately after the last reference disappears, and there're no pointers within the blob that would require tracing or cycle-detection.

Similarly, UI components within a GUI is another good case for refcounting (and presumably why Apple continues to use this for Objective-C and Swift in Cocoa). New references happen only in non-performance-critical code, most data remains live across collections, and copying/moving existing data would a.) be slow and b.) would invalidate any C pointers into contained data.

Sounds like the particular problem domain described in this article is one where heap allocations are frequent, which makes generational GCs more appropriate. That's probably the case with the vast majority of computational algorithms, but there are definitely problem domains where refcounting continues to beat GC.

tomp · on April 23, 2020

> you could implement a JVM that used reference counting

It would leak memory - reference counting cannot collect cycles (you need tracing GC for that, defeating the purpose of refcounting).

aidenn0 · on April 24, 2020

From right after what you quoted:

> (though in order to be general a small amount of additional work is needed)

Not also that there are two methods of cycle detection for a reference counted GC that are not just a backup tracing-GC

1. Trial deletion (known since at least the mid 80s)

2. Various tracing systems that exploit extra information known to reference-counted systems e.g. Levanoni/Petrank[1] which actually implemented a reference counted GC for Java.

1: https://www.cs.technion.ac.il/~erez/Papers/refcount.pdf

tomp · on April 24, 2020

Thanks, I didn't know that!

antender · on April 23, 2020

Refcounting can mess with caches in multithreaded scenarios, so results aren't really surprising to me. It also generates unnecessary work to maintain counters on all writing threads, while concurrent tracing GC batches all its actions on separate thread.

mcguire · on April 23, 2020

"With reference counting, objects are recognized as obsolete due to their reference counts dropping to zero. Deallocation of these objects leads to transitive deallocations of other objects because of their reference counts transitively dropping to zero. Since this is an inherently sequential process, this leads to a similar significant pause as with a stop-the-world garbage collector."

There are three primary performance penalties for reference counting:

1. This, where dropping one link causes a deallocation chain.

2. A typical heap implementation will leave live objects scattered throughout memory, leading to cache issues.

3. Maintaining the reference counts may cause writes to memory, leading to other cache issues, plus atomicity.

Rochus · on April 23, 2020

People often forget that memory allocation is very expensive; the C/C++ default allocator is very inefficient.

zerr · on April 23, 2020

No AWK and Ada this time? :)

shrimp_emoji · on April 23, 2020

[flagged]

lpghatguy · on April 23, 2020

It was mentioned in the article and was not considered a candidate due to a specific API requirement:

> Other mature programming languages with support for reference counting include Objective-C, Swift, and Rust [50]. However, in its algorithm for duplicate marking, elPrep requires an atomic compare-and-swap operation on reference-counted pointers, which does not exist in those languages, but exists in C++17.

zozbot234 · on April 23, 2020

Except that this "atomic" operation over shared_ptr's in the C++ standard isn't actually required to be lock free, and is in fact not lock free in common stdlib implementations. So they're not actually gaining anything over, e.g. RwLock<Arc<RwLock<T>>> in Rust.

gpderetta · on April 23, 2020

They are atomic in the sense that it is not possible to observe the intermediate states and they are not data races.

It is extremely hard though to implement them in a lock-free way without a true DCAS (which pretty much no architecture implements). I think all implementations use spinlock pools. I guess transactional hardware could be used.

This is one of the reasons why GC does make it easier to implement some lock free algorithms as you do not have to worry about races during memory deallocation. Without GC, I I guess the next best thing is RCU or harzard pointers, but they are significantly more complex.

From a cursory read I do not see any fancy lock free algorithm in the code though, so I'm not sure why they need atomic operations on shared_ptrs.

the8472 · on April 23, 2020

> without a true DCAS (which pretty much no architecture implements)

Does x86-64's cmpxchg16b not qualify?

gpderetta · on April 23, 2020

That's 2CAS. A true DCAS works on two arbitrary sized locations, cmpxchg16b acts on two contiguous pointer sized locations.

the8472 · on April 24, 2020

I am aware, but that restriction shouldn't affect its use for an atomic ref-counted pointer since you can place the count and the pointer next to each other.

gpderetta · on April 24, 2020

How would it work exactly? The layout of a generic shared ptr (but not specifically std::shared_ptr) is:

   ptr -> (count, payload)

instead of:

   (count, ptr) -> payload

As you can see, count is not alongside the pointer itself as distinct instances of ptrs need to share the count.

Let's say you want to acquire an additional reference to ptr. You need to both copy the current value of ptr and increment count atomically to protect against concurrent modifications of ptr that might drop the last reference to payload.

Delaying the deallocation of payload via hazard pointers, RCU or some other deferred reclamation scheme works, but it is significantly more complex. I believe this is what the rust arc-swap package does internally.

the8472 · on April 24, 2020

Ah you are right, I was only considering the limited case of replacing the value when the reference count is 1.

kinghajj · on April 23, 2020

So they need https://crates.io/crates/arc-swap ?

jayd16 · on April 23, 2020

Who uses GBh to measure performance? Either you can afford the ram or not. If you can't, you should bench with swap performance.