After quickly glancing the code, I concluded that they wrote C++ like there is no static type. It seems they faithfully ported the very dynamic nature of their existing code to C++ without thinking.
auto alns = any_cast<shared_ptr<deque<shared_ptr<sam_alignment>>>>(data);
So the data is sam_alignment type inside shared_ptr inside deque inside another share_ptr inside god forbid any? Why did they do that? What kind of abomination is this? Also from the context, it is the only possible type. they use:
If you're so sure an object of any only hold exactly one type and everything else is unexpected error, you shouldn't use any at all!
They really like std::deque<T> and use it everywhere even though the sizeof(T) is like a few dozen bytes at best so they should rather use std::vector. The data structure of deque is a list of array. while it can amortize the continious adding of the elements to front or back, since the element size is very small, they should rather use vector.
Speaking of data structure, they also use std::unordered_map<int, any>. the unordered_map is very slow(it's a node based hash map, not suitable for the modern hardware) and the sizeof(int) + sizeof(any) is like 20 bytes(sizeof(int) + two pointers) so they got no benefit of using node based data structure here. They should rather use sorted vector and binary search it.
My conclusion, it's slow because they wrote C++ like a dynamic typed language and they choose the wrong data structures.
For relatively small objects with no internall allocation, definitely yes.
Otherwise only a qualified yes: if you are copying it anyway in the body, pass by value allow moving object into and actually avoid copies, but the code linked doesn't use move at all anywhere.
As they seem fairly versed in C++, the only reason for doing that is that they want their code to be modificable by non-programmers, so they want to use a style that is safe and avoids possible use after move.
For this scenario, then yes, C++ is going to be slower than a language with a proper garbage collector (although I would like to see a version with a proper shared_string implementation instead of the abomination that is shared_ptr<std::string>).
Def a qualified yes. The pass by value will often allow the compiler to deal with less as a pass by ref/ptr makes it more pessimistic. You see this play out in trying to get the auto-vectorizer going by copying N values to a local buffer in the loop prior to operating in them.
The optimizer in c++ compilers are really good at inlining and if it's a single TU more so. So the size isn't the issue in those cases too. In addition, on the other size most optimizers have a form of return value optimization that elides the copy/move on the return value.
Where they often get hit up or lacking is heap allocations. They seem to hide a lot and only some compilers can elide a heap allocation.
Use after move is interesting. Many cases of move are not necessary anyways (returning an automatic storage object) but the complexity of the code already would be on part with using std::move. I don't think an optimizer will do that one even if it is that last usage.
It’s unfortunately not just ABI, but also API. The standard specifies that you can get iterations to specific buckets in O(1)[1], and also specifies bucket_count(), max_bucket_count(), bucket_size() (which is specified to be O(n)), and bucket(). Those functions and their specified performance make it effectively impossible to implement a standards-compliant std::unordered_map without using separate chaining.
I'm not even sure that is an unpopular opinion. Its a minority opinion, but people on the committee have been grumbling about how ABI consistency is hamstringing C++ for a while now.
It's implementation is buckets divided node based hash map. Theoretically, its order is good, but not that efficient in modern hardware where the memory access is heavily cached so the data locality is more important than saving the memory copy or memory size. Using the linked list to save some memory copy doesn't benefit at all and the overhead is far greater than simply copying the contagious large chunk of memory. Especially so if the data type is trivial so the simple byte by byte memcpy is suffice.
To be useful that would require that the nodes are allocated in the order they appear in the map (and still has pointer overhead). Is that a given with an unordered map or does it break down the moment the map is rehashed?
Because you should be able to find a good hash data structure for your language. You shouldn't be able to accidentally get a bad one. It should arguably be the first data structure you reach for.
Deprecate the old data structure. Even Java did that. Originally it had the Vector class as primary List type. Nowadays everyone uses the List interface and ArrayLists as default implementation.
You can even find this in the javadoc of the Vector class:
>As of the Java 2 platform v1.2, this class was retrofitted to implement the List interface, making it a member of the Java Collections Framework. Unlike the new collection implementations, Vector is synchronized. If a thread-safe implementation is not needed, it is recommended to use ArrayList in place of Vector.
The problem is, hash map isn't that effective in modern hardware, the overhead cost is greater than the benefit. If it's consists of clustered network connected computers, it may be, but not for the single local computer for the most of problem.
> My conclusion, it's slow because they wrote C++ like a dynamic typed language
Here's my conclusion without needing to look at the code: is C++17 so complicated a language that having to understand deeper concepts such as the ones you've mentioned means a reduced "time to market" for marginal gains over the same code written in Go?
Or put another way: did you just further prove that Go is the better option because it's providing extremely similar performance yet doesn't require this very esoteric knowledge you've mentioned?
If I can get the same work done in Go and I can learn Go in a week versus months for C++17 (which is far more complex), why would I pick C++17?
If you know Go and don't know C++ then yes absolutely use Go. If you're trying to do a performance comparison though it doesn't seem unreasonable to have "basic" familiarity with the languages being compared?
None of the knowledge mentioned is "esoteric". We're not talking hyper optimized insane C++ optimizations here with custom allocators and specialized code generation with templates. We're talking "basic" memory management & standard library data structures.
> If you're trying to do a performance comparison though it doesn't seem unreasonable to have "basic" familiarity with the languages being compared?
No. Because it's not just a performance comparison, really. Without realising it, it's also a learning test. The very fact they wrote sub-par C++ code, likely without realising it, and then proceeded to produce (sub-par?) Go code that was faster demonstrates the difficulty of one language over the other. It demonstrates the cognitive efforts required between the two options and one of them was easier to produce and the results were more than suitable.
These technology comparisons are important, for sure, but at the end of the day people want an ROI from everything they do, even the things that don't involve money. So if something can be solved in six hours with one solution or 36 hours with another, and the latter only yields a 10% performance, then I'd rather my end goal be ready and "in the market" for 30 hours ahead of the other guy using the slower solution for a 10% speed improvement.
> No. Because it's not just a performance comparison, really. Without realising it, it's also a learning test. The very fact they wrote sub-par C++ code, likely without realising it, and then proceeded to produce (sub-par?
Who or what tell you it's a learning test ? It is a pretty weak asumption.
The publication says nothing on the actual past experience of the developers in every language. They might very well be experienced in Go or Java.
From their C++, It is pretty obvious it is not their main language.
In my experience, this kind of code of nested shared_ptr everywhere is pretty typical of developers with a Java background just starting C++.
they clearly want the performance - the algorithms (as you can easily see in the repo) are trivial, its just a huge amount of data to process
C++ can shine here - and still looking nearly like the go or Java-Code (due to the simple algorithm) but some one tried to write highly sophisticated code (there is no tutorial or book about C++ that teaches you to do it that way) with the result that C++ is far more slower then possible - we definitly not talking about 10% gain here
> they clearly want the performance - the algorithms (as you can easily see in the repo) are trivial, its just a huge amount of data to process
> C++ can shine here
And my point is: it didn't. And it didn't because of the learning curve involved. It's only the better solution if the knowledge is invested in ahead of time, and if the gains don't outweigh those you'd get from Go by an amount that greatly exceeds the time invested to net them, then it's not the better choice.
but your point is just not correct - i know go, Java and C++ very well and i don't understand why the C++ code isn't just like the Java code - there is absolutely no need for this deep nested, ugly, container parties that are not in the Java code at all
The flip side to this is that for someone who knows roughly what a CPU does (you don't need in-depth knowledge to think about pre-fetching and branch prediction) C++ is going to be easier to use to leverage that. That's really the end of it.
I agree with you that writing C++ is harder than writing Go and that writing naïve C++ leading to disproportional pessimization of your program shows that it's not the right tool for people who lack the knowledge to use it.
At the same time it's really not hard to translate simple concepts into simple code for lower-level languages, that runs well on modern processors. Basic usage of (possibly) growable, homogenous arrays will get you far. Contrary to what most C++ app developers seem to think custom allocators aren't complicated either, and will guarantee that you get much better cache locality and memory management.
Most of these issues are made hard by people, including the premise of learning the basics of what you need to do for the processor to execute your code fast.
That being said if the time invested in learning C++ and the extra time needed to write it and optimise it doesn't exceed the performance of the same solution written in (optimised) Go by a factor larger than the time invested then it's not the right choice overall.
you are just wrong - someone without any good C++ knowlegde except porting the Java port plain to C++, ported it by using every crude feature and wild combination of shared-ptr, container nesting in a way you will nearly not find in any open source or tutorial project - its not about the ease of go, the the benchmark results are just wrong in so many ways
‘Esoteric’ is in the eye of the beholder. Even in Go you gain performance by avoiding boxing.
> If I can get the same work done in Go and I can learn Go in a week versus months for C++17 (which is far more complex), why would I pick C++17?
Because C++17 code if written well is likely to be much faster. If you don’t care about that, by all means stick with Go. But keep in mind that learning is a one-time cost, and you can keep applying the same knowledge to project after project.
> I was hoping someone could prove this, actually.
Maybe this presentation by Jason Turner from CppCon 2016 can illustrate just what can be done in C++17. His target in the presentation is a C64, but I think it illustrates well just what you can do with the language, and hence what tools are available for writing fast code.
that is really not the case here - the Java and C++ implementatios are very plain boring algorithm code (some maps,lists,searching) non of the fancy language features are use to reduce or optimize the code - its absolutely not a C++-is-hard-to-get-right-thing
the C++ code is just using rarely used features in over-wild combinations - as stated: it seems that someone tried to port highly dynamic script code 1:1 to C++ without any knowledge about C++ and that just make no sense, even porting the current Java-Code 1:1 to plain old C++ Code would result in a much better performing version of the algorithm
everything (even the simplest stuff) is FORCED to be heap-allocated (with huge amount of code), in case of deque and shared_ptrs even double or tripple times - without any benefit of doing it
i would bet that a more clean, less every-feature-using version of the C++ algorithm would beat the Java, go code by a huge factor
The problem with that idea is that he actually used more advanced features to make it worse (any_cast). This is not how you would do it if you picked up a beginner C++ book and read the first few chapters.
>the unordered_map is very slow(it's a node based hash map, not suitable for the modern hardware)
It is very slow?
I wrote my app in FreePascal. I need a hashmap, but FreePascal has no real standard hashmap, so I have been benchmarking Pascal hashmaps for weeks/months.
Today I added std::unordered_map for comparison. It will still take a day to run the benchmark, but so far it looks that std::unordered_map is 25% faster than the fastest map in the FreePascal standard library. And the best map of 45 Pascal maps is only 30% faster than std::unordered_map. Only 10 maps are faster, and 35 maps are slower.
std::unordered_map is supposedly slow because it has fairly stringent iterator invalidation requirements. In my experience, unless you wring the most performance out of your code as you can, it's not a huge issue; it's generally faster than most casual implementations.
Yeah people here are dumping on it for being non-performance-optimal while ignoring that higher performing versions have restrictions and design limitations. unordered_map is fast enough for probably 98% of uses, and the other 2% may be better off finding a specialized version best for their particular case.
Depends what you do. If you populate it only once and then you do a bunch of lookups, assuming a low load factor, you'll be paying for a modulo operation and an extra indirection, not great but not terrible either. If you do a lot of inserts and removal or your load factor is high, then the performance is going to be not great.
Maybe they wrote it really badly so someone will take offence and rewrite it for them!
Also in my experience `std::deque` is almost always slower than `std::vector`. In theory it shouldn't be, but copying contiguous data is so fast on modern CPUs that the reallocation of `std::vector` (or occasional front-insertion) is very cheap.
> They really like std::deque<T> and use it everywhere even though the sizeof(T) is like a few dozen bytes at best so they should rather use std::vector. The data structure of deque is a list of array. while it can amortize the continious adding of the elements to front or back, since the element size is very small, they should rather use vector.
Are you saying that access patterns (push/pop front/back) never matter and std::vector is always better than std::deque if the element size is very small? I don't think that this is true in this generality.
in CppCon 2016: Chandler Carruth “High Performance Code 201: Hybrid Data Structures"[1] he says:
"but the standard deque data type is really quite bad. I wouldn't recommend anyone use it for anything. if you look at the implementation constraints, the constraints placed upon its iterators, and its invalidation constraints, it's painted into a very unpleasant corner and it has very few opportunities to be an efficient data structure."
For a language whose purpose is high-performance, the std:: containers are an embarrassment. `deque` is slow, `map` is slow - so the language added `unordered_map` which is also slow...
Are any of the containers (other than `vector`) worth using in high-performance code?
I’m sympathetic to the game developers who I’ve heard say “we use C++, but nothing from std::”.
STL is mostly garbage, which is a big reason it's not used in places that actually need performance, both for build times and runtime.
Yes, it's embarrassing. I think it should be noted that the performance mantra is mostly for show with C++, though. The same people who spout it will also blatantly use pessimizing language features just because, when normal, procedural code would've done the job faster and ultimately simpler.
I think the performance overhead of a few things in C++ make sense, but in general you get performance in C++ by turning things off and abstaining from most of the features. Exceptions complicate mostly everything, so the best thing to do is to turn them off and not rely on anything that uses them, for example.
Modern C++ isn't fast and most of C++ wasn't even before the modern variant was invented. The C "subset" is.
> STL is mostly garbage, which is a big reason it's not used in places that actually need performance, both for build times and runtime.
It is NOT garbage. It is more than sufficient for the 99% of devs that need key in hand data structure and good enough performance (Meaning faster than 99% of over programming languages already).
If what you need is sub micro-second perf, then yes, redefined your data-structure.
BTW, you will very likely have to do that in any language anyway. Because it is impossible to design fastforever-living data structure. They (almost) all become obsolete when architectures evolve. Red-Black Trees where state of art DS, teached-at-school 10 years ago, they are useless garbage nowadays if you seek for performance.
> It is more than sufficient for the 99% of devs that need key in hand data structure and good enough performance (Meaning faster than 99% of over programming languages already).
I really don't get this argument. If you don't need pedal-to-the-metal performance then why are you using C++ in the first place? (Unless, of course, your answer is "legacy code".)
C++ is being touted as being high performance, but basically every standard data structure besides `std::vector` is garbage for high performance, pedal-to-the-metal code. And not only data structures - `std::regex`'s performance is bad, `std::unique_ptr` doesn't optimize as well as a plain pointers, no vendor has a best-in-class `std::hash` implementation (they're neither DoS-safe nor the fastest), etc.
> BTW, you will very likely have to do that in any language anyway. Because it is impossible to design fast forever-living data structure. They (almost) all become obsolete when architectures evolve.
Do you, though? Rust already replaced their standard hash map implementation with a completely different one which was faster, so it shows that it can be done.
> I really don't get this argument. If you don't need pedal-to-the-metal performance then why are you using C++ in the first place? (Unless, of course, your answer is "legacy code".)
There is many other things that pure compute-bound CPU performance that can drive you to a GC-less language like C++. Fine grain memory usage is one of them, latency control is an other.
> `std::regex`'s performance is bad. , `std::unique_ptr` doesn't optimize as well as a plain pointers
However, do you have any bench / source for std::unique_ptr ? You are the first one I hear giving critics on it.
> Do you, though? Rust already replaced their standard hash map implementation with a completely different one which was faster, so it shows that it can be done.
Rust does not have 25 years of code base behind him. We can talk to that again when he gets 10 years more.
C++ can not afford to randomly break API on one of its core STL component just to hypothetically gain few bit of performance.
It can be done, but it should be done with a depreciation process and an alternative implementation, like it has been done for auto_ptr -> unique_ptr.
Rust didn't break the API of HashMap. The trick is that the STL decided that certain algorithmic requirements were a part of the interface. For std::map, the case here is that it must be sorted. Its API also means you can't be polymorphic over a hashing strategy. This constrains possible valid algorithms. Rust made the opposite choices.
And so they did exactly what you say, for that reason. std::unordered_map is a better comparison.
std::array is great except it's annoying to statically initialize. Really wish make_array or to_array were non-experimental a lot sooner (to_array finally did land in C++20 at least).
But yeah it does mean you can finally stop doing that silly `sizeof(a)/sizeof(a[0]);` trick.
Thanks for the reference. But this is still specific to access patterns: You can use a deque as a queue without iterating, so iterators and their invalidation constraints should not matter.
As suspected, everything is dynamically allocated and no memory mapping (see e.g. http://man7.org/linux/man-pages/man2/mmap.2.html) is used. No wonder this is slow and eats a lot of memory. At the moment I have no information about why this design was chosen, if there is a justification for it, or if the developers only knew this option. Maybe I can find some hints in the paper. From what I've seen up to now it can be savely assumed that with optimal use of data structures and system functions the C++ results are at least one order of magnitude better.
I think this applies to any kind of advanced software development or programming languages, not only C++.
There may be many reasons why scientists who are not computer scientists feel more comfortable with Go than with C++, but performance is certainly not one of them.
It depends on your definition of "wrong".
You can mess up your domain logic with any language, but the chances of messing up perf (and even some ref vs. copy semantics that effect logic) in C++ are vastly greater than in GC'd langs.
I've ported C++ code to C# with much perf gain, due to c'tors of every kind getting called all over the place. Doing it right in C++ is too much cognitive overload -- at least for me. If you're writing inner game loops or device drivers, I feel I need to pay for this complexity - for any LOB apps, I can't see choosing C++, or even Rust for that matter.
Rust certainly comes with its share of complexity, but it might be worth clarifying that Rust doesn't have this specific problem of implicit constructors called all over the place. Instead, everything uses move semantics by default.
But most life scientists using the library for their work are not; Go is a language they can master, as well as Python or Lua; if they're biophysicists they're likely to know Fortran and C as well, but I rarely meet scientists fluent in C++. Also not sure how much experience the authors have with such software projects.
What is easy is to write C++ as if it was Java. (You just stick to the standard containers and use smart pointers when allocating objects on the heap.)
They address that somewhat in the discussion section:
> C++ provides many features for more explicit memory management than is possible with reference counting. For example, it provides allocators [35] to decouple memory management from handling of objects in containers. In principle, this may make it possible to use such an allocator to allocate temporary objects that are known to become obsolete during the deallocation pause described above. Such an allocator could then be freed instantly, removing the described pause from the runtime. However, this approach would require a very detailed, error-prone analysis which objects must and must not be managed by such an allocator, and would not translate well to other kinds of pipelines beyond this particular use case. Since elPrep’s focus is on being an open-ended software framework, this approach is therefore not practical.
Custom allocators are a relatively niche topic, and I agree that library users cannot be expected to customize memory behaviour to that level of detail (especially if the library users typically are not professional developers).
However, a certain level of proficiency must be expected, and in the case of C++ this includes "know when to use const&, unique_ptr<T> or shared_ptr<T>". If this cannot be expected of the user, the comparison becomes less a question about performance and more about which language is the best at being the lowest common denominator.
Of course you can use better allocators; but it's faster to avoid dynamic allocation (e.g. by pointing to memory mapped from the input file by the OS) altogether. If they allocate memory for each flyspeck of a 200 GB file and also create and change a reference counter for it, nobody should be surprised about the low performance. Have a look what e.g. shared_ptr does behind the scenes.
Unless you're streaming, in which case mmap'ed access on Linux is generally slower than read/write. At least it was the last time we checked for the Ardour project (probably about 3 years ago).
> elPrep is an open-ended software framework that allows for arbitrary combinations of different functional steps in a pipeline, like duplicate marking, sorting reads, replacing read groups, and so on; additionally, elPrep also accommodates functional steps provided by third-party tool writers. This openness makes it difficult to precisely determine the lifetime of allocated objects during a program run
> Phase 1 allocates various data structures while parsing the read representations from BAM files into heap objects. A subset of these objects become obsolete after phase 1.
> Therefore, manual memory management is not a practical candidate for elPrep,
TL;DR: it's non-deterministic which objects will be garbage when so some for of dynamic memory management is needed.
Well, one thing that jumps out immediately to me is that everything appears to be using shared_ptr. And by that I mean everything. Why is everything shared? What does it even mean to have a shared_ptr<stringstream>? Arbitrary mixed writes to a string seems like a bad idea, so shouldn't that be unique_ptr?
Or like this:
auto alns = make_shared<deque<shared_ptr<sam_alignment>>>();
A shared_ptr to a deque of shared_ptrs? deque isn't thread-safe, why would it be shared? And why does the deque instance need to be heap allocated at all? It just contains a pointer to the actual allocation anyway, moving it around by value is super cheap?
It's like make_shared is the only way they know to allocate an object. They even put string inside of shared_ptr:
class istream_wrapper {
...
shared_ptr<string> buffer;
It could be that it does need to be shared for some reason, but this looks like a pointer to a pointer for no obvious reason. Even ignoring the atomics that shared_ptr results in, the dependent data loads are going to be brutal.
EDIT: And they don't even seem to use std::move anywhere :/ There's a huge gap between this code and custom allocator territory.
they heap allocate strings because their string_range implementation is a shared_ptr to the original string plus two indices. It might be somewhat worth it assuming the strings are large enough. But if most strings are small, passing them around by move-value would probably be an overall win. One would need to benchmark it.
In that case you'd still want a shared_string not a shared_ptr<string>. The double pointer chases add up quick.
Edit: It might help to keep in mind that string is essentially an alias for unique_ptr<char[]>. As in, string is already heap allocated. They heap allocated a pointer to a heap allocation.
In my experience, this kind of error is the cause of 90% of the "wait, I thought C++ was supposed to be fast so why is it chugging compared to my Java implementation" problems. People turn everything into two dereferences.
The problem starts with the design decision to stream the file into memory; if you map the file instead you can directly use the mapped data and only have to allocate supporting structures (if required). And anyway, the "Therefore" part does not follow from the "Phase 1" statement.
For objects allocated at phase 1 and not necessary after,
I would use std::monotonic_buffer_resource added in C++17. It's implementation is consists of just a pointer point to large contagious memory. You want a n bytes of memory? return the pointer while adding ptr += n. Deallocation do nothing. Destructing std::monotonic_buffer_resource object deallocate the memory. If the objects has trivial destructor, as that is the case based on glancing the code, this is very efficient.
> To achieve good performance, it was therefore necessary to explicitly control how often and when the garbage collector would run to avoid needless interruptions of the main program, especially during parallel phases.
Why? This is a batch program, interruptions don't matter, only the end-to-end time does.
> The goal of elPrep is to simultaneously keep both the runtime and the memory use low.
Why? Keeping runtime low lets you get more work done. Keeping memory use low means what? They are using a machine with 384 GB RAM, make use of it.
Worth noting also that they used GCC 7.2.1, Go 1.9.5, and Java 10. That's a pretty old GCC.
They don't seem to explicitly select a GC with Java, so they'll be using G1. G1 is still not entirely mature. It got much faster between 9 and 10, and somewhat faster from 10 to 11. For a batch process like this, though, the parallel collector is still probably a better choice. Using a newer JDK and a different collector should give better performance - but admittedly, probably won't reduce heap usage.
It has to be noted: it's quite a strange approach all round - this framework reads all the data into memory. So if you have a 100GB genome it will read 100GB into memory. Presumably it stays in memory uncompressed so we are talking hundreds of GB to process even a single whole genome sample.
This may indeed have some performance benefits, but it's a very impractical approach from a hardware point of view. Few places doing processing of genomic data will have many compute nodes with > 256GB memory, yet that would barely process 1 sample with this framework. God forbid you have a family of samples or tumor/normal comparison samples to analyse and need several genomes in memory together.
Genomes are for the most part massively parallelisable and nearly every other toolkit I have seen has put that first and foremost in its design approach. Ensuring tools process data in a streaming manner and pipe between each other is a basic expectation of most genomic data tools.
Which is all to say ... this is a very strange beast and I'm not sure a lot of conclusions can be drawn from it that generalise to other activities or approaches.
Untrue. Large memory nodes are par the course for genomics workloads. Only a few stages of the analysis pipeline can stream the in/out effectively. Even then, going to disk for the result only to be read back is going to bottleneck your pipeline.
We once asked our cluster department to give us bare metal access to their smallest machine. They gave us a dedicated server with 256GB RAM. The real cluster nodes are bigger than that because they handle multiple jobs at the same time and they have hundreds of them.
This isn't some electron framework where it is unexcusable to use that much RAM. The hardware that is available to scientists is more than capable of handling these workloads.
> Based on our benchmark results, we selected Go as our new implementation language for elPrep, and recommend considering Go as a good candidate for developing other bioinformatics tools for processing SAM/BAM data as well.
They don't seem to take into account that their results depend on their own proficiency in each programming language.
They could've just saved themselves the trouble and said "Anyone fancy doing this in C++? No. Great, that's that tricky question off the table!" Rather thna going through this ridiculous exercise of writing bad C++ in order to justify not using C++. If we really believe that this is just a way of idnetifying a revealed preference fine, but engineers should be smart enough to not need to trick thesmlves into coming to the obvious conclusion.
This is not surprising. Parallel GC will almost always be faster than refcounting. If you wrote a C or C++ program to accomplish this task after carefully planning out exactly when stuff needs to get manually malloced/freed, you could outperform any of their approaches
And like another commenter mentioned, if you're writing a program which streams a lot of data sequentially from disk, and where throughput is important (such as in sequencing), you should always be using mmap
That's a case where mmap isn't actually all that much faster than read, and due to the inter processor interrupts needed to synchronize the memory mappings across cores, it may end up much slower. You're grabbing large chunks and flushing the TLB a whole lot.
If you are seeking randomly and doing small reads, then mmap will help quite a bit: the data will be faulted in, and accessing it a second, third, or hundredth time will not cost much.
That makes the assumption that any refcounted object lifetime management can be replaced with manual management, which is not the case.
As the paper states: it is not possible to manually determine the object lifetimes due to the nature of the problem, data and openness of the configurable analysis pipeline. Manual alloc/free would in their case mean holding on to much more memory than a GC/refcounting solution would.
for complicated stuff involving multiple threads and shared memory, it's good to plan it out in advance in some way. but you are right that most data can just get stack allocated
if you look at their code it's like they have no clue that the stack even exists. seriously. they heap allocate EVERYTHING
Yeah i hadn't really looked into their c++ code when I wrote my comment. it looks like code written by someone who doesn't even know the stack exists. insane
Very surprising result. I wouldn’t have bet that this is what would have happened.
But anyone working on language perf should take note even though it’s just one result from one team and one application. Of course they probably used C++ in a not great way and probably use Go in a better way. But maybe that is caused by something in Go that encourages good behavior or at least encourages the kind of behavior that Go optimizes for.
So, even if this result doesn’t mean that C++ devs should switch to Go to get more speed, it’s a result that is worth pondering at least a bit, particularly if you like thinking about what it is that makes languages fast or slow.
> I wouldn’t have bet that this is what would have happened.
Which part of the results are you referring to? It's well-known that reference counting has significantly lower throughput that tracing garbage collectors, so the fact that C++ is outperformed here isn't surprising at all.
As a Lisp programmer, I confess that when I see a performance shootout between Go, Java, and C++, I don't expect to hear the result that Go won -- and that its performance is on par with their old Common Lisp program.
Great point. Here’s the issue: there are tons of ways of doing reference counting in C++. Some go all-in with implied borrowing. Some make great use of C++’s various references. Some use the reference counting only for a subset of objects and rely on unique_ptr for hot things.
So, there is no universal answer to how C++ reference counting compares to GC.
There is a well known answer, that you’re probably referring to, if you reference count every object and you do it soundly by conservatively having every pointer be a smart pointer. But it just isn’t how everyone who does reference counting in C++ does it, and I was surprised because I’m used to folks going in the other direction: being unsound as fuck but hella fast.
> there are tons of ways of doing reference counting in C++
There are but critically there are also a lot of ways to not do ref counting at all. C++ isn't a refcounted language, it's a language where you can use refcounting (shared_ptr), but you don't have to (unique_ptr, value types). It's not even recommended to be primarily refcounted.
They chose a really odd subset of C++ to use here (shared_ptr exclusively), very unorthodox and not something I've ever seen elsewhere or recommended.
Sure, but I guess that's the point --- the GCs in Java and Go can handle any allocation pattern you throw at them reasonably well, but there's not, as far as I know, such a "one-size-fits-all" solution in C++ (not that it needs one.)
If you don't want to plan out proper (as in performant) data structures and memory management, C++ is indeed the wrong language. And judging by the stuff discussed in comments above, this is exactly what happened here.
I'm not sure I would call 'using your entire CPU to do trivially small heap allocations' a pattern unless you mean that it's a pattern you see when people wonder why their software is so slow.
Their software seems to be mostly just allocating lots of memory. It seems like they are comparing language speed if you don't know the first thing about optimizing ( which is to not allocate memory nonstop)
On one hand this comparison sucks because their C++ is quite non idiomatic, in the other hand it reflects something about C++. Why is it so easy to write non idiomatic C++? And so much so that it's much slower than a garbage collected language.
The fact that gc is well suited to this application isn't suprising but what I thought was interesting was that subtracting the time to do the deallocation from the C++ benchmark brought it into line with Go. In other words, ignoring mem management, for this application, Go and C++ performed on par.
There is a huge amount of gratuitous reference count updates and double ptr indirection (see for example their string_slice). Those add up quickly.
The rest is a lot of string manipulation. If you are not taking advantage of being able to layout your objects carefully and avoid memory avoiding allocations, I wouldn't expect C++ to have any particular advantage over Go or Java in in this particular scenario.
It's not well known, because it's not true, at least not absolutely. Speed could either mean latency or throughput. Reference counting is almost always worse than GC at throughput, but is usually better for latency.
Sorry about the naive question, but if the memory management overhead is worse in Swift, is the hardware it runs on typically better? I'm assuming some of this because I've noticed Android devices tend to require more CPU/memory compared to iOS devices in the same generation.
Lattner on Swift (2016): "...while it is true that modern GC's can provide high performance, they can only do that when they are granted much more memory than the process is actually using. Generally, unless you give the GC 3-4x more memory than is needed, you’ll get thrashing and incredibly poor performance..."
Apparently the custom processors used in iOS devices are actually some of the fastest out there in that form factor.
Designed by apple (I think using ARM instruction set) and made using some of the latest, most advanaced, process nodes by TMSC.
If you look at raw benchmarks, they handidly beat the best Qualcom/"android SOC" chips out there.
I think this changes very rapidly - given the market segment of high end phones has a yearly turn around. However, I did read an article today that claimed the "cheapest iPhone" (the new SE model) is faster than the most expensive android phone currently out there.
> elPrep allows users to specify arbitrary combinations of SAM/BAM operations as a single pipeline in one command line.
The assumption is that your native environment for data analysis is bash and you have to expose anything that you might want to do in bash. Further,
> elPrep also accommodates functional steps provided by third-party tool writers
That is, they are attempting to provide general semantics in a bash command line that they can handle arbitrary functions being dropped into their program.
Remember, the average salary of a job requiring both programming and biology knowledge is much lower than the average salary of a job requiring programming knowledge alone, so as bioinformaticists build skill to the point where they can be employable as programmers, they mostly leave bioinformatics.
Those who have escaped bash have usually done so into Python these days, which creates a class system of those gluing things together in Python versus those implementing algorithms in C or C++.
"We have not performed a detailed comparison against the original version of elPrep implemented in Common Lisp, but based on previous performance benchmarks, the Go implementation seems to perform close to the Common Lisp implementation." LMFAO.
If I understand correctly, new Go code was as fast as old Lisp code, with much less effort and much clearer code.
> Most existing Common Lisp implementations use stop-the-world, sequential garbage collectors. To achieve good performance, it was therefore necessary to explicitly control how often and when the garbage collector would run to avoid needless interruptions of the main program, especially during parallel phases. As a consequence, we also had to avoid unnecessary memory allocations, and reuse already allocated memory as far as possible, to reduce the number of garbage collector runs. However, our more recent attempts to add more functionality to elPrep (like optical duplicate marking, base quality score recalibration, and so on) required allocating additional memory for these new steps, and it became an even more complex task and a serious productivity bottleneck to keep memory allocation and garbage collection in check.
It seems that all the languages implementation in the benchmarking exercises are utilizing garbage collection (C++ using ref counting) it will be very interesting if someone extend this benchmark using D programming language, arguably the fastest language with GC. The fact that elPrep tool is using functional software architecture [1] and D natively supports functional programming makes me think this type of computing is very well suited for it.
It would be interesting to see an output of the compiler flags used in the C++ program that was benchmarked. The build script doesn't set an optimization level so it would default to -O0:
They use shared_ptr pervasively in their C++ implementation. It's not the best way to manage objects in C++ by far. The only thing that this performance comparison really tells me is that shared_ptr is worse than a modern GC. This is not really a surprising find.
Probably worth noting in the title that this is a June 2019 publication, so the versions used (gcc 7.2.1, go 1.9.5) etc. are not exceptionally ancient - basically they used what was readily available in CentOS 7 around that time.
I see, it sounded like a buzzword term (which I guess it still might be). The point is that the current title makes it sound like a general comparison, while the original title makes it clear that it's comparing three implementations of a single tool.
Ignoring the terrible C++, why would they bother to write the same program in three languages. I feel like spending 3x time on just one of them would have produced the best outcome.
In academic publication it is a convention to compare your proposed implementation with at least the other two comparable competitors (languages, framework, algorithm, scheme, etc). I think it is naturally intriguing and refreshing to see the performance comparison of programming languages by the the non-author of the languages themselves even though the implementations are probably not optimized to death.
It seems that the performance was dominated by memory management, so the comparison is not between the languages per se, but between their current garbage collectors, and respectively the reference counting implementation for C++.
Yeah, further down in the article they track down the gap between cpp and Go/Java to the ref-counting deallocation work. I'm not a cpp expert, but it seems surprising to me that GC would beat ref-counting in any scenario.
That might seem counter-intuitive, but tracing GC is the fastest way to manage memory in general (arbitrary life time, graph shapes and allocation patterns).
It permits really fast allocation, defragmentation and is sometime the only way to manage memory (cyclic datastructures for which reachability cannot be known directly from program text).
Reference-counted GC on the other hand is notoriously slow, incomplete, and exhibit lots of bad performance patterns (that can be somehow managed by taking some elements of tracing GCs).
The drawback is that tracing GCs are very hard to implement, to tune, and sometimes not all of the benefits are available at the same time. And because they are so counter intuitive, programmers in general have a hard time understanding their runtime behavior... Even GC experts struggle (the skills required range from low-level hardware to control theory, otherwise performance at scale is unpredictable).
In C/C++ we have the possibility to avoid dynamic allocation altogether and to use system features like memory mapping. If we use C++ the same way as Java (everything dynamically) it's not too surprising the result is not (much) faster than Java.
Their original implementation was in common lisp. Common Lisp also lets you use mmap (in fact it's not that uncommon to do so if you have a large amount of mostly static data) to manage your memory manually.
They clearly wanted automatic memory management, so the C++ implementation is reasonable. A fairer comparison might have used MPS or boehm instead of refcounting, but I suspect the results would have been similar.
Fig. 4 shows that deallocation, presumably of objects allocated during the first phase, takes half as long as the first phase itself. Unless the nature of the problem solved by the program requires you to have a ton of objects with shared mutability (the use case for shared_ptr), and requires you to make a lot of allocations up-front without knowing if you will need them, the memory thrashing occurring does not seem reasonable.
> But not without allocating dynamic memory and copying data.
Sure it does. In SBCL you can force a stack allocation (though rarely does it improve performance), and very short-lived values do not leave registers in any case.
> > They clearly wanted automatic memory management
> Most likely because of some misconceptions.
There are both good and bad reasons to want automatic memory management. At least one good reason it it would decrease the porting effort by keeping the code similar.
> > so the C++ implementation is reasonable.
> How so?
Using reference counting is a reasonable way to get automatic memory management in C++
> > but I suspect the results would have been similar
> Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.
Which is going to be rough on any automatic memory management system, which makes using a language with a better ecosystem of automatic memory management more performant.
> > Don't forget the data sets to be filtered, sorted an analyzed are up to 200 GB.
> Which is going to be rough on any automatic memory management system
You could equally say that it makes the case for actually designing memory allocation strategy (which only C++ really supports) that much more important.
You always see this in Java programs for large data analysis. They pick java because of memory management and the tooling. But it's just SO slow and after optimisation the only thing that stubbornly remains up there in the profiler data is memory and GC. And what do they do?
A global object of the following form:
class DataStore {
float theFloatsWeNeed[constHowMany];
int theIntsWeNeed[anotherConst];
}
You get the idea. Because this avoids memory allocation in java. And you use the flyweight pattern to pass data around. Or you fake pointer arithmetic in java. You create your own pointers by specifying indexes and you oh the horror use math on those indexes. Even then just checking those indexes actually becomes a significant time sink (and then you disable that, which of course kills memory safety in java, but you won't care).
The truth is you don't want memory management for large amounts of data. You don't want to allocate it, track it or deallocate it at all. You leave it in it's on-disk data format and never serialize/deserialize it at all. You want to mmap it into your program, operate on it and then just close the mmap when you're done. C++ definitely has the best tools for this way of working.
Yeah, right. The people who know this can apparently be counted on one hand when I look at the advertised publication and the discussion in this forum.
C++: It is permissible for an implementation to allocate every local variable on the heap.
Going back to my original point, you are suggesting a complete rearchitecture of their allocation system. That does not require switching languages to C++. If we are talking about working with 100s of GB of data, that's probably even the correct approach!
TFA does not, however, claim that they have a working set of 100s of GB of data. The data is 100s of GB at rest, but can be processed in chunks with a single pass. That, by itself, does not scream "mmap" to me. On top of that, the data is compressed at rest, so copying is inevitable.
So we are glad that we can use the data of the memory mapped file directly and do not have to allocate or copy anything. But of course you may solve the problem badly if you prefer so; after all, there are even "scientific" publications that do it that way, as the example shows.
1. Reference counting is a form of GC; you could implement a JVM that used reference counting (though in order to be general a small amount of additional work is needed)
2. Reference counting causes extra work every time a reference appears or disappears. Tracing GCs amortize that cost across many allocations.
2.b. This is particularly hurtful to performance for short-lived objects, since most tracing GCs have zero GC overhead for short-lived objects (the cost of a nursery collection under most implementations scales with the amount of live data in the nursery, so objects that appear and disappear in the time-span of a single nursery GC are freed at zero extra cost). Furthermore a tracing GC
3. Malloc cannot move allocated data, so many implementations have a lot of complexity to avoid heap fragmentation, which comes at a cost to both allocating and freeing data. Many GC'd languages allocate small objects with a single instruction in the typical (just incrementing a pointer, the non-typical case would be when the nursery is full and a GC happens).
4. the JVM and Go both have a lot of effort put into their GC; the ref-counting implementation used by this test is probably a bit more naive. In particular they talk about large delays when a chain of links cause many allocations to die at the same time. A less naive refcounting implementation would queue deleted objects and spread that work out across a larger time period.
Performance characteristics depend where on that continuum your workload falls. For example, Erlang/BEAM uses a generational GC for most common heap objects, but refcounts large binary blobs. This is pretty much a perfect case for refcounting: new references are created infrequently, copying or moving is expensive, destruction is deterministic and happens immediately after the last reference disappears, and there're no pointers within the blob that would require tracing or cycle-detection.
Similarly, UI components within a GUI is another good case for refcounting (and presumably why Apple continues to use this for Objective-C and Swift in Cocoa). New references happen only in non-performance-critical code, most data remains live across collections, and copying/moving existing data would a.) be slow and b.) would invalidate any C pointers into contained data.
Sounds like the particular problem domain described in this article is one where heap allocations are frequent, which makes generational GCs more appropriate. That's probably the case with the vast majority of computational algorithms, but there are definitely problem domains where refcounting continues to beat GC.
> (though in order to be general a small amount of additional work is needed)
Not also that there are two methods of cycle detection for a reference counted GC that are not just a backup tracing-GC
1. Trial deletion (known since at least the mid 80s)
2. Various tracing systems that exploit extra information known to reference-counted systems e.g. Levanoni/Petrank[1] which actually implemented a reference counted GC for Java.
Refcounting can mess with caches in multithreaded scenarios, so results aren't really surprising to me. It also generates unnecessary work to maintain counters on all writing threads, while concurrent tracing GC batches all its actions on separate thread.
"With reference counting, objects are recognized as obsolete due to their reference counts dropping to zero. Deallocation of these objects leads to transitive deallocations of other objects because of their reference counts transitively dropping to zero. Since this is an inherently sequential process, this leads to a similar significant pause as with a stop-the-world garbage collector."
There are three primary performance penalties for reference counting:
1. This, where dropping one link causes a deallocation chain.
2. A typical heap implementation will leave live objects scattered throughout memory, leading to cache issues.
3. Maintaining the reference counts may cause writes to memory, leading to other cache issues, plus atomicity.
It was mentioned in the article and was not considered a candidate due to a specific API requirement:
> Other mature programming languages with support for reference counting include Objective-C, Swift, and Rust [50]. However, in its algorithm for duplicate marking, elPrep requires an atomic compare-and-swap operation on reference-counted pointers, which does not exist in those languages, but exists in C++17.
Except that this "atomic" operation over shared_ptr's in the C++ standard isn't actually required to be lock free, and is in fact not lock free in common stdlib implementations. So they're not actually gaining anything over, e.g. RwLock<Arc<RwLock<T>>> in Rust.
They are atomic in the sense that it is not possible to observe the intermediate states and they are not data races.
It is extremely hard though to implement them in a lock-free way without a true DCAS (which pretty much no architecture implements). I think all implementations use spinlock pools. I guess transactional hardware could be used.
This is one of the reasons why GC does make it easier to implement some lock free algorithms as you do not have to worry about races during memory deallocation. Without GC, I I guess the next best thing is RCU or harzard pointers, but they are significantly more complex.
From a cursory read I do not see any fancy lock free algorithm in the code though, so I'm not sure why they need atomic operations on shared_ptrs.
I am aware, but that restriction shouldn't affect its use for an atomic ref-counted pointer since you can place the count and the pointer next to each other.
How would it work exactly? The layout of a generic shared ptr (but not specifically std::shared_ptr) is:
ptr -> (count, payload)
instead of:
(count, ptr) -> payload
As you can see, count is not alongside the pointer itself as distinct instances of ptrs need to share the count.
Let's say you want to acquire an additional reference to ptr. You need to both copy the current value of ptr and increment count atomically to protect against concurrent modifications of ptr that might drop the last reference to payload.
Delaying the deallocation of payload via hazard pointers, RCU or some other deferred reclamation scheme works, but it is significantly more complex. I believe this is what the rust arc-swap package does internally.
Like what is this? https://github.com/ExaScience/elprep-bench/blob/master/cpp/f...
So the data is sam_alignment type inside shared_ptr inside deque inside another share_ptr inside god forbid any? Why did they do that? What kind of abomination is this? Also from the context, it is the only possible type. they use: If you're so sure an object of any only hold exactly one type and everything else is unexpected error, you shouldn't use any at all!They really like std::deque<T> and use it everywhere even though the sizeof(T) is like a few dozen bytes at best so they should rather use std::vector. The data structure of deque is a list of array. while it can amortize the continious adding of the elements to front or back, since the element size is very small, they should rather use vector.
Speaking of data structure, they also use std::unordered_map<int, any>. the unordered_map is very slow(it's a node based hash map, not suitable for the modern hardware) and the sizeof(int) + sizeof(any) is like 20 bytes(sizeof(int) + two pointers) so they got no benefit of using node based data structure here. They should rather use sorted vector and binary search it.
My conclusion, it's slow because they wrote C++ like a dynamic typed language and they choose the wrong data structures.