Hacker News new | past | comments | ask | show | jobs | submit login

> elPrep is an open-ended software framework that allows for arbitrary combinations of different functional steps in a pipeline, like duplicate marking, sorting reads, replacing read groups, and so on; additionally, elPrep also accommodates functional steps provided by third-party tool writers. This openness makes it difficult to precisely determine the lifetime of allocated objects during a program run

> Phase 1 allocates various data structures while parsing the read representations from BAM files into heap objects. A subset of these objects become obsolete after phase 1.

> Therefore, manual memory management is not a practical candidate for elPrep,

TL;DR: it's non-deterministic which objects will be garbage when so some for of dynamic memory management is needed.

If you disagree, what would you use instead?




Well, one thing that jumps out immediately to me is that everything appears to be using shared_ptr. And by that I mean everything. Why is everything shared? What does it even mean to have a shared_ptr<stringstream>? Arbitrary mixed writes to a string seems like a bad idea, so shouldn't that be unique_ptr?

Or like this:

    auto alns = make_shared<deque<shared_ptr<sam_alignment>>>();
A shared_ptr to a deque of shared_ptrs? deque isn't thread-safe, why would it be shared? And why does the deque instance need to be heap allocated at all? It just contains a pointer to the actual allocation anyway, moving it around by value is super cheap?

It's like make_shared is the only way they know to allocate an object. They even put string inside of shared_ptr:

    class istream_wrapper {
        ...
 shared_ptr<string> buffer;
It could be that it does need to be shared for some reason, but this looks like a pointer to a pointer for no obvious reason. Even ignoring the atomics that shared_ptr results in, the dependent data loads are going to be brutal.

EDIT: And they don't even seem to use std::move anywhere :/ There's a huge gap between this code and custom allocator territory.


they heap allocate strings because their string_range implementation is a shared_ptr to the original string plus two indices. It might be somewhat worth it assuming the strings are large enough. But if most strings are small, passing them around by move-value would probably be an overall win. One would need to benchmark it.


In that case you'd still want a shared_string not a shared_ptr<string>. The double pointer chases add up quick.

Edit: It might help to keep in mind that string is essentially an alias for unique_ptr<char[]>. As in, string is already heap allocated. They heap allocated a pointer to a heap allocation.


In my experience, this kind of error is the cause of 90% of the "wait, I thought C++ was supposed to be fast so why is it chugging compared to my Java implementation" problems. People turn everything into two dereferences.


The problem starts with the design decision to stream the file into memory; if you map the file instead you can directly use the mapped data and only have to allocate supporting structures (if required). And anyway, the "Therefore" part does not follow from the "Phase 1" statement.


For objects allocated at phase 1 and not necessary after,

I would use std::monotonic_buffer_resource added in C++17. It's implementation is consists of just a pointer point to large contagious memory. You want a n bytes of memory? return the pointer while adding ptr += n. Deallocation do nothing. Destructing std::monotonic_buffer_resource object deallocate the memory. If the objects has trivial destructor, as that is the case based on glancing the code, this is very efficient.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: