> The only thing you can really quibble about with std::vector is [...] That's a...

aw1621107 · 2024-01-28T05:26:44.000000Z

I was curious why the std::vector version produced so much more assembly, so I tried to dig a little. At least in libc++, the relevant code is (cleaned up):

    template<typename T, 
    class vector {
    private:
        pointer __begin_;
        pointer __end_;
        __compressed_pair<pointer, allocator_type> __end_cap_;
    public:
        constexpr const pointer& __end_cap() const noexcept {
            return this->__end_cap_.first();
        }
        constexpr size_type size() const noexcept {
            return static_cast<size_type>(this->__end_ - this->__begin_);
        }
        constexpr size_type capacity() const noexcept {
            return static_cast<size_type>(__end_cap() - this->__begin_);
        }
    };

So if the functions are fully inlined we should end up with

    template<class T>
    bool can_fit(std::vector<T> const &v, size_t n) {
        return static_cast<size_t>(v.__end_cap_.first() - v.__begin_) - static_cast<size_t>(v.__end_ - v.__begin_);
    }

At least algebraically (and ignoring casts) it should be equivalent to v.__end_cap_.first() - v.__end_, which is more or less what the manual implementations do. Maybe the optimizer can't make that transformation for some reason or another (overflow and/or not knowing the relationship between the pointers involved, maybe)?

If you change can_fit(Vec<S>) to:

    return (v.f - v.b) - (v.e - v.b) >= n;

You end up with code that looks pretty similar to the can_fit(std::vector<S>) overload (the same for clang, a bit different for GCC), so it does seem it might be something about the extra pointer math that can't be safely reduced, and the casts aren't really relevant.

(I'm also a bit surprised that can_fit_fast produces different assembly than the Vec can_fit overload)

dataflow · 2024-01-28T05:36:00.000000Z

Kind of, yeah. The main thing to realize here is that pointer subtraction is not just arithmetic subtraction of the addresses; it's also followed by a division by sizeof(T). (This is not obvious in this context unless you've seen it before.) Thus for the compiler to prove that (f - b) - (e - b) == f - e, it has to prove the remainders of each address subtraction (mod sizeof(T)) don't affect the result. This is certainly possible, but it requires compilers to actually do some extra legwork (e.g.: assume these come from the same array, then prove that implies the remainders don't affect the result, etc.) prior to reducing these to arithmetic operations. For whatever reason they don't do that. My guess is the reason is something between "we believe the general case would break too much existing code" and "we believe this special case isn't common enough to be worth it". But who knows; maybe if someone files this as a missed optimization, they would try to implement it. I haven't tried.

(There's also a secondary issue here, which is that sizeof(S) isn't a power of 2. That's what's introducing multiplications, instead of bit shifts. You might not see as drastic of a difference if sizeof(S) == alignof(S).)

aw1621107 · 2024-01-28T07:45:59.000000Z

Yeah, the extra instructions from the division helped clue me into what might be going on, since the individual (f - b) and (e - b) calculations are visible in Clang's output.

I feel the division by sizeof(T) shouldn't matter that much, since the compiler knows it has pointers to T so I don't think the divisions would have remainders. I want to say pointer overflow and arithmetic on pointers to different objects (allocations?) should also be UB, so I suppose that might clear up most obstacles? I think I'm still missing something...

Does make me wonder how frequently this pattern might pop up elsewhere if it does turn out to be optimizable.

dataflow · 2024-01-28T07:52:56.000000Z

My 1st paragraph was directly answering your 2nd paragraph here (starting from "This is certainly possible..." to the end). I was saying, compilers can optimize this if they want to, but it requires work to implement, and I can only guess (the reasons I listed) as to why they might not have done so yet.

> Does make me wonder how frequently this pattern might pop up elsewhere if it does turn out to be optimizable.

Probably a fair bit, but as I mentioned, it might break a lot of code too, because there's too much code in the wild doing illegal things with pointers (like shoving random state into the lower bits, etc.). Or not... the Clang folks would probably know better.

aw1621107 · 2024-01-28T10:19:19.000000Z

Right, I suppose my second paragraph was waffling on the extra legwork the compiler would have to do, but I know validating optimizations is just the start of the work.

Maybe this could be a good way to jump into messing with LLVM...

Out of curiosity, how much of a performance difference did you observe in practice when you made this optimization?

dataflow · 2024-01-28T14:34:22.000000Z

I don't recall unfortunately, it's been a few years.

summerlight · 2024-01-28T03:41:47.000000Z

I wonder if you actually measured it with proper compiler options? I saw some folks insist that STL is somehow slow blaming unnecessary layers but many of them are pretty much groundless. There are some genuine spots where STL has a real performance problems, but I don't think your example illustrates it well.

dataflow · 2024-01-28T03:44:36.000000Z

Click the link in my post, I added it so you can see everything in action for yourself.

(Although, note I didn't claim "the STL is slow"... that's painting with a much broader stroke than I made.)

CyberDildonics · 2024-01-28T03:05:28.000000Z

Why would that be faster? Those function calls are going to be inlined.

jeffbee · 2024-01-28T03:06:54.000000Z

It's not the function calls they are alluding to, it's the way the compiler generates a bunch of shifts and multiplies except in the can_fit_2 case.

dataflow · 2024-01-28T03:39:52.000000Z

Turns out GCC is more clever than Clang here, and MSVC is just horrendous. (See update, I posted a link.)

jeffbee · 2024-01-28T03:55:46.000000Z

I somewhat agree with your point (esp. that MSVC is hideous) but I also stand by mine. I don't feel that checking the capacity is something that would be in my tight loop, because checking for capacity to hold N is something you do before adding N items, which amortizes the cost, meaning the capacity check is just off the real hot path. So it doesn't feel like a realistic use case.

Speaking of realism, putting these in quickbench seems to confirm that the differences between them are not material, and that the STL version is in fact the quickest, but they are all essentially free. There's not a way to make a realistic microbenchmark for this, for the same reason that it doesn't feel like a real-world performance issue.

By the way clang does a much better job here: https://quick-bench.com/q/XcKK782d-7A6YHbiBRTlOnIRnPY

dataflow · 2024-01-28T04:19:35.000000Z

> I don't feel that checking the capacity is something that would be in my tight loop, because checking for capacity to hold N is something you do before adding N items, which amortizes the cost, meaning the capacity check is just off the real hot path.

Your fallacy here is assuming that that just because N is a variable, therefore N is large. N can easily be 0, 1, 2, etc... it's a variable because it's not a fixed compile time value, not because it's necessarily large. (This isn't complexity analysis!)

> Speaking of realism, putting these in quickbench seems to confirm that the differences between them are not material, and that the STL version is in fact the quickest,

Your benchmark is what's unrealistic, not my example.

I'm guessing you didn't look at the disassembly (?) because your STL version is using SSE2 instructions (vectorizing?), which should tell you something funny is going on, because this isn't vectorizable, and it's not like we have floating-point here (which uses SSE2 by default).

Notice you're just doing arithmetic on the pointers repeatedly. You're not actually modifying them. It sure looks like Clang is noticing this and vectorizing your (completely useless) math. This is as far from "realistic" as you could possibly make the benchmark. Nobody would do this and then throw away the result. They would actually try to modify the vector in between.

I don't have the energy to play with your example, but I would suggest playing around with it more and trying to disprove your position before assuming you've proven anything. Benchmarks are notoriously easy to get wrong.

> So it doesn't feel like a realistic use case.

I only knew of this example because I've literally had to optimize this before. Not everyone has seen every performance problem; clearly you hadn't run across this issue before. That's fine, but that doesn't mean reality is limited to what you've seen.

I really recommend not replying with this sentiment in the future. This sort of denial of people's reality (sadly too common in the performance space) just turns people off from the conversation, and makes them immediately discredit (or just ignore/disengage from) everything you say afterward. Which is a shame, because this kind of a conversation can be a learning experience for both sides, but it can't do that if you just cement yourself in your own position and assume others are painting a false reality.

jeffbee · 2024-01-28T04:46:31.000000Z

I can't believe you typed all that just to show your condescending, point-missing aspect to its fullest.

dataflow · 2024-01-28T05:09:20.000000Z

I gave you an example from my personal experience in the real world. Instead of asking for the context, a benchmark, or anything else, just went ahead and declared I'm giving you an unrealistic example... surely you can see why this is kind of offensive? I suggested you don't do that, and even explained specifically what appeared off in your benchmark. This is missing the point and being condescending?