More

maxwell86 · on May 31, 2022

A person that works at Intel.

maxwell86 · on May 24, 2022

Four times the price of fully unlimited 4G for a much better connection with more availability than 4G seems fair to me.

If you don't need starlink, then don't get it.

But if you have a cabin in the Italian alps where its either Starlink or else, then 124 / month seems like a much better deal than having no internet at all. Particularly if with those 124 / month then one or more people can live there and work earning multiple thousand euros per month of income (vs zero income living there without Starlink).

maxwell86 · on May 8, 2022

lol please no

If you are using C++, and want to parallelize something, just add "std::execution::par" to your algorithms.

Instead of writing "std::for_each(...)" just write "std::for_each(std::execution::par, ...)".

That's it. It really is that simple. And with the right compilers you can just compile the code to run on FPGAS, GPUs, or whatever.

For someone that knows C++, doing that is the lowest barrier of entry, and gets you most of the way there without having to learn "some other programming language" like OpenMP (or anything else).

dragontamer · on May 8, 2022

OpenMP can parellize a for loop like:

    #pragma openmp parallel for
    for(int i=0; i<1000000; i++)
        C[i] = A[i] + B[i];

It's normal C++ and the pragma auto-parallelizes the loop. It's actually really easy and convenient.

OpenMP is probably the easiest join/fork model of parallelism on any C++ system I've used. It doesn't always get the best utilization of your CPU cores, but it's really, really simple.

It's the best way to start IMO, far easier than std::thread, or switching to functional style for other libraries. Just write the same code as before but with a few #pragma omp statements here and there.

maxwell86 · on May 8, 2022

> It's normal C++

C++ is an ISO standard, you can use it _everywhere_, in space, in automotive, in aviation, in trains, in medical devices, _everywhere_.

OpenMP is not C++, it is a different programming language than C++.

OpenMP is not an ISO standard, you can't use it in _most_ domains that you can use C++.

Your example:

    #pragma openmp parallel for
    for(int i=0; i<1000000; i++)
        C[i] = A[i] + B[i];

shows how bad OpenMP is.

It does not run in parallel on GPUs or on FPGAS (lacking target offload directives), and you can't use it on most domains in which you can use C++.

The following is ISO standard C++:

    std::for_each_n(std::execution::par, std::ranges::iota(0).begin(), 1000000, [](int i) {
        C[i] = A[i] + B[i];
    });

it runs in _parallel_ EVERYWHERE: GPUs, CPUs, FPGAS, and it is certified for all domains for which C++ is (that is: all domains).

Show me how to sort an array in parallel on _ANY_ hardware (CPUs, GPUs, FPGAs) with OpenMP. With C++ is as simple as:

    std::sort(std::execution::par, array.begin(), array.end());

If you have a GPU, this offloads to the GPU. If you have an FPGA, this offloads to the FPGA. If you have a CPU with 200 cores, this uses those 200 cores.

There is no need to turn your ISO C++ compliant program or libraries into OpenMP. That prevents them from being used by many domains on which C++ runs on. It also adds an external dependency for parallelism, for no good reason.

For any problem that OpenMP can solve, OpenMP is _always_ a worse solution than just using strictly ISO standard and compliant C++.

OpenMP has completely lost a reason to exist. It's not 1990 anymore.

imtringued · on May 9, 2022

You sure know how to ruin a good thing with bad demeanor. When someone likes something and you say they shouldn't like it because it does exactly the same thing in a different way, you are actively driving people away from the better thing.

maxwell86 · on May 9, 2022

It's ok to like OpenMP.

What I disagree with is that it should be suggested to beginners as the way to parallelize their C++ programs.

That's like telling a Javascript programmer that they should parallelize their programs by using Python or C.

Show them how to do it in Javascript, or in this case, in C++, so that they don't have to learn a whole new programming model or language to just write parallel code.

Particularly when C++ has supported this for so long now.

dragontamer · on May 9, 2022

OpenMP is a set of #pragma that just sit in your C++ code directly.

> What I disagree with is that it should be suggested to beginners as the way to parallelize their C++ programs.

I guess we can agree to disagree then. If beginners think your way is easier, they're welcome to try. But there's plenty of production code examples that show the simplicity of OpenMP.

dragontamer · on May 9, 2022

        #pragma openmp target parallel for

The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you.

OpenMP is just one tool in my toolbox. To be honest, I've found it to be not flexible enough for most of my usage, but its gross simplicity is again, one of the easiest C++ / C tools I've ever used. Yes, even for playing or dabbling in GPGPU programming.

Furthermore, OpenMP is usable on GCC, Clang. Its even available (OpenMP2.0 at least) on MSVC++ (though OMP 2.0 leaves much to be desired, that's still enough for some degree of programming on Windows). So OpenMP code on say, Blender (3d raytracing program) runs on pretty much all important C++ platforms.

GPU and FPGA programming is complicated to actually perform well, because GPUs and FPGAs have a huge PCIe 3.0 bottleneck. A lot of code in CPU-land can stay in L1, L2, or L3 cache and outperform the PCIe-transfer alone. In contrast, CPU-to-CPU transfers are very quick (and exist on the L3 to L3 transfer or L2 to L2 transfer speeds), so your "cost of communication" is very low. I don't want to discourage any beginner from playing with GPU code (especially if they're "just messing around"). GPU code is easier to write than most expect.

But its surprisingly difficult to actually beat CPU code with GPU-offload code.

If its not something that works out for you, that's fine I guess? There's a lot of different tools for a lot of different situations.

--------

A fun OMP thing btw, is...

        #pragma openmp parallel for simd

Which (tries to) compile your program into SIMD code, like AVX512 or NEON for ARM.

OMP isn't as flexible as writing your own threads by hand, but its easy to experiment with many forms of parallelism with the same code. There's also nifty attributes, like "firstprivate" or "collapse", or "reduction" clause... as well as having different schedulers (static, dynamic, guided).

Honestly, its really good for prototyping. You write one for-loop, but have all these knobs and dials to try out a bunch of different strategies. But for "final code", hand-crafted threaded code really can't be beaten.

--------

BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs.

maxwell86 · on May 9, 2022

> The "target" now makes the for-loop discussed a GPU or FPGA algorithm. Now what's strange about this is... you seemed to have known this already? So I've had difficulty making an actual response to you.

The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well.

> BTW: I don't think that C++ platforms like NVidia nvcc or AMD's ROCm support for_each_n. And even if they did, that's not how you really write GPU-parallelism programs.

They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.

dragontamer · on May 9, 2022

> The target does not suffice, you need to make sure the memory is manually moved to the GPU or the FPGA, so you need to handle that as well.

Yes. That's the bottleneck and difficulty of GPU / FPGA programming. Knowing where your memory is. PCIe is very high latency, especially compared to L1, L2, L3, or DDR4 RAM.

Look, even in NVidia's "Thrust" library, you want to very carefully be thinking about CPU vs GPU RAM. If your operations are primarily on the CPU-side, you want a CPU-malloc. If your operations are primarily on GPU-side, you want a GPU-malloc.

Modern PCIe can "handle the details" for you, but if your GPU memory accesses all go through the PCIe bus to read CPU-DDR4 RAM to do anything, it will be simply slower than using the CPU itself.

This isn't a beginner subject anymore, not in the slightest.

> They do, and performance is pretty much the same as native GPU code in my experience, and according to all peer-reviewed publications about it.

I severely doubt that std::for_each_n exists on GPU code.

I see that for_each_n exists in NVidia's "Thrust" library, which is also a beginner-level API / library to use in the CUDA system (but not as efficient as dedicated GPU code). And I can imagine that NVidia Thrust might be compatible with more recent C++ standards.

But I cannot imagine the underlying API to know whether or not to do a CPU-malloc or GPU-malloc efficiently. And I'm not seeing any std::api that handles this detail. (NVidia Thrust has the programmer explicitly call whether you're using a device_vector vs host_vector).

-------

The __ONLY__ API that ever tried to "automagically" figure out the CPU-malloc vs GPU-malloc issue was C++ AMP by Microsoft. It was interesting, but performance issues and DirectX11 compatibility prevented progress (when DX12 GPUs came out, the C++AMP project didn't keep up).

I liked their array_view abstraction and its "automagic" at trying to figure out this memory-management issue. But... I really haven't seen anything like that since C++AMP.

maxwell86 · on May 16, 2022

> I severely doubt that std::for_each_n exists on GPU code.

https://docs.nvidia.com/hpc-sdk/compilers/c++-parallel-algor...

This is 4 years old. Been using it in production for the last 2 years. Works fine.

Pretty much everyone I've talked to using this in production from other research groups was able to remove all their CUDA code and replace it with this without any performance hit.

There are some recent publications about this, but most of them are quite old right now cause this is not new anymore: https://arxiv.org/abs/2010.11751

phkahler · on May 9, 2022

OpenMP code can be compiled as single threaded on compilers that don't support it without code changes. It's not a language but more like a set of annotations to be added.

I was not aware of C++ having something similar. Is that a new feature?

Edit: YES it's C++17 and later: https://en.cppreference.com/w/cpp/algorithm/execution_policy...

gumby · on May 8, 2022

> If you are using C++, and want to parallelize something, just add "std::execution::par" to your algorithms.

Do any of the shipping standard libraries actually implement execution policies? I only use gcc and clang so have to resort to TBB to get this capability.

phkahler · on May 9, 2022

>> Do any of the shipping standard libraries actually implement execution policies? I only use gcc and clang so have to resort to TBB to get this capability.

Looks like it's C++17 and C++20 feature:

https://en.cppreference.com/w/cpp/algorithm/execution_policy...

maxwell86 · on Feb 2, 2022

I agree that in general most want to integrate new languages with their existing code bases (rust-bindgen automatically generates C bindings, ABI tests, etc. using libclang to access C and C++ from Rust, and for each language there are tools to generate bindings and ABI tests for Rust code, e.g., the cpp crate generates C++ wrappers around Rust libraries).

The consultancy company developing c2rust specifically helps clients translate their apps to Rust. IIUC these clients want to move from C to a memory and thread safe language without loosing performance.

c2rust is the first step in that process. It mechanically translates C into "C-looking unsafe Rust".

The engineers then go and start migrating from unsafe Rust to safe Rust incrementally.

This is a long process, c2rust speeds up a small fraction of it, but most of the engineers time is spent into translating unsafe Rust into safe Rust, and then refactoring safe Rust into idiomatic Rust.

maxwell86 · on Feb 1, 2022

You can’t drive in a German motorway at 110kph. You’d be in the truck lane at 80kph for your whole trip, having to over take every single truck, of which there are many. So 130kph+ is the minimum unless you are towing.

maxwell86 · on Feb 1, 2022

I drove a similar route as the GP, 1100km round trip for a ski trip to Austria, aiming at 190kph.

I didn’t have to tank for the round trip, but I decided to tank on the way back cause in Austria tanking is much cheaper.

My car isn’t particularly new, but the lowest I’ve seen it use is 4.5l/100km and at 180-190 it uses 6.5l/100km.

In Germany most people don’t drive huge SUVs. Mine is a relatively sleek station wagon.

Reducing the aerodynamic surface of cars significantly improved fuel economy.

TremendousJudge · on Feb 1, 2022

So consumption increases about x1.5 from the "best case" to "autobahn speeds"? Sounds about right -- a 500km range would get reduced to ~340km. It's better than the 500km -> 300~250km claimed by GP, but it's still a big reduction in range.

maxwell86 · on Feb 2, 2022

> So consumption increases about x1.5 from the "best case" to "autobahn speeds"?

Yes, pretty much. I wouldn't say 1.5x is a constant factor.

If I drive at 200+kph (e.g. 240kph) then consumption explodes.

This stuff must be online somewhere, but for me at least it seems that consumption increases exponentially (e.g. x^2), which makes sense since air resistance increases with v^2.

maxwell86 · on Feb 1, 2022

In Germany, the autobahn is the motorway.

I typically drive at 180-190kph, that’s my “relaxed driving” speed. 160kph is “falling asleep” kind of speed unless you have a super loud car.

maxwell86 · on Feb 1, 2022

A 50 minute per session.

maxwell86 · on Feb 1, 2022

I did a similar trip earlier in January. 1100 km from Germany to the alps (550km per way).

I had to stop once for the full 1100km trip to refuel (before leaving Austria on the way back cause it’s cheap to tank there, not because we actually needed to refuel).

It was ~4:30 per way. Nobody in the car had to pee, etc.

I think doing 15 min break per way would have been ok. But doing 1h per way or more if breaks is really not great. Particularly with kids in the car (we leave really early put them as sleep, they woke up for the last 2:30 h or so and that was borderline).

maxwell86 · on Feb 1, 2022

This is not rare. Just open Google street view in Germany. Most buildings are blurrier like this.