If you feel like you've finally groked GPU/massive parallel software programming and need more challenges, I highly recommend playing around with digital circuits! The level of parallelism available to you in hardware is truly unmatched and it's incredibly fun, especially once you start really pushing implementations of your designs on FPGAs. Granted, FPGAs are frequently less useful than what you could do on a GPU due to the higher clock speeds available on ASICs (if your GPU core clock is 3GHz and your FPGA design maxes out at 500MHz [which would be admirable!], the GPU has nearly 6x the number of cycles to match or beat your implementation!).
I started by reading “Digital Design and Computer Architecture”. There’s new RISC-V edition https://a.co/d/imzGBK5. The book starts from Boolean logic and transistor technology and goes all the way to assembly programming with everything in between. Most importantly gives great introduction to HDLs. Next I played with a bunch of hardware projects specifically targeting inexpensive Arty-A7 board to get comfortable with FPGA tooling.
FPGAs are hardware, you generally program them with hardware description languages like Verilog.
If you don’t want to buy hardware (reasonable, IMO, since HDL is kind of niche and the boards can be pricey), you could try out the language with something like Verilator. This will let you write Verilog, and then compile it to generate C++ classes which simulate your design.
The OSS CAD Suite [0] is a good open-source toolchain for this stuff. You can then write hardware designs in the SytemVerilog language (VSCode has some plugins, I believe, but I've just been using a basic text editor) and use the build toolchain to compile ("synthesize") and program e.g. an FPGA with your designs.
(FWIW, I've only just taken a class on Verilog this past Spring, but we used oss-cad-suite and I found it pretty straightforward to use. The bundled version of Verilator had some issues on my Mac though, so I had to compile my own copy of Verilator.)
I know it depends on the analysis, but I often am doing somewhat embarassingly parallel things. So just knowing GNU parallel for mid-scale things (and R/python basically parallelism, although shared memory is a bear), and how to temporarily scale across the cloud to like 500 core, is huge.