Hacker News new | past | comments | ask | show | jobs | submit login
Blaze: A high-performance C++ math library (bitbucket.org/blaze-lib)
91 points by optimalsolver on Jan 16, 2023 | hide | past | favorite | 47 comments



Since the library claims high performance I went to look into the benchmarks section

https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks

Turns out all benchmarks are done circa 2016/17 and use pretty old versions of libraries and compilers. That's a long time. Any excitement I had was definitely tempered, would be good to have a recent update.


I’d also be curious to see how Eigen configured to dispatch to MKL where appropriate would do in these comparisons. Eigen seems to be a pretty popular library and of course MKL is vendor supplied so it should be around forever, so I’d be more confident the ability of that pairing to stay up to date…


Eigen's latest performance benchmarks are also much better than the Blaze page paints.

Eigen has commit-based performance monitoring for some time, arewefastyet style: https://eigen.tuxfamily.org/index.php?title=Performance_moni...

I've used Eigen to play with 3K by 3K dense matrices and solve them in some cases, and it's not even blink an eye, neither in time nor in space department.


This surprised me enough to plink around in octave a bit, which lead to me to be curious about the built in blas vs mkl difference (3k isn’t huge even in dense but I didn’t expect to see “blink of an eye” associated with it). 4 seconds to .13 on that sort of matrix. With like zero effort on my part. Bit of a meandering path on my part, but thanks for kicking it off.


I bet it didn't get any slower.

I gather Kompute.cc, which runs on GPUs via Vulkan, is the modern choice. Running floating point array kernels on CPU cores seems pointless these days.

(That said, I run QubesOS, and literally everything I run does all its work on CPU cores--including all Vulkan code--because nothing can see the GPU.)


> I gather Kompute.cc, which runs on GPUs via Vulkan, is the modern choice

Or you can just use Eigen via CUDA:

https://eigen.tuxfamily.org/dox/TopicCUDA.html

> Running floating point array kernels on CPU cores seems pointless these days.

CPU level vectorization can do wonders in terms of speed with right libraries (like Eigen, MKL, etc.). I can solve the same problem in seconds with Eigen where its MATLAB implementation takes hours.


> Or you can just use Eigen via CUDA:

The link you’ve posted doesn’t mean you can do large dense matmul in CUDA right away, it’s just that you can use fixed sized vector/matrix operations inside kernels (which might be useful if you’re writing graphics code and need some vec3/mat3/quats.) You still have to write your own customized kernel for a large dynamic-sized gemm computation (with tiling and shared memory and all that jazz), and at that point it’ll be best to just use CUBLAS.


As you have read, the document states:

> By default, when Eigen's headers are included within a .cu file compiled by nvcc most Eigen's functions and methods are prefixed by the device host keywords making them callable from both host and device code.

Eigen casually overrides "*" to do any kind multiplication, hence I guess it'll also carry the GeMM functions alongside to the kernel during compilation, but this needs to be tested.

Considering the speed I got from running Eigen on CPU, I still need to find larger problems to make that effort worthwhile, however.


That makes sense if you want to and can depend on CUDA.

I gather there are efforts to provide a conversion layer from CUDA to other targets, but they seems to involve adapting the CUDA code to be more portable.


I personally don't prefer CUDA (or any other form of lock-down), however they started CUDA integration ~5 years ago (or even before that), so they need to start another migration effort to support another technologies.

While I use Eigen extensively on CPU, I don't follow any mailing lists of them, so I don't know the internal state and mindset.


> "Running floating point array kernels on CPU cores seems pointless"

This is true only for 32-bit or lower precision floating-point numbers.

For 64-bit or higher precision floating-point numbers, which are required for the bulk of the scientific computation applications, CPUs are much better than the modern GPUs.

In the latest generation of both NVIDIA and AMD "consumer" GPUs, the throughput ratio of FP64 vs. FP32 has been reduced to 1:64. This ensures that these GPUs have both a performance per dollar and a performance per watt that are worse than those of CPUs.

The "datacenter" GPUs have excellent performance per watt, but their immense prices ensure that these GPUs either have a performance per dollar that is worse than that of CPUs (except for those who buy hundreds or thousands, to be used 24/7, so the savings in power consumption can balance the high acquisition cost) or even when the performance is so high that the ratio performance/price is similar to CPUs, the magnitude of the price is so great that it is far outside the range that can be afforded by a small business or by an individual.

More than a decade ago, NVIDIA claimed continuously that the GPUs will completely replace the traditional CPUs for all high-volume computations. Nevertheless, they themselves are those who have prevented this from happening, by segmenting the GPU market and raising the price of the 64-bit GPUs by more than an order of magnitude.


I think you'd need to qualify what better means here. GPUs will give much higher performance for FP64 than CPUs will. The price of CPUs has continued to increase significantly in the last 5 years, and now high-end Xeons are more expensive than GPUs. Consumer GPUs aren't what you'd typically compare to since people doing large scientific simulations wouldn't use a desktop unless it's a very small model.


I am corrected.


It makes plenty sense to run array kernels on CPU if the kernel is faster than the CPU <-> GPU data shuffling latency or bandwith hurdles


It makes plenty of sense to do isolated computations on CPU. The crossover point where you are better off shipping the input to a GPU and results back is pretty small, presuming you have one and the program knows about and can get to it.

But, as always, fast enough is fast enough.


I makes sense if you don't have a gpu, which is quite often in the embedded world.


Or if you have a latency-sensitive computation, and the gpu round trip time is just too long.


If you work hard enough, you can use multiple DMA channels and stream the data in and out of GPU simultaneously, however it needs a datacenter class card (Tesla, Radeon Instinct, etc.). Also in nVidia's case, you can only do this while using CUDA only, because nVidia doesn't want you to get full performance out of these cards without CUDA.


Explain? I didn't think there was any way around paying the pcie cost. If your data is hosted in memory/cache, and the computation takes less than a few mics, I didn't think there was any way to make GPUs competitive.


It depends on what you’re doing. I’m strongly against moving to GPU just for the sake of it.

However, in some cases, where GPU brings substantial performance benefits in a long running (hours, days, etc.) application, you can setup pinned memory locations, where they map to GPU. Then you can feed the input part, where it feeds the GPU. If you have multiple DMA engines in the GPU, you can just stream the data from in to GPU to out, and read the results from your second pinned memory location.

If the speed bump is big enough, and you can eat the first startup cost, PCIe cost can become negligible.

However, it’s always horses for courses, and it may not fit your case at all.

In my case, Eigen is already extremely fast for what I do on CPU, and my task doesn’t run long. So adding GPU doubles my wall clock.


PCI cost is typically in the noise for realistic applications. There are ways to reduce latency and signal flags across PCIe, but do most of the workload on GPU to keep the latency down.


I'm guessing the guy criticizing latency sensitivity is talking about the applications where that does matter. There are plenty of applications in say finance where they need to run a 20 parameter dot product and get the result back in nanoseconds, and I don't think GPUs can ever handle work loads like that.


That's true, but for those types of applications you're looking at a custom ASIC in most cases. high ns or very low us are FPGA territory, and low us and above are suitable on GPU.


This video asserts that FPGAs do the 10ns work, and C++ does the >100ns work. I've never heard of ASICs being used, since the deployment cycle is too frequent to justify the expense of fabing.

https://m.youtube.com/watch?v=8uAW5FQtcvE


I think Eigen is more popular these days. https://eigen.tuxfamily.org/index.php?title=Main_Page


I replaced lapackc with eigen and it was largely performance neutral. But nicer interface and header only. Those benchmark charts seem to mostly imply blaze is better at fine grained parallelism. But I largely use threadpools with coarse grained parallelism for stuff like this. Still quite interesting, not sure why I haven't heard of it before. I'd like to see nalgebra in the benchmarks...


Agreed. Some of the upsides of eigen compared to lapack relate to the use of expression templates to eliminate temporaries. And the ability to specialize code for small fixed size arrays and matrices. If you have large matrices and don’t make use of the expression templates, I would expect performance to be about the same.


I was doing some medium sized level 3 stuff with a small temporary and a small decomposition. But that over and over again pretty fast.


Last I tried eigen, adding those headers to the project added a full 3 seconds to my compile (it was under a second before), seriously harming ability to quickly play with the code, as you would say in NumPy.


Related:

Blaze: High Performance Vector/Matrix Arithmetic Library For C++ - https://news.ycombinator.com/item?id=28493373 - Sept 2021 (28 comments)

Benchmarks for Blaze, A high-performance C++ math library - https://news.ycombinator.com/item?id=10117971 - Aug 2015 (30 comments)


Its getting rather complicated to select a framework for high performance computation. But it is also more important than ever before because the automatic speed upgrades of CPU's are long gone.

Ideally one would like a library that checks at least all of the following (in no particular order):

* open source with a good community around it

* written in popular, modern, easy to use programming language

* offers complete functionality: linear algebra (including ndarrays), special functions etc. Numerical Recipes is a reasonable checklist of what is needed

* has an intuitive, "math-like" API that expresses well mathematical expressions

* makes optimal use of all available hardware (multi-core, GPU, cluster) without the need for (too much) specialized programming or tuning

* can integrate easily with other programming languages / frameworks as part of bigger projects

Those days there are quite a few candidates, from numpy/python, to armadillo,eigen,blaze,xla/C++ to various rust and julia libraries. What we don't have is an easy way to select :-). On the plus side, if the project is still in the early phases / not too big, it is relatively easy to try out and migrate. But a detailed and updated comparison table (e.g. in Wikipedia) would indeed be quite handy.


> Those days there are quite a few candidates, from numpy/python, to armadillo,eigen,blaze,xla/C++ to various rust and julia libraries. What we don't have is an easy way to select :-)

They all use or can use the same core libraries to do the actual computationally intensive parts (BLAS, LAPACK) so they mostly share the code where optimisations would make the most impact. So, at the end of the day, it's mostly about the API you like because performance-wise it's likely to be a toss up.


in principle there is this low level convergence but differences quickly pile up at the user level. eg armadillo does not support ndarrays except in an indirect way via "field" objects. 2d and 3d fixed size arrays are in general better supported by all libraries as they have wider applications in viz and games but arbitrary size / dimension tensors are hit and miss (e.g. in eigen they are an "unsupported" extension)

it gets even more complicated if you expand into efficient random number generation, graph processing etc. then it becomes important that the library is extensible with some plugin mechanism etc. as, ideally, you don't want your project to be a potpourri of different libraries / API's / dependencies

all in all there are some amazing projects out there but sometimes more options means it is harder to decide without spending serious time. I think this is one advantage of the python scipy stack: at the expense of a (possibly small) performance penalty you have one choice only but it is fairly complete and consistent


> easy to use programming language

> C++


If blaze is of interest to you and you use GPUs, check out our library called MatX: https://github.com/NVIDIA/MatX

I haven't used blaze, but the readme sounds similar. We generate kernels using expression templates and have a similar syntax to matlab or python.


My collaborators and I settled on Blaze for the linear algebra backend of a small physics library. We ultimately regretted it as there was no support or roadmap to GPU support.


For GPU support take a look at our library:

https://github.com/NVIDIA/MatX

If anything is missing we're happy to take feature requests.


AMD support? No? :)


No, but also consider that it's a pretty new project and is heavily based on Nvidia math libraries. some of these libraries don't have exact counterparts on the amd side.


I would like to learn about experience with Kompute.cc, particularly as backed by AMD GPUs.


This made me wonder if RLibm has been extended to doubles yet[0]. Sadly that doesn't seem to be the case. It's a floating point math library that always produces correct rounding results, but it is limited to 32-bit floats (and posits, for that matter). Interestingly, despite being the only library that always rounds correctly, it also claims to be faster than other libraries (at same bit-depth).

I can recommend the papers, they're quite accessible and there is something very satisfying about the math tricks they came up with to build the polynomials.

They do seem to have added more rounding modes though, which presumably makes the library more generally useful.[1]

[0] https://people.cs.rutgers.edu/~sn349/rlibm/

[1] https://blog.sigplan.org/2022/04/28/one-polynomial-approxima...


Here is a benchmark from 2020 comparing Eigen, Blaze, Fastor, Armadillo, and XTensor (run-time and compile time performance):

https://romanpoya.medium.com/a-look-at-the-performance-of-ex...


Question since I never try myself, but how does this and similar libraries compare in terms of performance to just use the likes of Tensorflow as a computation library and compiling it ahead of time [1]?

[1] https://www.tensorflow.org/xla/tfcompile


I never used blaze, but the main developer, Klaus Iglberger, is a terrific c++ conference speaker/teacher. His talks are really well structured and pedagogical whether the topic is basic or advanced. I recommend checking out his talks and watching them if they are relevant to you!


It is also worth mentioning Armadillo here: C++ library for linear algebra & scientific computing, https://arma.sourceforge.net/docs.html


Armadillo is particularly notable in that has a Python interface (based on pybind11). Although, sparse matrix support still needs to be added. Also the PyArmadillo API resembles that of MATLAB, which can be convenient when translating.

https://pyarma.sourceforge.io/


I love that it starts out by telling you what it is before it gets into what is new in the latest release. If only more announcements would be like that...

(Hint, hint.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: