A blast from the past, I used to work in particle physics and used ROOT a lot. I had a love/hate relationship with it. On the one hand, it had a lot of technical debt and idiosyncrasies. But on the other hand, there are a bunch of things that are easier in ROOT than in more "modern" options like matplotlib. For example, anything that has to do with histograms. Or highly structured data (where your 'columns' contain objects with fields). Or just plotting functions (without having to allocate arrays for the x and y values). I also like the very straightforward object-oriented API. It feels like old-school C++ or Java, as opposed to pandas/matplotlib which has a lot of method chaining, abuse of [] syntax and other magic. It is not elegant, and quite verbose, but that is probably a good thing when doing a scientific analysis.
I left about 5 years ago, and ROOT was in a process of change. They already ripped out the old CINT interpreter and moved to a clang-based codebase, and now you can run your analyses in Jupyter as far as I know (in C++ or Python). I heard the code quality has improved a lot, too.
The best thing about root was how it handled data loading. TTree's, with their column based slicing on disk, are such a good idea. Ever since I graduated and moved into industry, I've been looking for something that works the same way.
I was hosting one of the leads of ROOT at Google and we got to talking about ROOT. I mentioned sstables and columnio and he said "oh, yeah, we've been doing that for years".
Because matplotlib is not so histogram focused (I guess because the kids these days have plenty of r
RAM), people always show these abominable scatter plots that have so many points on top of each other that they're useless. Yuck.
Haskell would be great for designing the interface of a library like this, but not for implementing it. It would definitely not look like "old-school C++ or Java" but, well, that's the whole point :P
I haven't used ROOT so I don't know how well it would work to write bindings for it in Haskell; it can be hard to provide a good interface to an implementation that was designed for a totally different style of use. Possible, just difficult.
I think having Haskell bindings to it will be quite valuable .For implementation of core structures, though, it's better to stick to C++ to max out on performance and have a finer control on resource usage. Haskell isn't particularly good at that.
There's a number of reasons for this. The first is that the quant physics community has never really adopted functional programming. It's not particularly obvious to scientists, who typically want to express their computation the way they want to- something that C, C++, and Fortran are all long-established at doing. The second is that much of physics depends on old libraries written over the last 30-40 years, and it's easiest to use them from a language that the library is written in, or one that has a highly similar interface (for example, Python is similar enough to C++ that many foreign function interfaces are literally just direct wrappers). The third is that types (other than simple scalars, arrays, and trees/graphs) have never been a high priority in quant physics. The fourth is that undergrad education outside CS rarely teaches students Haskell, while most undergrads in a quant field graduate knowing some amount of Python.
It's much more likely the physics community would adopt Julia, or maybe Rust, and even that has been pretty slow.
(nothing I said above should be construed as taking a position about the suitability of any specific language or lack thereof for doing scientific computing. I have opinions, but I am attempting to explain the reason factually with a minimum of bias)
Using something like Haskell for ROOT is ridiculous for a lot of obvious reasons. A simple and dismissive "no" invites the cautious reader to discover them on their own rather than waste engaging in a protracted debate. Maybe it's better to reject the idea out of hand and spend our time elsewhere.
That’s just not how technical discussions work. Not everyone knows what you know and the point of this community is to share knowledge not gatekeep it behind some “discovering it yourself” bullshit. The fastest thing to do is not dismissing it with no explanation but rather explaining for all the readers why that is the case. Because if one person doesn’t know I can guarantee that there’s plenty out there who are just as interested to know. And it’s a waste of everyone’s time to have each person independently come to the same conclusion when it’s apparently easily explainable.
You’re free to not do any of that, of course, but be prepared to defend the fact that you’d prefer not engaging in discussion and instead just shallowly dismiss something.
This is a great example of why the age of truly terrible software is going to be ushered in as LLMS get better.
When the cost of complexity of interacting with an API is paid by the LLM, optimizing this particular part of software design (also one of the hardest to get right) will be less fashionable.
There are no many reasons why new analyses should default to using ROOT instead of more user friendly and sane options like uproot [1]. Maybe some people have some legacy workflow or their experiments have many custom patches on top of ROOT (common practice) for other things but for physics analysis you might be self torturing yourself.
Also I really like their 404 page [2]. And no it is not about room 404 :)
One common criticism of uproot is that it's not flexible when per-row computation gets complicated because for-loops in Python is too slow. For that one can either use Numba (when it works), or, here's the shameless plug, use Julia: https://github.com/JuliaHEP/UnROOT.jl
That'a true and Julia might be a solution but I don't see the adoption happening anytime soon.
But this particular problem (per row computation) have different options to tackle now in hep-python ecosystem. One approach is to leverage array programming with NumPy to vectorize operations as much as possible. By operating on entire arrays rather than looping over individual elements, significant speedups can often be achieved.
Another possibility is to use a library like Awkward Array, which is designed to work with nested, variable-sized data structures. Awkward Array integrates well with uproot and provides a powerful and flexible framework for performing fast computations on i.e jagged arrays.
Uproot already returns you Awkward array, so both things you mentioned are different ways of saying the same thing. The irreducible complexity of data analysis is there no matter how you do it, and "one-vector-at-a-time" sometimes feel like shoehorning (other terms people come up with include vector-style mental gymnastics).
For the record, vector-style programming is great when it works, I mean Julia even has a dedicated syntax for broadcasting. I'm saying when the irreducible complexity arrives, you don't want to NOT be able to just write a for-loop
A great alternative to numba for accelerated Python is Taichi. Trivial to convert a regular python program into a taichi kernel, and then it can target CUDA (and a variety of other options) as the backend. No need to worry about block/grid/thread allocation etc. at the same time, it’s super deep with great support for data classes, custom memory layouts for complexly nested classes, etc etc, comes with autograd, etc. I’m a huge fan - makes writing code that runs on the GPU and integrates with your python libraries an absolute breeze. Super powerful. By far the best tool in the accelerated python toolbox IMO.
>they made a lame excuse that Pytorch didn't support 3.12
how is this a lame excuse
>but it fails on a bunch of PyTorch-related tests. We then figured out that PyTorch does not have Python 3.12 support
they have a dep that was blocking them from upgrading. you would have them do what? push pytorch to upgrade?
>Later, even when Pytorch added support for 3.12, nothing changed (so far) in Taichi.
my friend that "Later" is feb/march of this year ie 2-3 months ago. exactly how fast would you like for this open source project to service your needs? not to mention there is a PR up for the bump.
Here it's more the other way around. CERN needs a data analysis framework, so CERN develops, maintains and publishes it for other users.
That being said, I don't know whether it's actually a good idea for someone external to actually use it. My experience may be a little outdated, but it's quite clunky and dated. The big advantage of using it for CERN or particle physics stuff is that it's basically a standard, so it's easy to collaborate internally.
Well these are two very different examples. One, ROOT, is a powerful data analysis framework that as powerful as it is failed to be general and easy to use enough to ever get out the HEP world.
The other one, gstreamer, is a beautifully designed platform with an architecture so nice it can be easily abstracted and reused in completely different scenarios, even ones that probably never occurred to the authors.
What is not cool is that ROOT was "designed" and built by people who had absolutely no idea how to run a large-scale software project. And it shows everywhere - it's one huge monolith that you have to constantly fight to do anything slightly non-trivial. I'm happy that I don't have to use it frequently, though I still have some exposure.
IMHO, ROOT[3-5] is too many things with a lot of poorly designed API and most importantly a lack of separation between ROOT-the-library and ROOT-the-program (lots of globals and assumptions that ROOT-the-program is how people should use it).
ROOT 6 started to correct some of these things, but it takes time (and IMHO, they are buying too much into llvm and clang, increasing even more the build times and worsening the hackability of ROOT as a project)
Also, for the longest time, the I/O format wasn't very well documented, with only 1 implementation.
Now, thanks to groot [1], uproot (that was developed building on the work from groot) and others (freehep, openscientist, ...), it's to read/write ROOT data w/o bringing the whole TWorld.
Interoperability. For data, I'd say it's very much paramount in my book to have some hope to be able to read back that unique data in 20, 30, ... years down the line.
You don’t have to. I worked on data analysis (mostly cleaning and correction) for CMS (one of the two main experiments at LHC) for a while and didn’t have to touch it. Disclaimer: I was a high energy theorist, but did the aforementioned experimental work early in my PhD for funding.
I mean, most of the researchers I know at least use PyRoot (or the Julia equivalent) as much as possible, rather than actually interacting with Root itself. Which probably saves their sanity...
I did my master and PhD around the time numpy/scipy got competitive for a lot of analysis (for me a complete replacement) but the Python binding for root weren't there or in beta. Root-the-data+format remained however the main output of Geant4, so I set up a tiny Python wrapper around a root script that would dump any .root contents and load it up in a numpy file.
I'm still waiting for the interface-breaking, let's-finally-make-root-good, version 7, which I think I first heard about in 2016 or so... true vapourware.
Hehe. I worked at an online lending website around 2013 with a group of particle physicists hired to build risk prediction models. They used ROOT for the modeling and build some interface through ruby... fromnthe software engineering POV it was an abomination. But the statistics POV was pretty neat.
This was way before the Python ecosystem gained traction. And R ML packages were also just starting.
Root is definitely the backbone of a ton of work done in experimental particle physics but it is also the nightmare of new graduate students. It's affectively engrained into particle physics and I don't expect that to change anytime soon
It is not that bad now with pyroot (ROOT python interface) and uproot being an option that is easy to learn for new graduate students. The problem is about legacy code which they usually have to maintain as part of experiment service
I can’t count the number of of times where a beginner did some stuff in pyroot that was horrifically slow and just implementing the exact same algorithm in C++ was two orders of magnitude faster.
If you don’t use RDataFrame, or it’s just histogram plotting, be very careful with pyroot.
The part of Root I use is Cling the C++ interpreter along with Xeus in a Jupyter notebook. I decided one night to test the fastest n-body from benchmarkgames comparing Xeus and Python 3. With Xeus I get 15.58 seconds and running the fastest Python code with Python3 kernel, both on binder using the same instance, I get 5 minutes. Output is exactly the same for both runs. Even with an overhead tax for running dynamic C++ at ~300% for this program Cling is very quick. SIMD and vectorization were not used just purely the code from benchmarkgames. I use Cling primarily as a quick stand-in JIT for languages that compile to C++.
It was a nice guest post on the website about eclipse, but most people just use gdb. It is now possible to step through ROOT macros with gdb by exporting CLING_DEBUG=1. See https://indico.jlab.org/event/459/contributions/11563/
Without Cling, this sort of thing wouldn't be feasible in C++. Not in the way which Clojure dialects work. The runtime is a library and the generated code is just using that library.
Have they released v7 yet? When I started my PhD it they announced it, and I looked forward towards the consistency between certain parts of the software they would introduce (some mismatches really dont make sense and are clearly organic) and now I'm already 2 years past my graduation.
Certain gTLDs have been borderline scams. The most infamous one might be .sucks, an extortion scheme charging an annual protection fee of $$$, complete with the pre-registration process when you could buy <yourtrademark>.sucks for $$$$ before it’s snatched up by your enemies.
They also screwed up some old URL/email parsers/sniffers hardcoding TLDs. Largely the fault of bad assumptions to begin with.
Other than the above, I don’t see much of a problem. Whatever problems people like to point out about gLTDs already existed with numerous sketchy ccTLDs, like .io. Guess what, the latest hotness .ai is also one of those.
If we allowed all possible TLDs, then we'd need a default organization to administer them. The current setup requires an organization to control each TLD, which allows us to grant control to countries or large organizations. The web should be decentralized, which means TLD ownership should be spread across multiple organizations. More TLDs with more distinct owners is a better situation than one default.
I struggle to see why one may want to use an interactive analysis toolkit via C++. Could anyone who has used ROOT enlighten me on this? I understand why you may write it in C++, but why would you want to invoke it with C++ for this sort of work?
All of our other code is C++. The data reconstruction framework writing ROOT files, the analysis frameworks doing stat analysis. The event data model is implemented in C++.
It has its rough edges, but you do get a lot of good synergy out of this setup for sure.
comments here have already mentioned couple horror stories of people accidentally/by inexperience doing a lot of work above the framework - if you can save that by not being slow, why not?
What I remember about ROOT Cint is that it was an absolute nightmare to work with, mostly because it couldn't do STL containers very well. It was a weird time to do language interop for physicists.
back in the days, one always had to have 2 terminals open to work with ROOT: one to work and the other to 'kill -9 root.exe' thanks to CINT happily completely destroying your TTY.
I left about 5 years ago, and ROOT was in a process of change. They already ripped out the old CINT interpreter and moved to a clang-based codebase, and now you can run your analyses in Jupyter as far as I know (in C++ or Python). I heard the code quality has improved a lot, too.