ROOT: analyzing petabytes of data scientifically

captainmuon · 2024-06-01T13:34:24 1717248864

A blast from the past, I used to work in particle physics and used ROOT a lot. I had a love/hate relationship with it. On the one hand, it had a lot of technical debt and idiosyncrasies. But on the other hand, there are a bunch of things that are easier in ROOT than in more "modern" options like matplotlib. For example, anything that has to do with histograms. Or highly structured data (where your 'columns' contain objects with fields). Or just plotting functions (without having to allocate arrays for the x and y values). I also like the very straightforward object-oriented API. It feels like old-school C++ or Java, as opposed to pandas/matplotlib which has a lot of method chaining, abuse of [] syntax and other magic. It is not elegant, and quite verbose, but that is probably a good thing when doing a scientific analysis.

I left about 5 years ago, and ROOT was in a process of change. They already ripped out the old CINT interpreter and moved to a clang-based codebase, and now you can run your analyses in Jupyter as far as I know (in C++ or Python). I heard the code quality has improved a lot, too.

casualscience · 2024-06-01T19:15:36 1717269336

The best thing about root was how it handled data loading. TTree's, with their column based slicing on disk, are such a good idea. Ever since I graduated and moved into industry, I've been looking for something that works the same way.

moelf · 2024-06-01T21:03:09 1717275789

Apache arrow and parquet all work this way. Even HDF5 in column mode isn't completely bad.

TTree is succeeded by RNTuple, which is basically CERN's take on Apache Arrow, they're incredibly similar

amelius · 2024-06-01T22:15:07 1717280107

Is this a kind of lazy loading?

dekhn · 2024-06-01T23:42:02 1717285322

I was hosting one of the leads of ROOT at Google and we got to talking about ROOT. I mentioned sstables and columnio and he said "oh, yeah, we've been doing that for years".

cozzyd · 2024-06-01T19:14:58 1717269298

Because matplotlib is not so histogram focused (I guess because the kids these days have plenty of r RAM), people always show these abominable scatter plots that have so many points on top of each other that they're useless. Yuck.

ephimetheus · 2024-06-01T18:52:34 1717267954

We all have a love/hate relationship with it. It’s a bit like Stockholm syndrome.

ilrwbwrkhv · 2024-06-01T14:11:31 1717251091

I wonder if Haskell would also be a good fit for writing something like this.

tikhonj · 2024-06-01T16:17:52 1717258672

Haskell would be great for designing the interface of a library like this, but not for implementing it. It would definitely not look like "old-school C++ or Java" but, well, that's the whole point :P

I haven't used ROOT so I don't know how well it would work to write bindings for it in Haskell; it can be hard to provide a good interface to an implementation that was designed for a totally different style of use. Possible, just difficult.

goy · 2024-06-01T15:49:33 1717256973

I think having Haskell bindings to it will be quite valuable .For implementation of core structures, though, it's better to stick to C++ to max out on performance and have a finer control on resource usage. Haskell isn't particularly good at that.

EDIT: there's one at https://hackage.haskell.org/package/HROOT

shrimp_emoji · 2024-06-01T14:12:41 1717251161

mynameisvlad · 2024-06-01T16:02:18 1717257738

This is a technical community. You really have to do better than a one word dismissal without any reasoning.

In other words, why do you think it’s not a good fit?

dekhn · 2024-06-01T23:48:37 1717285717

There's a number of reasons for this. The first is that the quant physics community has never really adopted functional programming. It's not particularly obvious to scientists, who typically want to express their computation the way they want to- something that C, C++, and Fortran are all long-established at doing. The second is that much of physics depends on old libraries written over the last 30-40 years, and it's easiest to use them from a language that the library is written in, or one that has a highly similar interface (for example, Python is similar enough to C++ that many foreign function interfaces are literally just direct wrappers). The third is that types (other than simple scalars, arrays, and trees/graphs) have never been a high priority in quant physics. The fourth is that undergrad education outside CS rarely teaches students Haskell, while most undergrads in a quant field graduate knowing some amount of Python.

It's much more likely the physics community would adopt Julia, or maybe Rust, and even that has been pretty slow.

(nothing I said above should be construed as taking a position about the suitability of any specific language or lack thereof for doing scientific computing. I have opinions, but I am attempting to explain the reason factually with a minimum of bias)

sfpotter · 2024-06-01T21:41:36 1717278096

I think the response gets right to the point!

Using something like Haskell for ROOT is ridiculous for a lot of obvious reasons. A simple and dismissive "no" invites the cautious reader to discover them on their own rather than waste engaging in a protracted debate. Maybe it's better to reject the idea out of hand and spend our time elsewhere.

mynameisvlad · 2024-06-02T15:10:11 1717341011

That’s just not how technical discussions work. Not everyone knows what you know and the point of this community is to share knowledge not gatekeep it behind some “discovering it yourself” bullshit. The fastest thing to do is not dismissing it with no explanation but rather explaining for all the readers why that is the case. Because if one person doesn’t know I can guarantee that there’s plenty out there who are just as interested to know. And it’s a waste of everyone’s time to have each person independently come to the same conclusion when it’s apparently easily explainable.

You’re free to not do any of that, of course, but be prepared to defend the fact that you’d prefer not engaging in discussion and instead just shallowly dismiss something.

hackable_sand · 2024-06-01T15:49:56 1717256996

Could it though?

BiteCode_dev · 2024-06-01T15:18:31 1717255111

Honestly now with chatgpt, matplotlib terrible API is less of a problem.

typon · 2024-06-01T19:00:17 1717268417

This is a great example of why the age of truly terrible software is going to be ushered in as LLMS get better.

When the cost of complexity of interacting with an API is paid by the LLM, optimizing this particular part of software design (also one of the hardest to get right) will be less fashionable.

OutOfHere · 2024-06-01T15:54:33 1717257273

That's true, but still, there are things you just can't do in matplotlib that you can do better in other GPT-aware packages like plotly.

elashri · 2024-06-01T12:56:27 1717246587

There are no many reasons why new analyses should default to using ROOT instead of more user friendly and sane options like uproot [1]. Maybe some people have some legacy workflow or their experiments have many custom patches on top of ROOT (common practice) for other things but for physics analysis you might be self torturing yourself.

Also I really like their 404 page [2]. And no it is not about room 404 :)

[1] https://github.com/scikit-hep/uproot5

[2] https://root.cern/404/

moelf · 2024-06-01T13:23:22 1717248202

One common criticism of uproot is that it's not flexible when per-row computation gets complicated because for-loops in Python is too slow. For that one can either use Numba (when it works), or, here's the shameless plug, use Julia: https://github.com/JuliaHEP/UnROOT.jl

Past HN discussion on Julia for particle physics: https://news.ycombinator.com/item?id=38512793

elashri · 2024-06-01T13:59:39 1717250379

That'a true and Julia might be a solution but I don't see the adoption happening anytime soon.

But this particular problem (per row computation) have different options to tackle now in hep-python ecosystem. One approach is to leverage array programming with NumPy to vectorize operations as much as possible. By operating on entire arrays rather than looping over individual elements, significant speedups can often be achieved.

Another possibility is to use a library like Awkward Array, which is designed to work with nested, variable-sized data structures. Awkward Array integrates well with uproot and provides a powerful and flexible framework for performing fast computations on i.e jagged arrays.

moelf · 2024-06-01T14:17:17 1717251437

Uproot already returns you Awkward array, so both things you mentioned are different ways of saying the same thing. The irreducible complexity of data analysis is there no matter how you do it, and "one-vector-at-a-time" sometimes feel like shoehorning (other terms people come up with include vector-style mental gymnastics).

For the record, vector-style programming is great when it works, I mean Julia even has a dedicated syntax for broadcasting. I'm saying when the irreducible complexity arrives, you don't want to NOT be able to just write a for-loop

Just a recent example, a double-for loop looks like this in Awkward array: https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark/blo... -- the result looks "neat" as in a piece of art.

szvsw · 2024-06-01T14:46:09 1717253169

A great alternative to numba for accelerated Python is Taichi. Trivial to convert a regular python program into a taichi kernel, and then it can target CUDA (and a variety of other options) as the backend. No need to worry about block/grid/thread allocation etc. at the same time, it’s super deep with great support for data classes, custom memory layouts for complexly nested classes, etc etc, comes with autograd, etc. I’m a huge fan - makes writing code that runs on the GPU and integrates with your python libraries an absolute breeze. Super powerful. By far the best tool in the accelerated python toolbox IMO.

OutOfHere · 2024-06-01T15:56:55 1717257415

Negative, as Taichi doesn't even support Python 3.12, and it's unclear if it ever will. Why would I limit myself to an old version of Python?

almostgotcaught · 2024-06-01T16:15:32 1717258532

Hn people are so haughty

https://github.com/taichi-dev/taichi/pull/8522

OutOfHere · 2024-06-01T16:24:01 1717259041

The haughtiness is not for nothing. Since Dec 2023, they made a lame excuse that Pytorch didn't support 3.12: https://github.com/taichi-dev/taichi/issues/8365#issuecommen...

Later, even when Pytorch added support for 3.12, nothing changed (so far) in Taichi.

almostgotcaught · 2024-06-01T17:03:00 1717261380

>they made a lame excuse that Pytorch didn't support 3.12

how is this a lame excuse

>but it fails on a bunch of PyTorch-related tests. We then figured out that PyTorch does not have Python 3.12 support

they have a dep that was blocking them from upgrading. you would have them do what? push pytorch to upgrade?

>Later, even when Pytorch added support for 3.12, nothing changed (so far) in Taichi.

my friend that "Later" is feb/march of this year ie 2-3 months ago. exactly how fast would you like for this open source project to service your needs? not to mention there is a PR up for the bump.

I stand by my original comment.

leohonexus · 2024-06-01T09:17:22 1717233442

Very cool to see large-scale software projects used for scientific discoveries.

Another example: Gravitational waves were found with GStreamer at LIGO: https://lscsoft.docs.ligo.org/gstlal/

hkwerf · 2024-06-01T10:03:05 1717236185

Here it's more the other way around. CERN needs a data analysis framework, so CERN develops, maintains and publishes it for other users.

That being said, I don't know whether it's actually a good idea for someone external to actually use it. My experience may be a little outdated, but it's quite clunky and dated. The big advantage of using it for CERN or particle physics stuff is that it's basically a standard, so it's easy to collaborate internally.

aulin · 2024-06-01T10:17:08 1717237028

Well these are two very different examples. One, ROOT, is a powerful data analysis framework that as powerful as it is failed to be general and easy to use enough to ever get out the HEP world.

The other one, gstreamer, is a beautifully designed platform with an architecture so nice it can be easily abstracted and reused in completely different scenarios, even ones that probably never occurred to the authors.

im3w1l · 2024-06-01T13:06:35 1717247195

Gstreamer must have been a winamp clone right?

andy-x · 2024-06-02T22:42:18 1717368138

What is not cool is that ROOT was "designed" and built by people who had absolutely no idea how to run a large-scale software project. And it shows everywhere - it's one huge monolith that you have to constantly fight to do anything slightly non-trivial. I'm happy that I don't have to use it frequently, though I still have some exposure.

jakjak123 · 2024-06-01T11:34:49 1717241689

> Gravitational waves were found with GStreamer at LIGO: https://lscsoft.docs.ligo.org/gstlal/

Say WHAT now?!

semi-extrinsic · 2024-06-01T19:42:53 1717270973

They even have a "gstlal-ugly" package!

sbinet · 2024-06-01T18:44:13 1717267453

IMHO, ROOT[3-5] is too many things with a lot of poorly designed API and most importantly a lack of separation between ROOT-the-library and ROOT-the-program (lots of globals and assumptions that ROOT-the-program is how people should use it). ROOT 6 started to correct some of these things, but it takes time (and IMHO, they are buying too much into llvm and clang, increasing even more the build times and worsening the hackability of ROOT as a project)

Also, for the longest time, the I/O format wasn't very well documented, with only 1 implementation.

Now, thanks to groot [1], uproot (that was developed building on the work from groot) and others (freehep, openscientist, ...), it's to read/write ROOT data w/o bringing the whole TWorld. Interoperability. For data, I'd say it's very much paramount in my book to have some hope to be able to read back that unique data in 20, 30, ... years down the line.

[1] https://go-hep.org/x/hep/groot (I am the main dev behind go-hep)

ephimetheus · 2024-06-01T18:53:56 1717268036

uproot to this day doesn’t properly implement reading TEfficiency, I believe, which is a bummer, to be honest.

sbinet · 2024-06-01T19:24:33 1717269873

that's odd. TEfficiency is a relatively simple thing to read/write :

- https://github.com/go-hep/hep/blob/main/groot/rhist/efficien...

ephimetheus · 2024-06-01T19:37:24 1717270644

Yeah I think it has to do with the memberwise splitting. https://github.com/scikit-hep/uproot5/issues/38

I understand this has not been a priority so far.

It kinda works if you open a magic file with a specific on-disk representation which bypasses this, but that’s not a solution at all.

SiempreViernes · 2024-06-01T08:35:23 1717230923

Ah, root... every day it happens I am thankful I don't have to used a version older than 6.

YakBizzarro · 2024-06-01T08:53:25 1717232005

Root was zone of the reasons to decide to not study particle physics

oefrha · 2024-06-01T09:00:11 1717232411

You don’t have to. I worked on data analysis (mostly cleaning and correction) for CMS (one of the two main experiments at LHC) for a while and didn’t have to touch it. Disclaimer: I was a high energy theorist, but did the aforementioned experimental work early in my PhD for funding.

aoanla · 2024-06-01T09:39:30 1717234770

I mean, most of the researchers I know at least use PyRoot (or the Julia equivalent) as much as possible, rather than actually interacting with Root itself. Which probably saves their sanity...

brnt · 2024-06-01T10:26:00 1717237560

I did my master and PhD around the time numpy/scipy got competitive for a lot of analysis (for me a complete replacement) but the Python binding for root weren't there or in beta. Root-the-data+format remained however the main output of Geant4, so I set up a tiny Python wrapper around a root script that would dump any .root contents and load it up in a numpy file.

My plots looked a lot nicer ;)

tempay · 2024-06-01T12:59:08 1717246748

These days you can mostly avoid it. The Python HEP ecosystem is now pretty advanced so you can even read ROOT files without needing root itself. See:

https://scikit-hep.org/

twixfel · 2024-06-01T09:28:12 1717234092

I'm still waiting for the interface-breaking, let's-finally-make-root-good, version 7, which I think I first heard about in 2016 or so... true vapourware.

amadio · 2024-06-01T20:48:05 1717274885

ROOT 7 is coming. Things are being discussed this year about it, the target is for HL-LHC. See link below. https://indico.cern.ch/event/1369601/contributions/5867782/a...

bobek · 2024-06-01T12:23:12 1717244592

Aaah, this brings memories of late night debugging sessions of code written by briliant physicists without computer science background ;)

xtracto · 2024-06-02T01:53:59 1717293239

Hehe. I worked at an online lending website around 2013 with a group of particle physicists hired to build risk prediction models. They used ROOT for the modeling and build some interface through ruby... fromnthe software engineering POV it was an abomination. But the statistics POV was pretty neat.

This was way before the Python ecosystem gained traction. And R ML packages were also just starting.

andrepd · 2024-06-01T18:27:25 1717266445

Ahh I can imagine the 2000 lines-long main() :)

codecalec · 2024-06-01T11:33:02 1717241582

Root is definitely the backbone of a ton of work done in experimental particle physics but it is also the nightmare of new graduate students. It's affectively engrained into particle physics and I don't expect that to change anytime soon

elashri · 2024-06-01T12:58:49 1717246729

It is not that bad now with pyroot (ROOT python interface) and uproot being an option that is easy to learn for new graduate students. The problem is about legacy code which they usually have to maintain as part of experiment service

ephimetheus · 2024-06-01T18:58:12 1717268292

I can’t count the number of of times where a beginner did some stuff in pyroot that was horrifically slow and just implementing the exact same algorithm in C++ was two orders of magnitude faster.

If you don’t use RDataFrame, or it’s just histogram plotting, be very careful with pyroot.

SiempreViernes · 2024-06-01T20:48:40 1717274920

You should be using RDataFrame though, or awkward + dask.

ephimetheus · 2024-06-01T20:58:52 1717275532

+1 for RDataFrame for what it can do. Just be prepared to bail to C++ and for loops when you exceed what it can do without major headaches.

wolfspider · 2024-06-01T17:42:44 1717263764

The part of Root I use is Cling the C++ interpreter along with Xeus in a Jupyter notebook. I decided one night to test the fastest n-body from benchmarkgames comparing Xeus and Python 3. With Xeus I get 15.58 seconds and running the fastest Python code with Python3 kernel, both on binder using the same instance, I get 5 minutes. Output is exactly the same for both runs. Even with an overhead tax for running dynamic C++ at ~300% for this program Cling is very quick. SIMD and vectorization were not used just purely the code from benchmarkgames. I use Cling primarily as a quick stand-in JIT for languages that compile to C++.

Jeaye · 2024-06-01T18:29:29 1717266569

I'm using Cling for JIT compiling my native Clojure dialect: https://github.com/jank-lang/jank

Trying to bring C++ into the Clojure world and Clojure/interactive programming into the C++ world.

nomilk · 2024-06-01T08:37:22 1717231042

Source code: https://github.com/root-project

dailykoder · 2024-06-01T09:00:33 1717232433

>Debugging CERN ROOT scripts and ROOT-based programs in Eclipse IDE (30 Oct 2021)

Oh gosh. The nightmares. - What obviously shows that you can build extraordinary stuff in horrible environments.

BSDobelix · 2024-06-01T09:04:52 1717232692

I don't understand is it about eclipse?

amadio · 2024-06-01T21:19:31 1717276771

It was a nice guest post on the website about eclipse, but most people just use gdb. It is now possible to step through ROOT macros with gdb by exporting CLING_DEBUG=1. See https://indico.jlab.org/event/459/contributions/11563/

scheme271 · 2024-06-01T09:17:35 1717233455

ROOT, providing the C++ repl that no one asked for.

Jeaye · 2024-06-01T18:31:12 1717266672

I definitely asked for it. I'm using Cling for JIT compiling my native Clojure dialect: https://github.com/jank-lang/jank

Without Cling, this sort of thing wouldn't be feasible in C++. Not in the way which Clojure dialects work. The runtime is a library and the generated code is just using that library.

fooker · 2024-06-01T11:30:54 1717241454

The researchers behind this contributed it into mainline clang as clang-repl

pjmlp · 2024-06-01T13:29:15 1717248555

Before ROOT, there was Energize C++ and Visual Age for C++ v 4.0, however too expensive and resource demanding for early 1990's workstations.

There are also a couple of C++ live environments in the game industry.

lnauta · 2024-06-01T12:07:04 1717243624

Have they released v7 yet? When I started my PhD it they announced it, and I looked forward towards the consistency between certain parts of the software they would introduce (some mismatches really dont make sense and are clearly organic) and now I'm already 2 years past my graduation.

npalli · 2024-06-01T12:07:51 1717243671

v6.32

mjtlittle · 2024-06-01T08:35:03 1717230903

Didnt know there was a cern tld

ragebol · 2024-06-01T09:06:23 1717232783

Handy if they host conferences, for people worried about too many TLDs perhaps.

https://con.cern is not yet used, so...

sneak · 2024-06-01T08:42:07 1717231327

Yes, the root zone is terribly polluted now. Unfortunately there’s no way to unring that bell, people depend on a lot of these new domains now.

It was a huge mistake, borne out of greed and recklessness.

https://en.wikipedia.org/wiki/ICANN#Notable_events

Biganon · 2024-06-01T09:14:41 1717233281

I fail to see the problem with those new TLDs.

oefrha · 2024-06-01T09:32:19 1717234339

Certain gTLDs have been borderline scams. The most infamous one might be .sucks, an extortion scheme charging an annual protection fee of $$$, complete with the pre-registration process when you could buy <yourtrademark>.sucks for $$$$ before it’s snatched up by your enemies.

They also screwed up some old URL/email parsers/sniffers hardcoding TLDs. Largely the fault of bad assumptions to begin with.

Other than the above, I don’t see much of a problem. Whatever problems people like to point out about gLTDs already existed with numerous sketchy ccTLDs, like .io. Guess what, the latest hotness .ai is also one of those.

9dev · 2024-06-01T09:31:32 1717234292

I still wonder why we need that arbitrary restriction anyway?

8organicbits · 2024-06-01T10:24:03 1717237443

If we allowed all possible TLDs, then we'd need a default organization to administer them. The current setup requires an organization to control each TLD, which allows us to grant control to countries or large organizations. The web should be decentralized, which means TLD ownership should be spread across multiple organizations. More TLDs with more distinct owners is a better situation than one default.

jesprenj · 2024-06-01T08:44:39 1717231479

I guess ICANN needs to get money somehow.

lambdaxyzw · 2024-06-01T09:13:43 1717233223

Why can't it just get funding from the government?

rnhmjoj · 2024-06-01T10:25:23 1717237523

Aren't they already getting an outrageous amount of money for essentially supervising a txt file?

jesprenj · 2024-06-02T21:54:04 1717365244

Most services can be described by "essentially supervising a txt file" if you are succinct enough.

j16sdiz · 2024-06-01T23:46:09 1717285569

Which government?

SiempreViernes · 2024-06-01T08:37:00 1717231020

Yeah... according to wikipedia they've had it since 2014, but even now a lot of their pages are on .ch

usgroup · 2024-06-01T17:55:54 1717264554

I struggle to see why one may want to use an interactive analysis toolkit via C++. Could anyone who has used ROOT enlighten me on this? I understand why you may write it in C++, but why would you want to invoke it with C++ for this sort of work?

ephimetheus · 2024-06-01T18:56:38 1717268198

All of our other code is C++. The data reconstruction framework writing ROOT files, the analysis frameworks doing stat analysis. The event data model is implemented in C++.

It has its rough edges, but you do get a lot of good synergy out of this setup for sure.

konstantinua00 · 2024-06-01T21:12:50 1717276370

if you can work in a fast language, why not?

comments here have already mentioned couple horror stories of people accidentally/by inexperience doing a lot of work above the framework - if you can save that by not being slow, why not?

rubicks · 2024-06-01T18:27:36 1717266456

What I remember about ROOT Cint is that it was an absolute nightmare to work with, mostly because it couldn't do STL containers very well. It was a weird time to do language interop for physicists.

frumiousirc · 2024-06-02T12:41:21 1717332081

Oh yes, I remember the CINT times, but then I also remember PAW and KUMAC.

Modern ROOT of course replaces CINT with Cling and STL containers are well supported.

sbinet · 2024-06-02T21:18:10 1717363090

back in the days, one always had to have 2 terminals open to work with ROOT: one to work and the other to 'kill -9 root.exe' thanks to CINT happily completely destroying your TTY.

SilverSlash · 2024-06-01T09:27:00 1717234020

Let me guess, it only run on an IBN 5100?

div72 · 2024-06-01T10:47:21 1717238841

Only for the optional "read time travel and world domination plans" module.

8organicbits · 2024-06-01T10:19:20 1717237160

No. https://root.cern/install/

nousernamed · 2024-06-01T15:48:52 1717256932

the amount of times I googled 'taxis' with predictable results

qa-wolf-bates · 2024-06-01T16:19:48 1717258788

I think that this article is very interesting

koolala · 2024-06-01T16:00:16 1717257616

can they release a quantized 1bit version? i dont think anyones pc can science this

anticensor · 2024-06-02T20:32:30 1717360350

This is not a ML model, this is a distributed compute framework.