Unless things have changed, root is quite horrible actually.

amadsen · on July 29, 2023

Root is absolutely, mind-blowingly, amazing. It gets a bad rap because it forces you to use primitives that were designed back in the early nineties. If you're "just" trying to analyze some data, your experience will indeed be "horrible" compared to what's offered by Python, R, Matlab, or Julia. But beyond that... Root adds fully working reflection to C++. Root gives you dynamic library loading and reloading - you can fix a bug or add a new feature, recompile parts of your program and keep working without restarting it. Root has a feature complete C++ interpreter, with scripting and a REPL loop. You can work with it completely interactively. After prototyping you can save your code as a script. After identifying performance critical parts of your code, you can compile them and get the full power of bare metal C++, without changing anything about the code. Yes this is technically possible with e.g. python + numba as well, but not as straight-forward. Root is fully interoperable with Python and R - you can mix scripts and REPLs between the languages and pass objects between them. Root can serialize any object, without requiring any custom code whatsoever (some serious dark magic needed for this). In fact you can pause your entire program and save it to disk or send it over the network to keep running somewhere else. Root has its own file format for efficiently storing massive amounts of data in arbitrarily complex structures. It can stream it over the network too, with probabilistic read ahead and caching for maximum efficiency. Root comes with libraries for physics/math/stats that rival those of the largest commercial and open source offerings. Each one of these is a massive technical achievement and Root has had most of them for decades now. Oh, and it has largely maintained backwards compatibility through all this time as well.

Of course, very few people outside of CERN need all of this. Even within CERN, many projects don't. But for those who do, there are very few - if any - alternatives.

duped · on July 29, 2023

That all sounds like a nightmare compared to python and matlab, which can do most of that

otabdeveloper4 · on July 29, 2023

Python can do maybe 1 percent of all that. (Hell, Python has real trouble not shitting itself and dying after a "pip install", you can definitely forget about seamless native code compilation.)

davidktr · on July 29, 2023

But do you really need these features, already available in Matlab/Python/R/Julia/Lisp/? Or did the the C++ folks simply refuse to learn other languages?

From what I have seen in R and Python, the main reason for speed issues are incompetent programmers. Certainly, bad C++ code is much faster than bad Python code, but there is also the effort to build/maintain/document/teach Root to noobs.

Hot take: It's really about preferences, not features.

amadsen · on July 29, 2023

Let me paint you a picture: You have data coming off the detectors at a rate of a couple of hundreds of GB/s (after pre-filters implemented in FPGAs etc) that needs to be processed and filtered in real time with output written to disk and tape at about one GB/s. We're talking really CPU intensive processing here: Kalman filters, clustering algortihms, evaluating machine learning models. The facility is one of a kind and operating cost is in the billions per year so downtime is unthinkable, this stuff needs to work. Offline, you're running very, very, detailed (and CPU heavy) simulations. All in all, you have some hundreds of petabytes of data that are constantly being processed and reprocessed for hundreds of different purposes. These systems have many millions of lines of code between them, a lot of which needs to be shared between them. Offline analysis needs to re-run online algorithms and so on - you need a single stack for all systems. You have some hundreds of thousands of CPU cores to run all of this. Due to how academia works, beyond a couple of large core datacenters, resources are mostly spread out in hundreds of locations globally so that each participating university can have maintain a cluster on their premises for teaching/research/funding reasons. You need an efficient way to get the data that a program needs to where it is running, or preferably move the program to where the data is. This is not a tech company, there's no revenue so throwing money at the problem is not an option - it's all funded by tax payers so efficiency is paramount. What language do you reach for? Matlab? Lol. The closest analogy I can think of are some big trading systems and large scale ML inference and content serving at FAANG and the like. That's all usually java or C++.

Oh, one more thing: There's very few professional developers dedicated to this. A lot of it is built and maintained by grad students and researchers in-between writing papers. They're smart people, and they can code, but they have neither time nor interest in learning a new language or framework every other year. They move around. A lot. It wouldn't work to have different tech stacks for different projects - you need to pick one solution, not just for one area but for the entire field. So people can spend less time learning and more time doing. There's no one available to migrate legacy code because some new cool language appeared or yesterday's cool library isn't maintained anymore. These projects run for decades. Whatever tech you pick you must be certain that it will still be around and supported 10, 20, 30 years later. That the code still runs and the data that you paid billions for can still be read.

davidktr · on July 29, 2023

Thanks for the detailed answer, I really appreciate the insight. I work in research myself, so I'm familiar with the general constraints.

I was certainly unaware of the size of the data coming from the detectors. If speed is the argument that beats all others, I rest my case. From what I read on the root.cern website, root is a data analysis and simulation environment, so I was not aware of the aspect of prototyping for online use.

Because I spent a lot of time thinking how software development can work in an analysis heavy research environment, I still would like to comment on some of your points. To distribute binaries and source code, packages work very well for us. Especially if you want to reuse software components in unseen contexts, packages and a package registry makes the most sense.

The use case "re-run online algorithms in offline analysis" is a very familiar one. In my line of work, we do that daily: Switching between online and offline to test + deploy algorithms. Vastly smaller scale, of course. But to us, packages are the first part of the solution. All you do is change the data source. For offline, it's local data or a remote DB, for online, it's an interface such as a websocket.

The second part of the solution are unit- and integration tests. Other users will immediately see what you did (or didn't) test. Again, packages are the distribution system of choice. This has nothing to do with Matlab/Python/R/Julia. Rust has crates.io, JS has npmjs, even Java has something like Maven Central.

Regarding the funded-by-taxpayers argument: The issue I see here is that the cool ML, simulation, data analysis stuff which the CERN people do remains in the root ecosystem. If they used something like PyPI, I could use their stuff too. I have a lot of clustering problems, especially on time series. With a more or less proprietary system like root, I can't use any of CERN's implementations.

Regarding "researchers don't have time to learn new languages": If you look on github.com/root-project/root/issues and root-forum.cern.ch, there are suspiciously many questions regarding "how can I make use of root and Python libraries", and "X doesn't work in root, what do". Newbs have to learn root as well, and they seem to like using Python at least as an enhancement.

gpderetta · on July 29, 2023

> C++ folks simply refuse to learn other languages

root is many many decades old. And I understand that the scripts developed under it are often directly incorporated into C++ applications. CERN is a big C++ user (I have a little experience with their GEANT4 framework), and being able to do everything in one language is a big productivity boost (see for example the rise of node.js for web-related work).

moelf · on July 29, 2023

yeah... basically... in fact they're leaning more into JIT + string programming now and seeing big JIT latency etc. (imo is kinda ~approaching Julia from C++ side)