I originally posted this on reddit[1], but figured I'd share this here. I checke...

ilyagr · on Feb 3, 2023

Re parallelism: I have 12 cores, and cargo indeed effectively uses them all. As a result, the computer becomes extremely sluggish during a long compilation. Is there a way to tell Rust to only use 11 cores or, perhaps, nice its processes/threads to a lower priority on a few cores?

I suppose it's not the worst problem to have. Makes me realize how spoiled I got after multiple-core computers became the norm.

mbrubeck · on Feb 3, 2023

`cargo build -j11` will limit parallelism to eleven cores. Cargo and rustc use the Make jobserver protocol [0][1][2] to coordinate their use of threads and processes, even when multiple rustc processes are running (as long as they are part of the same `cargo` or `make` invocation):

[0]: https://www.gnu.org/software/make/manual/html_node/Job-Slots...

[2]: https://github.com/rust-lang/cargo/issues/1744

[2]: https://github.com/rust-lang/rust/pull/42682

`nice cargo build` will run all threads at low priority, but this is generally a good idea if you want to prioritize interactive processes while running a build in the background.

epage · on Feb 3, 2023

To add, in rust 1.63, cargo added support for negative numbers, so you can say `cargo build --jobs -2` to leave two cores available.

See https://github.com/rust-lang/cargo/blob/master/CHANGELOG.md#...

touisteur · on Feb 4, 2023

Small quality of life changes like this really make the cargo and rust community shine, I feel. I'm not a heavy rust user but following all the little improvements in warnings, hints, they build up to a great experience. I wish we had the mindshare and peoplepower to do that in my language and tooling of choice (I'm specifically talking about effort and muscle, because motivation is clearly already there).

jrockway · on Feb 3, 2023

Are they real cores or hyperthreads/SMT? I've found that hyperthreading doesn't really live up to the hype; if interactive software gets scheduled on the same physical core as a busy hyperthread, latency suffers. Meanwhile, Linux seems to do pretty well these days handling interactive workloads while a 32 core compilation goes on in the background.

SMT is a throughput thing, and I honestly turn it off on my workstation for that reason. It's great for cloud providers that want to charge you for a "vCPU" that can't use all of that core's features. Not amazing for a workstation where you want to chill out on YouTube while something CPU intensive happens in the background. (For a bazel C++ build, having SMT on, on a Threadripper 3970X, does increase performance by 15%. But at the cost of using ~100GB of RAM at peak! I have 128GB, so no big deal, but SMT can be pretty expensive. It's probably not worth it for most workloads. 32 cores builds my Go projects quickly enough, and if I have to build C++ code, well, I wait. ;)

globalreset · on Feb 3, 2023

exec ionice -c 3 nice -n 20 "$@"

Make it a shell script like `takeiteasy`, and run `takeiteasy cargo ...`

kstrauser · on Feb 3, 2023

Partly because of being a Dudeist, and partly because it's just fun to say, I just borrowed this and called it "dude" on my system.

  dude cargo ...

has a nice flow to it.

twotwotwo · on Feb 3, 2023

This also relates to something not directly about rustc: many-core CPUs are much easier to get than five years ago, so a CPU-hungry compiler needn't be such a drag if its big jobs can use all your cores.

michaelt · on Feb 3, 2023

It's true!

Steam hardware survey, Jan 2017 [1] vs Jan 2023, "Physical CPUs (Windows)"

           2017    2023
  1 CPU    1.9%    0.2%
  2 CPUs  45.8%    9.6%
  3 CPUs   2.6%    0.4%
  4 CPUs  47.8%   29.6%
  6 CPUs   1.4%   33.0%
  8 CPUs   0.2%   18.8%
  More     0.3%    8.4%

[1] https://web.archive.org/web/20170225152808/https://store.ste...

masklinn · on Feb 3, 2023

However, rustc currently has limited ability to parallelise at a sub-crate level, which makes for not-so-great tradeoffs on large projects.

burntsushi · on Feb 3, 2023

Someone asked (and then deleted their comment):

> How many LoC there is in ripgrep? 46sec to build a grep like tool with a powerful CPU seems crazy.

I wrote out an answer before I knew the comment was deleted, so... I'll just post it as a reply to myself...

-----

Well it takes 46 seconds with only a single thread. It takes ~7 seconds with many threads. In the 0.8.0 checkout, if I run `cargo vendor` and then tokei, I get:

    $ tokei -trust src/ vendor/
    ===============================================================================
     Language            Files        Lines         Code     Comments       Blanks
    ===============================================================================
     Rust                  765       299692       276218        10274        13200
     |- Markdown           387        21647         2902        14886         3859
     (Total)                         321339       279120        25160        17059
    ===============================================================================
     Total                 765       299692       276218        10274        13200
    ===============================================================================

So that's about a quarter million lines. But this is very likely to be a poor representation of actual complexity. If I had to guess, I'd say the vast majority of those lines are some kind of auto-generated thing. (Like Unicode tables.) That count also includes tests. Just by excluding winapi, for example, the count goes down to ~150,000.

If you only look at the code in the ripgrep repo (in the 0.8.0 checkout), then you get something like ~13K:

    $ tokei -trust src globset grep ignore termcolor wincolor
    ===============================================================================
     Language            Files        Lines         Code     Comments       Blanks
    ===============================================================================
     Rust                   34        15484        13205          780         1499
     |- Markdown            30         2300            6         1905          389
     (Total)                          17784        13211         2685         1888
    ===============================================================================
     Total                  34        15484        13205          780         1499
    ===============================================================================

It's probably also fair to count the regex engine too (version 0.2.6):

    $ tokei -trust src regex-syntax                          
    ===============================================================================
     Language            Files        Lines         Code     Comments       Blanks
    ===============================================================================
     Rust                   29        22745        18873         2225         1647
     |- Markdown            23         3250          285         2399          566
     (Total)                          25995        19158         4624         2213
    ===============================================================================
     Total                  29        22745        18873         2225         1647
    ===============================================================================

Where about 5K of that are Unicode tables.

So I don't know. Answering questions like this is actually a little tricky, and presumably you're looking for a barometer of how big the project is.

For comparison, GNU grep takes about 17s single threaded to build from scratch from its tarball:

    $ time (./configure --prefix=/usr && make -j1)
    real    17.639
    user    9.948
    sys     2.418
    maxmem  77 MB
    faults  31

Using `-j16` decreases the time to 14s, which is actually slower than a from scratch ripgrep 0.8.0 build. Primarily do to what appears to be a single threaded configure script for GNU grep.

So I dunno what seems crazy to you here honestly. It's also worth pointing out that ripgrep has quite a bit more functionality than something like GNU grep, and that functionality comes with a fair bit of code. (Gitignore matching, transcoding and Unicode come to mind.)

Thaxll · on Feb 3, 2023

It was me, and thanks for the details. I missed the multi threaded compilation in the second part, I thought it was 46sec with -jx

kibwen · on Feb 3, 2023

In addition, it's worth mentioning here that the measurement is for release builds, which are doing far more work than just reading a quarter million lines off of a disk.

manholio · on Feb 3, 2023

The most annoying thing in my experience is not really the raw compilation times, but the lack of - or very rudimentary - incremental build feature. If I'm debugging a function and make a small local change that does not trickle down to some generic type used throughout the project, then 1-second build times should be the norm, or better yet, edit & continue debug.

It's beyond frustrating that any "i+=1" change requires relinking a 50mb binary from scratch and rebuilding a good chunk of the Win32 crate for good measure. Until such enterprise features become available, high developer productivity in Rust remains elusive.

burntsushi · on Feb 3, 2023

To be clear, Rust has an "incremental" compilation feature, and I believe it is enabled by default for debug builds.

I don't think it's enabled by default in release builds (because it might sacrifice perf too much?) and it doesn't make linking incremental.

Making the entire pipeline incremental, including release builds, probably requires some very fundamental changes to how our compilers function. I think Cranelift is making inroads in this direction by caching the results of compiling individual functions, but I know very little about it and might even be describing it incorrectly here in this comment.

pjmlp · on Feb 4, 2023

As far as I remember Energize C++ (and VC++ does a similar thing), allowed to do just that, and it feels quite fast with VC++ incremental compilation and linking.

josephg · on Feb 3, 2023

> It's beyond frustrating that any "i+=1" change requires relinking a 50mb binary from scratch

It’s especially hard to solve this with a language like rust, but I agree!

I’ve long wanted to experiment with a compiler architecture which could do fully incremental compilation, maybe down the function in granularity. In the linked (debug) executable, use a malloc style library to manage disk space. When a function changes, recompile it, free the old copy in the binary, allocate space for the new function and update jump addresses. You’d need to cache a whole lot of the compiler’s context between invocations - but honestly that should be doable with a little database like LMDB. Or alternately, we could run our compiler in “interactive mode”, and leave all the type information and everything else resident in memory between compilation runs. When the compiler notices some functions are changed, it flushes the old function definitions, compiles the new functions and updates everything just like when the DOM updates and needs to recompute layout and styles.

A well optimized incremental compiler should be able to do a “i += 1” line change faster than my monitor’s refresh rate. It’s crazy we still design compilers to do a mountain of processing work, generate a huge amount of state and then when they’re done throw all that work out. Next time we run the compiler, we redo all of that work again. And the work is all almost identical.

Unfortunately this would be a particularly difficult change to make in the rust compiler. Might want to experiment with a simpler language first to figure out the architecture and the fully incremental linker. It would be a super fun project though!

pjmlp · on Feb 4, 2023

Here, Energize C++ doing just that in 1993.

https://www.youtube.com/watch?v=yLZwLSzkH3E

VC++ has similar kind of support nowadays.

mlindner · on Feb 4, 2023

Most of the time for most changes you should just be relying on "cargo check" anyway. You don't need a full re-build to just check for syntax issues. It runs very fast and will find almost all compile errors and it caches metadata for files that are unchanged.

Are you really running your test suite for every "i+=1" change on other languages?

manholio · on Feb 4, 2023

> Are you really running your test suite for every "i+=1" change on other languages?

You don't have to run your testsuite for a small bugfix (that's what CI is for), but you DO need to restart, reset the testcase that triggers the code you are interested in, and step through it again. Rinse and repeat for 20 or so times, with various data etc. - at least that's my debug-heavy workflow. If any trivial recompile takes a minute or so, that's a frustrating time spent waiting as opposed to using something like a dynamic language to accomplish the same task.

So you would instinctively avoid Rust for any task that can be accomplished with Python or JS, a real shame since it's very close to being an universal language.

CGamesPlay · on Feb 3, 2023

Can you explain why the user time goes down when using a single thread? Does that mean that there's a huge amount of contention in the parallelism?

pornel · on Feb 3, 2023

This is caused by hyperthreading. It's not an actual inefficiency, but an artifact of the way CPU time is counted.

The HT cores aren't real CPU cores. They're just an opportunistic reuse of hardware cores when another thread is waiting for RAM (RAM is relatively so slow that they're waiting a lot, for a long time).

So code on the HT "core" doesn't run all the time, only when other thread is blocked. But the time HT threads wait for their opportunity turn is included in wall-clock time, and makes them look slow.

pjmlp · on Feb 3, 2023

Back in the early days of HT I was so happy to get a desktop with it, that I enabled it.

The end result was that doing WebSphere development actually got slower, because of their virtual nature and everything else on the CPU being shared.

So I ended up disabling it again to get the original performance back.

pornel · on Feb 3, 2023

Yeah, the earliest attempts weren't good, but I haven't heard of any HT problems post Pentium 4 (apart from Spectre-like vulnerabilities).

I assume OSes have since then developed proper support for scheduling and pre-empting hyperthreading. Also the gap between RAM and CPU speed only got worse, and CPUs have grown more various internal compute units, so there's even more idle hardware to throw HT threads at.

burntsushi · on Feb 3, 2023

To be honest, I don't know. My understanding of 'user' time is that is represents the sum of all CPU time spent in "user mode" (as opposed to "kernel mode"). In theory, given that understanding and perfect scaling, the user time of a multi-threaded task should roughly match the user time of a single-threaded task. Of course, "perfect" scaling is unlikely to be real, but still, you'd expect better scaling here.

If I had to guess as to what's happening, it's that there's some thread pool, and at some point, near the end of compilation, only one or two of those threads is busy doing anything while the other threads are sitting and idling. Now whether and how that "idling" gets interpreted as "CPU being actively used in user mode" isn't quite clear to me. (It may not, in which case, my guess is bunk.)

Perhaps someone more familiar with what 'user' time actually means and how it interplays with multi-threaded programs will be able to chime in.

(I do not think faults have anything to do with it. The number of faults reported here is quite small, and if I re-run the build, the number can change quite a bit---including going to zero---and the overall time remains unaffected.)

ynik · on Feb 3, 2023

Idle time doesn't count as user-time unless it's a spinlock (please don't do those in user-mode).

I suspect the answer is: Perfect scaling doesn't happen on real CPUs.

Turboboost lets a single thread go to higher frequencies than a fully loaded CPU. So you would expect "sum of user times" to increase even if "sum of user clock cycles" is scaling perfectly.

Hyperthreading is the next issue: multiple threads are not running independently, but might be fighting for resources on a single CPU core.

In a pure number-crunching algorithm limited by functional units, this means using $(nproc) threads instead of 1 thread should be expected to more than double the user time based on these two first points alone!

Compilers of course are rarely limited by functional units: they do a decent bit of pointer-chasing, branching, etc. and are stalled a good bit of time. (While OS-level blocking doesn't count as user time; the OS isn't aware of these CPU-level stalls, so these count as user time!) This is what makes hyperthreading actually helpful.

But compilers also tend to be memory/cache-limited. L1 is shared between the hyperthreads, and other caches are shared between multiple/all cores. This means running multiple threads compiling different parts of the program in parallel means each thread of computation gets to work with a smaller portion of the cache -- the effective cache size is decreasing. That's another reason for the user time to go up.

And once you have a significant number of cache misses from a bunch of cores, you might be limited on memory bandwidth. At that point, also putting the last few remaining idle cores to work will not be able to speed up the real-time runtime anymore -- but it will make "user time" tick up faster.

In particularly unlucky combinations of working set size vs. cache size, adding another thread (bringing along another working set) may even increase the real time. Putting more cores to work isn't always good!

That said, compilers are more limited by memory/cache latency than bandwidth, so adding cores is usually pretty good. But it's not perfect scaling even if the compiler has "perfect parallellism" without any locks.

burntsushi · on Feb 3, 2023

> Turboboost lets a single thread go to higher frequencies than a fully loaded CPU. So you would expect "sum of user times" to increase even if "sum of user clock cycles" is scaling perfectly.

Ah yes, this is a good one! I did not account for this. Mental model updated.

Your other points are good too. I considered some of them as well, but maybe not enough in the context of competition making many things just a bit slower. Makes sense.

Filligree · on Feb 3, 2023

User time is the amount of CPU time spent actually doing things. Unless you're using spinlocks, it won't include time spent waiting on locks or otherwise sleeping -- though it will include time spent setting up for locks, reloading cache lines and such.

Extremely parallel programs can improve on this, but it's perfectly normal to see 2x overhead for fine-grained parallelism.

burntsushi · on Feb 3, 2023

I'd say there's still a gap in my mental model. I agree that it's normal to observe this, definitely. I see it in other tools that utilize parallelism too. I just can't square the 2x overhead part of it in a workload like Cargo's, which I assume is not fine-grained. I see the same increase in user time with ripgrep too, and its parallelism is maybe more fine grained than Cargo's, but is still at the level of a single file, so it isn't that fine grained.

But maybe for Cargo, parallelism is more fine grained than I think it is. Perhaps because of codegen-units. And similarly for ripgrep, if it's searching a lot of tiny files, that might result in fine grained parallelism in practice.

Filligree · on Feb 3, 2023

Well, like mentioned elsewhere, most of that overhead is just hyper threads slowing down when they have active siblings.

Which is fine; it’s still faster overall. Disable SMT and you’ll see much lower overhead, but higher time spent overall.

burntsushi · on Feb 3, 2023

Yes, I know its fine. I just don't understand the full details of why hyperthreading slows things down that much. There are more experiments that could be done to confirm or deny this explanation, e.g., disabling hyperthreading. And playing with the thread count a bit more.

Filligree · on Feb 4, 2023

Hyperthreading only duplicates the frontend of the CPU.

That's really it. That's the entire explanation. It's useful if and only if there are unused resources behind it, due to pipeline stalls or because the siblings are doing different things. It's virtually impossible to fully utilize a CPU core with a single thread; having two threads therefore boosts performance, but only to the degree that the first thread is incapable of using the whole thing.

That's why the speedup is around 20%, not 100%.

burntsushi · on Feb 4, 2023

I know all of that. There's still a gap because it doesn't explain in full detail how contended resources lead to the specific slowdown seen here. Hell, nobody in this thread has done the experiments nexessary to confirm that HT is evem the cause in the first place.

fulafel · on Feb 3, 2023

Spinlocks are normal userspace code issuing machine instructions in a loop that do memory operations. It is counted in user time, unless the platform is unusual and for some reason enters the kernel to spin on the lock. Spinning is the opposite of sleeping.

edit: misparsed, like corrected below, my bad.

burntsushi · on Feb 3, 2023

I think you're saying the same thing as the GP. You might have parsed their comment incorrectly.

celrod · on Feb 3, 2023

User time is the amount of CPU time spent in user mode. It is aggregated across threads. If you have 8 threads running at 100% in user mode for 1 second, that gives you 8s of user time.

Total CPU time in user mode will normally increase when you add more threads, unless you're getting perfect or better-than-perfect scaling.

twotwotwo · on Feb 3, 2023

There are hardware reasons even if you leave any software scaling inefficiency to the side. For tasks that can use lots of threads, modern hardware trades off per-thread performance for getting more overall throughput from a given amount of silicon.

When you max out parallelism, you're using 1) hardware threads which "split" a physical core and (ideally) each run at a bit more than half the CPU's single-thread speed, and 2) the small "efficiency" cores on newer Intel and Apple chips. Also, single-threaded runs can feed a ton of watts to the one active core since it doesn't have to share much power/cooling budget with the others, letting it run at a higher clock rate.

All these tricks improve the throughput, or you wouldn't see that wall-time reduction and chipmakers wouldn't want to ship them, but they do increase how long it takes each thread to get a unit of work done in a very multithreaded context, which contributes to the total CPU time being higher than it is in a single-threaded run.

nequo · on Feb 3, 2023

Faults also drop to zero. Might be worth trying to flush the cache before each cargo build?

Ygg2 · on Feb 3, 2023

As someone who uses Rust on various hobby projects, I never understood why people were complaining about compile times.

Perhaps they were on old builds or some massive projects?

burntsushi · on Feb 3, 2023

Wait, like, you don't understand, or you don't share their complaint? I don't really understand how you don't understand. If I make a change to ripgrep because I'm debugging its perf and need to therefore create a release build, it can take several seconds to rebuild. Compared to some other projects that probably sounds amazing, but it's still annoying enough to impact my flow state.

ripgrep is probably on the smallish side. It's not hard to get a lot bigger than that and have those incremental times also get correspondingly bigger.

And complaining about compile times doesn't mean compile times haven't improved.

Ygg2 · on Feb 3, 2023

I do understand some factors, but I never noticed it being like super slow to build.

My personal project takes seconds to compile, but fair enough it's small, but even bigger projects like a game in Bevy don't take that much to compile. Minute or two tops. About 30 seconds when incremental.

People complained of 10x slower perf. Essentially 15min build times.

Fact that older versions might be slower to compile fills another part of the puzzle.

That and fact I have a 24 hyper thread monster of CPU.

TinkersW · on Feb 3, 2023

30 seconds isn't incremental, that is way too long.

I work on a large'ish C++ project and incremental is generally 1-2 seconds.

Incremental must work in release builds(someone else said it only works in debug for Rust), although it is fine to disable link time optimizations as those are obviously kinda slow.

Ygg2 · on Feb 4, 2023

> 30 seconds isn't incremental

I don't recall exact numbers. But bevy can pull a lot of depenencies. Enough for `target` directory to rival NPM worst offenders (e.g. ~1GB).

mlindner · on Feb 4, 2023

I'll echo Ygg2's comments. My previous job the minimum compile times were around 30 minutes so compile times under a minute feel like they're happening almost instantly. It's enough such that I don't need to break my thought process every time I compile.

burntsushi · on Feb 4, 2023

Surely you can see how 1) it's all relative and 2) different people work differently. Like is this really so hard to understand? As far as I can tell, your comment is basically, "be grateful for what you have." But I am already. Yet I still want faster compile times because I think it will help me iterate more quickly.

I truly just do not see what is difficult to understand here.

jackmott42 · on Feb 3, 2023

First, compile times can differ wildly based on the code in question. Big projects can take minutes where hobby projects take second.

Also, people have vastly different work flows. Some people tend to slowly write a lot of code and compile rarely. Maybe they tend to have runtime tools to tweak things. Otherwise like to iterate really fast. Try a code change, see if the UI looks better or things run faster, and when you work like this even a compile time of 3 seconds can be a little bit annoying, and 30 seconds maddening.

Taywee · on Feb 3, 2023

It's less about "big projects" and more about "what features are used". It's entirely possible for a 10kloc project to take much more time to build than a 100kloc project. Proc macros, heavy generic use, and the like will drive compile time way up. It's like comparing a C++ project that is basically "C with classes" vs one that does really heavy template dances.

Notably, serde can drive up compile times a lot, which is why miniserde still exists and gets some use.

puffoflogic · on Feb 4, 2023

People are enabling serde codegen on every type, for no reason. That's it, that's the whole story. Those of us who don't do this will continue to read these "rustc is slow!!1!" posts and roll our eyes. Rustc isn't slow, serde is slow.

mlindner · on Feb 4, 2023

Completely agree, coming from a job where the C project I worked on took 30 minutes for basic software builds (you don't generally compile the code while writing it and spend a lot of time manually scanning looking for typos), Rust compile times are crazy fast.

jph · on Feb 3, 2023

Code gen takes quite a while. Diesel features are one way to see the effect...

diesel = { version = "*", features = ["128-column-tables"], ... }

fnordpiglet · on Feb 3, 2023

I remember I would spend hours looking at my code change because it would take hours to days to build what I was working on. I would build small examples to test and debug. I was shocked at Netscape with the amazing build system they had that could continuously build and tell you within a short few hours if you’ve broken the build on their N platforms they cross compiled to. I was bedazzled when I had IDEs that could tell me whether I had introduced bugs and could do JIT compilation and feedback to me in real time if I had made a mistake and provide inline lints. I was floored when I saw what amazing things rust was doing in the compiler to make my code awesome and how incredibly fast it builds. But what really amazed me more than anything was realizing how unhappy folks were that it took 30 seconds to build their code. :-)

GET OFF MY LAWN

burntsushi · on Feb 3, 2023

I dare to want better tools. And I build them when I can. Like ripgrep. ¯\_(ツ)_/¯

fnordpiglet · on Feb 3, 2023

Keep keeping me amazed and I’ll keep loving the life I’ve lived