Hacker News new | past | comments | ask | show | jobs | submit login
Google Is Uncovering Hundreds of Race Conditions Within the Linux Kernel (phoronix.com)
603 points by pjmlp on Oct 3, 2019 | hide | past | favorite | 322 comments



Yay Konstantin Serebriany and his team! The dude behind these sanitizers is quite brilliant. The Universe should know its heroes.


Race conditions create meta-stable states. Fixing them always increases predictability of the code. It can also result in fixing formerly "cosmic ray" type bugs that occurred once or twice and were never seen again.

This is because one of the sources of very hard to reproduce bugs is a set of race conditions aligning just right.


Meta-stability [0] is a hardware issue, surely? Race conditions can create some crazy effects (BTDT) but at least they are fixable in software.

[0] https://en.wikipedia.org/wiki/Metastability_%28electronics%2...


With all due respect to Wikipedia, the term meta-stability is applicable to any system issue where the behavior of the system is undefined when one or more of its inputs can be undefined.

The inputs to "software" finite state machines are typically inputs in the form of variables. Those inputs define which branches within the routine will be taken during processing.

You can model a subroutine as an FSM for which its "outputs" are its state, the FSM takes inputs, processes them, and sets new outputs or a new state.

We use fuzzing to expose the state machine to all possible combinations of inputs and this identifies all possible exit states from all possible input states, but assumes the input states are stable.

Multi-processing introduces the possibility of race conditions. In a race condition, one input state is present when the state machine is entered, but during execution the race completes which changes the input value.

We use 'stable' to define an input that has the same value across the entire duration of the state machine's state execution.

During a race condition, an input may change value one or more times across the time interval of the state machine's state execution. Thus at any instant in time the input has a singular value, during an interval in time that value may have many different values. These inputs are meta-stable.

And yes, you can fix meta-stability in software with things like mutexes and execution exclusion. Just like you can fix meta-stability bugs in hardware signals by add a clock synchronization domain that spans the widest period of meta-stability possible for a signal.

One of the things that makes race condition bugs so hard to debug is that your typical tracing facility assumes that the inputs passed to a function are stable and don't change. Thus you'll see a trace record of a function call, with its parameters, and walk through the code and say "Wait, with those parameters this code could never do what it just did." Or conversely, that variables that have shared write status across execution domains will be the same throughout a function.

Does that make is clearer what I was talking about? Race conditions suck :-)


I think Wikipedia would benefit from your contribution!


I think the parent meant (emphasis on the scare quotes) “cosmic ray” type bugs as the kind of bugs we programmers sometimes attribute to cosmic rays, despite actually being a complex race condition.


Good point, happened I had one today that appeared impossible, yet (of course) it was easily understandable once the root cause was found.

The Zebra test is a good friend.

[0] https://en.wikipedia.org/wiki/Zebra_%28medicine%29


This is why open source for lower level systems is amazing!! That is why graphic card drivers should be open sourced


They are for the 2 of the biggest GPU vendors.


Is nVidia no longer one of the two biggest GPU vendors?


I don't think that's what meant. "2 of the biggest GPU vendors" !== "the 2 biggest GPU vendors".


Nice distinction! Thank you.


Depends on what you count. If mobile GPUs are included, I wouldn't be surprised if one of PowerVR, Mali (ARM), or Adreno (Qualcomm) was on top.


I'm not surprised that there are vast numbers of latent low-probability race conditions in the linux kernel or any major software project. Having seen the testing process in both hardware and software projects, the two are not even comparable. The testing process in most software projects is dominated by fully deterministic unit tests that are very simple in nature, and make large numbers of assumptions about the behaviors and interactions with other components. Low-probability race conditions between different components is exactly the outcome I would expect from this testing process.

https://software.rajivprab.com/2019/04/28/rethinking-softwar...


Maybe this will do a better job of identifying the race condition that made the AMD card in my laptop unusable. About 5 years ago a bunch of changes evinced a change in behavior when trying to switch hybrid graphics controllers. I worked in a bug report for a long time after I and a few others fingered it as a race condition (it failed maybe 1/10 switches). A cluster of other changes meant it was difficult to bisect (it broke one way at a few points and broke differently at other points, but in a way that made it difficult to identify whether the bug we were triageing existed at that point.)


Solid contribution. Will hopefully improve stability even further.

Does this automatically generate fixes too, or does someone need to investigate time for each by hand?


It needs investigation by hand. I don't see how that can be automated.


I would have thought so too, but the github says:

>kcsan-with-fixes: Contains KCSAN with various bugfixes for races detected; the commit messages for those bugfixes include the KCSAN report as-is.

Which to me seems to imply some sort of automatic mitigation. Maybe I'm reading too much into it


Nope, that's just the repository with manual fixes.


For those interested, here are some of the bugs which their tool has found: https://github.com/google/ktsan/wiki/KTSAN-Found-Bugs


No, that's KTSAN, which is different from KCSAN, although they do find similar bugs.


Are there any HN readers here that use MINIX or possibly another OS that does not have the Linux kernel? I'm not about to argue pros/cons but would like to see peoples' use cases; my own use cases have not necessitated using anything more complicated than Ubuntu, and quite happily so.

Still, I think from my interest point of view it would be interesting to not only understand the Linux kernel better, but also other OS design paradigms.


I swear by OpenBSD. I use it on my home server. I use it to write C code (their C standard library has a few very nice things, such as arc4random, strl{cat,cpy}, explicit_bzero, timingsafe_memcmp, libtls, and other things that I run into often enough that I don't want to think about them before I start adding portability stuff).

Of course, it's not much of an option for anything I require proprietary software.


Openbsd has opened my eyes to how deficient documentation is in Linux.


Meaning it has good documentation?


+1 for OpenBSD. For C, it’s the nicest platform around. I love the system man pages and the like you said, they have a lot of nice system and library functions to use as well.

Also it’s pretty easy to copy the C stuff to new OSes as well for portability.


> Are there any HN readers here that use MINIX or possibly another OS that does not have the Linux kernel?

Like... Windows?


To be fair, Windows has the Linux kernel (WSL 2).


Which is basically a VM. It's not that the Windows kernel isn't running even when using WSL.


It's just a compatibility layer satisfying the kernel API (= syscalls). It's an API on top of the Windows kernel.


That's WSL 1 you're thinking about.


Crazy, wsan't aware of this. I stand corrected


Wsl2 has complete kernel not just syscall translation


I’m setting up a SmartOS server on a little NUC right now and loving it. Zones are very nice, pkgsrc package manager is better than I expected, man pages wonderfully written, even ipfilter feels more elegant than iptables. Service and zone admin very well thought out. It feels much more like a cohesive /system/ than say fedora or Ubuntu.

Can run Linux without virtualization (native performance via syscall translation) in a zone if I hit any issues, which I’ve not yet (simple web stack).

Google around for some of Bryan Cantrill’s talks on smartos. Solaris derivative. He’s been quite critical of several linux engineering decisions and what he frames as sloppy thinking in the kernel community.

I found this helpful/inspiring - https://timboudreau.com/blog/smartos/read - if you follow his example be sure to change dataset_uuid to image_uuid in the zone manifest.


> Are there any HN readers here that use MINIX or possibly another OS that does not have the Linux kernel?

I have both macOS and Ubuntu on my MacBook. I have always predominantly used macOS for all my day-to-day uses and only use Ubuntu to test software that needs to run on Linux before releasing.

I have known many people who use FreeBSD or OpenBSD daily and I have even heard some people installing Haiku as a secondary OS because they have now found it 'very interesting' to use.

I don't know anyone using MINIX though, but I believe that Fuchsia looks like one of the most technically interesting new OSes to be developed in terms of OS design paradigms.


> Are there any HN readers here that use MINIX or possibly another OS that does not have the Linux kernel?

I use macOS all the time. My use-case is systems programming.

I also uses iOS on my phone.

Or is that not what you meant?


I don't know if you go through old comments, but for your reference:

"Mac OS X is sort of microkernelish. Inside, it consists of Berkeley UNIX riding on top of a modified version of the Mach microkernel. Since all of it runs in kernel mode (to get that little extra bit of performance) it is not a true microkernel, but Carnegie Mellon University had Berkeley UNIX running on Mach in user space years ago, so it probably could be done again, albeit with a small amount of performance loss, as with L4Linux. Work is underway to port the Apple BSD code (Darwin) to L4 to make it a true microkernel system." [1]

[1] https://www.cs.vu.nl/~ast/reliable-os/


I assume you mean Open Source, and that would exclude Windows and macOS. Apart from MINIX which we now know lives on every Intel chips, there are BSDs, NetBSD on Apple AirPort, FreeBSD on Netflix Open Appliance, Playstation, even Solaris is still being developed worked on. L4 is used in every Apple SoC with Secure Enclave.

And many many others.


There's always the BSDs.


I'd be shocked if something written in C that's as complex as the Linux kernel didn't have race conditions.


The language a program is written in has no bearing on whether or not the program is susceptible to race conditions. You can have data races in Python, Javascript and Rust as easily as you can in C.


You cannot have data races in safe Rust.


Race conditions aren't necessarily data races. This article and the parent comment was about race conditions in general.


The parent comment literally says

> You can have data races in Python, Javascript and Rust as easily as you can in C.

Don't use "race conditions" and "data races" interchangeably if you understand the difference...


That's a false equivalence. Some languages make it much easier to create race conditions than others. For example, you have to work hard to avoid them in C, but you'd have to actually try to introduce them in Clojure.


I wonder if any software rely on these bugs

EDIT: windows famously had bugs and had to add special code to preserve the buggy behaviour to keep certain applications that relied on them working https://www.joelonsoftware.com/2004/06/13/how-microsoft-lost...


Seems very tricky, almost by definition, to rely on a race condition?


My previous company preserved bad server caching behavior for one or two customers, as they had custom code depending on it. Given that it was a cache, it was a race condition, just not one likely to trigger (would only get evicted if there was a storm of requests).


You assume every race is a bug.

Part of the job of a kernel is to resolve races. Requests come to it from multiple threads and processes in a parallel fashion. Sometimes your timing will be one way and get you one result, sometimes it will go another way. That's OK. It's the nature of the beast.


This is a misunderstanding. A situation where "either process A or process B gets a resource first" is not a race condition. A situation where both of them get it at the same time is.


Ok, I was being imprecise.

Even if there is parallelism and multi-core, it is still a question of ordering, and from there we can use language like who gets there "first" [indeed "race" suggests this]. Or, did this read of a machine word see somebody else's write, etc.


This tool finds data races. All data races are bugs.


You cannot say all data races are bugs independent of what the data represents or how it is handled.

Perhaps a read not seeing someone else's write in a timely fashion is OK in your situation. Perhaps you will resolve conflicts later using proper synchronization.


Data races are undefined behaviors hence bugs under the C standard. You do have a point in that Linux kernel is not written against the C standard and could declare to be written in the dialect of C where data races are defined...


Undefined behavior is not necessarily a "bug", even in the context of the C standard. Undefined behavior is undefined behavior--the whole point of term is that the standard has nothing to say about the actual behavior. Undefined behavior includes behaviors about which the standard literally says nothing, like the effect of a supernova. But the term is normally used to distinguish certain behaviors that for various reasons the standard chooses (and chooses loudly) not to impose any requirements even though in context it could easily choose to do so; behaviors people might otherwise mistakenly assume are defined (or "natural") in reliance on their own outside experience and knowledge.

Considering that C didn't support threading or even an atomics API until C11, and that most C code was written well before C11, by your definition the entire Linux kernel has been one giant "bug" ever since it added SMP support.

Are data race conditions bugs? Usually--the term itself is conclusory of that fact. They're bugs not because they constitute undefined behavior as defined by the C standard, but because neither the hardware, compiler, application logic, or anything else guarantees consistent behavior; and if you have no guarantee of behavior then your software is ipso facto incorrect. If something were to guarantee the behavior, such as the notably generous cache coherency semantics of the x86 ISA, and nothing else conflicts with that guarantee, like an aggressive compiler, then it wouldn't be a bug, at least to the extent you knowingly relied on that guarantee.

Undefined behavior is not an epithet. Just because a language specification doesn't use the term doesn't mean it doesn't have undefined behavior. Gobs of code in languages like Python and PHP rely on undefined behavior because those languages have very loose specifications. Even Java has undefined behavior despite its otherwise extremely rigorous specifications. (What Java does have is a very generous memory model which usually cabins the consequences of undefined behavior. Even then you can never rule out undefined behavior resulting in nasal daemons through some cascade of unexpected code paths in the application.)

One of the reasons the term is so important in C and C++ is because they're two of the few languages with numerous, diverse implementations. The odds of two implementations diverging while programmers unwittingly rely on one of several potential behaviors is much greater. If there are only a few relevant implementations then there's less pressure to explicitly distinguish undefined behavior. Behaviors will tend to converge through informal communication. There's almost no pressure beyond the good diligence of the engineers if there's only a single implementation. Any accidental semantics relied upon by users can be made retrospectively well defined, and if there's no reliance then behaviors silently change (sometimes breaking code).

For reference, here's what undefined behavior means, per C11 3.4.3 (N1570):

  1 undefined behavior
    behavior, upon use of a nonportable or erroneous
    program construct or of erroneous data, for which this
    International Standard imposes no requirements

  2 NOTE Possible undefined behavior ranges from ignoring
    the situation completely with unpredictable results, to
    behaving during translation or program execution in a
    documented manner characteristic of the environment
    (with or without the issuance of a diagnostic message),
    to terminating a translation or execution (with the
    issuance of a diagnostic message).

  3 EXAMPLE An example of undefined behavior is the
    behavior on integer overflow.


> You do have a point in that Linux kernel is not written against the C standard and could declare to be written in the dialect of C where data races are defined...

You had my retort already written. Thank you.

And in fact the linux kernel does rely on very specific compilers, and breaks the standard's machine abstraction quite a bit.


It is not a given that those compilers will support the unspecified memory model forever. And in fact the linux kernel has been slowly moving to the C++11 memory model for a while.


Is this actually what the system was attempting to observe? I was under the impression that most of these race conditions were unknown and indeed were supposed to be implicit.


I am wondering if machine learning can use used to solve these problems. I have heard that Google uses a machine learning to auto find the bugs and raise merge request against them. Someone from Google can confirm.


You don't need machine learning, you need a better architecture. Micro kernels can be made in such a way that they are guaranteed race condition free.


If you're willing to throw a large part of the existing code away to move to a microkernel architecture, you might as well rewrite it in a language that doesn't have race conditions by design.


Race conditions can be constructed in almost any language. It is more of a systemic thing than a language artifact though some languages are more race condition prone than others.

You can even have race conditions in hardware. Multi-threading or distributed software are excellent recipes for the introduction of race conditions, sometimes so subtle that the code looks good and the only proof that there is something bad going on is the once-in-a-fortnight system lockup.


One of my professors in processor design had a story about debugging a register leak in an out of order processor with register renaming.


I believe the undocumented instructions in the 6502 was caused by race conditions in the CPU. Some of these instructions are actually usable while others give random results.


I wrote a 6502 assembler and had a really hard time not giving $88 it's own mnemonic in the main table rather than to load it at the start of every file from an include. It was just too useful :)


Why wouldn't you? These opcodes seems to be used by most people these days, so there seems to be no reason not to treat them just like any other opcode?


Ah, no. Those are caused by don't care states in the logic of the PLA instruction decode block.


Race conditions can be created by I/O paths. I'm willing to state that unless the said language abstracts away all I/O, it will always be possible to introduce a bug :)


Haskell Monads!

#Sarcasm


The funny thing about this being that Linux was originally announced on comp.os.minix.



How does rust negate data race conditions? These aren't memory errors.


The borrow checker prevents it. Race conditions that aren’t data races are not prevented, though.


It's an OS with microkernel written in Rust. It's literally the first thing you see on the website. So it's both microkernel, and Rust, so it is an example of people doing what parent mentioned.


Yikes.


Yeah, I exploded a bit. Sorry about that :(


Is the trade off efficiency? Because I'm sure a language that is willing to wait 5 seconds between operations can do pretty well at eliminating race conditions.


Race conditions do not have anything to do with absolute time, it is simply an assumption about the sequence in which things are going to happen and finding out in practice that the implementation allows other sequences the output of which is undefined.


Most absolutes spoken about anything are natural fallacies. The truth is that they can. Hell, I see Apple devices that don't update anything including time until you take them out of screen lock, and you see the previous date and charge amount sometimes even after unlock. That is an efficiency based form of race condition.


Perhaps the microkernel is free of race conditions. But now you've introduced a zoo of interacting processes (drivers, servers, etc.) and with that a new class of race condition in their interactions, which may or may not be harder to debug than what you started out with.


That zoo will still have a lot of advantages over a monolith, and one of those advantages will be that if the components stick to their defined interfaces for interaction that your chances of race conditions are substantially limited.


What magic in microkernels prevents race conditions?


That they allow for reduction of scope to the point that race conditions can be mostly eliminated. A macro kernel has untold opportunities for race conditions because the kernel itself is multi-threaded. Effectively every driver is a potential candidate.

In a - properly implemented - micro kernel you will have each of those drivers isolated in its own user process, and if there is an issue (memory corruption, crash, logic error or race condition) the possibility of its effects spreading is limited.

Isolation is half the battle when dealing with issues such as these.


> Isolation is half the battle when dealing with issues such as these.

having had to deal with race conditions between processes running at opposite side of a continent, color me unimpressed.


That's a sign of a very poor architecture, see also: the fallacies of distributed systems.


Do you mean that each driver is forced to be single-threaded, somehow? Or just that the drivers are isolated, so they can have as many race conditions as they like, but the microkernel will keep running?

Because I don't see the advantage here. So what if my filesystem driver is isolated, a race condition in there is still going to corrupt all my data and I'm not going to be cheered up by the fact that the kernel can keep on ticking.


That's one way of looking at it.

Another is to think of this as layers of essential services: kernel, memory management, networking services, file services and so on. For a layer to be reliable the layers below it have to be reliable too. Isolating the layers in processes allows for faster resolution of where a problem originates which in turn will help to solve it.

And being able to identify the source of a problem allows for several other options: restart the process, proper identification of a longer term resolution because of the ability to log the error much closer to where it originated.

These are not silver bullets but structural approaches acknowledging the fact that software over a certain level of complexity is going to be faulty.

In my opinion the only system that gets this right is Erlang. (not the language, but the whole system)


I take your point about Erlang, and its systems for restarting/replacing failed code & processes, but I don't see how this eliminates race conditions or their consequences.

One problem here is that it's not always a simple layer on layer order of dependencies, they can work both ways. Take swapping/paging for example. It can't be isolated as a separate driver or service. The kernel relies upon it, and so do processes, and yet the service itself relies upon filesystems, a separate layer again. A race condition in the swap system (like one of the bugs found by the tool in question) would break multiple layers, or a race in the filesystem could cause the swap system to fail. There is no real isolation here that saves an OS from failures.


It's not 'magic', but will eliminate a whole swath of possibilities.

One reason why race conditions in monolithic kernels are very common is due to the requirement that code is re-entrant because of multi-threaded execution of the kernel itself. In a micro kernel situation the multi-threading can be avoided because you have enough processes to spread the cpu cores over to have a number of them active at the same time without having two or more of them active within the same process.


> Another is to think of this as layers of essential services: kernel, memory management, networking services, file services and so on. For a layer to be reliable the layers below it have to be reliable too. Isolating the layers in processes allows for faster resolution of where a problem originates which in turn will help to solve it.

is there any reason to believe that most (or even a just considerable part) of race conditions in a kernel are cross module?


Not necessarily, but the effects of a race condition (for instance: a privilege escalation) would be limited to the process the race was found in. That would substantially limited the potential for damage.


They're small :P!

Less facetiously, some micro kernels have been proven formally correct. You need a small code to prove formally correct.

And even on an intuitive level, if the code is small I can "hold" it all in my head - at least really grock it in a way you can 1E6 loc.


why would a microkernel be smaller than than a monolithic kernel, given the same feature set? Unless you claim that async message passing leads to shorter code than procedure calls.


Because it gets spread around over multiple processes you get process based isolation for each module that normally all lives within the same address space. So you may be able to break a module but you can't use that to elevate yourself to lord of the realm.


I very much agree that a microkernel can be both more robust and secure. I'm arguing against being easier to understand. Modularity does not require address space separation, which is a runtime property, not a code organizations feature.


The one tends to go hand-in-hand with the other, in my experience. Of course, that's anecdata, but the more complex projects that I've worked on that used message passing separated much more naturally and cleanly into isolated chunks of code that produced their own binaries. Essentially message passing and micro kernels dictate a services based approach, which is an excellent match for OS development.


It's very likely. Code linting has been around for a long time. In the security world, static code analysis tools are used to find security vulnerabilities. It makes sense that Google would have something similar for identifying all manner of code defects.


It'd be a really interesting project to data-mine a large number of IDEs at a company like google. Use machine learning to predict fixes to lint/compile errors, and surface those suggestions in the IDE. Would also be fun to do it with code reviews to see what kind of garbage suggestions you can generate there. Probably a lot of "I don't understand what this does" and "please write tests for this."

Would also be cool for auto-complete suggestions. I'm thinking more along the lines of extracting larger patterns than for loops, such as stubbing out classes that follow a projects existing style. For example, adding an auth check at the start of a web request handler if that's what it sees elsewhere.


I really like the development of all the sanitizers we got in the last decade or so. I only wish that more could be done at compile time as opposed to these runtime checks.


That's the killer feature why I gave F# instead of my usual C# a chance - it 'does more for me' at compile time vs. run time.

Yes, these are both much higher level than C, but if what I'm reading among other comments is correct, Rust accomplishes an equivalent improvement in compile time checks while still being a systems language.


and Rust will prevent data races, which F# won't.


As someone who has personally written Rust code that deadlocked under specific circumstances, I think I can say that Rust does not prevent deadlocks. Not sure about specific types of race conditions, but I don't think it's possible for the compiler to detect all race conditions at compile time.

It is a heck of a lot easier to not make them in Rust than in someting like C and C++. But Rust is not a panacea for asynchronous code.


Rust does prevent:

* Data races

Rust does not prevent:

* Deadlocks

* Race conditions

* Memory leaks

* Logic bugs

That said, it does make some of these things harder to accidentally introduce, but strictly speaking does not prevent them, it's true.


I thought resources were freed when they went out of scope preventing any memory leaks in Rust.


You can still introduce them via circular strong reference-counted pointers. The solution of course is to have only one strong reference and many weak ones, but the compiler in principle will let you create a cycle that will never get deallocated.


That's the common case, but you can still make a reference counting cycle, or have a thread hang while holding onto something, or even just call `mem::forget` on an allocation.

This is different from Rust's guarantees about memory and thread safety, which you can't break without `unsafe`.


They are, and in a typical program you don't need to worry about leaking memory.

However, it is possible to leak, e.g. if you use a reference-counted type and create a cycle. There's also Box::leak() that does what it says (it's useful for singleton-like things).


Rust would have tradeoffs here, though. For example, in C# you'd be able to do (non-leaking) graphs and other structures in 100% safe code, while Rust would need unsafe.


You don't need unsafe. This point is brought up a lot, though I understand where the misunderstanding can come from.

Rust is perfectly capable of expressing graphs with zero unsafe code, just currently not in the most optimum implementation.


Is is fine, or suboptimal to the point of being impractical? Genuine question, how bad is this, if quanified?


You need to use reference counting or a scoped memory pool. It's totally fine in almost all cases.

Refcounting in Rust is still faster than refcounting in ObjC or Swift (because Rust can avoid using atomics or increasing counts in many cases), but people who use Rust tend to insist on zero overhead, and a truly flexible zero-overhead solution can't be verified statically.


It is fine and practical.


Adjacency matrices are trivial to implement in Rust. I leverage graphs constantly.


It's slow.


You don't even need "not the most optimum implementation" (assuming you're talking about index-based graphs or something).

You can make non-leaking pointer-based graphs in safe Rust using reference counting (basically what C# does, if you squint) or arenas (if your nodes' lifetimes fit that pattern).


It doesn't need to be unsafe or unoptimal, just use indices into flat memory instead of pointers.


There are sound static analyzers, like Trust-in-Soft, that can statically guarantee the absence of undefined behaviors in C programs, but require interactive work to get rid of the false positives.

But the distinction between "runtime" and "compile-time" is not entirely binary. What you call "compile time" checks can be viewed as an abstract interpretation [1] of the program, or as running the program in some abstract domain -- e.g., where a concrete value could be 3, its abstract value could be called `int`, and the + operation could be interpreted as `int + int = int`; type inference could be viewed as abstract interpretation. The problem is that sound abstract interpretation (i.e., one without false negatives -- if it tells you your programs is "safe" then it will definitely be safe, for some appropriate definition of safe) is limited, and often comes with big tradeoffs, e.g. either false positives given by sound static analyzers or the pain something like Rust's borrow checker sometimes causes -- these are both instances of the same underlying difficulty with abstract interpretation.

One of the most promising directions in formal methods research is "concolic testing" [2], which means a combination of concrete (i.e. non-abstract, or "runtime") execution and symbolic execution (an instance of abstract interpretation). It is not sound, but can be made quite safe, and at the same time it can be very powerful and flexible, checking properties that can be either impossible or extremely tedious, to the point of infeasibility, with abstract methods like types. It might prove to be a good sweet spot between the power, expressivity and cost of concrete (AKA dynamic, AKA runtime) methods and the soundness of abstract (AKA static, AKA compile-time) methods.

Even dynamic methods are often "stronger" (more sound) than just testing assertions during testing. For example, fuzzers can sample the program's state space, offering more coverage, either by analyzing the program (whitebox) or just randomly (blackbox), and various sanitizers sample the scheduling space by fuzzing thread scheduleing.

[1]: https://en.wikipedia.org/wiki/Abstract_interpretation

[2]: https://en.wikipedia.org/wiki/Concolic_testing


Sure, just look at the languages that have more stringent type systems.


They don't really help me make the mountains of C we have in the world less buggy.


C code doesn't have enough information in it to do a whole lot of mechanical analysis. Runtime analysis of its behavior is what you've got.


Have you never heard of program analysis for security? Genuine question. RVI, the Mayhem developers, and many others would strongly disagree with you:

https://runtimeverification.com/match/1.0-SNAPSHOT/docs/benc...

https://spectrum.ieee.org/computing/software/mayhem-the-mach...

The field evidence would support their position, too. Your claim about C is true just enough to encourage people to not use it if they want to get more out of program analysis. Yet, current tools make up for it enough to find most or all bugs in benchmarks with Mayhem fixing them, too.

The only problem is that almost nobody that cares about FOSS security is working on those tools that automate it. The companies doing it end up black boxing, patenting, etc the tools that they sell for exhorbitent prices. Something like RV-Match well-integrated with repos either FOSS or just price scaled to focus on mass adoption over high profit would be a game changer. Especially if 3rd parties could use it on repos.


Aren't those runtime tools like I just suggested? The other ones in the 1st link are theorem provers (at least the one that I recognized, frama-c) that require a lot more than off-the-shelf C code for their inputs. You can't just run CC=my-compiler-frontend make to suddenly get better diagnostics.


Companies are mixing static and running analysis more than they used to. Most static analyzers look at just the code itself, build models of it, check their properties, and report the presence/absence of errors. First links were more about effectiveness on C. I just dug around in search to find you something that explains it, shows its effectiveness, and threw in a case study you might enjoy:

https://www.cs.colorado.edu/~kena/classes/5828/s12/presentat...

https://cacm.acm.org/magazines/2010/2/69354-a-few-billion-li...

Here's the main competition evaluating them so you can see criteria and current state-of-the-art:

https://www.sosy-lab.org/research/pub/2019-TACAS.Automatic_V...

The provers do something similar but with more manual effort and logic-based approach. Frama-C and SPARK Ada are interesting in that they use programmer annotations plus supporting functions (eg ghost code). These are converted into logical representation (eg facts about the program), fed into logic solvers with properties intended to be proved, and success/failure maybe tells you something. Like with static analyzers, you get the results without running the program. A recent, great addition is that, if proof is too hard, the condition you didn't prove can be turned into a runtime check. You could even do that by default only proving the ones that dragged performance down too much. Flexible.

The best methods are a mix. I've always advocated mixing static analyzers, dynamic analysis, fuzzing etc since each technique might spot something the other will miss. They also cover for implementation bugs each might have. That adding runtime analysis can improve effectiveness does not negate my original claim that static, non-runtime analysis of C code can tell you a lot about it and/or find a ton of bugs. I also agreed with one point you made about it not conveying enough information: these clever tools pull it off despite C lacking design attributes that would make that easier. SPARK Ada is a great example of a language designed for low-level, efficient coding and easy verification. Although, gotta give credit to Frama-C people for improving things on C side.


This is really interesting stuff, thanks. I've assumed that without source annotation, the only things that you can verify statically are violations of the runtime model (e.g., stack overflow, heap corruption). Mostly through inferences on possible value sets of function arguments and return values.

I'm going to have a look at your links more. I guess the question I don't have answered quite yet is how to tell the tool what to look for beyond the language semantics.


What you've assumed is mostly correct. Almost all the annotation-free tools look for violations like you described since that's what most people want. The correctness-oriented work starts with annotated code. The only other thing I know that works without annotations, but sort of similar, is creating the specs from example inputs and outputs. Then, the code can be analyzed against them.

What pro's often do, though, is prove the algorithm itself in a higher-level form that captures its intrinsic details, create an equivalent form in lower-level code, and prove the equivalence. That lets you deal with the fundamentals first. Then address whatever added complexity comes in from lower-level details and language semantics.


For some reason this feels like a rediscovery of the virtues of leaving asserts in production code to me. Though I need to give these a read. Thanks!


That's a good comparison given asserts are specs. If you like asserts, the easiest, high-ROI method of getting into verification is Design-by-Contract. Here's both a way to sell it to your boss and some articles highlighting it:

https://www.win.tue.nl/~wstomv/edu/2ip30/references/design-b...

https://www.hillelwayne.com/post/contracts/

https://www.hillelwayne.com/post/pbt-contracts/

Many of us independently converged on a consensus about how to do it quickly, too. You put the contracts into the code, do property-based testing to test them, and have them turn into runtime or trace checks combined with fuzzers. Eclipser is one that runs on binaries finding problems way faster than others. This combo lets you iterate fast knocking problems out early. If you want more assurance, you can feed those contracts (aka formal specifications) into tools such as Frama-C and SPARK Ada to get proofs they hold for all inputs. Obviously, run lots of analyzers and testing tools on it overnight, too, so you get their benefits without just staring at the screen.

https://github.com/SoftSec-KAIST/Eclipser

Another lightweight-ish, formal method you might like was Cleanroom Software Engineering. It was one of first to make writing software more like engineering it. Stavely's introduction is good. His book was, too, with the chapter on semi-formal verification being easier to use and high ROI having excellent arguments. Fortunately, the automated and semi-automated tools doing formal analysis look good enough to replace some or all of the manual analysis Cleanroom developers do. Languages such as Haskell with advanced type systems might let it go even further. I advise coding in a simplified, hierarchical style like in Cleanroom to ease analysis of programs even if you do nothing else from the method. If you're curious, that technique was invented by Dijkstra in 1960's to build his "THE Metaprogramming System."

https://web.archive.org/web/20190301154112/http://infohost.n...

https://en.wikipedia.org/wiki/THE_multiprogramming_system

Have fun with all of that. I'm pretty sure straight-forward coding, careful layering, DbC, and PbT will help you on realistic projects, too.


Are you Santa? Isn't it a bit early for Christmas?

Thank you so much! I've been looking into building a foundation with regards to formal methods and safety-critical systems, and besides getting familiar with the regulations and Standards around that type of software, I've also been looking to build up some familiarity with the tools to achieve those higher levels of provability!

This all looks like it fits the bill perfectly! Thank you! Thank you! Thank you!


"Are you Santa? "

Shhhhhh!!

"safety-critical systems"

Interesting enough, safety-critical field doesn't actually use formal methods much. It's more about lots of documentation, review, and testing. Mostly just human effort. They do buy static analyzers and test generators more than some other segments. There's slowly-slowly-increasing adoption of formal methods in that space via SPARK Ada and recently-certified CompCert. The main incentive so far is they want to get bugs out early to avoid costly re-certifications.

DO-178C and SIL's are main regulations driving it if you want to look into them. DO-178B, the prior version, is also a good argument for regulating software for safety/security. It worked at improving quality. We even saw our first graphics driver done robustly. :)


It does suggest we should delete it, which is pretty helpful. It's a terribly liability.


I have an alternate opinion which is compile times should be fast because it done often. Where sanitization is something you only need to do in later release QA stages. SO it can be slow and that's okay.


You can have different "thoroughness" levels while compiling, though.


Is there even an approach to doing that? Intuition tells me that's a mathematically intractable problem.


Rust prevents data races at compile time. Note that data races, while a common form of race condition, is not the only kind of race condition, and Rust won't prevent the others.


You're right to an extent. I think that because of Halting problems, a number of useful properties computer could tell you about itself are limited, because they boil down to solving Halting problem.

However, one way to avoid such a problem was basically settling for less accurate, but still strict subproblems. I.e. Rust doesn't stop all race conditions but solves them for a subset of data races.

In other words you reject some valid programs (false positive?) in order to make sure all valid programs are really valid (false negative?).


ATS language. It compiles to C and I believe I have seen even that experimentally some kernel module was developped in it.

Here some resources:

The Wikipedia page: https://en.wikipedia.org/wiki/ATS_(programming_language)

"A (Not So Gentle) Introduction To Systems Programming In ATS" by Aditya Siram at the StrangeLoop 2017: https://www.youtube.com/watch?v=zt0OQb1DBko

Introduction to ATS, a series of screen casts: https://www.youtube.com/playlist?list=PL6BIXG1a4elsauhh56i5n...

I do not program in ATS because I am not doing anything that needs to be as efficient, so that I have the luxury to enjoy programming and learning Haskell, in the hope that by the time when I would need to write some super secure and performant program that has to run without interruption, and is used by more users than myself, well, I hope that by that time Haskell will evolve enough to allow me to write such a program. Otherwise I might end up using ATS, though honestly the syntax is so ugly.


Yes, just probably not with a language as semantically ambiguous as C. Rust for example has guarantees against data races.


The good news is it's also the golden age of static analysis right now. There's open source and commercial tools that find most of the problems. Some are scaling up on codebases the size of the Linux kernel.

So, not only can you do that: it's standard practice for some developers already in and out of commercial space. Mostly commercial developers doing it that I see, though.


Note that this is a dynamic analysis, and precisely because equivalent static analysis won't scale to Linux kernel.


It is dynamic analysis. I don't think it's that reason. Both Saturn and Facebook's tool scaled analysis to the size of the Linux kernel. I think most developers and companies aren't using or pushing them like they could for cultural reasons. Look at the number of people that used sanitizers vs low-F.P. static analyzers even if free. The latter would've gotten them results sooner. It wasn't empirical evidence or effectiveness that led to that decision. Human factors trump technical factors in adoption decisions most of the time.

http://saturn.stanford.edu/

Note: Linking to Saturn since it's old work by small team. I theorized big investment could achieve even more. Facebook's acquisition and continued investment with excellent results proved it.

If tech and money is the limit, tell me why Google hasn't straight-up bought the companies behind RV-Match and Mayhem to turn their tech on all their code plus open-source dependencies. Even if over-priced, it might actually turn out cheaper than all the bug bounties adding up. Maybe re-sell it cheap as a service Amazon-style. The precedent is that Facebook bought one that built a scalable tool they're applying to their codebase. Then, they were nice enough to open-source it. Google, Microsoft, Apple, etc could get one, too.

What's Google's excuse? They sure aren't broke or dumb. Gotta be cultural. Likewise, Microsoft built lots of internal tools. They do use at least two static analyzers: one for drivers, one for software. I heard they even ship 2nd one in VS. They don't use most of the MS Research tools, though. The reason is cultural: Operations side knows they'll keep making money regardless of crap security.

Far as FOSS developers, most don't care about security. Those that do tend to follow both what takes little effort (naturally) and whats trending. The trending topics are sanitizers, fuzzers, and reproducible builds. Hence, you see them constantly instead of people learning how to build and scale program analyzers, provers, etc with much better track record. Note that I'm not against the others being built to complement the better methods. I just think, if this is rational vs emotional, you'd see most volunteer and corporate effort going to what has had highest payoff with most automation so far. For a positive example, a slice of academics are going all-in on adaptive fuzzers for that reason, too. They're getting great results like Eclipser.


I know a bunch of the people at Google who are responsible for program analysis. The idea that they haven't bought some company doing symbolic execution because of some internal vendetta against static analysis is just ridiculous.

I also find it weird that you reference Saturn, since Alex Aiken hasn't been driving heavy SAT stuff for a while. Clark Barrett is at Stanford now and is doing more things related to what you want. And... oh wait he spent a few years at Google at few years ago.

Fuzzing plus sanitizers work like mad. They aren't magic and interpocedural static analysis provides value too. But your claim that Google is ignoring these techniques and doing so because of cultural idiocy just isn't correct.


I know they do static analysis thanks to this article that we discussed a year ago on Lobsters:

https://cacm.acm.org/magazines/2018/4/226371-lessons-from-bu...

I'm saying they didn't care enough to do the kind of investment others were doing which would've solved lots of their problems.

The article also indicates they couldn't get developers to do their job of dealing with the alerts despite false positives. If Coverity's numbers are right, there's over a thousand organizations whose managers did it better.

Since they didn't address it, Google would be about the best company to acquire expensive tech like Mayhem that finds and fixes bugs so their developers can keep ignoring them. Alternatively, start double teaming the problem investing in FB's tool, too, moving in best advances from SV-COMP winners.

I mean, their size, talent, and finances don't go together with results delivered (or not) in static analysis. They could do a lot more with better payoff.

EDIT: Forgot to mention I referenced Saturn because parent said the methods couldn't scale to the Linux kernel. And Saturn was used on the Linux kernel over a decade ago. A few scale big now.


That's just one team. There's entire other teams at Google that aren't represented in this article.


You keep missing my point. I keep giving examples of large-scale, catch-about-everything systems. I'm not talking merely has a few teams on something. This is Google, not a mid-sized business. You win if they already have their own version of Mayhem that fixes their bugs for them. Otherwise, they're behind Facebook in terms of financial effort they'll put in to get great capabilities. I'm also going to assume they're playing catch-up to Infer unless they've released their analyzers at least for peer review.


The infer team is only like 15 people, last I checked. That's certainly not more than "a few teams on something". Also, separation logic and symbolic/concolic execution are such wildly different approaches that it seems odd to pivot to Infer here.

I obviously won't be able to convince you since they aren't publishing at the same rate as facebook. So you'll just have to take my word that Google doesn't have a vendetta against static analysis.


"The infer team is only like 15 people, last I checked."

The Infer team is claiming to get results on C/C++ with both sequential and concurrency errors via a tool they open sourced. I value scalable results over theories, team counts, etc. Does Google have a tool like that which we can verify by using it ourselves? Even with restrictions on non-commercial use? Anything other than their word they're great?

"that it seems odd to pivot to Infer here." "So you'll just have to take my word that Google doesn't have a vendetta against static analysis. "

You really messed up on 2nd sentence since I linked an article on Google's static analysis work. I told people they're doing it. I mentioned Infer was Facebook pouring a lot of money into a top-notch, independent team to get lots of result. I mentioned Google could do that for companies like that behind Mayhem that built the exact kind of stuff they seem to want in their published paper. They could do it many times over. If they did it, I haven't seen it posted even once. They don't care that much.

Your claims about team size and "vendetta against static analysis" are misdirection used to defend Google instead of explain why they don't buy and expand tech like Infer and Mayhem. And heck, they can use Infer for free. My theory is something along the lines of company culture.


golden age of static analysis PR maybe.


There are ways to achieve some of this statically with annotations:

https://clang.llvm.org/docs/ThreadSafetyAnalysis.html


Linting for concurrency bugs is very hard. Modifying thread scheduling to make rare manifestations much less rare is much simpler.


Doesn't the kernel make use of some lock free data structures? Like RCU? I would be surprised if there are no races there, but the point is that they are harmless and handled.


Lock free data structures still use atomics, which this tool understands. All bugs found really are data races.


The tool doesn't understand all lockless techniques, so there will be some false positives. For example, there are some cases where people have used the low-level primitives barrier() and cmpxchg(), which this tool (not possessing human-level intelligence) can analyze.

Also, not all data races can be exploited into serious bugs (in some cases, some stats might just be incorrect, for example).

That doesn't make the tool useless, of course! Just that one should take the numbers with a grain of salt.


I don't see how any lockless techniques can cause false positives on this tool, since this tool ignores them. The tool instruments plain memory accesses. For each access, with some probability, 1. setup watchpoint, delay, and then delete watchpoint, or, 2. check watchpoint. If there is a matching watchpoint, two threads made "simultaneous" accesses, hence data race.


1) data races are different from race conditions. and data races are what this sanitizer detects.

2) data races are undefined behavior in C, but:

3) they don't necessarily translate to bugs in practice. for example:

https://godbolt.org/z/Z9RvpB

this is a data race, but in practice everything works as expected.

in cases like this, fixing the data race adds overhead without giving us much benefit.


Any chance of this sanitizer being generalize for use outside of the kernel? Someday could the we use this like UBSan?


I'm not sure about other platforms, but in Xcode, there is the ThreadSanitizer option in the Scheme diagnostics. Sounds similar to the Kernel Thread Sanitizer mentioned in the article. I assume it's part of clang, but am not positive of the under-the-hood implementation. If so, it may be available on other platforms. (I mention it because Xcode also has options to use Address Sanitizer and Undefined Behavior Sanitizer, so perhaps it's the same thing as the Kernel Thread Sanitizer but for user-space stuff?)


Could you please stop creating accounts for every few comments you post? We ban accounts that do that. This is in the site guidelines: https://news.ycombinator.com/newsguidelines.html.

HN is a community. Users needn't use their real name, but do need some identity for others to relate to. Otherwise we may as well have no usernames and no community, and that would be a different kind of forum. https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...


Reading the article, it seems like these are from an automated tool. I wonder how many of these are actual race conditions, vs false positives because of logic that the tooling can't decipher.


This is a non-issue and really a false narrative blown up into something that isn't really an issue. Any concern of society can be fed into an AI and out comes a desired result. It's a computer program.


Boy, there are a lot of negative comments here. Google is contributing to the Linux kernel and released their tool as open source. Why all the hate?

Anyway, back on topic: would fixing these race conditions be only beneficial to stability, or would this also improve performance/responsiveness?


Fixing races generally means stricter synchronization, which makes for slower code. The world is complex and there are always exceptions, but I'd expect performance to decrease.

Edit: it goes without saying, but it's much better to fix the race conditions, regardless of what it means for performance. Some people seem to think I'm advocating for not fixing them.


As far as I understand, performance is not a valid concern.

For example, from dvyukov's KTSAN wiki [1]:

  Given a sufficiently expressive atomic API and a good
  implementation, you pay only for what you really need (if
  you pay just a bit less, generated code becomes incorrect).
  So performance is not an argument here. 
[1] https://github.com/google/ktsan/wiki/READ_ONCE-and-WRITE_ONC...


You're talking about the sanitizer performance. We were talking about the kernel performance in the real world. Fixing these race conditions will have some effect.


No, that page is about kernel performance, not sanitizer performance.

The idea is that if you omit READ_ONCE/WRITE_ONCE when they are needed, it doesn't matter if the code is faster because it is wrong.


"brakes make my car slower, so I'd expect my average velocity to decrease if you fixed them"


"brakes make my car slower, so I'd expect my average velocity to decrease if you fixed them"

Just to be pedantic... in some contexts better brakes actually allow you to go faster. Consider racing on an oval track... a car with better braking ability can maintain speed longer as it approaches a corner, then scrub off speed more quickly to navigate the corner. A car with inferior brakes has to start slowing down sooner or risk crashing by taking the corner too fast. So better brakes can lead directly to faster lap times.


>Just to be pedantic... in some contexts better brakes actually allow you to go faster.

Yes, and in other contexts, better brakes make you go slower. Better brakes means bigger brakes: larger rotors (discs), larger and heavier calipers, etc. Larger brake rotors take more energy to accelerate (they're basically a bigger flywheel), so they decrease the car's acceleration. So if your goal is to have a car that accelerates as fast as possible, better brakes are actually a big detriment.


Indeed, disk brakes were first used on race tracks before being available on street cars.


And some people now argue that disc brakes on road bicycles are more dangerous because you learn that you can brake later and more suddenly. I.e You will more frequently approach the limit of friction between the (very narrow) tyre and the road.


>And some people now argue that disc brakes on road bicycles are more dangerous

Some people argue that the Earth is flat. That doesn't mean they have a good point.

The morons you're referencing probably think cars should all go back to drum brakes and bias-ply tires so people think more about brake fade and tire grip. It's an idiotic argument. Disc brakes on bikes are better in every way, except weight (they add about 1 pound, maybe). Reducing performance available to a cyclist doesn't make any sense at all; you never know when you're going to need to stop suddenly in the real world.


I had not heard that one. I always thought the main argument against disc brakes on road bikes was they they aren't good for long descents where you're riding the brake for long periods of time... due to heat fade or whatever.

Of course both things could be true...


>I always thought the main argument against disc brakes on road bikes was they they aren't good for long descents where you're riding the brake for long periods of time... due to heat fade or whatever.

No, you have it the other way around. Rubber brake pads on wheel rims fade with heat; disc brakes are able to dissipate far more heat, and are basically essential for long descents.

The only valid argument against disc brakes on bikes is that they weigh a little more (maybe 1 pound). I suppose you could also argue that brake fluid is more trouble to deal with than a cable, and certainly not as easy to jerry-rig, but hundreds of millions of cars use hydraulic brakes without any trouble these days, and I wouldn't want to have jerry-rigged brakes anyway.


Indeed. The insane thing about a modern F1 or LMP1 car isn't the acceleration (or at least not as much, the hybrid systems in a modern x car change that) but the braking.

Some drivers have complained about rain because the tears have been literally pulled out of them under the deceleration


I think dekhn may be making the same point, so no need for "Just to be pedantic." You raise an interesting factoid that many people probably haven't thought about, although it's pretty easy to intuit once you are encouraged to think about it.


No I wasn't making that point. The side-pedants talking about racing cars, those represent a tiny amount of braking activity compared to the total amount of braking activity across the world, so my statement about expecting the average velocity to go down still stands.

Also, I drift corners instead of braking, it seems to be a lot faster if you nail the tire orientation on the exit.


Also, I drift corners instead of braking, it seems to be a lot faster if you nail the tire orientation on the exit.

Fair point. As the old saying goes "looser is faster". The risk, of course, is that you wind up in the wall with your velocity = 0. :-)


In factorio the research you perform to get your trains to run faster is braking force research.


That helps that posters arguments.


That's fine by me. I don't have a dog in that fight, so to speak...


It's more like traffic lights being turned off.


I grew up near a city where they set up the traffic lights along one of the main roads in such a way that if you kept to the speed limit you would always arrive at the next light just after it turned green, so you never would have to brake or accelerate. I wouldn't be surprised if there are concurrency patterns analogous to that.


Ah yeah, groene golf (green wave [1]). I grew up remembering my parents also using that. Except it never bloody worked.

[1] https://en.wikipedia.org/wiki/Green_wave


Wow, from that article:

> In the UK, in 2009, it was revealed that the Department for Transport had previously discouraged green waves as they reduced fuel usage, and thus less revenue was raised from fuel taxes.

Ugh...


Perverse behaviors like this are what make we worry about the calls to tax negative externalities without earmarking the funds for addressing those externalities. If you tax the externalities, but get to use the revenue for something else, what is your incentive to reduce the behavior causing the problems in the first place?


Yes! This is a point I raise frequently. My go-to example is cigarette taxes being used to fund public schools. They should be used to cover costs (borne by the relevant government) of people smoking cigarettes! Otherwise, as has happened, school funding can be hurt by people smoking less, which is not a good situation to have created.

The degree by which governments routinely violate the accounting principle of matching revenues to expenses is awful.


I think you're describing the problem of earmarking funds at all. In general it's just a stupid posturing maneuver. Spending is fungible; it just replaces the amount contributed from the legal body's general fund. As needs change, general fund money can be spent as needed. Earmarked funds are stuck.

A carbon tax on consumers (incl. businesses) is effective at producing good market behavior regardless of how the tax revenue is spent.


> A carbon tax on consumers (incl. businesses) is effective at producing good market behavior regardless of how the tax revenue is spent.

In the example I reacted to, a government agency actively worked to inhibit that behavior in order to preserve its revenue. Your point would hold if governments didn't have the power to do things like this.

> Earmarked funds are stuck.

If the revenue is being raised to pay for damage caused by society, those funds should be stuck to that purpose and should rise and fall based on the level of damage being done.


> In the example I reacted to, a government agency actively worked to inhibit that behavior in order to preserve its revenue. Your point would hold if governments didn't have the power to do things like this.

A governmental bad actor is basically orthogonal to the idea of a carbon tax. The government could equally try to increase sales of liquor and cigarettes for sin taxes, or go around murdering people for the estate taxes. The UK example is shameful but reflects on that government body rather than the specific tax on petrol.


This is very likely an urban myth.


The problem isn't earmarking funds in general. Money has to be allocated somehow and it's good to do so explicitly.

The problem is earmarking or allocating a variable set of revenue as funding for something unrelated, and in particular something unrelated that's otherwise valuable or important.

And NOT earmarking or allocating 'sin taxes' to pay for the supposed costs of those sins, borne by the public in the form of the government, seems pretty stupid (to me) if it's actually necessary or desirable for the government to 'manage' those sins or those sin's consequences.


Ave, Government, morituri te financiert.


It makes me really disappointed in the world every time I hear stories like this. Supposedly we chose 110v power over a higher voltage in the US so we could sell more copper in wires. And everyone can see Health Care is a two-faced profiteer-filled industry here.


Yes - but smart motorways are doing a pretty good job of handling congestion at peak times by keeping traffic moving at a slower pace, which ends up with everyone using less fuel ... so at least UK roads are becoming slightly better in that regard :)


Yes, well, the green wave is a good idea in principle, but it's sorta mistaking averages for reality.

There is this thing called the "waiting time paradox" [0] (or more generally the "inspection paradox") that suggests a surprising thing, the average time between discrete events observed by many other randomly interspersed observation is the same as the average time between events. This should be surprising, because it suggests that many more people experience times LONGER than the average wait to balance out people who experience a shorter wait because they were randomly closer to the event.

This happens because longer intervals, when they do exist, get a larger proportion of the random observations than the shorter intervals between events. More precisely and generically, when the quantity being observed affects the observer, the observations will be distorted by the quantity in question and have to be normalized.

In the case of a green wave, those rare times when someone waits too long on a green because they were looking at their phone or a naked person in a nearby window and create slowdowns and stalls in the pipeline? Those moments are longer, so there is a longer window where you could experience them.

This actually comes up a lot in observations of things in nature. Imagine if instead of light flips or bus arrivals we were observing clicks on a geiger counter and you'll realize how fundamental this is to experiencing the world.

tl;dr: The Green Wave only works in averages and it can be correct even as your experience of it never working is also correct. :)

[0]: https://jakevdp.github.io/blog/2018/09/13/waiting-time-parad...


Here in Vienna, it works great. Its often quite amusing to me, as I ride a slow moped - about horse speed - and get passed by irate drivers all the time. However, I ride that green wave, while they are constantly stopping/starting at the lights ...


Works fine in my town (population around 100k). 50kmph (around 31mph) and it's all green for some 20+ crossroads.


There is a road near where I used to live that was the opposite. Speed about 10mph over the limit (55 in a 45) and it would be green all the way. Go the speed limit and you get stopped at every single light. It was shockingly reliable. The difference between speeding and going the speed limit was 15 minutes on a 20 minute drive. The only time it got broken up was the one intersection with a pedestrian crosswalk where people sometimes hit the button.

Of course traffic was heavy enough that you'd be stuck behind someone doing the speed limit most days, but sometimes you got to ride the wave.


As I wrote in another post, there's a highway near me where you almost always hit every red light, but it's designed such that going say 20% over the limit won't work either. I believe some people deliberately go 20% under to reduce the waiting.


I'd be willing to give that a shot; I drive the speed limit in my town (population estimated at about 15.6k), but the lights seem more in relation to current traffic at the intersection. So you catch this red light and you're going straight, you stop and wait. Now when you approach the second light, because you and others were held back there was no one approaching the light so it will likely be red or turning red when you approach. To actually get out of this loop, you basically have to speed up a bit which seems like a behavior that shouldn't be rewarded as such.


Sounds like sensor-based lights and not timed lights? Regardless, Redmond, WA is full of sensor-based lights. Hence I catch myself speeding up to an otherwise empty lane and a green light. 'cuz if I don't hurry up, it'll turn red.

Timed lights work, I've seen it in action. Top of my head, Indianapolis 20 some years ago, Capitol Avenue going north I'd bet you could go from (what is effectively) Zero Street to 38th and not hit a red light.


Worked fine for my family near Drachten, but that had the advantage that they also had a sign showing your speed next to the road. I suspect one issue is that the speedometer in cars are not that accurate


I live somewhere with lots of roadside speed signs (they flash speed in red if you're speeding, green if under). In the tens of cars I've driven/ridden here, the speedometer is always within a small range (±2km) of what the road sign says


The thing is that you have to look at this from the perspective of the expected largest difference between different cars, not one speedometer and actual speed.

If two cars both say they drive at 50 km/h, and one is actually 45 km/h and the other 55 km/h, that is quite a bit relative margin of error. And we only need two cars like that in the whole group to start causing problems.


But I don't think you have to look that hard at some technical error causing this variance. Drivers speeds vary between 40 and 60 pretty regularly on residential roads. Having a speedometer error doesn't even have to enter into the conversation.

If you can't design a green wave to be tolerant of varying speeds, then it isn't useful.


There are some weird speed signs near me. One that particularly irritates me is set up to flash if you're over 25 mph. But the speed limit is 30! There's also a flashing sign by a school, that is left on at times that the regular sign specifically says the school speed limit is not in effect.


radar detectors always read lower than your true speed since they can't be positioned directly in your line of travel.


In the situation I experience on my commute daily: I'm 40m back and the detector is 3m perpendicular off my line of travel (reasonable approximation, as they're on a roadside light standard, the lane width is 3.5m, and the car is centered in the lane), that is a very minor discrepency. On the order of 0.2 km/h


systolic matrix multiplication is roughly analogous to that.

Great Highway in SF, if you drive 35MPH you will just get greens on the evenly spaced lights. The worst is when you get stuck behind a person who drives 40, then brakes heavily, then drives 40, then brakes heavily, etc.



There's a road in Berlin (and probably it's not an exception) that has overhead displays showing how fast you should go to be on a green wave. At least that's how I understood how it works from driving there several times. Additionally it has variable number of lanes for each direction signaled by lights.

Here it's street view of the road, it's not perfectly visible, because those are LED displays: https://goo.gl/maps/T7RzQCUS4nk3QVZN9


There's a highway near me with periodic lights triggered by the cross traffic such that you almost always have to stop. The limit is 55 mph, but if you see someone on the side street and speed up to, say 70, to try to beat the light, you will probably see it turn yellow within a second and have to stop anyway. There aren't that many lights, but you almost always have to stop for every one because of the cross traffic taking precedence.


Slow drivers must cause extreme road rage on that road


In my experience, it is the “fast” drivers that cause road rage by being slow. They race to the next light, stop, and because they are stopped, prevent everyone from riding the green wave because they now must accelerate from stopped. Sooooo frustrating. Just go the speed limit and none of us will need to stop you idiots!

Edit: (or go double the speed limit if you must but please stop creating stops)


You run the risk of the computational equivalent of speedo not beeing 100% accurate, e.g. clock innacuracy.


there's a 3 or 4 mile stretch of road in san francisco called the great highway that's like this (https://www.google.com/maps/@37.750302,-122.5062241,14.64z)

It was one my little pleasures to drive along that at 35 (or 30, I forget what the speed limit was) and see cars zip ahead at each light at 45, only to watch them ride the brakes as they hit the next red.


When there’s not too much traffic, Ocean Parkway in Brooklyn also has a green wave.


They did this on a few roads around where I grew up.(Southern Tier of New York.)


The road I take to the gym is exactly like this, if you keep just under 40mph.


In both directions? How would that work?


You can actually do all 4 directions.

Imagine a translucent checker board of blue and black squares overlaid on top of a map of roads. You can see the color of the squares but you can also see the roads underneath them.

If a light falls on a blue square, make its principal north/south direction Red at t=0 and it’s principal east/west direction green. For black squares do the opposite.

Calculate how long it would take cars to travel through each square unimpeded. It helps if this time is roughly the same regardless of direction. It also helps if the squares are roughly large enough that it takes roughly one light phase worth of time to travel through it. Call that time N.

At t=N, switch the lights to the opposite color, so now north/southbound traffic gets green lights on blue squares, and east/west gets red. Black squares get the opposite.

What you’ll see if you imagine yourself as a car driving through this, is that as you enter a black square as it turns green, the green lights will stay green until you enter the nearest blue square, at which point the phase shifts and now the blue square gets green lights as you drive through it.

This works in both directions, with both north and south bound traffic.

Now, this falls apart when roads are diagonal, although it’s mitigated if diagonal roads have higher speed limits. It also means if you make a turn, you’re in the wrong light cycle, but it will correct itself after the next red light.

It also doesn’t work well at all if you have left turn arrows complicating your light cycles, so you’ll need to have something like a Michigan Left and the requisite wide medians to make this work.

It’s not as ideal as I described in real life, but the basic principle works.


I'm guessing it doesn't work so well if the stretches of road between lights are of significantly different distances.


Or, worse, if the roads aren't on a rectangular grid. But for square blocks, it can work very well.


So lock-free datastructures such as ring buffers are like roundabouts, huh?


I see your point, though in racing brakes make a car faster, not slower. :)


Brakes slow a car down in terms of velocity over time. But they may make a car faster in the sense of distance over time (aka lap times).

I guess the real issue here is that 'fast' is ambiguous. In terms of racing it can be used for both lap-times or velocity. lap-times and maximum velocity are not necessarily correlated.

</nerdsnipe>


You could say that brakes make the car faster in race conditions.


Daaaaaaaymn. Good one, holds up!


They often are correlated though. Better brakes mean you can stay on the accelerator longer before braking, which means you usually have higher maximum velocity and faster lap times (faster lap times are literally higher average velocities, so this makes sense).


Not really, if the brakes are broken, your average will approach 0 mph after the vehicle crashes.


Why are you comparing linux kernel and cars?


Because analogy is the foundation of understanding things?


So, so far away from the glory days of Slashdot where it wouldn't even be considered a legitimate comment without a car analogy.


But many race conditions are harmless. 'Fixing' those that are with extra synchronisation is not a good use of anyone's time.


by definition data races in C and C++ are UB. Even if you believe that the compiler cannot not possibly miscompile your code [1], a large quantity of these false positives will make data race detection tools (like the one used by G) less useful. Just mark your shared memory locations as atomic, you can always use relaxed ordering and non-atomic RMW operations if you worry about the cost of the extra synchronization.

[1] these sort of assumptions of course tend not to age well.


Data races are. Race conditions are not.


True.


Note that Phoronix is being imprecise here and those hundreds are in fact data races (with no false positives), which are never okay.


> data races (with no false positives), which are never okay

This just isn’t true.

Here’s a concrete example - many JIT compilers use counters to work out when to compile a method. They allow data races in updates to the counters because performance is more important than correctness for them and the impact of a lost write is basically zero. It’s only a bug if you decide it’s a bug.


I've never heard the term "race condition" refer to something that was harmless. Can you point to some examples in the literature?


Even Wikipedia says it's not always a bug

> A race condition or race hazard is the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.

Or read the canonical Encyclopaedia of Parallel Computing page 1693

> [A race condition is when there is] some order among events [where the] order is not determined by the program


Some race conditions are intentional at the expense of knowing that the data would not be perfect. For example, if you have a structure of different metrics and one thread updates them while another thread reads them, clearly the best way to handle that is to lock while writing and reading so you always get the intended results. However, if you don't necessarily care about all of your metrics being snapshotted at the exact same point in time, you can just have the other thread share the variables knowing that this race condition will actually improve performance a little bit. Obviously you need to take care of things like words sizes that take multiple loads and stores.


Hmm. I thought data races due to concurrent thread access are actually undefined behavior (up to and included termination of the program).


Are you thinking of a particular programming language specification here? It seems like one programming language might define this this way, while another might simply say that the order of the events is not guaranteed (and may not be the one that the programmer expected), but not that the events themselves may do something other than the normal language semantics provide.


C++ and C both explicitly say that data races are undefined behavior, not unspecified behavior.

Data races are one of those cases where "anything may go" undefined behavior is actually quite necessary. Actual hardware memory models (particularly on "weak" architectures like ARM or PowerPC, or especially the infamously weak Alpha) result in cases where the order of memory traffic is not a consistent view between different threads. Formally specifying the hardware memory model is actually surprisingly difficult, made all the more difficult when you want to give a hardware-agnostic definition [1]. And should you want to introduce hardware transactional memory, break down in despair at the complexity [2].

The great breakthrough in memory models is the definition of the "data-race-free" model. This model says that, if memory accesses are properly protected with locks (i.e., there are no "data races" [3]), you cannot observe a violation of sequential consistency, which makes reasoning much easier. Java used this model for its corrected memory model in Java 5, and then C++11 adapted Java's memory model, and C11 took C++11's model verbatim. The upshot of this approach is that, if you make data races undefined behavior, you don't have to try to figure out how typical memory optimizations, both by the compiler and by the physical hardware, affect visible memory. Trying to work out the effects of these optimizations on the memory system is extraordinarily difficult, and the benefits of doing so are manifestly unclear, since most programmers will need to properly synchronize their code anyways.

[1] The C++ memory model attempts to do this with atomic memory references. The resulting specification (in C++11) is generally considered to be wrong, and the question of how to fix it is still up in the air as of right now.

[2] PLDI 2018 had the first formal semantics that combined transactional memory and weak memory semantics, and went on to prove that ARM's hardware lock elision was broken. This should be showing just how difficult this is to grasp even for theoreticians pushing the boundary, let alone a language trying to make it accessible to typical programmers.

[3] There are two definitions of "data race" going on here. Vernacular definitions usually define it as accesses not protected by synchronization, which permit "benign" data races for regular loads/stores. C/C++ tweaks the definition so as to use it to proscribe undefined behavior, so "benign" data race is oxymoronic in those languages. However, the use of memory_order_relaxed is intended to indicate a permitted data race in the first sense but not the second sense.


Thanks for the tremendous detail on this point.

Could you or someone else please give a minimal example of an undefined access in C, just so that I can be sure of whether we're talking about the same phenomena here?


A minimal program (using C11 threads):

   #include <threads.h>
   int x;

   int do_thread(void *) {
     for (int i = 0; i < 10000; i++)
       x = i;
     return 0;
   }
   int main() {
     thrd_t t;
     thrd_create(&t, do_thread, NULL);
     do_thread(NULL);
     thrd_join(&t, NULL);
     return 0;
   }
There is an undefined data race on x, since it is simultaneously accessed from two threads without any synchronization. Replace the declaration of x with `_Atomic int x;`, and the resulting code is well-defined.


thanks, this comment (point 3 in particular) did a great service in clearing up the disconnect.


no programming languages, just the hardware. I think the only way to reasonably think about data races is at the hardware instruction/memory level, not the programming language.


they very much are.


Can you point to a source on that? If you have two pthreads in C accessing the same variable, they will definitely not necessarily see the same load/store order, but it's perfectly valid as a form of lazy evaluation to assume the value eventually can be seen.


Can't check chapter and verse on the standard right now. Will you take cppreference[1] as authoritative? See the section on data races. The C11 memory model only differ minimally form C++11.

[1] https://en.cppreference.com/w/cpp/language/memory_model


I'm aware of c++ memory models, but this is exactly what I was talking about. The actual behavior is defined by the hardware, in the programming language is an abstraction on top of that. For certain sizes, you will not get a load or a store that is only partially there. So on x64, you can safely share data between threads knowing that the observed order of that data is completely undefined. But in cases of looking into a view of statistics, that may be perfectly acceptable given the trade-offs.


> The actual behavior is defined by the hardware, in the programming language is an abstraction on top of that.

This is a very dangerous way to understand semantics of programming languages. Programming languages are usually defined via some sort of operational semantics on an abstract machine which occasionally bears some resemblance to how processor semantics are defined. But sometimes there is a massive disconnect between processor semantics and language semantics--and perhaps nowhere is that disconnect greater than memory.

In hardware, there is a firm distinction between memory and registers. Any language which approximates hardware semantics (which includes any mid-level or low-level compiler IR) is going to mirror this distinction with the notion of a potentially infinite register set (if pre-register allocation) and use of explicit instructions to move values to and from memory. But in every high-level language, there is no concept of a memory/register distinction: just values floating around in a giant bag of them, often no notion of loads or stores. You can approximate this by annotating values (as C does, with the volatile specifier), but the resulting semantics end up being muddy because you're relying on an imperfect mapping to get to the intended semantics (don't delete this load/store). Optimization that screws up the mapping of the language memory model to the hardware memory model is anticipated, even 30 years ago--that's why C has a volatile keyword.

It is possible, if you are very careful, to write C code that exposes the underlying hardware model that allows you to reason about your program using that model. But this is not typical C code, and instead requires modifying your code in such a way that you make the intended loads and stores explicit. And if you write your code in the typical way and expect it to map cleanly to hardware semantics, you're going to have a bad time.


Coming from a world where latency and performance is paramount, I think it's perfectly acceptable to write code targeted at a specific architecture you know won't change. I understand your points though, and it's definitely not a style you should use in 99% of programming.


I would call this sampling that could turn into a race condition if it stops being treated as sampling.


Random imagined example: Two files with different names being created in the same directory. The race decides which one gets which inode number, but both are created correctly and successfully. That would be a race condition, but as long as only the assignment of inode numbers changes between outcomes of the race nobody would care. But again, this is just a contrived example.


Examples in the literature? This is not a bloody academic exercise?!

I use the term 'race condition' about once a week to point out something may happen before or after something else, because the things are happening in parallel.

Simple example:

me: I just uploaded the file to our shared dropbox

you, refreshing: I don't see it yet

me: race condition


eventual consistency is not a race condition.


Of course it is, that's exactly the point. 'Eventual consistency' is the name for situations where a race condition is acceptable and even a feature. It doesn't matter whether it are threads, processes or entirely different computers that are racing. Whether the data is shared via CPU cache, memory, disk or a network. We don't call it "failed eventual consistency" when a data race occurs.


I would never call this a race condition, but mostly because it does not cause an unrecoverable / system crashing failure state (you and your coworker have a protocol to resolve the inconsistent state).


The phrase "race condition" is usually not used to describe the harmless asynchronous code you're referring to.


No you're thinking about 'data races'. Almost all software that interacts with the outside world, or clocks, like of course a kernel, is going to have many harmless 'race conditions'. Erlang for example, is a language extremely prone to race conditions, but not to data races.


The tool specifically looks for data races.


I think by definition, you are wrong. If the race condition was harmless then it wouldn't be a race condition.


By who's definition? If you go by Padua's classic definition, it's any behaviour that depends on timing sequences. For example HN has a race condition in that the ID your comment gets allocated depends on when it's submitted. But that's not a bug, is it? It only becomes a bug when it's undesirable. I think, like a few people here, you're confusing it with a 'data race'.


You still understand, though, that the colloquial definition being used here is what you're referring to as a "data race." Your comments are disingenuous in that I believe you know that you're being pedantic, I just don't know why you've brought it up the multiple times.


I didn't realise something as specific as 'race condition' could have a colloquial definition, but ok.


For many terms of (programming) jargon, there's a strict academic definition, perhaps as defined in a paper in which the term was originally introduced, and the usage of the word in the broader (programming) community.

The latter is the colloquial definition.


I feel like this comment is a classic example of reading a wikipedia entry instead of understanding the topic.

You can use it to refer to any race, but in reality the race you are referring to is only subjective to the user. The server is generally processing all requests in the order they are received.

A race condition usually is referring to a multithreaded system where the threads interfere with each other, causing undesired effects.

Relaxing "race condition" to refer to all causality... is such a loose definition as to give it no meaning at all.


> I feel like this comment is a classic example of reading a wikipedia entry instead of understanding the topic

I did a PhD partially on irregular parallelism where understanding race conditions are essential for understanding the topic.


Didn't we go over this already? It's not any behavior that depends on timing sequences, it's only if your code assumes it can rely on timing sequences while it can't, i.e. a bug.


> It's not any behavior that depends on timing sequences,

That's literally how 'race condition' is defined in the industry.


Not true, it's pretty hard to encounter anyone in the industry using the phrase "race condition" without implying a bug.


what? You need to describe what you’re saying better. How is an ID allocated a comment a “race condition”. A race condition desirable or not has to be related to some sequence of events.. ie its a race with a particular set of outcomes. What you’re describing sounds like basic cause and effect.


Event A is your comment. Event B is my comment. If your comment gets to the server first if gets ID 1. If mine gets to the server first it gets ID 1. Which ID you get depends on timing outside your control. They race to get the ID. A race condition.


So what? What’s missing is that unless something depends on the ordering of those ids, it really isn’t a race condition. Just plain happening of events is the inevitable passage of time, not a race condition. What is interesting about what happens downstream with those IDs?

You’re not wrong that not all race conditions are harmful, but i feel you’re doing a pretty bad job of explaining what a race condition is.


Well, if we want true security, we need to accept that all computers must take MASSIVE performance hits.

It's the price to be paid for security. No more branch prediction, no more cutting corners with UB. It's time to do things the right way once and for all and that means setting back performance a good 10-20 years.


Would 'stricter' mean 'more' in this context?

Will the fixes consist of adding more synchronization mechanisms (mutexes, semaphores, etc) to prevent potential race conditions?


Would you live with a less secure but faster platform rather?


It is a good question, the answer is "it depends" but usually it would be "more secure", unless you know WTF you are doing. Usually, people don't, or they think they do but they don't.


No good deed goes unpunished. My OSS projects had a few haters recently and it's amazing the amount of vitriol you can get when you give someone something for free.

I think my follow on comment is to just suggest that they fork it and maintain the project themselves.



That is horrifying.


We're getting closer to "How Google got me fired for the 532 bugs it found in my code..."

EDIT: It was meant as a joke as to how media manages to make something that's good into a problem.


In any decent organisation, a bug that goes out to live isn't the developer who wrote the codes fault.

It's a team fault, meant that maybe the PO didnt spec correctly, then the developer implemented something wrong, then this was missed in the code review, then missed in team testing, then missed in testing before going live. Missed in the PO sign off before live...

If anything I would hope that stuff like this by Google helps to encourage orgs to build better processes. If you succeed as a team, you fail as a team. No scapegoats!


> In any decent organisation

Yes, and in my physics classes I learned a lot about spherical, frictionless cows in a perfect vacuum.

In the real world, the shape of cows is not a sphere, nor even a closed-form equation. And real organization care about blame, and as the saying goes, shit rolls downhill.


It's your responsibility to work on a team that handles bugs like adult engineers rather than school children or politicians. Engineers have a duty to be spokespersons for rational and humane engineering processes.

I've shipped more than my share of bugs live. I've never felt like I didn't have a team behind me. Obviously I shipped a lot of things that work, as well.

Working someplace humane and rational doesn't just happen, it requires work.

I expect anyone with "Senior" in their job title to do that work, both in terms of setting team culture and establishing post-mortem policies with management.


There are actually a lot of really good organizations. The ones that don't do basic engineering management well don't tend to keep their good engineers these days.


>In the real world, the shape of cows is not a sphere, nor even a closed-form equation. And real organization care about blame, and as the saying goes, shit rolls downhil

The difference is that in the real world, organizations exist that don't merely blame the developer, and aren't really that rare. To give you an idea, I simply don't work for managers that are focused on blame. I probe this during any interview I have, or for any position I'm considering. It's on my list of "Life is too short to put up with this."

I don't have trouble finding jobs.


What is a PO?


Product owner


Product owner


Depends. What if you represented yourself as the multithreading expert and were hired for that reason? The buck needs to stop somewhere.


Is this a joke, or a true objection to improving the kernel?


It's a joke, I welcome our robot overlords.


Sounds like you are still unhappy about a tool finding bugs.


I use linters and having an ML one that can find even more difficult ones are more than welcome.

As long as they are done locally and not sending source code to Google.


What is sending which code to Google? It is a fricking open source analyzer.


Linux will be the mainstream kernel for anything eventually. And companies will bundle it with their proprietary solutions.


Not to mention there are +9K open source contributors on KCSAN project.


Nope, that's because it is developed as a fork of Linux kernel. Those +9K people are Linux kernel contributors.


I think would be more helpfull if Google invest into model and describe Linux Kernel with TLA+. I googled ad found only one person trying doing it (https://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/ker...)


Seriously...?

“Google is uncovering millions of gender preference in literature...” /sarcasm


"Eschew flamebait. Don't introduce flamewar topics unless you have something genuinely new to say. Avoid unrelated controversies and generic tangents."

https://news.ycombinator.com/newsguidelines.html


I can't help but feel most of these bugs and vulnerabilities in the Linux kernel could be avoided with a more robust foundation than C. It's nice to see that Rust is starting to be tolerated in the Linux kernel for modules.

It would be nice to see an effort to migrate some of the kernel core, but I can't see Rust gaining widespread acceptance in the kernel development community any time soon.


> It would be nice to see an effort to migrate some of the kernel core, but I can't see Rust gaining widespread acceptance in the kernel development community any time soon.

I don't even think that a migration effort from C -> Rust for the Linux Kernel is even feasible or even worth it, given the scale of the project. At this point, you might as well start from scratch.

Google is already experimenting with Rust for OS development with their new Fuchsia operating system [0], which has some drivers written in Rust with a capability based security model.

[0] https://fuchsia.dev


Developing a new kernel or operating system in Rust is a good idea, but a massive undertaking, and it's hard to see it gaining adoption quickly. I think a more credible way to benefit from Rust in the short term is to rewrite small but critical components of existing kernels, or adopt Rust components from experimental Rust OS projects.

Similarly, Firefox's Project Quantum seeks to bring Rust to Firefox from the more experimental Servo project.


Rust for OS development has a pretty big following, even though there are no large production kernels (yet) obviously.

There are 2 reasons the Linux kernel in particular uses C and will continue to use C: 1) it was C originally and 2) it should compile without requiring some other compiler.

Anyone who sets out to write a new OS kernel (or similar) today for platforms where Rust is available, would probably/hopefully use Rust over C. That doesn't mean it's a good idea to start making existing code bases "hybrids". No one wants a huge codebase where to build it you need N compilers for languages that were popular in various times during the code bases hisory, and where you need to be an expert in N languages to maintain it.


I think one of the reasons OSs are written in low-level, shoot-yourself-in-the-foot ready languages is sometimes you need to do truly stupid things to make stuff work. Hardware doesn't do what it's supposed to and getting around it via some nutty hack is the way the world works.


Rust wouldn't fix these problems. They exist, at least some subset of them because the kernel needs to do low level concurrency and lock free algorithms. That's all unsafe code in Rust. So assuming you got the buy in and managed to scrape together enough engineering talent for a mammoth task, it wouldn't help much with these kinds of issues.


Unless these linting tools go out there with source, I'm going to be skeptical of Google's magic code checking and bug finding process.

And IMHO, such large number of patches / fixes on Linux should always use a decent open audit process.


> Unless these linting tools go out there with source

The tool is released as open source.

> such large number of patches / fixes on Linux should always use a decent open audit process

All contributions to the Linux kernel are audited by the kernel development team. Look at kernel.org, you'll see that contributions are always signed off by at least one other kernel dev, and often also Linus himself.



Regardless of how the bug is found, you can verify if it is a bug or not without knowing how it was found.


Google is a corporation. The linux kernel is [by implication] developed and maintained by Linus. Please don't do this. If there are race conditions, fix them following the process of kernel.org.


Something like 80+% of commits to the kernel are by people paid to work on Linux (Google employees among them). This is just part of the Linux development model and has been for a long time.


Off topic, but what companies, besides Redhat or Google, will pay you to work on Linux. I imagine that would be an enjoyable job.


I don't understand, how is Google not following the official process?


The last sentence if TFA had an ominous cast:

> For those wanting to learn more, the code at least for now is being hosted on GitHub.


Where in the 'linux process' does it mandate that 'code MUST NOT be made available on GitHub'?

Google basically announced their project and published the code. That's it. In their email to the LKML (https://lkml.org/lkml/2019/9/20/394) they specifically mention:

  In the coming weeks we're planning to:
   * Attempt to send fixes for some races upstream […]

  There are a few open questions:
   * How/when to upstream KCSAN?
So they intend to follow the standard contribution process of sending fixes upstream.


The sanitizer code is being hosted on GitHub.

How is this a sufficient point of confusion to lead to this comment chain? It’s a link. Click on it and see.


This sounds more like "on github for now instead of kernel.org".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: