Weird that this treats uninitialized variables as unknown values. For example in ex3.c, the program
int main(){
int x;
if (x <= 42){
assert(x != 12345);
}
}
is of course UB in C, even though under a "uninitialized is random" model the program is valid and does not assert (as the model checker concludes).
(even in O1 clang gets rid of the whole function, including even the ret instruction, I'm surprised it does not at least leave an ud2 for an empty function to help debugging since it would not cost anything https://godbolt.org/z/eK8cz3EPe )
Well, specifically, it counts uninitialized variables as being set to a non-deterministic value. The point of this tool isn't to optimize functions as a compiler would, but rather to find bad behavior based on its machine model.
This isn't perfect, of course. In this case, the compiler can rightly treat x as uninitialized, meaning that its value could be <= 42 at one point, and not be <= 42 at another point. Since the uninitialized variable isn't "pinned", it could technically be different values in different locations.
CBMC's machine model works differently. In this case, x is assigned a non-deterministic value. The branch condition creates a refinement of this value within the scope of that statement. If the branch is taken, then by SMT rules, x can't equal 12345, because it was already refined as being <= 42.
On its own, a model checker can miss situations like these. It's why I recommend -Wall -Werror -Wpedantic in conjunction with CBMC. The compiler should catch this as a warning, and it should be upgraded as an error.
> In this case, the compiler can rightly treat x as uninitialized, meaning that its value could be <= 42 at one point, and not be <= 42 at another point. Since the uninitialized variable isn't "pinned", it could technically be different values in different locations.
LLVM has a "freeze" command to stop propagation of undefined values (although I think that command was added later than the first version), so that the value is "pinned" as you say. However, the "undef" command, if you do not use "freeze", will not do this.
I think that the C compiler should "pin" such undefined values where they are used, but I don't know which compilers have an option to do this. (Perhaps CBMC should also have such a switch, so that you can use the same options that you will use with the C compiler.)
With this ex.3 file, the optimizer should be allowed to result in a function that does nothing and has an unspecified return value (which might or might not be the same each time it is executed). (If optimizations are disabled, then it should actually compile the conditional branch instruction.)
This is one reason why I explored model checking machine code output, since at this point, the behavior is very much defined, even if it differs from implied source behavior. But, this gets complicated for other reasons.
This is actually one of the reasons I've been tinkering with CBMC https://www.philipzucker.com/pcode2c/ The idea is to use Ghidra to lift a specialized C interpreter of the machine code and then compare to source or ask other questions via CBMC
I do recommend standardizing on hardware and toolchain for projects like these, as it can ensure that the machine code is matched.
The third phase of my work will round the corner with machine code analysis. Currently, I'm working on constructive proofs and equivalence proofs of imported C code to handle the pieces that CBMC doesn't do so well, as part of my second phase. So far, I can extract efficient C from Lean 4, but I want to import C directly into Lean 4 to prove it equivalent to extracted code. Hand-written C is easier to read than the line noise I can extract.
In particular, model checking doesn't fare well with data structures, algorithms, and cryptography. These can be torn down quite a bit, and at least we can verify that the algorithms don't leak memory or make bad integer assumptions. But, if we want to verify other properties, this requires constructive proofs.
Between 80 and 95 percent of the time, depending on the type of code being written, CBMC is enough. I'm now solving for that 5 - 20 percent of code that CBMC can't easily tackle.
> model checking doesn't fare well with data structures, algorithms, and cryptography
To the contrary, that's what I'm using it for in most of my projects. It found interesting algorithmic bugs in my ctl find function (3-way comparison callback missing cases), is used to crack poor hashes.
Also my tiny-regex matcher profited from it.
What do you mean by standardizing on hardware and toolchain? I am currently choosing ghidra and cbmc, and it seems like the approach is applicable to any compiler or arch that ghidra supports without too much change. I have only been doing x86-64 and gcc so far though
Currently, I'm feeding commits into libcparse, which is the start of this effort. That stream is currently about 180 days out of phase with what's in my local repositories, but it is slowly trickling in. The first step of this second phase will be to use libcparse to verify itself via constructive proofs in Lean. Currently, and by that I mean what hits in March or April, libcparse has a mostly C18 compliant lexical scanner for the C preprocessor. The goal is to have a SAX-like C18 parser that can detect all C declarations and definitions. Tooling will include a utility that imports C declarations and definitions into both Lean 4 and Coq. This is a moonlighting effort for an upcoming book I intend to write.
As for standardizing on hardware and toolchain, what I mean is that I try to get projects to commit to using specific hardware and a specific toolchain. The goal is to limit variables for automated testing and analysis.
OpenBSD, in particular, is adamant about testing their OS on real hardware for platforms that they support. Their definition of "working" is on real hardware. I think that's a reasonable goal.
It helps with analysis like this, as you don't have to worry about different compilers or different hardware. GCC, in particular, can produce different output depending on which hardware it was compiled on, unless a specific configuration is locked down. If there is a flaw in how it does vector operations on a particular Xeon processor, this won't be caught if testing locally on a Core i5 processor that produces different code.
Since any use of the variable is undefined behavior, that's what we want to be informed about.
The reasoning about nondeterministic values should be spared for situations when it's other than undefined behavior. For instance, accessing an uninitialized structure member that came from malloc isn't undefined behavior. It's (I think) unspecified behavior. In an implementation that has trap representations for some types, it could hit a trap.
That's understandable, and I would suggest reaching out to the CBMC developers. It shouldn't be difficult to add an uninitialized variable pass to the solver.
Practically speaking, -Wall -Werror should catch this. Any use of a tool like CBMC should be part of a defense in depth strategy for code safety.
It does in clang.
$ cc -Wall -Werror bar.c
bar.c:5:11: error: variable 'x' is uninitialized when used here [-Werror,-Wuninitialized]
if (x <= 42){
^
bar.c:4:12: note: initialize the variable 'x' to silence this warning
int x;
^
= 0
$ cc --version
clang version 16.0.6
It also does in gcc.
$ gcc -Wall -Werror bar.c
bar.c: In function 'main':
bar.c:5:10: error: 'x' is used uninitialized [-Werror=uninitialized]
5 | if (x <= 42){
| ^
bar.c:4:11: note: 'x' was declared here
4 | int x;
| ^
cc1: all warnings being treated as errors
$ gcc --version
gcc 12.2.0
I do that! Some of my projects are packaged in distros. In the past, I've found discussions of problems among distro people, who didn't contact upstream at all, but just tried to patch things on their own. I fixed things on my end, and in such a way that their patch would fail to apply when they update. No words exchanged at all.
I'm not a user of CMBC. I read about it here, I wrote a comment here, and that's the end of it.
But for a minute I will bite. The submission has no link to the actual project. I found that by a web search.
The project has no project trappings: no mailing list, or anything of the sort. The page says, "For questions about CBMC, contact Daniel Kroening". The name is a link to his home page. His home page has no obvious contact information. There are links to papers; probably his e-mail address is given in some of them. Let's click on "A Tool for Checking ANSI-C Programs". Hey, e-mail links in that page, before we even get to the PDF. But ... from 2004. Is k-----g@cs.cmu.edu from two decades ago going to work, given that he's at Oxford?
Suppose I hunt down his current e-mail address. What do I say? "Dear Sir, I feel really awkward to be contacting you about your CBMC program, which I have never actually used, but believe me when I say I have a valuable suggestion ..."
If I used the thing, I'd fix it myself and send a patch!
https://github.com/diffblue/cbmc The github repo is active and they are very responsive to issues. As you say, it is unneccesary and probably undesirable to post an issue for a tool you don't use. The links for this are multiple places in the submission. The state of the cprover website is unfortunate and represents the state of the project unfairly imo.
FWIW I have had luck mailing old research emails, I read my 20 year old university email. It is one of the best signal to noise ratios in communications I have.
I would not mail if it is only about unintialized variables.
When I say that a variable is used, I mean that its value is accessed. This is in line with compiler jargon. In this basic block, the variable x has no next use at the first line; the assignment of 42 is a dead assignment.
Recall trying to find some fun bugs caused by uninitialized memory, which was made harder to find due to the fact that the debug runtime zeroed memory allocations while the release runtime did not.
A declaration doesn't reserve an address. It just means that "x is a variable of type M." How this actually works depends on the compiler.
Weird things happen to uninitialized variables during optimization. Basically, nothing can be assumed about the value, because using an uninitialized variable results in undefined behavior. The original un-optimized generated code may very well have some default value, maybe. Probably not. But, the optimizer can decide that this value is represented by register R0 here and R4 there, and since the variable is uninitialized, there's no point in doing a spill or a load. It really comes down to how the optimization passes were written.
I'll take your word for it, but I would say there is a point, which is that the symbol was declared.
And I'm not talking about a default value, this doesn't imply a default value even though assigning an address will come with some value. But I do get that defining the name and type are a seperate thing that don't necessarily require assigning an address any more than assigning an a value.
The difference is, any default value would be arbitrary. 0 is just a value that has a meaning that the compiler can't know. It's almost as arbitrary as 37.
But picking the next available address to attach to the symbol is not like that.
The compiler always picks that itself and the programmer always has to ask what it is when they want to know it. The compiler can safely do that much right at declare time without making any assumptions. The value is still random, which is fine because 0 is almost the same as random, in that it's random if 0 would actually be the best default value for this particular item vs 1, -1, size, -size, size/2, etc.
> But picking the next available address to attach to the symbol is not like that.
> The compiler always picks that itself
One of the best analogies I heard, years ago on IRC, is that good source code is like select cuts of meat and fat, and the compiler is like a meat grinder. What comes out of the meat grinder is hamburger. It bears little resemblance to the source code, syntax, or semantics. But, it is meat.
I'll talk about register machines with a stack, since code is generated on most architecture this way. But, that's not necessarily a thing that can be relied upon. The compiler will preserve "relevant meaning" once the optimization passes are finished. Side-effects and return values (if used by the caller in the case of LTO) may be preserved, but that's largely it. That being said, let's talk about what a variable is, from the perspective of a compiler. A variable is decomposed into a value, which may be referenced, may be changed, or may need to be preserved. The optimizer aggressively strips away anything that isn't immediately needed. If the compiler needs to preserve the value of the register, then it can choose several ways of doing this. If the value is fixed, then it may just become a constant in code. Fixed values can also be produced synthetically. If the value is variable that is set at runtime and referenced later, then it could be set in one register and transferred to another register. If neither is possible, then the value could be spilled to the stack. But, this spill location is not necessarily assigned to this variable. This spill location could be used for another variable later in the function when this one falls out of scope. Spill locations aren't necessarily fixed. On many ABIs, there are scratch registers, registers that must be preserved between calls (the called function must preserve them if used), and registers that can be overwritten after a call. How a value is preserved depends entirely on whether it must be preserved, and which is the most efficient way to do so.
If a variable in source code is referenced but never assigned, then the optimizer can optimize away any "storage" (i.e. stack or a register) for this variable. When the variable is "read", the value used is arbitrary, if at all. The "read" has no meaning and has been optimized away. The value isn't fixed to a random value either at compile time or at runtime. It's whatever flotsam happens to be at that particular place in code at that particular time. If the value is "read" once in one location and once again in another location, even on the next line of source code, these values could be the same or entirely different. There's no telling what the compiler will do, because the optimizer is really free to interpret this in any way. The only way to ensure that a value is preserved by the optimizer is to explicitly assign this variable a value in the source code. This assignment is respected, assuming that the variable is referenced afterward. Beyond that, we can't make any meaningful assumptions.
It seems that CBMC is not intended to check production sources directly against C semantics, but to prove things about programs written in a C-like syntax.
I see the calling an undefined prototype style more often, perhaps for this reason. You are probably right this is undefined behavior, but it's subtle. https://stackoverflow.com/questions/11962457/why-is-using-an... I suspect CBMC just picks some concrete behavior for undefined behavior. It may not be a good detector for that. I'm not sure. This gets into shaky territory of understanding for me.
Any invocation of undefined behavior should be considered an assertion failure as far as any model checker should be concerned. Compilers can--and will--treat undefined behavior as license to alter the semantics of your program without any constraint, and most instances of undefined behavior are clearly programmer error (there is no good reason to read uninitialized memory, for example).
Reading an uninitialized variable is not "subtle" undefined behavior. It's one of the most readily accessible examples not only of what undefined behavior can exist, but also the ways compilers will mutilate your code just because you did it. To be honest, if something as simple as the consequences of reading uninitialized memory are shaky understanding for someone trying to prove code correct, that will completely undermine any trust I have in the validity of your proofs.
Also, even if the compiler does what a "Real programmer" type thinks it "should" do for this case (and I agree with you that you're not entitled to expect that), you aren't guaranteed that there's some particular value since you never initialized it.
Your operating system likely feels entitled to assume that if you never wrote to this page of RAM you don't care what exactly is in it. After all what kind of lunatic reads a bunch of unknown data, says "Yeah, that's coincidentally what I wanted" and just leaves it unmodified? No, almost anybody would write data they want to keep instead. So, if you never wrote to this particular page of RAM and your OS finds it convenient to swap that page for a different one, no harm no foul right? But now the contents of your uninitialized variable changed!
> So, if you never wrote to this particular page of RAM and your OS finds it convenient to swap that page for a different one, no harm no foul right? But now the contents of your uninitialized variable changed!
No sane OS will do this. Any page that's handed to a process that was last written by a different process must be zero'd (or otherwise have every address initialized) by the OS to avoid leaking information across process boundaries. You could, in theory, have a page that was munmap'd by /this/ process be handed back to the same process to fill a request for a different virtual address without zeroing it, but I can't imagine that any OS tracks the last writer to enable this "optimization" in the few cases it would apply.
glibc marks freed pages with MADV_FREE, and on Linux, an MADV_FREE page won't be unmapped immediately. It's possible to read a value of an MADV_FREE'd page, have Linux swap the page out, and then read it again and now observe a zero-init'd page. If malloc returns a new allocation to a freed page, it isn't written to by the allocator (which would prevent the page from being subsequently zeroed if it hasn't already done so), so it doesn't guarantee that the contents won't be randomly zeroed.
Whether or not this is a sane OS I leave an exercise to the reader, but it is nonetheless the property of a common OS.
> glibc marks freed pages with MADV_FREE, and on Linux, an MADV_FREE page won't be unmapped immediately. It's possible to read a value of an MADV_FREE'd page, have Linux swap the page out, and then read it again and now observe a zero-init'd page. If malloc returns a new allocation to a freed page, it isn't written to by the allocator (which would prevent the page from being subsequently zeroed if it hasn't already done so), so it doesn't guarantee that the contents won't be randomly zeroed.
Oh.
I feel like I should have more to say than that, but... oh.
(there's also usually a write, unless the allocation-head-metadata is on a different page than the parts of the allocation you're acting spooky with)
Edit: after watching the video elsewhere in thread, it was indeed crossing page boundary, but that behavior has to be a kernel bug since the documentation says "accesses" are sufficient to repopulate it.
No, it's specifically a write that causes the page to actually be freed. Per the man page:
> However, subsequent writes to pages in the range will succeed and then kernel cannot free those dirtied pages, so that the caller can always see just written data. If there is no subsequent write, the kernel can free the pages at any time. Once pages in the range have been freed, the caller will see zero-fill-on-demand pages upon subsequent page references.
At least Linux does this, and it is not a security hole because it only reuses pages with memory of the current process (as in the worst case you end up reading data from some other part of your program, but not other programs).
The problem is that uninitialized memory really is subtle.
After initializing all members of a struct, the padding bytes of that struct are still uninitialized. Yet, it is valid to memcpy the whole struct somewhere else; even with a hand-written char*-based copying loop. (and indeed, this is a good reason for reading uninitialized memory!)
So reading uninitialized memory is not always undefined behavior; just most of the time.
The C standard text is barely sufficient for the purposes of compilers, but wholly insufficient for what a model checker would need to know.
> After initializing all members of a struct, the padding bytes of that struct are still uninitialized.
Depends on how you initialize them. If you do it all in one go, then yes. If you partially initialize it, then no: some of the padding bytes are guaranteed to be zero. (Observing this and maintaining this variant is an exercise for the reader.)
> there is no good reason to read uninitialized memory, for example
That's an assumption on your part.
One very good reason might be to ensure that memory really is there given the optimistic allocation schemes and to force a page to be read into core. You could also write to it to get the same effect. Another good reason to do a read on uninitialized memory is to see what's still there from the previous program that ran. Now arguably that's a nefarious reason but it is still a good one and it may help uncover bugs in the underlying system.
The code doesn't read uninitialized memory. Automatic storage of objects need not be in memory. The C99 standard only uses the word "memory" in normative sections when talking about "malloc"
I apologize for using the word subtle. What I should have said is I don't understand the topic and reading about it has left me a sense that one should be very careful.
I've used CBMC in large commercial projects ranging around half a million lines of code. The secret is to break down every one of those functions into small pieces that the model checker can analyze.
CBMC works best with a contract-based programming style, where contracts are enforced by assertions and shadow functions exist to simplify model checking.
A single reply on HN is hardly a place where idiomatic styles can be expounded, but it is quite possible to build a resource-oriented idiomatic style that is reinforced with CBMC.
So it's probably best to consider it for unit and "small scale" aggregated functional testing then? This is actually the first time I've come across it. It seems like it would be good to consider it for new projects or new features but would be quite an ask to retrofit it on legacy code that wasn't already well "unit tested"?
In particular, one of the more important rules I made when using CBMC was to keep the search depth for each invocation as shallow as possible. If CBMC runs in less than five minutes, you're doing it right. If it takes more than that, then you're asking it to do too much.
This led to the creation of function contracts and shadow functions. When evaluating functions, we actually want any calls that it makes to go to shadow functions instead of real functions. When we write a function, we include a contract that includes anything it may return, any side-effects it may have, and any memory it may touch / allocate / initialize. We then write a twin function -- its shadow function -- that technically follows the same contract in the most simplified terms, randomly choosing whether to succeed or fail. From CBMC's perspective, it's equivalent enough for model checking. But, it removes additional stack depth or recursion.
A good example of this would be a shadow function for the POSIX read function. Its contract should verify that the descriptor it is passed is valid. It should also assert that the memory region it is passed is bounded by the size it is given. The shadow function picks a non-deterministic state based on how it should set errno and how it should return. This state follows the POSIX standard, but the underlying system calls don't matter. Likewise, depending on the return code, it should initialize a number of bytes in the read buffer with non-deterministic values.
I used this shadow read function to find a buffer overflow in networking code and also to find a DOS infinite loop error in a tagged binary file format reader we were using.
CBMC isn't perfect, but when coupled with a good linter and -Wall / -Werror / -Wpedantic, it is a very useful layer in a defense in depth strategy for safer software.
> CBMC isn't perfect, but when coupled with a good linter and -Wall / -Werror / -Wpedantic, it is a very useful layer in a defense in depth strategy for safer software.
I agree to a certain extent, but at what point are you putting in more effort than it would take to migrate to a stricter language that offers more safety by default?
I'm not aware of any language that can't be improved by this strategy. Amazon even added SMT solving tooling around Rust, because Rust's built-in safety guarantees aren't enough.
I'd say that the added effort is minimal. The compiler spits out more errors. We think a little more carefully about how we write functions so CBMC doesn't complain. There are no silver bullets.
>Satisfiability Modulo theories (SMT) is an area of automated deduction that studies methods for checking the satisfiability of first-order formulas with respect to some logical theory T of interest. It differs from general automated deduction in that the background theory T need not be finitely or even first-order axiomatizable, and specialized inference methods are used for each theory. By being theory-specific and restricting their language to certain classes of formulas (such as, typically but not exclusively, quantifier-free formulas), these specialized methods can be implemented in solvers that are more efficient in practice than general-purpose theorem provers.
>While SMT techniques have been traditionally used to support deductive software verification, they are now finding applications in other areas of computer science such as, for instance, planning, model checking and automated test generation. Typical theories of interest in these applications include formalizations of arithmetic, arrays, bit vectors, algebraic datatypes, equality with uninterpreted functions, and various combinations of these.
Not that I'm an expert in processing what exactly is being tested, but it basically looks only able to prove that an individual function doesn't overrun buffers. If you tell it to assume that integer overflows can't happen (!). So I'm not impressed.
If you only test buffer overruns, VC++ static analyzer + SAL2 can do an excellent job on this. Basically if you annotate every pointer with a length, the compiler can tell you if a pointer arithmetic is safe or not.
Klee (another tool mentioned in TFA) was run successfully against unmodified versions of all GNU textutils (or maybe it was coreutils... in any case, except for GNU sort, which was tricky due to large dynamically allocated buffers), and found bugs in quite a number of them.
Actually these GNU tools are relatively simple, compared to the code we usually write as a C/C++ software engineer at daily work. For example, if you have a function that takes just one single protobuf object, Klee cannot help you. Because the input space is too large. Klee can only operate at unit test level, with special crafted code. It did a good job on GNU textutils because the inputs of the each tool are relatively independent, and most inputs are just boolean flags that are either true or false.
Also, please be aware that klee cannot provide any kind of assurance. Normally it cannot give you a proof saying your code is 100% safe, because most code are too complex to reach that. I'm saying while it usually tries to find all the execution paths of your code and execute them symbolically, usually it is not possible to finish executing all the paths.
Though Klee can support C++, you will find fuzzing C programs is much easier than C++, because C++ data structures are way more complicated. Like, a C-style string vs a C++ std::string. A C array vs a std::vector. So, in order to get a broader usage of Klee, we need to rewrite our code in a simpler way.
Frama-C is also very cool. Also Coq VST and however seL4 does it. I think out of all of these, CBMC requires the least expertise / has highest automation. I also don't see how it can scale beyond the unit test level, or patch together verification conditions into global properties. Perhaps this could be done by some meta framework plugging together calls to CBMC. Given the level of penetration of formal methods generally, I think the relatively low ambition of CBMC compared to these higher ceiling techniques is good.
I miss this kind of stuff in computer programming languages.
In hardware design, verification is done simultaneously with design, and semiconductor companies would be bankrupt if they did not verify the hell out of their designs before committing millions in manufacturing these chips.
Even among hobbyists this is getting traction with yosys.perhaps its time for programmers to adapt this kind of tooling so there will be less buggy software released...
There are probably orders of magnitude in difference between hardware design output and programming output. At least at the time, seL4's verification was quite impressive for a codebase on the order of 10,000 lines. But we should work towards the goal of improved formal modelling and checking all the same.
isn't to have a checker is clever enough to know that the assert will not go off, but to be informed that the automatic variable x is being accessed without being initialized!!!
There are trivial cases of the use of an uninitialized variable that elementary data flow analysis will uncover.
When the variable definition and its use are together in the same basic block, it's immediately obvious that the variable has a next-use on entry into the block, and that its definition hasn't supplied it with a value.
From there it gets complicated. GCC has a history of emitting overly zealous diagnostics in this area, warning about possibly uninitialized variables that can be proven not to be.
The difficulty is equivalent to the halting problem:
void f()
{
int x;
if (invokes_undefined_behavior(f))
x = 'A';
putchar(x);
}
No, it does not. It probably does constraint based verification or looks for "proofs" of possible error conditions or asserts. In the case shown, an uninitialized variable is trivial to prove that it could equal the asseted value.
Technically speaking it a) unrolls all loop up to their length (if it can be statically determined) or to a given verification bound; b) transforms the code into SSA; c) transforms the SSA into a SAT formula. If the SAT formula is satisfiable, its certificate is a counterexample to the property under verifecation.
Of course the toy example is a toy, and of course it does not "run" anything, but this is a sound technique: if there is an execution that violates a property, it will be found.
I highly suggest reading the paper that introduced it [1], it is remarkably clear!
[1] Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. In 10th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS) (LNCS), 2004, Barcelona, Spain. Springer, 168–176. https://doi.org/10.1007/978-3-540-24730-2_15
Yes, correct. It does not and could not _actually_ run all possible executions unless the program state space is quite small. But from a user's perspective, it is similar. I have tried this pedagogical approach to describing verifiers like these as "infinite" unit tests before (to mixed results). It feels to me that what symbolic or constraint based reasoning (algebraic identities, programs, integrals, what have you) in general is doing is finding a way to finitely reason about a very large or actually infinite number of concretized cases.
> There are many footguns. People love their guns. Makes them feel powerful.
Well, duh: if you remove the most egregious footguns from C, you end up basically with Pascal, and here is Why Pascal Is Not My Favorite Programming Language.
And when you try to explain to them that C's abstract machine has semantics of "when you omit a safety check, you actually makes a promise to the compiler that this check is redundant, that you'll promise that you'll make sure in some other way that in no possible execution will the illegal values ever show up in this place, and the compiler will actually hold you to this promise" — they throw tantrums and try to argue that no, it hasn't this semantics, and even if has, it's not the semantics as originally intended by K&R and it is the authors of the C compilers who are wrong (even though the authors of C compilers are the people who actually write both the C standards and the C compilers...)
Must have to do something with imprintment: I've learned C as my 3rd language, and from the start I was taught that its UB is absolutely insane — and I've never felt "nah, it should just do what I meant, dangit" denial about it, only "this is sad, how can I even know while writing code when I am making a baseless promise especially when it's made by omitting something?" But like you've said, people like their footguns.
Pascal, the dialects most people used, not the ISO Pascal from 1972 used by Kernighan, had ways to pass around dynamically size arrays and strings.
Which anyway was not an issue in the PL/I variants that predated C, nor the Mesa/Modula-2 descendents from the mid-70's.
EDIT:
Walter's example with 1978's Modula-2.
MODULE Example;
FROM InOut IMPORT WriteString, Write, WriteLn;
PROCEDURE get(buffer : ARRAY OF CHAR): CHAR;
BEGIN
RETURN buffer[10]; (* This will boom unless compiled with bounds checking disabled, in which case who knows *)
END get;
VAR
buffer : ARRAY [0..9] OF CHAR;
BEGIN
WriteString('The out of bounds character is');
Write(get(buffer));
WriteLn;
END Example.
Eh, I think most C programmers frustrations with UB stem from knowing that the C standard has fundamental flaws and modern compilers abuse that fact for "optimizations" on UB. This paper covers the topic pretty well[0].
I fully support Casey Muratori's viewpoint that undefined behavior should not exist in the standard[1]. Instead, the C standard should enumerate all valid behaviors compliant compilers can implement. This would allow compilers to make unintuitive optimizations for the platforms that need them, but still allow programmers to be certain that their program semantics will not change in different versions of the said compiler.
> I think most C programmers frustrations with UB stem from knowing that the C standard has fundamental flaws and modern compilers abuse that fact for "optimizations" on UB
Do they actually know that? I don't think so. Let's take [0] for instance (the author is an expreienced C programmer who loves the language and wrote a re-implementation of bc in it):
There seemed to be a lot of misunderstandings; I could not get a handle on what this person thought UB meant.
I finally figured it out: this person’s definition of UB was not “the language spec can’t guarantee anything.” Instead, it was “compilers can assume UB does not exist and optimize accordingly.”
Wat.
Yep. Apparently, that was news to him, even though C implementations have been behaving like that for about 30 years already, and you can read the rest of the post for the "this is evil, we the users must do something about it" take. And the proposed "something" is not "we should instead use a language with actually defined semantics", oh no. It's "use compiler flags to force more reasonable behaviour, hopefully" and "somebody should write boringcc, unfortunately, I am myself a bit too busy for that". Well, despite numerous pleas and several attempts, nobody has managed to write boringcc which is telling of something, I'm just not sure of what exactly. So... I think it is about imprinting: "Oh, it's a wonderful language, well, it would be if it was actually implemented the way I used to think it is implemented (and I still think it should be implemented that way) but still, it's a wonderful language if only not for that pesky realitiy" is forcing unfounded expectations onto reality, and where do these expectations even come from in the first place?
And yeah, I fully agree with you that C standard should've probably done that. But it didn't happen, and it can't happen because backwards compatibility [1]. But even then, C programs would still be non-portable, in a sense that you have to use #ifdef's for tinkering with platform-specific behaviour for anything interesting, because C standard even today leaves a lot of stuff completely up to implementation, see [2] for an especially apalling example, even without touching UB.
> Well, despite numerous pleas and several attempts, nobody has managed to write boringcc which is telling of something, I'm just not sure of what exactly.
John Regehr has a blog post that touches upon why this is the case [0]:
> I’ll assume you’re familiar with the Proposal for Friendly C and perhaps also Dan Bernstein’s recent call for a Boring C compiler. Both proposals are reactions to creeping exploitation of undefined behaviors as C/C++ compilers get better optimizers. In contrast, we want old code to just keep working, with latent bugs remaining latent.
> After publishing the Friendly C Proposal, I spent some time discussing its design with people, and eventually I came to the depressing conclusion that there’s no way to get a group of C experts — even if they are knowledgable, intelligent, and otherwise reasonable — to agree on the Friendly C dialect. There are just too many variations, each with its own set of performance tradeoffs, for consensus to be possible.
Granted, at the end he says that it's not that it's impossible, it's just not for him, so it's still possible that someone else would succeed, but they'd have an uphill battle for sure.
Those ideas are not convincing. Consider the glibc bug from yesterday. If I understood it correctly, there is an integer overflow causing a heap overflow. The integer overflow was unsigned wraparound - so fully defined behavior. So no, we do not want latent bugs remain latent. We want UB sanitizers and checkers and fix all this crap.
I think that's arguably a distinct concern from what boringcc/friendly C are intended to address. My understanding is that the latent bugs those are intended to keep are those that would be unearthed by aggressive exploitation of UB. Sanitizers/checkers/etc. would still be useful iirc boringcc/friendly C for finding bugs arising out of fully-defined behavior.
Sanitizers / checkers are much useful if the behavior is defined, because you can then not be sure it is not intentionally used in the way allowed by the definition. So you can not check at run-time in production and stop the program but you can with UB and during testing you would need to analyze each case individually.
I think sanitizer usefulness for defined behaviors is going to depend a lot on how frequently those behaviors are intentionally used and how easy/hard it would be to suppress false positives.
There might be some behaviors that sanitizers don't currently catch but are used for optimizations as well, and boringcc/friendly C might be useful for those? Not entirely sure what those might be, if they even exist, though; IIRC sanitizers don't currently flag strict aliasing violations, but -fno-strict-aliasing exists so there's no need for a new compiler/dialect (for GCC/Clang?)
As the author of the first blog post, I think you got my conclusion wrong because I'm writing a language without such UB. So I would like to have a better language.
But I was also talking about UB in the context of already existing code! My argument is that compiler writers are breaking existing code when they could very well avoid breaking it.
We have tried to convince compiler writers to not do this, but they have refused. So that's why I said that users must do something: because compiler writers won't.
And yes, I knew compilers would take every advantage that they could before that; my surprise was that someone considered UB's sole purpose to be for the benefit of compiler writers at the expense of everyone else, including non-programmers who suffer catastrophic consequences for security bugs.
Actually the definition of UB is not "compilers can assume UB does not exist" but indeed "the language spec can't guarantee anything". We clarified this in the C23 spec. (although a limited form of "compilers can assume UB does not exist" can be derived from "the spec can't guarantee anything")
Well, I was under impression that word "anything" actually means "anything", so e.g. "the language spec can't guarantee that the compilers won't assume that the execution paths with UB on them never happen and then use that assumption for optimization" is a valid instantiation of "the language spec can't guarantee anything".
Actually, the spec does not say exactly "the language spec can't guarantee anything". It says "undefined behavior --- behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this document imposes no requirements." So it says "no requirements" but this refers specifically ("for which") to the behavior in question and not the whole program. This implies that preceding observable behavior can not be affected. (but non-observable behavior can according to the "as-if" rule and this is sufficient for optimization). We added a note to clarify this in C23.
CBMC's default contract for getc returns a non-deterministic integer value. A non-deterministic integer value basically counts as "pick a value that would cause this de-reference to crash the machine model".
Sounds like it's doing a solid job of data flow analysis. But I can still produce cases where it cannot determine either the range of the index or the length of the array being pointed to.
The programmer will have to add code to keep track of the length of the array, and add code to check the value of the index. The good thing is the analyzer will tell you where you'll need to add those checks.
With the proposal I made, none of this is necessary. It just works. Over two decades of experience with this approach in D is it-just-works.
The machine model tracks the array with its size and compares this to any index. This is loaded into an SMT solver. This is not a standard static analyzer. It's compiling the code to an abstract machine model and then running through this model with an SMT solver.
CBMC isn't perfect, but the strategies you are talking about are meant to defeat a linter or a shallow analyzer in a compiler. CBMC isn't a linter.
SMT cannot solve cases where it simply cannot know what the length of an array is. For example, when array is allocated in code not available to the solver, or is set by the environment.
I remember trying out Spark a few years ago, which advertised its compile time checking abilities. I tried using an integer equation that used an OR, and it gave up on it.
If you need to add in range checking code to help the solver along, then the range check itself is a source of bugs, and the range limit has to be made available somewhere and be correct.
In my proposal, the array length is set at the time of creation of the array, and the length is carried along with the pointer. The solver's role then becomes one of eliminating the bounds check in the cases where it can prove the index is within range. The user doesn't have to do anything.
We've been using it in the implementation of the D compiler for a very long time now, and problems with buffer overflows are a faded memory.
P.S. I also added manual integer overflow checks when passing the size to malloc(), no matter how unlikely an overflow would be :-)
> SMT cannot solve cases where it simply cannot know what the length of an array is.
Sure it can. It sets the array length to a non-deterministic unsigned integer, and then finds a counter-example where this array length is invalid. CBMC will also make the array pointer itself non-deterministic if you haven't refined it. So, the pointer could be NULL, could be pointing to invalid memory, etc.
Don't think of values as fixed numbers, but rather as functions shaping a non-deterministic value. The range check made upstream becomes part of this function. The goal of the SMT solver is to find a counter-example that crashes the machine model. If you want a rather leaky abstraction, imagine that by transforming the source code into an SMT equation, it's effectively turning it inside out and transforming it into a logic equation with modulo math.
I agree with you that it would be nice if bounds were included in C arrays. But, we have to work with what we have. Unfortunately, especially in firmware that requires a commercial C compiler, we are stuck with C. In those cases, if we want safer software, we need tooling like this.
I agree that if one isn't going to enhance C, one is going to have to resort to these tools.
C gets new features now and then. Why not add something incredibly useful, like the slice proposal? Instead, C23 got enhanced with the crazy Unicode identifiers. Richard Cattermole has been adding them to D's C support, requiring 6000 lines of code!!
Yeah... I'm not too happy with some of the choices made in C18 and C23. One of the reasons why I chose to standardize my C SAX-like parser on C18 for now was to avoid the Unicode complexity. There's also a widening feature gap between modern C and modern C++ that makes interop harder.
> Richard Cattermole has been adding them to D's C support, requiring 6000 lines of code!!
I wondered if you were implementing a normalization (which is complicated), but that doesn't seem the case? I have read the PR and it's clear that it solves multiple issues at once: no start-continue distinction (bug), a code generator, the table itself (which is just an inefficiently formatted block of data). 6000 lines of code just for Unicode identifiers are simply an exaggeration.
A new generated code alone is 4000 lines long [1]. The actual code added is just 2000 lines, and some are used to pay debts, I mean, to make a proper code generator. It is debatable whether that generator should count in the number of lines, since you can always write a generator in a simpler scripting language (for example, Rust uses Python to generate the `unicode-xid` crate). In any case however it is far less complex and, in my opinion, much smaller than the entire C parser proper.
> C23 got enhanced with the crazy Unicode identifiers.
Actually, this spec security bug was added with C11. But nobody really added it, until gcc decided that this new "feature" needs to be supported around C23 time. C23 tried to mitigate that mess by disallowing non-identifiable normalization variants, but still does not care about Unicode security guidelines.
A lot of static analysis tools explicitly support forcing you to check the value of "tainted" values such as return values of getc and so on i.e. even with slices you want to be able to catch this kind of thing ahead of time.
With all respect, please don't say nobody cares. There many people in the C community that want this but it's not easy to add a fat pointer to C after all these years. Though, there are people who are planning to work on this exact solution [1]. The best thing the rest of the ecosystem can do, is support them.
(Just to be clear, I am not the author of the draft I posted in the parent comment, but it's the feature I want the most to be added in C2y. I was actually meaning of writting a proposal, but realized that I don't know a thing about compilers and language design.)
I have examined many other ways one could add such functionality in C. I found that the most pleasant one is a combination of your syntax and some runtime operators to retrieve the length. As for the lack of implementation burden, I would send an email to thephd in your position. I think that having that comment coming from you would be really beneficial for the paper.
I get what you mean and I'm not saying that not having extra operators is a dealbreaker that will make spans unusable. I just find it more pleasant/convenient for the following reasons:
1. Helps clarify intent
2. Makes teaching the feature easier, since it's name is closer to what the user wants to write. Also, rejecting pointers helps people not make mistakes.
3. Keeps sizeof consistent. The operator sizeof is supposed to return the size of the storage occupied by the object. A span is a pointer and some sizes (one could also design a span to be strided, meaning more integers to the mix), so sizeof could just return the size of that pointer ond the integers. Having sizeof be somewhat overloaded for spans could be surprising to nome people.
But again, just having sizeof is fine
Very nice comment. This was a blocking point I fell onto too. There is another option, to not allow spans on the return channel, but I don't know how people who handle resources with lifetimes that exceed the function scope will take this. Just like you, I'm curious to see what how this will be solved.
A possible answer is that one could modify n inside the function body, which means that the length is lost. But honestly, one could just use const to avoid this. Though, pointers to VLAs are really useful and convenient when allocating dynamic arrays, especially multidimensional.
GCC only works in the first case anyway: https://godbolt.org/z/jo57r1MW4
One could add in the other too, but language semantics currently do not allow this.
Changing the n later has no impact on the size of the arrays later and the compiler is perfectly able to remember the original size.
(even in O1 clang gets rid of the whole function, including even the ret instruction, I'm surprised it does not at least leave an ud2 for an empty function to help debugging since it would not cost anything https://godbolt.org/z/eK8cz3EPe )