It isn't a use case; it is a drawback of the C array and pointer semantics:
- Array values decay to pointers in rvalue contexts (though not as the argument of sizeof);
- a[b] is syntactic sugar for *(a+b).
— ⁂ —
These two design decisions have some desirable results:
- Arrays, including strings, can be in effect passed as arguments to functions without implementing a special parameter-passing mechanism for arrays.
- Functions on arrays are implicitly generic over the array length, rather than that length being a part of their type. (When this isn't what you want you should probably be using a struct instead.)
- Array iteration state can be represented as a pointer, preventing bugs in which you index the wrong array. In a sense a single pointer represents an array range or slice, as long as you have some way to identify the array end, like nul-termination in strings or a separate length argument.
- You can change a variable (including a struct field) from being an embedded array to being a pointer to an array allocated elsewhere—or vice versa—without changing the code that uses it. (But if this had been a significant design consideration, -> wouldn't be a separate operator from . in C.)
- It's easy to create new "arrays" at runtime: just return a pointer to some memory.
— ⁂ —
Like all design tradeoffs, these also have some drawbacks, which are so severe that no language of the current millennium has followed C's lead on this, although many of C's other design decisions are wildly popular:
- Bounds checking is impossible.
- Alias analysis for optimization is infeasible.
- If you aren't using a sentinel, you have to pass in a separate argument containing the array length whenever you pass in an array pointer, or stuff these base and limit fields into a slice struct, or something.
- Arguably, these decisions are hard to separate from the fact that C strings are terminated by a sentinel value and thus are not binary-safe.
- 3["hello"] is legal C.
— ⁂ —
Of these five drawbacks, the fifth seems like it may not be as severe as the other four?
The thing is, the more complex a spec is (or rather, how much stuff it allows that will never be used), the bigger the danger is that somewhen down the line, this will introduce a security or other issue.
It however is not "natural" for someone who doesn't know the obscure bits of history in a standard written many decades ago.
Someone writing, say, a static code analysis tool or an IDE may not assume that it is possible that in the expression `a[b]` a may be something else than a pointer / array.
speculating, but don't think it's about use case so much as it is about it being a simple way to implement ('C is portable assembly') which probably carried through to our more current notion of this being a 'language level' thing
I imagine there are more optimal and less optimal ways of actually doing the indexing in machine code and the former may be better semantics, but I would think a compiler would generate identical machine code for both.
"Pointer arithmetic" takes care of that. Adding an integer to a pointer will multiply the size of the type pointed to by the integer and adds that to the pointer.
> (yes it will probably blow up in modern compilers, or at least give you a warning)
Nope. For the code snippet I posted an hour ago, even with -pedantic -Wall -Wextra gcc won't issue any warnings. And why should it? It's perfectly standards conformant, because the standard actually defines the [] operator through the equivalent addition expression.
I think the reason the behavior is still there because it is not used. There is no gain in changing the standard, and the compiler warning could draw criticism. Why waste your time solving a non problem?
Most compilers will warn about misleading indentation. This is misleading indexing. A program containing misleading indentation is also standards-compliant, but that's completely irrelevant when talking about what code should trigger warnings.
Misleading indentations, unused variables, unused goto labels and the like are a quite good indicator that there is something wrong here. The thing we are talking about here is issuing warnings for "but that's not how we usually do it".
When you add a new warning to a C compiler, you will break build processes all over the planet that have "-Werror" turned on and/or have management that insists on warnings being addressed. Some of those build processes compile decade old, safety critical production code. Code that has a couple hairy, stylistically sucky places in it. Code that sometimes does weird but perfectly valid things because those portions were ported over from assembly back in the 80ies. (And yes, I can guarantee you first hand that the situation I describe here is very real)
C compilers have become critical infrastructure and meddling with their internals and their behavior poses real word risks. Adding a whole new compiler warning must be carefully considered and better have a damn good reason.
"This pattern in the syntax tree strongly indicates that there is something wrong in the code" is a good reason.
"This is not how I usually write code" needlessly forces people to rewrite finicky code that has been working perfectly for decades in safety critical environments, for no reason other than you not liking e.g. the order of operator arguments.
> To encourage people to pay more attention to the official language rules, to detect legal but suspicious constructions, and to help find interface mismatches undetectable with simple mechanisms for separate compilation, Steve Johnson adapted his pcc compiler to produce lint [Johnson 79b], which scanned a set of files and remarked on dubious constructions.
Yes, I'm not arguing against such warnings in general. I'm arguing against pure coding style type warnings.
Here's an example: If you do an '==' comparison inside an if, you might accidentally type '=' instead, making it a perfectly valid assignment.
The gcc developers eventually decided to issue a warning if you do an assignment inside an 'if' conditional, but give you the option to put another set of parantheses around if that's really what you want to do here. I think this is perfectly reasonable.
However, in the mean time, a lot of people have decided to adapt a coding style where you always put the constant or literal on the left hand side if possible, to avoid this issue. In theory, the gcc developers could in addition also have opted to issue warnings for comparisons if the left hand side is an lvalue and the right hand side a constant or literal, that you might want to flip it around. Thus enforcing a "safer coding style" through compiler warnings.
I'm arguing that the former is a perfectly reasonable thing for a compiler to do, while the later isn't.
Then again, it's not common that people make the mistake of confusing an array with an index. Misleading indentation is a somewhat common issue. So it makes sense to have the latter as a warning, but probably nobody thought of adding a warning for the former, or just decided to not bother coding it up.
I think that perhaps this has ventured into the job of linter or stylechecker. It's definitely not compiler warning territory.
When I learned C as a teenager from k&r I learned that these statements are absolutely equivalent, and I was surprised to see it even mentioned in TFA's README.
Yes, compilers should warn on indisputably poor style, even when program behaviour might still be correct. This is helpful to the programmer, who probably didn't intend to write their code that way.
Fortunately compilers already do this. GCC will warn you about unused variables, for instance.
> To encourage people to pay more attention to the official language rules, to detect legal but suspicious constructions, and to help find interface mismatches undetectable with simple mechanisms for separate compilation, Steve Johnson adapted his pcc compiler to produce lint [Johnson 79b], which scanned a set of files and remarked on dubious constructions.
To this day, the best result regarding adoption of such tooling places it around 11%.
I wonder how much education we need to keep fighting for adoption.
Auto is the implicit default right? As in function scoped, stack allocated, and lives until the function is returned?
K&R (Second Ed). Makes no mention of the auto keyword in Section 1.10, but it does say,
> Each local variable in a function comes into existence only when the function is called, and disappears when the function is exited. This is why such variables are usually known as automatic [sic] variables[...]
yes, exactly. That's why there is no need for it in modern c. This compiler however is different: The type is optional (and assumed to be int). Say you have a variable declaration "auto int i;". Back then you could omit int, now you can omit auto.
"auto" probably is the storage class, it tells what kind of variable this is. Automatic as opposed to "register" which would force the variable to be a register, or "static" or "extern".
The type is not given at all, I think by default it would be "int".
One of the unusual things in this early version of C is that "int" can be used for any word-sized value, including pointers. The type system was very loose.
Even back then this was considered poor practice, however. The first edition of K&R had a subsection entitled "Pointers are Not Integers" (I don't know if that's still in modern editions).
It looks to me like that section was removed in the 2nd edition. Some sections moved around, so maybe I'm just looking in the wrong place, but it's not nestled between "5.5 Character Pointers and Functions" and "5.7 Multi-Dimensional Arrays" like it is in the 1st edition.
The interesting thing for me is that a variable without a type annotation could potentially store anything. It kind of explains why the language used "int" as the default type of variables declared without a type annotation.
No. "auto" is not a type but a storage class that means automatically allocated instead of being allocated to a register, extern-al to the file, or in the static code segment.
I think he means that int and pointer address must be interchangeable. As long as that holds, the size can be either 16 bits or 32 bits.
On a PDP-11 int would have been 16-bit. On x86 32 bits. But on x86_64 int is 32 bits but pointers are 64-bit. The easiest way to retain the original assumption with minimal changes to the historical source code while targeting a modern CPU is to compile in 32-bit mode.
My original comment was rather tongue in cheek - but I have actually ported this compiler (well a later version of it, from the v6 release) to a 32-bit target - it was a different time, and C was a different, definitely more forgiving and simpler language - with other systems languages like BCPL/Bliss/etc around at the time the whole 'int is the same as a pointer' was definitely a way of thinking about stuff at the time
Why can't it be 64-bit? I don't see any reason why we can't have an ILP64 data model. If int and int* were both 64-bit then it would restore so much of the original beauty of C.
It can be, and is, on platforms where supporting large arrays (if integers are 32 bits, arrays can ‘only’ have 2³¹ entries (#)) is deemed more important than memory usage.
Oh that's an Intel math kernel library. They probably just arbitrarily chose a 32-bit type in one of their FORTRAN interfaces. The x86_64 architecture itself, is pretty much word size agnostic. Arrays can be indexed using a 64-bit index, for example: mov (%rax,%rbx,8),%rcx where %rbx is the 64-bit word index. The only caveat with the instruction encoding is that fixed-size displacements can't be more than 32-bits. So for example, you can access the second 64-bit word in the array as mov 8(%rax),%rcx but you can't say mov 0x1000000000(%rax),%rcx. You'd instead have to load 0x1000000000/8 into %rbx and use the first op. This poses headaches for programs that compile to 3gb executables, since unlike arrays, code is usually referenced using displacement notation, but workarounds exist and it's still fine. All that stuff is abstracted by the compiler.
It can be, but people have arranged for it to not be, presumably because they don't feel the storage space to have all integers be 8-bytes is not justified.
Yes, and on 16-bit and 32-bit systems, sizeof(int) == sizeof(int*). On 64-bit systems, this is most probably not the case. This is a common roadblock when porting old C programs.
The very first B compiler was written in BCPL by Ken Thompson. B later became self-hosting, i.e. the BCPL compiler compiled the B compiler, but this had another set of challenges due to the extreme memory constraints. It was an iterative process where a new feature was added such that it pushed the memory limit and then the compiler was rewritten to use the new feature to bring the memory usage down.
C was heavily inspired by B and I suspect written in B aswell. Alternatively, BCPL was extremely portable as it compiled to OCode (what we'd recognise today as bytecode) so that might have been another option. The assignment operators of =+ are straight from B and later changed to += due to Dennis Ritchie's personal taste.
Wow, TMG was a new one for me. From the Wiki article on it:
"Douglas McIlroy ported TMG to an early version of Unix. According to Ken Thompson, McIlroy wrote TMG in TMG on a piece of paper and "decided to give his piece of paper his piece of paper," hand-compiling assembly language that he entered and assembled on Thompson's Unix system running on PDP-7."
Ehh...I taught myself machine code programming when I was 11. Hand translating programs I wrote in Assembler on paper to byte code and typing it in byte by byte. And I am no programming God. So it might be less hard than you think :)
If it could bootstrap itself, then there would be no need to port it to GCC.
From how I read it, it is not capable of bootstrapping itself, and an earlier C compiler in BCPL existed, this is the first C compiler written in C itself.
The question of if the first C compiler was written in C, how could it be the first C Compiler?
Because to be the first, it has to be bootstrapped in an intermediate host language…
You have to get a parser running, then the syntax, then the etc… etc…
( immense plug of the Ahl book here…)
To be the first complier in a language, as was pointed out, long before I was born,
the compiler has to compile itself, so before it could compile itself, it had to have other language processing programs creating the parsing, the syntax, the etc…
Porting it to GCC just means that they could compile it with GCC, the big test is to get it to compile itself, on what ever platform that is the target platform, because finally, if it cannot generate object code/machine language in the target machine’s binary, then its not really ported.
Later on, UNIX came with tools to build compilers with, YACC and LEX.
If they got it to produce PDP-7 Code, its not really much of a port, really.
Probably out of topic, but are there real examples of compiler attacks due to bootstrapping ?
I did not hear about them before reading the scifi classic Accelerando by C. Stross
Brilliant. I was going to post this but you posted it first, I read this while learning to program in C, and then looked at my compiler with extreme suspicion.