Hacker News new | past | comments | ask | show | jobs | submit login
The case for a modern language (codeberg.page)
206 points by bshanks on Jan 22, 2022 | hide | past | favorite | 196 comments



PL/I, created in 1964, had strings. Real strings, where the compiler knows the length even when it gets passed around and is declared char(*) var in the receiving function. You can't have buffer overflows because the compiler and runtime know every string's current length and allocated length.

This isn't a particularly hard problem. C just took a shitty shortcut to fake strings using byte arrays and the world glombed onto it. Now we're stuck with a crappy "standard" that people should have scoffed at when it first showed its ugly face.


Personally, I'm tired of people bitching about C. At the time, the choice was C or assembly language for embedded/operating systems. There was no other choice in the 1970's. In fact, it wasn't even an option for most of the 1970's.

If you worked at a company and wanted a team of people to develop on a multi-user system, and port it to a single-user stand-alone system, you were out of luck. Our company sold test equipment based on the Data General minicomputers, and while DG had multi-user systems and single-user systems, they had no common programing language besides FORTRAN. It was so frustrating.

And then Digital came to us and wanted to buy a lot of systems, but it had to be running on a PDP-11. Trouble is, our test system was written in Data General assembly language. We had to re-write the system in a portable high-level language that could run on RSX-11 OS. But how?

We searched for a suitable programming language we could buy support for, and ended up using PASCAL - which was a P-code interpreter. The P-Code was portable across operating systems. So I "ported" an assembly-based system to Pascal, and was able to have equivalent runtime performance, because the DEC system had RAM-based overlays and the DG had disk-based overlays. Otherwise, performance of Pascal over ASM would have made it unfeasible.

A few years later, C was commercially available. Oh I wish it was a choice that was available then. The rule of thumb was that C would run with 90% of the performance of assembly language. And that was before they made incredible strides in compiler technology. PL/1 would have been a disaster, assuming it could run at all on a 16-bit machine.


> Personally, I'm tired of people bitching about C.

The complaining gets old, but then again, memory leaks, overrruns and underruns and other C footguns get old, too. C has been and still is a great tool, but there is some level of... maybe we can do better 40 years later? You appreciate C more if you've had to implement anything reasonably large in assembly (which clearly you have).

> In fact, it wasn't even an option for most of the 1970's.

I started programming professionally in the mid 80s. There really wasn't much better than C. Pascal, compiled BASIC (it wasn't quite the VisualBASIC era yet)and ancient stuff like COBOL, PL/1 and FORTRAN were really the other real options. The old languages had a lot of limitations baked in. Pascal was better, but there were huge limitations imposed by Pascal arrays and Pascal's type system that rendered it very difficult to use for many entire classes of applications (anything where dynamic allocation of blocks of memory was needed, so for something like I/O... or video... or text editing (255 character lines much?) or whatever I happened to be working on. It wasn't impossible to do big projects with Pascal, but it was a lot more work.


>The complaining gets old, but then again, memory leaks, overrruns and underruns and other C footguns get old, too.

Agreed. C is an old language, but at the time it was a very good language. One can argue the choice nowadays, but comparing it to PL/1?

A quick search on Linkedin:

* 117 Jobs for PL/1 programmers * 300,000+ jobs for C programmers


I'm not promoting the PL/I language, although I did do significant work with it back in the day, including the Prime operating system which had huge chunks written in a system-programming version of PL/I. PL/I likely never "made it" because it's a huge, bloated language that's hard to implement, compilers were scant, and they were expensive. For the curious, here are Prime's 2 system-programming subsets, the first, PL/P, is from 1978:

PLP: https://sysovl.info/pages/blobs/prime/pet/pe-t-483%20PLP%201... SPL: https://sysovl.info/pages/blobs/prime/pet/pe-t-xxx%20SPL%20R...

I am promoting the concept of strings having a built-in capacity and current length, and the language compiler and runtime understanding that rather than trying to use a byte array as a string. Even compiled BASIC I used in the late 70's had real strings like that.


> but comparing it to PL/1?

I think the point was that a language designed when pterodactyls ruled the skies had a better string implementation than C. Regardless, C's not going anywhere. You still need something close to the hardware that has better ergonomics than assembly language to implement that new safe language that mythical safe OS will be written in :-)


> maybe we can do better 40 years later?

But we do. There are plenty of programming languages besides C.

Also, while we're at it, UNIX is now a good 50 years old and if anything it contributes as much to the problem of unsafe software as anything else out there, every driver has the potential to hose the entire system.


> Also, while we're at it, UNIX is now a good 50 years old and if anything it contributes as much to the problem of unsafe software as anything else out there, every driver has the potential to hose the entire system.

If Unix was a single OS and codebase this holds water, but Linux isn't Unix, and real Unix comes in lots of flavors, and each has it's own set of issues. In any OS, save some microkernels, interfacing to hardware creates problems. Incidentally, insecure hardware is a universal problem.


OK - but it's not the 1970s any more. Modern hardware is many orders of magnitude faster than it used to be. (If you check the numbers it's not just a linear jump in clock rate of a thousand or so, but a multiplier of another 10 or 100 because of pipelining, faster memory, larger caches, and bigger word sizes.)

So why are we still using a language designed as a quick hack in the 70s and which is a dinosaur now?

Beyond that - why are we still using the ideas from that period without modernising them? Why are so many 1970s constraints and hacks baked into POSIX and OS features when modern issues - security, stability, consistency, reliability, multi-national localisation and support, and so on - should be taking precedence?


You're right of course that hardware has improved immensely, but I'm not sure what your point was. There are plenty of domains where performance is still of great importance, and C still has excellent performance.

I think the real point is that modern languages can significantly improve on the major issues with C, particularly its undefined behaviour and how that translates to real-world security issues, without significantly impacting performance. Rust (and in particular its Safe Rust subset) has been competing more with C++ than with C, but the point is still there.

I admit though that I don't have hard numbers on what would be the performance cost of writing an OS (for example) in Rust rather than C.


Zig is apparently not much more complicated than C. It's got the same focus on low-level programming and manual memory management as C does. Doesn't support operator overloading though, and probably never will. :(


I didn't mention Zig because as far as I know there's no Safe Zig subset, nor are there plans to develop one. Zig itself is an unsafe language. [0][1]

That's the nice thing about Safe Rust, it's a proper safe language akin to Java and JavaScript, while retaining high performance, plain old ahead-of-time compilation, and no garbage collector. Zig isn't playing the same game.

[0] https://www.scattered-thoughts.net/writing/how-safe-is-zig/

[1] https://news.ycombinator.com/item?id=26537693


Arguably, it's not the prevalence of the language per se that's the problem: it's that the C function call interface has become the de-facto interop language for all shared libraries, which also means that every other language (whether it's Ada, Pascal, Rust, Python, Ocaml or Haskell) has to support the dinosaur-age ideas or exist only in its own niche.


> OK - but it's not the 1970s any more

Sure, but that can be taken both ways. eg: "then stop griping about a language that was at the top of the heap in the 70's".


It should rather be "then stop using a language that was at the top of the heap in the 70's". The entire discussion around C happens because C is still actively being used.


> At the time, the choice was C or assembly language for embedded/operating systems. There was no other choice in the 1970's. In fact, it wasn't even an option for most of the 1970's.

Unix was written in C because Thompson and Ritchie had been working on Multics, which was written in PL/1 in the 1960s. So the idea of an OS written in a high level language was hardly obscure and had nothing to do with C. It’s hard to say that C was much of an option in the 1970s anyway as k&r wasn’t even published until 1978.

There was a lovely (and also annoying) Cambrian explosion of languages and OSs in the 70s and even into the mid 80s or later. Computer companies often wrote their own languages and OSs, which made porting difficult (but porting wasn’t hugely common).


> Unix was written in C because Thompson and Ritchie had been working on Multics, which was written in PL/1 in the 1960s. So the idea of an OS written in a high level language was hardly obscure and had nothing to do with C.

OK, but at the time they started working on Unix, Multics had not yet been delivered. Nor was it clear that it would ever be delivered. So the idea that an OS could be successfully written in a high-level language was not yet proven.


The Burroughs system for the B5000 was written in Algol and preceded Multics.


Thanks, excellent example! I had no known that one.


I think you’re trying to split hairs for some reason I can’t figure out.

In any case your assertion is not correct: Multics was operational around campus in 1969.


I’ve never played with Rust, but could Rust have been viable on early 70s hardware? Had Rust existed, could Unix have been originally written in Rust?


I will answer 'Yes' to this the moment we have a viable mainstream OS written in Rust. There are people working towards this so with some luck we will be able to see what the brave new world of a whole system running production software built in Rust looks like.


> PL/1 would have been a disaster, assuming it could run at all on a 16-bit machine.

There was a PL/1 compiler of sorts that IBM flogged on MS-/PC-DOS in the early days of the IBM PC. IBM didn't write it and only distributed it IIRC, but I can't recall if the one I used was written by Digital Research or Language Processors, Inc. (LPI). It was riddled with compromises and indeed, slow as death compiling. That's saying something when I was used to fiddling with paper tape and audio cassette by then; floppy disks were considered lightning fast by comparison, and the compiler bogged down that experience. So. Many. Floppy. Swaps.

It was sold under the value proposition that your mainframe programmers could prototype small bits of code on their PC's (even from home!!!), then when satisfied with the results they'd upload the polished source to the mainframe. I shudder to imagine what it took to make that USP a reality for real production code snippets.


Perhaps you're thinking of PL/M [0]? I had a brief encounter with it (as sold by Intel) targeting (of all amazing things) the 8051 microcontroller ISA.

And yes: So. Many. Floppy. Swaps. The codegen was not horrendous, but IIRC its price kept it well out of reach of non-commercial users.

[0] https://en.wikipedia.org/wiki/PL/M


Sure c was better than anything else in the 70s. The question is why aren't we using something better 40 years later? Even if that is just a better version of C with a better libc, error handling, protection from overflows, etc.


You can if you want. Pick a board, a chip set, read the manual and get cracking.

The Intel manual is only what, 2200 pages?

I bet you could bootstrap an operating system, compiler, tool chain, and basic tools in a few years. And maybe in ten or twenty years you could have your new development environment up so you can start publishing software for all you new users.

I think the reason it sticks around is because of network effects like platform exclusivity, ecosystem, etc.


From experience: it takes about two years so you are quite on the money with your estimate. The hard part is to gain traction.


It shocked me to realize this, but the 70s are now a half century ago.


1983 (Star Wars Return of the Jedi, Michael Jackson Thriller and Billie Jean) is closer to WWII than to now...


Also with minimal c++ you can fix almost all the issues with C and people in embedded who are new overlook it all the time for fancier things like rust/zig/whatever which would force a complete thought process change. Luckily I've been coding away with C++ as a better C for decades in embedded projects. It does require some knowledge of what's going on underneath classes, inheritance, basic templates, etc., but it's all very doable. I generally avoid RTTI and exceptions for example.


> Personally, I'm tired of people bitching about C ... There was no other choice in the 1970's.

This in fact prove the opposite. People not complain hard enough about how much terrible C/C++ are.

1970: That is more than 50 years of BEING WRONG.


Not really. For a long time C was a perfectly legitimate choice. The 'wrongs' of C have only really come to light with widespread usage of the Internet and the much higher focus on security. Back in the day people working with computers weren't necessarily doing so to make a quick buck, hose your system or to try to see if they could do damage. The bulk of the people working with computers was trying hard to produce something useful instead of to deconstruct that which was already there in a malicious way.

The parasites came long after C.


Even then, you have better options (main one: pascal).

P.D: My main gripe is not about why C was made how is made. Is that it STAY like that until now. It must have been deprecating dangerous stuff long ago...


I've used both. C over Pascal any day, warts included.


Nope Pascal was never a real competitor to C. I have programmed serious software using both languages and I would never pick Pascal over C.


Disagree. Pascal was no more safe than C


>That is more than 50 years of BEING WRONG.

"I'd like to write software for you, but I have to wait 20 years before there's a suitable language to be invented first."


Code written in C/C++ is running the world. Python/JavaScript/PHP/Java etc. are all thin layers on top of runtimes written in C/C++, running on operating systems written in C/C++, using drivers written in C/C++/Assembler. There is a reason for this. It didn’t happen by accident. C/C++ competed against thousands of alternative languages and won. We are only now starting to see any real competition (Rust maybe?) but there is a looooooong way to go before C/C++ doesn’t run the world.


It seems the world is a pile of shitty shortcuts.


I don't know about the world, but our brains sure are! xD

We rely on cognitive biases (shitty shortcuts) for as long as we can. In many cases, for longer than would be optimal.


at the time the choice made sense for C, today it is a vanishingly rare case that you want a language that doesn’t know how long its own arrays and strings are, or that doesn’t know what might be null.


In practice it's _worse_ than that because you probably don't want a "long", you probably want a particular size like a 64 bit integer. So you have to add ifdefs to call either strtol or strtoll depending on the size of "long" and "long long".

And if you are using base 16 then strtol will allow an optional "0x" prefix. So if you didn't want that you have to check for it manually.

Strtol also accepts leading whitespace so if you didn't want that you have to test manually for it.

Don't pass a zero base thinking it means base ten. This works almost all the time but misinterprets a leading zero to mean octal.

Good luck!


>> you probably don't want a "long", you probably want a particular size like a 64 bit integer. So you have to add ifdefs to call either strtol or strtoll depending on the size of "long" and "long long".

stdint.h (https://en.cppreference.com/w/c/types/integer) provides fixed-width integers in specific sizes. It became a standard in C99.


Doesn't provide strto32 and strto64 though so you still need an ifdef.


Newsflash, the C language has conversions. You can assign a long or long long to your int32_t.


Great, and what happens when you accidentally assign something greater than 2^31 to an int32_t when using strtol? You won't benefit from a range error if LONG_MAX is 2^63, and now you have to make sure to handle any implementation defined behavior.


That's true. If the range you're interested in is not the same as the range of long or long long, you'll have to check the range yourself before you go on and use the value. No ifdefs required. If you're not happy doing it yourself, I do recommend strtonum or any of the alternatives that allow you to explicitly specify the range you're interested in. I don't see the point in littering the standard library with functions having hard-coded range for every range you might be interested in.


And get undefined behavior where the compiler can do anything it wants...


Unlike other cases of signed overflow, you actually don’t get UB on integer-to-integer conversions. You still have a bug in your code, though.


This is crazy


Welcome to the real world. At least it is understandable crazy. After all it is in a library call not part of the mental model of c.

To compare, I can’t say about the mental baggage you carry with javascript the core language. Still, it sort is of works. World moved on. Good luck.


The #1 problem with C is buffer overflows. The solution is pretty simple:

https://www.digitalmars.com/articles/C-biggest-mistake.html

and does not break existing code.


For strings, you need more than the allocated length: you need the current length too. Otherwise you end up with:

- O(n) algorithms that are constantly scanning strings looking for a zero byte

- strings can't contain zero bytes

- strings have to contain one zero byte

- putting a zero byte in the middle of a string chops it off

- probably other nonsense I haven't thought of

Please don't adopt another half-assed solution just because it fits more easily into C's existing set of crap. That's how we got fake strings in the first place.


This solution has been in D for 20 years. It works very well. It is fully assed.

It is very, very rare to see a buffer overflow in D because the use of these arrays is so easy, convenient, and robust.

Not only does it virtually eliminate buffer overflows (when used), it is more efficient than 0 terminated strings. It does not need to scan the strings, nor does it need to load the string into memory to determine its length.

I understand your concerns about mixing it up with 0 terminated strings. They are real, but have not been a particular problem in practice. What happens is one simply moves away from using 0 terminated strings. A zero terminated string can be converted to a length one with:

    a = s[0 .. strlen(s)];
Going the other way requires a memory allocation similar to what strdup() does.


I read your article before posting, and it says:

-----

void foo(char a[..])

meaning an array is passed as a so-called “fat pointer”, i.e. a pair consisting of a pointer to the start of the array, and a size_t of the array dimension.

-----

I didn't see a "current length" mentioned. Is it there? Can I have a string with an allocated length of 20 bytes and a current length of 10 bytes, without using a looking for a zero byte?


The capacity value is not part of it.

This proposal is not about memory management any more than 0 termination is about memory management. It is just about finding the end of the array.


The CHERI extensions for the ARM architecture allow for compilers that achieve this effect by making all pointers "fat", with bounds, and doing pervasive hardware bounds checks. They've been playing with FPGA versions and emulators for a while, but the first actual SoCs just got shipped: https://www.theregister.com/2022/01/21/arm_morello_testing/ -- software for it includes tweaked versions of BSD and I think Linux which use bounds-checked pointers throughout, including in the kernel.


Making them all fat doesn't fix code that uses strlen().


Not quite sure what you mean. All it can do is turn undefined behavior on an out-of-bounds reference into a segfault, but if that ends up turning an RCE vulnerability into something less severe, it's still an improvement -- at least from that perspective. The intent is for the hardware-checked bounds to be at very close to the declared bounds for arrays in source code, at least in cases most typically subject to buffer overlow; see https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-947.pdf

(And I'm not sure how the alternative fat-pointer proposal you mentioned does better. Searched that page for references to strlen, but didn't find much.)


Most arrays do not have declared bounds, that's why a runtime check is necessary.


Bounds for storage are known when it is allocated. CHERI puts those bounds into the pointers (which double in size so they can fit), and has the processor do the runtime checks at every dereference. So, same effect as the proposal you reference, AFAICS -- just without altered syntax.


I'm fairly new to C, am I understanding correctly that the new syntax is just sugar for a struct containing two values: a pointer to the start of the array, and its length? This can of course be done without the new syntax, and it seems exceedingly useful. Are such structs commonly used in C projects?

What is the actual source of bugs in this regard? How does passing the length as a separate parameter lead to more bugs than having it bundled -- is the main source of error passing the wrong variable?


If the length is bundled with the pointer as syntax then the compiler and maybe even the runtime can provide checking on behalf of the programmer. Passing it by hand means it’s the programmers responsibility to (remember to) validate


> How does passing the length as a separate parameter lead to more bugs than having it bundled

It boils down to being inconvenient, unreliable, error prone, and difficult to audit. That's why it isn't used and C's #1 problem remains buffer overflows. And so it goes for all the other solutions for C for this problem, except my proposal.

My proposal is how D works, and it's been convenient, reliable, robust and auditable for 20 years. You can still use raw pointers in D, they are fully supported, but the use of the arrays make use of raw pointers rare.


>This can of course be done without the new syntax, and it seems exceedingly useful. Are such structs commonly used in C projects?

There are string libraries that work that way in C.

If you do it manually, without a lib, then you need to check that the length is valid yourself (after every operation), so it's not as useful as a language with first class support for it.


Presumably you can add an inlined array access function or macro in the header to do this for you.


If that worked, people would have done it, and C buffer overflows would be a thing of the past.


The reason this hasn't been done is mostly because C programmers have an allergy to runtime checks that might slow their programs by even single digit percentages.


People don't use these other schemes because they are clumsy, inconvenient, look bad, and have never caught on.

With it as part of the syntax, it becomes natural to use them. I'm not making this up, it is based on extensive experience.

The runtime overflow checks can be turned on and off with a compiler switch, so it becomes trivial to see what performance effect it actually has. Critical loops can be coded with ordinary pointers as necessary. For the rest, the performance effect is not measurable.

Again, this is from experience, not supposition.



Neat idea. What does implementing new syntax in one of the established C compilers involve? Is it the kind of thing that could be reasonably tackled in a small patch just to play with?


It wouldn't be hard. The semantics are straightforward and don't interfere with the way a C compiler already works.


> The #1 problem with C is buffer overflows.

There is also the null hypothesis that buffer overflows are simply a category of bugs that is simply easy to identify, and thus, apparently prevalent.


Would it be reasonable to assert that bugs which have such rare and uninteresting consequences that nobody has either noticed (not widely prevalent) or identified them (worth investigating because the consequences were severe), could be the number one problem?


The cold is one of the most common human infections, does that make it the most severe health problems facing humanity as well?


You started off suggesting there may be more serious problems in C which haven't been identified or are not prevalent. In this comparison we have identified car crashes, heart disease and lung cancer as health problems, and they are prevalent.


I've never heard of anyone dying of the cold. I've heard of many major security breaches caused by buffer overflows.


Buffer overflows aren't easy to identify. That's why they lurk for years in shipping software, and then there's a panic when they're discovered.



> C is probably the patriarch of the longest list of languages. Notable among these are C++, the D programming language, and most recently, Go. There are endless discussion threads on how to fix C, going back to the 80’s.

Why is Java missing in that list?


Maybe because Java is more of a direct descendant of C++ than pure C (though obviously C++ is an iteration of C).

Or maybe they omitted it before there are literally hundreds of languages that were inspired from C and listing them all would have been boring for the reader.


Java has a C++ syntax to sell it to the C++ devs at the time, however its major influence was Objective-C.

https://cs.gmu.edu/~sean/stuff/java-objc.html


Or so somebody said in an interview or thought they did, but otherwise Java is nothing like Objective C with respect to messages (and of course, wrt to syntax, but that's not that important).


If you want to do dynamic dispatch, there are ways to achieve it via reflection and dynamic proxies.

Interfaces, dynamic code loading, JAr bundles, lightweight class type reflection, all trace back to Objective-C, or Smalltalk, if one wants to be pendantic.

In case you missed it, even JEE started as an Objective-C framework for the Spring distributed OS, Distributed Objects Everywhere.

https://en.m.wikipedia.org/wiki/Distributed_Objects_Everywhe...


That's my posting that was linked to.

Of course, Java doesn't have dispatch. And it's also true that Gosling's team originally considered, then rejected, C++ in favor of building Oak. But Oak borrowed an awful lot directly from Obj-C, and only later underwent a lot of syntactical surgery (turning into Java) in order to "look" like C++ specifically to attract C++ programmers, even though it didn't feel like C++ at all. This is pretty well documented.


Yet it didn't have any of the dynamic nature of Objective-C which made the attempt to replace Objective-C in OS X with Java a failure.


What?

There was no attempt to replace Objective-C with Java on OS X.

Apple was unsure if the strange look from Objective-C would ever appeal to the Object Pascal/C++ communities of Apple developers, thus they used Java wave as plan B, in case Objective-C was rejected by them.

As this did not happen, there was no reason to keep plan B around.


There was no attempt to replace Objective-C with Java on OS X.

I would say there was a heck of an attempt with the Java-Cocoa bridge that didn't do well because Java didn't have a lot of the dynamic nature of Objective-C. They certainly to my eyes as a developer tried to push Java.


As you wish, Java-Cocoa bridge could have never replaced Objective-C, when writing Objective-C was still part of the game to actually use it.

Do you actually believe that Jobs liked Java, when Apple was created on top of Object Pascal and C++, and then he was responsible for bringing Brad Cox to NeXT?


Yeah, given the crap they were sending in the monthly discs to developers at the time. They certainly seemed like we should of took Java seriously. I was rather annoyed given I had learned Objective-C on NeXTSTEP. I'm glad that someone realized it was not substitute, but they did push it. Jobs hated it later, but he changed his mind on things fairly often given just looking at iPod features.


How could Java be a substitute when Cocoa bridge only allowed for a subset of OS X frameworks to be called from it?

Java was already available on System 7.


Well, it sure didn't, but the messaging in those CD-ROMs was pretty obvious and I suppose they would have expanded it if someone didn't realize it was a lost cause. I didn't develop for Macs until Apple bought NeXT, so I don't know what was available for System 7.


Sometimes one understands what they think they want to understand.

Java was definitely not on the picture when Apple went to CERN doing their OS X marketing sessions.

In fact, you now made me dig out some stuff.

https://developer.apple.com/library/archive/documentation/Ja...

https://developer.apple.com/library/archive/documentation/Co...

> This document discusses issues that arise when writing Java applications with Cocoa, which is implemented in Objective-C.

No Java here on the OS X announcement:

https://youtu.be/SjlLG1EzJ2k?t=4450

The only message was Java being first party on OS X, as in the System 7 days, the JVM was not from Apple rather a third party. Thus the announcement at JavaONE 2000.

https://www.javacoffeebreak.com/articles/javaone00/index.htm...

You will not find in those CDs anything like this:

> Swift is a successor to both the C and Objective-C languages. It includes low-level primitives such as types, flow control, and operators. It also provides object-oriented features such as classes, protocols, and generics, giving Cocoa and Cocoa Touch developers the performance and power they demand.

https://developer.apple.com/swift/


Why would you think Java is a descendant of C++? There may be overlap in some syntax, mainly from C. C++ is not, and was not, the only OOP language, and I have heard no such that that is should be descendant of C++.


I was once at a talk given by James Gosling. He said that Java-the-language was «a trick to get C++ programmers to use the Java Virtual Machine». He deliberately made Java very similar to C++ but removed what he saw as the hard and risky parts (memory management, operator overloading, etc) that are typically not required for standard applications.

Well, that was my interpretation of what he said, errors are my own etc. But this would make Java a direct descendant of C++, in my mind.


I have never heard that quote from James before. Are you sure about the JVM? The JVM was quite controversial back then, Java first had to prove that you could make a performant virtual machine.

But Guy Steele claimed "We were not out to win over the Lisp programmers; we were after the C++ programmers. We managed to drag a lot of them about halfway to Lisp."


Well, it’s many years ago and memory corruption is real. I got the impression that their goal was to get adoption of (what at some point became) the JVM, or the «compile once run anywhere» vision. They envisioned many languages to coexist on the JVM, which kinda happened but maybe not as much as they thought. So they designed a language to get started, Java, and made it familiar-looking to get people om board.


I was around at the time.

Java's object semantics are explicitly intended as a streamlining of C++, the keywords are the same for the most part, and it was sold as a C++ which runs anywhere with no memory leaks.

Note that I mentioned the semantics: the object semantics of Java and C++ are so similar as to have corrupted the entire concept of objects in their favor.

This wasn't an accident, and it wasn't malice, it just feels like it sometimes.


I'm not a Java programmer but as far as I can tell java object semantics, far from being corrupted, do indeed come from simula via c++.

Thanks to reflection and a featureful VM, Java does have a significant amount of dynamic behaviour that can be used to implement a lot of features of the smalltalk side of the OO family tree.


Because at the time it was pretty clear.

'At that stage, C and C++ "absolutely owned the universe"' - https://www.zdnet.com/article/programming-languages-java-fou...

They took a lot of inspiration from C/C++'s syntax and seemed to be pretty concerned with improving memory management, security and developer velocity.


Another programming language being popular by no means mean that it is a derived language of any sort. Any development is of course retrospective, but it is sorta like saying all music is descendant from pop.


I understand the point you’re trying to make but writing music is a creative process whereas marketing programming languages isn’t.

I was around at the time and C++ was trendy so Sun were marketing it as the future for C++ developers. It was definitely influenced by what was in vogue at the time even if it doesn’t adopt all of the traits of C++.

I remember this because I wasn’t a fan of C++ back then as I’d come from the ALGOL family of languages so found C-style syntax a little alien (and tbh I still don’t like C++ now even though I’ve since warmed to C’s syntax) so it took me years before I warmed to Java.


Define "derived".

In particular, if Java kept (almost?) all the keywords, and the operators, and the statement terminators, and the block delimiters, and the same approach to object-oriented... how is it not derived from C++?


>C++ is not, and was not, the only OOP language

No, but it was the only one that mattered at the time, as far as adoption was concerned, and regarding marketing Java as similar to existing programmers and their managers...

That's also how it was hyped at the time and the kind of people it was sold too (I was -barely- there).


I suspect it’s because manual memory management in Java isn’t built into the language. Is it even possible? I’m not a Java programmer and I don’t know. My understanding has been that the runtime doesn’t expose the memory model to you.


In a way it does. Java just likes to push most features to methods on special objects, instead of exposing them as native functionality (to avoid backwards incompatible changes).

So it would look something like MemorySegment.allocateNatice(100, someScope). This new API has a runtime ownership model, so by default only a single thread can access this memory address, and it can be freed at will.


Seriously, the biggest gripe about C is the design of standard library?

Not the pervasive undefined behaviour and compilers that become more aggressive every release about breaking previously-working code?

Not the reams of code that assume sizes of integers and signedness of char?

Not the wild build process that makes it awfully hard to actually build anything that has any dependencies whatsoever.

strtol. Damn, what a nuisance!


It says (part 1) in the headline, even on HN!

As a place to start discussing why a successor systems language is necessary, comparing string parsing across Rust, Zig, and C? Pretty good place to start, because the problems it introduces are pervasive as the discussion continues.

Magic error values that get globally mutated? Check. Pointers which are either null or exist so you can do arithmetic to deduce a byte length? Check. An almost aggressive distain for handling sum types of even the simplest sort? Check.

Turns out Zig and Rust are whupping the ol' llama on undefined behavior and build processes as well, not to mention memory safety. If only this author had indicated that they might continue writing on the subject...


Where does the article say that those other things are not also big problems? In fact, it specifically says strtol is one example of the things wrong with C. This comment seems needlessly dismissive.


Why complain about a functions though? you can just write your own that does exactly what you want it to do. I don't see how this is a flaw in the language. You could at least mention things like the loss of size information when passing arrays between scopes, that is an annoyance that can be considered as a problem in the language.


Read the last section. None of the real problems of C are even mentioned.


You do not consider it a sign of C's problems in the modern world that so many of its core functions (atoi, atol, atoll, atof, gets, strcat, strcpy, sprintf, etc.) are unsafe and yet still are out there in production and teaching?


Cumbersomeness of these functions is so trivial compared to the problems of C that are real that it's just not worth mentioning them.

Even if these functions disappeared overnight and C acquired the standard library redone using the knowledge accumulated over years, it would not make a difference:

- C is a disaster to write any amount of code in due to the definition of undefined behaviour in the specification.

Not only compilers are free to do anything they want with code that exhibits undefined behaviour, they are unable even to detect undefined behaviour in the code, and hence the only way to write reliable code is to freeze your compiler, toolchain and target OSes: any minor change in any of components (including OS headers) may completely ruin your program.

Ask Linux folks who were bitten by it many times, even though do not have to deal with garbage in vendors' OS headers.

- C is a disaster due to its compilation model. Preprocessor defines wreak havoc on the source code, and it is a backbreaking job to maintain cross-compilability of any large-ish codebase. Forget about trying to cross-compile the whole software with dependencies: this is a full-time job by itself.

- C does not specify sizes of short, int, long and signedness of char. This means any codebase that ever touches these types (and there are OS interfaces that are expressed in them!) is inherently non-portable: every new target means combing over the whole codebase and checking breakages in all arithmetic operations.

There are other underspecified pieces in spec, but this one is just the most salient one.

> in production and teaching

If a teaching resource does not state "C spec has undefined behaviour, and you will have a very bad time if you don't know about it", then it's utter garbage.


> Cumbersomeness of these functions is so trivial compared to the problems of C that are real that it's just not worth mentioning them.

I fully agree! C has many problems, and libc sucking is but one of them (and rather easily worked around). That said..

> Ask Linux folks who were bitten by it many times

A three-decade old project with thousands of developers and millions of lines of code (and heavy reliance on platform & implementation specifics in certain parts) will inevitably at some point along its life get bitten by the rough edges of whatever technology they settle on. Coverage of the few issues they've hit is wildly blown out of proportion, probably precisely because it's so rare that people find these things surprising when it bites them and thus it makes the news.

If Rust gets well adopted in the kernel (as it might), I guarantee that in 30 years, they will have hit its rough edges.

Now if you actually ask Linux folks, they will tell you to flip on -fwrapv and -fno-delete-null-pointer-checks and move on with your life, because whining about these old and solved issues is not productive use of anyone's time.

> C is a disaster due to its compilation model. Preprocessor defines wreak havoc on the source code

Did you ask the Linux folks? They make pretty good use of the preprocessor.

> it is a backbreaking job to maintain cross-compilability of any large-ish codebase. Forget about trying to cross-compile the whole software with dependencies: this is a full-time job by itself

Did you ask the Linux folk? It's hilarious that you mention Linux folk, given that it is one of the most frequently cross-compiled code bases on earth (along with much of the Linux userspace). Btw, I cross compile Linux and various applications regularly at work. In fact, I compile (and maintain) an entire distro with custom kernels for different devices & architectures. And that's not a full-time job. Most of my work is application development, with some driver development now and then.

> - C does not specify sizes of short, int, long and signedness of char.

But it does specify their minimum sizes, which is often all you need. Signedness of char? It sucks, and it is not a big deal. The signed and unsigned keywords exist, btw, if you need a specific sign. If that's too much typing for you, I can sell you typedef. You could ask the Linux folk for advice, they have a few typedefs seem very popular now.

> This means any codebase that ever touches these types (and there are OS interfaces that are expressed in them!) is inherently non-portable: every new target means combing over the whole codebase and checking breakages in all arithmetic operations.

That's not true at all. Minimum sizes are guaranteed, and there are times where using types with implementation defined size is exactly the thing you need because you're dealing with quantities that are inherently related to platform specific ranges. Using these types makes your code more portable, not less portable. I recommend you go ask the Linux folk, or take a look at their source code, which conveniently runs across quite a few platforms.


"None of the real problems of C are even mentioned."

In your opinion. It may surprise you to know, others might have differing opinions.

At least from how I read it, I completely agree with the post. C makes it hard for programmers to write safe code in general and the author was pointing out one example of this behaviour and what causes it.


It feels valid to me. Redis might be a good example. They had to write their own string library (sds) to make a cache server.


The standard library is part of the language. Its also how the language is used, both because the language design encourages it and because people have a tendency to copy how the standard library does things, since you learn to do things how the libraries you use do them.


the build process of C is one of its absolute benefits. Each unit compiles on its own, producing an object file. The fact that people have now started to make header-only libraries makes the story even better! Each function gets a name. No namespaces, classes, scopes, modules, w/e. You can even just declare a function as extern at compile time!


The fact that people have to resort to header only libraries is a sign of how bad building portable C libraries really is.


C compilation is not that bad, what makes it atrocious is the preprocessing step.

Show me the large-ish (100K+ LOC) codebase with dependencies that can be cross-compiled, does not come with tons of cruft like autoconf or Meson, and does not require installing reams of software on the host as "libraries", and then we are talking.

(edited: typo)


While a programming language and ecosystem includes some of the culture, bad code and project structure IMO should not be blamed on C. Modern C projects are a breeze


> Modern C projects are a breeze

So, where's the link to a modern C project that is a breeze to work with? Requirements are in the parent comment.


Can’t tell if this is sarcasm or not.


From the article:

  char *forty_two_bee = "42b";
  char *end;
  errno = 0; // remember errno?

  long i = strtol(forty_two_bee, &end, 10);
> This will return 0

No, this will return 42. strtol() parses greedily until a character cannot be parsed, but then it returns the conversion of what it did parse.

I guess the fact the author got this wrong... kind of proves their point that strtol()'s API is not great?

On the other hand, while the article purports to criticize a language, it then proceeds to only cover its standard library. Sure, C's stdlib is old-fashioned, but there are many things in C that are much worse than its standard library! (And I say that as someone who still likes the language.)


Author mentions four increasingly obscure C replacements (first I've heard of Odin) without mentioning that the creators of the original C and Unix went on to make Go.

Go does not have manual memory management. Despite (actually because of) that captures the spirit and design goal of the original C beautifully. It's a minimalist systems programming language.

One of the amazing things about Go is the standard library-- the thing he complains about with C. The Go standard library is incredibly readable. It's night and day from C/C++ where opening glibc/STL etc is assault on the senses.



What a weird post. The examples from Rust and Zig don’t fail gracefully, so they can’t be considered complete. Panicking on bad user input is bad code, too. And the main complaint seems to be that the C stdlib could be improved. But where it has been improved, the author complains that it’s really just doing the ugly stuff under the hood. What does the author think the Rust stdlib function is doing exactly?


Either I'm going mad - in which case please set me straight - or the Rust example doesn't even compile: had to remove the odd-looking borrows on the method calls, and replace the type annotation in the final 'if let' with a turbofish on the call.


Nope, you're right, I need to apply the following diff (wrapped in a fn main() {}) to avoid rustc complaining:

    # diff -up test.rs test.fixed.rs 
    --- test.rs 2022-01-22 16:03:57.302742242 +0100
    +++ test.fixed.rs 2022-01-22 16:03:27.766250377 +0100
    @@ -2,13 +2,13 @@ fn main() {
         // pretend that this was passed in on the command line
         let my_number_string = String::from("42");
         // If we just want to bubble up errors
    -    let my_number: u8 = &my_number_string.parse()?;
    +    let my_number: u8 = my_number_string.parse()?;
         assert_eq!(my_number, 42);
         // If we might like to panic!
    -    let my_number: u8 = &my_number_string.parse().unwrap();
    +    let my_number: u8 = my_number_string.parse().unwrap();
         assert_eq!(my_number, 42);
         // If we're a good Rustacean and check for errors before trying to use the data
    -    if let Ok(my_number: u8) = &my_number_string.parse() {
    +    if let Ok(my_number) = my_number_string.parse::<u8>() {
             assert_eq!(my_number, 42);
         }
     }


One of the best things to happen to C++ discussions was people starting to write godbolt links for their code. Immediately the code being discussed becomes code somebody actually compiled and maybe tried running and not just "oh, ignore the fact it's syntactically invalid - you know what I meant". No we don't.

You can obviously write a godbolt link for Rust too, but Rust's playground is also a reasonable choice.

I think that code maybe makes more sense with a turbofish for each parse call and type inference, but maybe that's just a matter of taste. If the author had used godbolt or playground or whatever they'd have written code that compiles and we'd not be guessing.


> It exists because it became part of the POSIX standard way back when a pdp7 was an advanced computer…

The PDP-7 was long obsolete by the time the POSIX effort started. By then the most common Unix host was a VAX (32 bits), though it, or Unix-alikes, ran on a variety of 16 and 32 bit machines, hence a desire for standardization.


One of C's design principles is to be fast at the cost of safety, just like an F1 formula car. It will let you make fast mistakes.

You drove a Corolla in college, then got a job and drove a cool BMW for several years and now you think you're hot shit, so you hope in an F1 car and not only does it take forever to learn how to drive it, it has to be driven on a special track and the gearbox is different, what a nuisance!

"If only we could add 4 doors, automatic transmission, snow tires, and a trunk to put our stuff in, people won't keep getting into accidents with this car", you say. Right, but then it becomes a BMW. If you want real speed, you need to first go slow and master the car because otherwise you'll crash and burn.

C is messy because real world hardware is very messy. You can't push bytes through the hardware at its speed limit without getting your hands dirty, and we all come out into the real world wearing "class Dog extends Animal" white gloves.

To use C effectively, you should not be coding in C in your mind. You should be thinking in assembly, but your fingers should be typing C code. It's not safe, but if you want to reach 230MPH and accelerate at 60MPH in 2.6 seconds, you better know exactly what you're doing when you hop behind the wheel of that car. It's not for the weak.


> C is messy because real world hardware is very messy

Ada was designed for embedded systems specifically and has guards over many of the pitfalls in C. Still, it provides easy access to in-depth low-level control when you need it (assembly, intrinsics, binding variables to specific memory locations, importing C, creating your own custom allocators). The difference is that you write intent, and then paint additional control on top of that. This makes Ada also suitable for higher level applications.


I think the brittleness of C's string handling functions is not a necessary consequence of anything you said. It's just sloppiness and inertia.


I like that analogy. You're saying that C should only be used in competitions, and not be allowed in the real world, right?


I don't see a problem with using C in the real world, but if you're going to attempt to race on the highway and you don't know how to steer clear of potholes, don't go blaming the car when the wheels fly off. The car requires you to know how to drive at high speeds and a lot of people don't know how to, so instead of being honest with themselves, they look around and conclude that it must be the car's fault, because this many people couldn't possibly be that bad at racing.

It's possible to become a better driver to handle the F1 car, just like it's possible to arrive at the same destination driving a Corolla, just 2 minutes later. If you want the speed though, you have to put in the effort.


You're right, we need to draw a distinction between the Real Programmers and the Quiche Eaters. A mere Java or Python user just isn't good enough, they can't write portable assembly like a Real Programmer can.


Absolutely not. Once a user reads "Head First Java" or customizes Django sites, they get their standard issue keyboard and they're ready to start writing interrupt handlers in C. If the code crashes, it must be the language.


The F1 analogy is easy to use against this line of argumentation. Today's F1 cars are way faster than their predecessors. They are also safer, more automated, and in large part faster because they are easier to drive. The racing is more boring, and cars are uglier, but those are different topics.

The idea that you can't maintain the runtime performance of C while innately supporting automated reasoning about invariants/safety just doesn't hold up. The idea is to move the whole Pareto front outward - that's what advancements in theory and technology do.


They came, they saw, and they went away, and C is still the smallest, fastest and most portable language.

I think the only way to dethrone C is to change the equation of what's expensive to do in hardware - accessing memory, and that's not a problem us software guys are going to solve.


I upvoted this because hey, no lies detected.

The problem is that this particular design principle of C is ready for a comfortable retirement in a beach community.

The machismo is probably why you're getting dragged a bit, but the bottom line is that being intimate with the hardware is orthogonal to pointlessly segfaulting. C does both, Zig is aiming for one of these things and I'll let you guess which.


A segfault lets the user know that the developer made a mistake, and where in the code it happened. Blaming C for segfaults is blaming the tool.

C is a small language with a spec designed to adapt to new hardware while remaining fast. The spec is ambiguous in precisely the places where resolving the ambiguity would mean either limiting its portability or its speed. This increases the learning curve significantly and also requires diligence on behalf of the developer, so it's high effort to write.

It's a perfectionist's language, because, if you can steer clear of the known pitfalls, you get a working piece of software that's maximally portable and fast, and fast is still what we want our tools to be.

There is a place for Zig, and Nim and Rust in this world, but there is no world in which these tools make the same trade-offs as C and end up with a faster and more portable (across hardware) language.

They can sacrifice speed to make it more difficult for the developer to make mistakes. They can sacrifice portability to make assumptions that resolve undefined behavior, which would also decrease the burden on the developer, but they will never get all three - correctness, portability and speed, so in that sense, they will never replace C, they can only hope to starve C of developers.


I work with power tools when I have to. A table saw is dangerous, and I won't refuse to use one on that basis. I wouldn't blame a table saw for cutting someone's thumb off.

One of these days I'll have a big project space though, and I'll put a table saw in. That table saw will be one of the fancy ones which destroys blades instead of digits, when the two come into conflict.

> There is a place for Zig [...] but there is no world in which [this tool makes] the same trade-offs as C and end up with a faster and more portable (across hardware) language.

This isn't the bar it needs to clear. It needs to be as fast and as portable. C can be the fastest possible language, and Zig could be exactly as fast (with, LLVM, say), and still be a language I would prefer because of comptime and some design choices which make it harder for me to lose a digit.


> That table saw will be one of the fancy ones which destroys blades instead of digits, when the two come into conflict.

SawStop. You can expect suddenly a lot of tool manufacturers who would have assured you ten years ago that this technology is either dangerous or compromises the saw's usefulness, will over the next ten years offer substantially the same features as the first patents run out.


> segfault lets the user know that the developer made a mistake, and where in the code it happened. Blaming C for segfaults is blaming the tool.

You can't even assume that most things will segfault though; with UB, you're lucky if it segfaults, since it's more noticeable and easier to debug! But there's not guarantee it will do that when you mess up.

> There is a place for Zig, and Nim and Rust in this world, but there is no world in which these tools make the same trade-offs as C and end up with a faster and more portable (across hardware) language.

I don't think anyone is arguing that a language would be faster or more portable, just that one could be written that's equally fast and portable enough to be useful for most things. I'd be happy to let C remain dominant specialized hardware if it means that the OS for my laptop, desktop, and phone can be written in something safer and as fast.


I’m not aware of any case in which unsafe Rust has any overheard over C. The advantage of Rust, then, is that you can restrict your use of `unsafe` to places where you actually care about things like the overhead of bounds checking.


I think your breakdown of a language is a neat idea, the decomposing of implementations by there ‘scores’ in the three areas of correctness, portability, and speed. I think I’d like to replace speed with a performance score encompassing both speed and memory footprint though. I also agree that achieving high ‘scores’ in all three areas is a relative impossibility.

For me, the best language is going to be the one that has a maximum in the performance area and is provably (at least to some reasonable measure) correct. I think portability between execution environments can be a loss for the types of things I enjoy programming.


There's nothing fast about zero-terminated strings. In fact, many operations on them are much slower than sane alternatives, because they first have to scan the entire string to compute its length. You can't even create a temporary substring without either modifying or copying part of the original string. How lame is that? Zero-terminated strings are almost never the best solution, so why are they the language-supported default?

> You should be thinking in assembly, [...]

Well, then you shouldn't by typing in C, because Undefined Behavior coupled with modern C compilers will make sure that what you get is not what you thought. *cough* signed integer overflow * cough*

> You can't push bytes through the hardware at its speed limit without getting your hands dirty

Rust proves you wrong (maybe some other languages, too, but I don't know them as well)


What you're missing is the difference between known issues and unknown issues. You're looking at a language that's been heavily used for 60 years and accumulated a long list of known issues and things not to do, that powers pretty much everything in computers, and you're comparing that with the new kid on the block with a vocal fanbase.

You could invest your time into learning that finite list, or you could invest your time into learning a new language with a long list of _unknown_ issues yet to be discovered - but out of sight, out of mind, right?

As far as runtime speed goes, assuming equal instructions being generated, if Rust spends even one CPU cycle checking array lengths, its generated code will be slower than C's, by definition. You can justify the trade-off ("it checks array lengths for me because I am human and I forget sometimes") or relax the restrictions ("it's not humanly noticeable"), but you can't claim it runs faster or even just as fast, because it's not.

The only thing Rust proved to me is that there was a whole generation of developers who did't mind writing unreadable Perl code who had kids that are equally unaware of how unreadable Rust code is and it'll take a few decades for them to see that, assuming that Rust stays relevant for another decade.


> You could invest your time into learning that finite list, or you could invest your time into learning a new language with a long list of _unknown_ issues yet to be discovered

That's a bad argument, because it could be used against any change or improvement. By that logic, humans should have never even come down from the trees.

> if Rust spends even one CPU cycle checking array lengths

That's the thing: Almost all checks and guarantees which make Rust safer than C are done at compile time and have no negative effect on the generated code.


That's not a bad argument, it's a statement of fact. I'm using it to point out that using Rust carries risk, whether you realize it or not. Just because you've accepted the risk, doesn't mean it is a universally good decision and C is now bad. Maybe coming down from a tree pays off, maybe you get eaten by a jaguar.

People who don't put in the effort to really learn their tools, need tools with training wheels. It's perfectly fine for a language to put in checks to protect you against yourself and be "fast enough for practical purposes", just don't confuse "almost fast" with "always fast". Rust programs have to pay the price for runtime checks because Rust doesn't trust you to know what you're doing.


The first paragraph of this is weird coming from someone who claims to think assembler and type C. Do you realize, or no, that Rust and Zig use LLVM for release code (Rust uses it for everything)? What are these risks you refer to, looking funny?

> Rust programs have to pay the price for runtime checks because Rust doesn't trust you to know what you're doing.

Buddy, you talk a lot of game about knowing your tools. Don't say obviously ignorant stuff about other peoples tools, it makes me think you're bluffing about C.


To answer your question, I'm well aware of the backends used, but using LLVM doesn't mean that the same IR or assembler gets generated. Enjoy the rest of this weekend.


There is nothing about the design of strtol that makes it particularly fast. If anything, the extra checks and accesses to errno (which on modern systems is generally an implicit function call) that are required to use strtol correctly represent unnecessary overhead, though only a trivial amount of it. But mostly it’s just an awkward API design.


C was not designed to be fast. It was designed to be a bit simpler and a whole lot more portable then assembly code. The speed is a biproduct of how it does not try to do anything other then basically mapping perfectly to the hardware.


This would be more justifiable if C had support for vector instructions, which are crucial for high-performance code on modern CPUs.


Wait is this article saying that there is no good/obvious/standard function to parse a string to a number and has the two obvious outputs of such a function (the number, and a bool or error code)?

Even a person in the 60s would realize that that’s the api for conversion from a string to a number (or any conversion that might fail)! What happened? Why do these functions even exist?


This isn't the usual way this is coded:

  char *one = "one";
  char *end;
  errno = 0; // remember errno?
  long i = strtol(one, &end, 10);
  if (errno != 0) {
      perror("Error parsing integer from string: ");
  } else if (i == 0 && end == one) {
      fprintf(stderr, "Error: invalid input: %s\n", one);
  } else if (i == 0 && *end != '\0') {
      f__kMeGently(with_a_chainsaw); 
  }
It's actually like this:

  errno = 0;

  long i = strtol(input, &end, 10);

  if (end == input) {
    // no digits were found
  } else if (*end != 0 && no_ignore_trailing_junk) {
    // unwanted trailing junk
  } else if ((i == LONG_MIN || i == LONG_MAX)) && errno != 0) {
    // overflow case
  } else {
    // good!
  }
errno only needs to be checked in the LONG_MIN or LONG_MAX case. These cares are ambiguous: LONG_MIN and LONG_MAX are valid values of type long, and they are used for reporting an underflow or overflow. Therefore errno is reset to zero first. Otherwise what if errno contains a nonzero value, and LONG_MAX happens to be a valid, non-overflowing value out of the function?

Anyway, you cannot get away from handling these cases no matter how you implement integer scanning; they are inherent to the problem.

It's not strtol's fault that the string could be empty, or that it could have a valid number followed by junk.

Overflows stem from the use of a fixed-width integer. But even if you use bignums, and parse them from a stream (e.g. network), you may need to set a cutoff: what if a malicious user feeds you an endless stream of digits?

The bit with errno is a bit silly; given that the function's has enough parameters that it could have been dispensed with. We could write a function which is invoked exactly like strtoul, but which, in the overflow case, sets the *end pointer to NULL:

  // no assignment to errno before strtol

  int i = my_strtoul(input, &end, 10);

  if (end == 0) {
    // underflow or overflow, indicated by LONG_MIN or LONG_MAX value
  } else if (end == input) {
    // no digits were found
  } else if (*end != 0 && no_ignore_trailing_junk) {
    // unwanted trailing junk, but i is good
  } else {
    // no trailing junk, value in i
  }
errno is a pig; under multiple threads, it has to access a thread local value. E.g

  #define errno (*__thread_specific_errno_location())
The designer of strtoul didn't do this likely because of the overriding requirement that the end pointer is advanced past whatever the function was able to recognize as a number, no matter what. This is lets the programmer write a tokenizer which can diagnose the overflow error, and then keep going with the next token.


Sure, you can't get away from handling the cases, but as the article clearly demonstrates, there can be a much better interface for it.


> Sure, you can't get away from handling the cases, but as the article clearly demonstrates, there can be a much better interface for it.

It's a very apples to oranges comparison, to the point that it almost feels like a straw man. "Interface (that does X) sucks for doing Y; look at how easy the Rust interface for doing Y is!"

Yes, there can be a much simpler interface for the case when you want to assert that a string is nothing but digits and must fully convert. That's not what strtol is for though.

Now I think libc sucks (no surprise given its age; complaining about is beating a dead horse), and it sucks more if you don't take various GNU & BSD extensions with it, but I'm kinda getting tired of people complaining that "foo in C is hard" when their unstated requirement is that they can't use any libraries to help them do it. Like this fellow the other day: https://news.ycombinator.com/item?id=29990897

If you look at programs written in "modern" languages, they almost invariably bring a plethora of libraries and dependencies with them anyway so why is C repeatedly judged on the merits of ancient libc interfaces that you don't have to use?


IMO, external libraries are for domain-specific tasks. If something is needed in pretty much every program, it should be a part of the language or the standard library.

Also, it's much easier to use external libraries in other languages. npm install, cargo install, nimble install, cabal install, gem install, …


> If something is needed in pretty much every program, it should be a part of the language or the standard library.

It sure would be convenient that way. That said, you don't need to convert strings in pretty much every program. There's a lot of C code out there that does very little with strings.

Now do you dismiss an entire language if its standard library is lacking or doesn't exist? IMO that would be throwing out baby with the bathwater.

> npm install, cargo install, nimble install, cabal install, gem install, …

Yes, I've witnessed the mountain of unaudited dependencies that somehow turn a 300 line program into something the size of my kernel.. should I dismiss all those languages because people do something I don't like with their libraries?


>Now do you dismiss an entire language if its standard library is lacking or doesn't exist?

As anything much more than a toy, yes. If there's no standard library at all (or nearly so), the language ecosystem is quite likely to end up a complete mess of incompatible implementations of even the most basic functionality, which is a waste of everyone's time to deal with.


I wonder how your programs do I/O if not with strings. Reading numbers from STDIN is the next thing after Hello World.

As another comment pointed out, C has many flaws unrelated to its standard library. Also check out https://eev.ee/blog/2016/12/01/lets-stop-copying-c/.

You know what's the main cause of dependency hell? Needing a library for every basic thing. Notice that mountains of dependencies are much less common in “batteries-included” languages.


> I wonder how your programs do I/O if not with strings.

There's this one weird trick we call binary. Let me give you an example of how I did I/O yesterday:

    static void usb_tx(struct usb_ep *ep, const void *data, uint len) {
      if (len) memcpy(ep->buf, data, len);
      *ep->bufctl = BC_FULL | ep->datax << BC_DATAX_S | BC_AVAIL | len;
      ep->datax ^= 1;
    }
Usage example:

    struct kb_report r = {.m={.id=KB_ID_M, .x=-a[1], .y=-a[0]}};
    usb_tx(KB_IN, &r.m, sizeof r.m);
stdin does not exist in this program.

> As another comment pointed out, C has many flaws unrelated to its standard library.

Yes it does, but this thread has already become a tangent of a tangent. Let's not turn it into a general diatribe against C, as opposed to a discussion about the library interface that TFA takes issue with.

> You know what's the main cause of dependency hell? Needing a library for every basic thing. Notice that mountains of dependencies are much less common in “batteries-included” languages.

In theory, yes. Like I said, libc sucks, and I would love to have a better standard (or de-facto standard) library. But anecdotally C programs are not very prone dependency bloat, perhaps precisely thanks to the fact that C doesn't have a de-facto package manager that allows you to just install a bunch of crap.

Anecdotally, "batteries included" languages are still prone to dependency bloat if there's a package manager. This includes recent experience with Python (I can't remember the last time I had to lay my hands on a python project that didn't need a bunch of things to be installed with pip) and somewhat less recently with Perl (isn't cpan pretty much the grandfather of "oh there's a library for that"?).

Hilariously, my recent experience has people using Python and depending on Python libraries which then depend on C and C++ libraries in order to implement the same things that I'm doing in plain C with no dependencies.

But I'll conclude my participation in this subthread with this message because it's gone too far off the rails into a pointless language flame war.


> isn't cpan pretty much the grandfather of "oh there's a library for that"?

The TeX CTAN in 1992 [1] was clearly the inspiration for CPAN a year or three later [2] (in both name & thing). So, maybe CTAN is the great grandfather? :-) { My intent is only to inform, not be disputatious. I know you said "pretty much". }

To be fair, C has an ecosystem. OS package managers/installers are a thing. There is surely a list of much >1 "core libs/programs" (terminfo/curses/some text editor/compilers/etc.) that would be in most "bare bones" OS installs upon which you could develop. One certainly depends upon OS kernels and device drivers. IMO, at least one mistake "language" package managers make is poor integration with OS package managers. Anyway you cut it, it is hard to write a program without depending upon a lot of code. Yes, some of that is more audited.

As the "lump" gets giant, dark corners also proliferate. There was a recent article [3] and HN discussion [4] about trying to have the "optimal chunkiness/granularity" in various ecosystems. I agree that it is doubtful we will solve any of that in an HN sub-to-the-Nth thread. I think that article/discussion only scratched the surface.

I will close by saying I think it's relatively uncontentious (but maybe not unanimous) that packaging has gone awry when a simple program requires a transitive closure of many hundreds of packages. FWIW, I also often write my own stuff rather than relying on 3rd parties and have done so in many languages. Nim [5] is a nice one for it. It's not perfect - what is? - but it sucks the least in my experience.

[1] https://en.wikipedia.org/wiki/CTAN

[2] https://en.wikipedia.org/wiki/CPAN

[3] https://raku-advent.blog/2021/12/06/unix_philosophy_without_...

[4] https://news.ycombinator.com/item?id=29520182

[5] https://nim-lang.org/


I think my point remains valid, to do safe string stuff in C I have to think a lot harder about stuff to do with lengths that I don't have to think about in go. And I didn't want large dependencies because i was writing a .so to preload and intercept execve and open. And even after all these threads I don't know the name of a small string library to use in C, except TCL because i used it before.


Would you be open to sharing what you did with strings?

My central argument in the response there is that writing buf[len] = ‘\0’; is almost always a sign that you either don't know libc functions, aren't willing to use them, are trying to outperform them (the performance of libc functions is a legitimate complaint for some use cases), or what you're dealing with is not a string but some arbitrary binary blobs that you're trying to make strings out of (in that case, you can't blame the string representation or string handling functions for not knowing what the extent of your binary is; yes, you'll have to first create a string, knowing the length).

To put it more explicitly, if you always provide a valid buffer and size, snprintf() will always terminate your string. strlcat() and strlcpy() will always terminate your string. If you need formatted catenation, you can make a trivial wrapper around snprintf that takes a pointer to the end of your string and updates the "head"; this can be called successively without ever having to compute a length outside the wrapper. asprintf() will allocate and terminate your string. Things that need the length of your string (strspn, strchr, etcetra) will figure it out since it is implied by the already-present nul byte. strtok & co (they have their issues) also work without requiring you to do any manual termination.

What this means in practice is that you can have thousands of lines of string handling code that never manually terminates a string and only deals with lengths to the extent that your "business logic" needs to. Unless you're actually trying to use the string representation to your benefit by manually splicing it any which way, inserting nul bytes based on arcane computations.. in that case, it sounds like you got what you wanted. Yes, people actually do that sometimes: they figure out how easy it is to manipulate the string representation by hand and thus avoid library functions, and then they complain about doing it by hand.

There are always exceptions of course, so I'm giving you benefit of the doubt. That's why I'm curious to see what you were doing. Having to point out library functions however is a regular thing as people seem to always start out by hand-rolling it for some reason.

As for the question about string libraries.. well, I gotta point out that "small" wasn't a qualifier in the previous discussion. Popular libraries include sds, bstring, glib strings. Plan9port also has the extensible string library. There's icu for fancy unicode stuff but I have no experience with it and it probably isn't "small." There are plenty more if you look around, and I'll let you judge the size of the choices for yourself. I'm pretty sure one of these choices is always mentioned in these HN threads when someone asks for recommendations, including sds in the bchs thread.


I just always put the buf[len] = ‘\0’ to cover myself if I screwed up something. Generally I also use calloc if it’s standalone code as well.

I was copying strings from a file of allowed binary names into a list of char * and also logging the first two parameters to execve to disk, appending. It was fine but 100 times scarier than the same in Go would be.

I have used snprintf and strl* functions when I was doing fancier stuff, but have not tried asprintf. It has been a long time that I was doing large amounts of c code, and then I was either doing binary with Len always passed along or else calling some template library, but I do thank you for the lib recommendations.

My point is if you ask for a good string library that makes it as safe and easy as the same in Go, you will not see a pattern of answers t use well known strlib X.


Regarding buf[len] = '\0', I've personally had to use it in many scenarios following strncpy, which doesn't add a null terminator if the maximum length is reached. Do you know of any simpler way of getting a prefix up to a certain length?


snprintf. If you want to stick to the (safest) pattern of only passing the buffer size for the second parameter, you'd do this:

    snprintf(buf, sizeof buf, "%.*s", prefix_length, source_str);
Example:

    $ cat x.c
    #include <stdio.h>
    int main(void) {
      char buf[128], tinybuf[5];
      const char *copythis = "hello there\n";
      snprintf(buf, sizeof buf, "%.*s", 5, copythis);
      snprintf(tinybuf, sizeof tinybuf, "%.*s", 5, copythis);
      printf("buf: %s\n", buf);
      printf("tinybuf: %s\n", tinybuf);
    }

    $ cc -W -Wall -O3 x.c
    x.c: In function ‘main’:
    x.c:6:41: warning: ‘snprintf’ output truncated before the last format character [-Wformat-truncation=]
        6 |  snprintf(tinybuf, sizeof tinybuf, "%.*s", 5, copythis);
          |                                         ^
    x.c:6:2: note: ‘snprintf’ output 6 bytes into a destination of size 5
        6 |  snprintf(tinybuf, sizeof tinybuf, "%.*s", 5, copythis);
          |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    $ ./a.out
    buf: hello
    tinybuf: hell


Thanks! I've never considered using snprintf in that way before; the default warnings are annoying, even though their intent is understandable.


The warning isn't a false positive here; truncation is going on in that line: the chosen prefix doesn't fit into tinybuf.


I'm not convinced. Can the Rust function which is shown (that function alone) tokenize a number out of a string such that the number is overflowing the target type? Yet indicate to the caller where that overflowing number ends, so that tokenization can continue with subsequent characters, if any?

E.g. suppose we have a string with this kind of syntax:

   "12345 : 12345   , 12345"
We can

1. use strtol to get the first integer and a pointer to just after it.

2. use ptr += strspn(ptr, " ") to skip spaces

3. check for the colon and if we find it, skip with ptr++

4. use strtol to get the second integer (possibly preceded by space).

5. similarly to the colon handling, do the comma

6. strtol again to get the integer.

This is efficient: no splitting of the string into pieces requiring memory allocation, and extra list processing.

We can code this robustly: it can recognize valid syntax even if some of the numbers overflow. So for this kind of input:

    "1234523442345345345234534545454545 : 12345  12345" 
the code could diagnose the overflow, and the missing comma in one pass.

If you don't care about the details, just "is this number in range, with no trailing junk, or else is it bad", then raw strtol isn't convenient. But takes only a little code to wrap it.


Rust can return string slices, so a tuple of a return type with a potential number and a slice for the still unprocessed string would be an option, which is much safer than your proposed one and arguably more readable.


How often do you actually need this?


Whenever you do lexical analysis on syntax containing numbers.

On today's hand-held supercomputers avoiding allocations and, generally, exercising memory-efficiency may not be a primary concern, but it very much was at the time when this stuff was built. And it's still relevant today on restricted systems, like microcontrollers, where C is still the primary language.


> Whenever you do lexical analysis on syntax containing numbers.

If that was an intention C should have a full set of lexical analysis functions, but it doesn't (scanf doesn't count). strtol being able to distinguish two error cases and thus being marginally useful for lexical analysis is most likely accidental.


In my experience, reading a single number is a much much much more common operation than doing lexical analysis.


but this is not a appropriate place for that functionality.


It is entirely appropriate for a function which lexically analyzes a buffer in memory in order to match an integer to be able to tell you where that integer ends.


i think thats a reasonable opinion in the context of say, language implementation.

for the relatively simple case of parsing cli arguments i would want an equally simple api. "is this string a valid representation of a number?" and "what number does this string represent?" should be separate apis.

even in language implementation, i'd want the identification of number tokens to be separate from the parsing of those number tokens. i would then have another, separate api for "where does the first number end in this string?" which would probably more likely be "return to me the next substring from this string that represents a number"


Often one can ignore the errno case, as the input is (by then) semi-constrained, and its ambiguity resolution is not needed.

e.g. so an idomatic pattern from some real code would be something like:

  static bool
  parse_thing (char *value, struct thing *th)
  {
      char *endp = NULL;
      unsigned long firstport = strtoul(value, &endp, 10);
      if (endp == value || *endp || firstport > 0xffff) {
          /* Do something with error, like log it */
          return false;
      }
  
      th->firstport = firstport;
      return true;
  }
But granted, for the general case I'd prefer to use some helper like either of:

  bool parse_decimal_uint32(struct string const *str, uint32_t *outval);
  bool parse_decimal_uint32(char const *str, unsigned slen, uint32_t *outval);


If you can ignore the errno case due to a "semi-constrainted" input, and don't care about trailing junk or having a pointer to more string material after the number is scanned, you can just call atol(str).


strto*() is the wrong API to use if you care about errors.

  char* forty_two = 42;
  int i;
  if (sscanf (forty_two, "%d", &i) != 1) {
      /* error */
  }
Sometimes, there's more than one way to skin a cat, and one of them is more suited to the task at hand.


Now what happens if you pass it a string without a null byte terminator?


I made no claim that using sscanf instead of strto* fixes all the issues with C string represenation.


I thought Zig didn't have unicode strings?

https://www.reddit.com/r/Zig/comments/9q3or3/how_to_deal_wit...

If that's true, Zig is NOT a modern language. Modern languages use international strings, and are unicode aware with a good unicode aware string library.

For crap's sake, the code example for comparing modern languages USES A STRING. The fact it is not unicode doesn't matter.


Is a 3-year-old reddit comments section really a better source for this than the Zig standard library?

https://github.com/ziglang/zig/blob/master/lib/std/unicode.z...


That case has existed for 40 years now, yet C still stands. Guess its the power of network effect.


If a user wants to parse integers etc. from a string, the function snprintf and family is often applied. It is a neatly simple function. This article seems to invent a problem rather than an organic one.


The article argues that there is no easy way to detect whether the parsing finished successfully. As a consequence, the C standard library is unsafe when used normally.

It's interesting how beginners are encouraged to use various string functions which are not safe to use with external input.


the rss feed is broken on that site, it outputs relative links (as opposed to absolute links)


[flagged]


Such comments are against the ethos of HN and its guidelines, which you can find at the bottom of this page.


You have complete freedom.


How about leaving the old stuff you want to "replace" alone? People are using it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: