Hacker News new | past | comments | ask | show | jobs | submit login

What surprises me in C developers is that C exists for probably 40 years but they still don't have proper strings (not just pointers). In many cases there is no large performance penalty for storing string length, and checking it, but they still use pointers or a separate pair of variables for pointer and buffer size instead of single object.



Because they have a security culture that you only get errors due to "holding it wrong" or "every good programmer does it right", despite evidence on the contrary.

Plus added the fact that early C compilers generated quite lousy code on 8 and 16 bit computers, there is this idea to micro-optimize each line of code as it is being written, without any profiling feedback of it actually matters, rather cargo cult how writting code like X is faster than Y.

For example, outside 3D rendering and audio software processing, I never saw a visible impact (to the end user) of bounds checking.


> For example, outside 3D rendering and audio software processing, I never saw a visible impact (to the end user) of bounds checking.

Indeed the computation overhead on bounds checking is irrelevant for "cold" code, but consider this

1) Pretty sure you can get bounds checking when compilers detect that you're accessing a static array.

2) Otherwise, it's unclear how to devise a system that integrates bounds checking with C semantics. (Yes, that's unfortunate!)

3) Bounds checking does at least increase code size.


1) Totally optional from ISO C point of view, not all C compilers do do it, usually requires static analysis or specific compiler warnings to be enabled. Which not everyone does.

2) Solaris does it perfectly fine on SPARC thanks to tagged memory (ADI). Which Google in collaboration with ARM will make mandatory on future Android releases as well. [0]

3) It hardly mattered in MS-DOS and Amiga LOB applications developed in across Turbo Basic, Quick Basic, GFA Basic, Turbo Pascal, Clipper, so it matters even less nowadays unless we are speaking about PIC like hardware.

Regarding bounds checking I usually refer to Hoare's turing award speech, back in 1981:

"Many years later we asked our customers whether they wished us to provide an option to switch off these checks in the interests of efficiency on production runs. Unanimously, they urged us not to--they already knew how frequently subscript errors occur on production runs where failure to detect them could be disastrous. I note with fear and horror that even in 1980, language designers and users have not learned this lesson. In any respectable branch of engineering, failure to observe such elementary precautions would have long been against the law."

Or for that matter, the DoD Multics's B2 security evaluation[1], with several remarks how PL/I made the system safer with its string handling, pointer integrity validation and bounds checking.

As noted, yes there are niche cases where bounds checking does have an impact, but for a large spectrum of code that gets daily written it isn't the case.

[0] - https://security.googleblog.com/2019/08/adopting-arm-memory-...

[1] - https://multicians.org/multics-fer.pdf


1) It's not standardized but it's very simple to implement nevertheless.

It is my understanding that 2) doesn't have anything to do with C (so given hardware support, you can have it for free whether working in C or not. I think this kind of invalidates your point).

And also, 2) is not perfect, only probabilistic (a source I found says 94% likelyhood to detect OOB).

And it works only for the most basic situations where you use the system allocator and never subpartition these allocations. So, beyond these simple cases that we can get for free without any involvement from C semantics, I still maintain that it is unclear how to devise a reasonable a useful system to do bounds checking that can be added to the C memory model. We can make up an annotation syntax to cover many of the simpler cases, but these are hardly better than plain assertions (which I regularly use).

I doubt Hoare had a good idea to add general bounds checking on a low-level language like C, otherwise that would be standardized by now.


Pointer validation via hardware memory tagging has everything to do with C, because it is only due to C's shortcommings that millions of research dollars keep being spent to try to make it work.

In what concerns Solaris, and the requirements for future Android with ARM memory tagging, the system allocator is all there is, at least from official support point of view.

Hoare hardly needed to pursue such endevour, because all systems programming languages derived from Algol, like ESPOL, NEWP, PL/I, PL/S, PL/8, BLISS,.... were sane regarding bounds checking.

Hardware validation of memory is the only way to tame C, the alternative is to just dump the language, because as proven by ISO C11 dropping Annex K, very few actually care about making the language safe.

However given UNIX's dependency on C, it is also quite clear to me that in the next couple of decades C will be around, long after I am gone, and business opportunities to create companies on top of CVE exploits due to memory corruption bugs.


> Pointer validation via hardware memory tagging has everything to do with C, because it is only due to C's shortcommings that millions of research dollars keep being spent to try to make it work.

So, to restate, Solaris does not do it "perfectly fine". Thanks for making my point.

> In what concerns Solaris, and the requirements for future Android with ARM memory tagging, the system allocator is all there is, at least from official support point of view.

That's a pity, because if you're not doing your own allocators then you'll have to accept lock contention, extreme memory overhead (for smaller allocations, say <= 64 bytes), and you'll need to match every little allocation with a deallocation, instead of making e.g. custom pool allocators.

You're just not going to write a large infrastructure (i.e. performance-oriented) system in this way.


There are countless libraries that add higher level string functions and no end to higher level languages. C fills the niche where you want something higher level then assembly but lower level then Perl, ruby, python, etc. Sometimes you want or need to manage your own memory. Arduino is a good contemporary example.


Exactly. C philosophy is to use libraries and not put things like a better string library in the core functions of the language. It keeps the language relatively clean and easy to understand, unlike c++


So the C philosophy is to provide a bad standard string library that is not thread-safe and makes it impossible to write programs without remote code execution vulnerabilities, rather than have a better library with exactly the same functionality but with more security and safety "bloat"?

/snark

Yes, C used have a cavalier approach towards security in the past. But if you're asking why the standard library is not fixed yet, I think that instead of pinning it to some lofty philosophy, it's safer to say that good C developers realized long ago that the original C strings are a mistake. Most big C projects define their own string functions and often their own length-prefixed string types. The C standard committee just gave up on fixing this issue in the standard library, but this is not due to philosophy, but because of the impracticality to force a standard solution on this stage.


Bonus points when the libraries are so incompatible among themselves that require extra conversation steps, and then just get dropped 'cause "mind the performance".


When compatibility is needed, APIs should consume pointer+length pairs. It doesn't get better than that, in ANY language, in terms of simplicity and modularity.

Libraries like Qt with extreme lock-in are at the other end of the spectrum. If it works for them, that's nice. Doesn't work for me. I don't think using a string library is in the spirit of C programming.


Which means either pointer+length get packed into their own structure to avoid mix up errors, thus requiring conversion function calls across libraries, or they are given separately manually, thus opening the door to the copy-paste mistakes from pointer+bad length that they are supposed to protect against.

Qt is as locked-in as any LPGL 3 FOSS project is.


> Which means either pointer+length get packed into their own structure to avoid mix up errors, thus requiring conversion function calls across libraries, or they are given separately manually, thus opening the door to the copy-paste mistakes from pointer+bad length that they are supposed to protect against.

No, we were talking about compatibility/modularity. You're shifting the topic.

> Qt is as locked-in as any LPGL 3 FOSS project is.

I'm not speaking about licenses lock-in, but about lock-in from an engineering point of view. Are you aware of any significant Qt projects that don't have "Q" all over their codebase?

(And yes, by contrast to GPL, I believe you can use LGPL libraries without suffering a terrible amount of (license) lock-in)


How am I shifting the topic?

Compatibility/modularity doesn't happen in the air, rather in written code.

So either one passes structs around, and somehow they need to be compatible.

Or one passes pointer + lenght as two separated variables, with the consequences to keep in sync two unrelated variables, from the compiler point of view.


Arduino uses C++.


I've used talloc [1] where appropriate to greatly simplify memory allocation and string handling.

[1] https://talloc.samba.org/talloc/doc/html/index.html


C has _only_ pointers for variable size things, not just strings.

Roughly speaking C vars are either known fixed length, or accessed via pointer. There is nothing else.

(except arrays -- which are mostly pointers)


The thing you are pointing to could have its length prefixed. I've always assumed that this isn't the case because nobody could commit to the size of the length prefix. Is it string8, string16, string32, or string64? Using a sentinel value to denote length is less opinionated and more portable.


And also unbeatable in memory efficiency.


Unfortunately, it's often algorithmically less efficient and more error prone, so everybody ends up replicating what git is doing with 'strbuf' but in slightly incompatible ways. The effect makes dealing with strings in C unnecessarily unpleasant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: