Storing the length of a string alongside the string is a viable option in C, it's up to you to do it yourself though (whereas Java and almost every other language does it for you). See https://github.com/phusion/nginx/blob/master/src/core/ngx_st..., used by NGINX.
Which breaks down every time a C API needs to be called, and that linked API still has plenty of functions with separate pointer and length parameters.
You can do it like in Free Pascal/Delphi: store the string both length-prefixed (for fast length access and bounds checking) and zero terminated (for passing to functions that expect zero terminated strings).
If the callee is meant to corrupt it then it is your fault for misusing it (unless the corruption is intentional and you plan to recalculate the length after calling it). If it wasn't meant to corrupt it then it is a bug and if it is in your code then it is your fault for not using the string functionality that you already have in place to avoid the bug in the first place. If it is not in your code, as long as you had to use it you'd have that bug regardless of what language or framework you used since it is out of your control and there isn't anything you can do about it.
I was actually more looking for practical issues. Usually the code that I write doesn't even handle strings a lot. Maybe I'm just using other languages for when I do that or maybe I'm using other approaches where others would use strings or maybe I just subjectively don't find them so bad as others. I'd just like to see exactly what people are complaining about so I could find out why I usually don't.
Like, apart from tons of well known bugs, and vulnerabilities caused from string manipulation? How exactly did you miss news items, reports, posts, university lectures, books, and even your own personal experience, on them?
Or is the insinuation that we are hand-wavy about it, and you doubt the existence and scale of the problem? It's a well researched, well established problem, known for almost half a century.
3 - C brags about performance and is probably the slowest language to compute string length
I have a feeling that of all axes of performace C cares the most about memory overhead. Then the obvious idea is to have it at exatly one byte per "simple" string, and you get to pick the class of programs that can't get away with that default string type:
• One-byte terminator: complicated text-handling application with a lot of (longer than a couple of pointers on average) string slices.
• One-byte length: anything that needs strings longer than 255 chars.
And then of these two solutions you pick the obviously more general one. What could possibly go wrong?
What was made around them was only ever the gorilla with jungle thing. Fine for small programs, or larger ones within walled gardens. But not fine for infrastructure work.
these seem like tradeoffs which are straightforward to understand, which allow for simpler ABI & runtime. sure, the "UX" of the language suffers compared to e.g. Python, but at least the mechanics are easier to understand. If you want Python style string handling in C you could just use the Python C-API.
>If it was a bad thing, UNIX wouldn't be so widespread.
Strange argument. Javascript is widespread, COBOL was widespread, Windows is widespread, X86 is widespread. Widespread doesn't mean good. UNIX was a disaster, and the whole family of UNIX-like OS spent decades just for mitigate its errors and faults.
> To make substring in some other languages, you need to store pointer to beginning of the substring and length of substring.
Plus a pointer to the beginning, plus a reference counter as the user expects it to manage lifetime. In C this is the user's job. Where they know the life time is guaranteed they can optimize.
> That substring won't remain valid without copying it.
Unless it's not modified? And unless, when modified, that shouldn't be its new value?
C Strings (nul-terminated) are the right approach for static storage of small static strings (like strings literals in the source code) since they have low overhead, and "substrings" aren't second-class citizens.
For dynamically allocated strings that won't be modified after creation, the right approach is using a large memory chunk that is shared between many such strings, plus two indices for offset / length (or just offset if it's text that can be terminated with a sentinel).
Having a short string of about 10 characters allocated as a dynamic object in its own allocation is wasteful. Slow to allocate and has about 2x to 3x overhead. This approach isn't good for applications that store a large amount of data.
> O(N) is not the same as O(1).
Don't call strlen() in situations where the strings are large and you need to know the length ahead of time, and running time is paramount. Instead, store the length.
You don't need profiler feedback for these things, just back-of-the-envelope calculations. How much data you would like to store and process is not something a profiler can answer for you.
I can tell you that in one of my programs, the difference between garbage collected strings and optimized ones (actually, strings converted into unique integer handles immediately) for one of my projects (SAT solver in Java), handling a few million variables, was something like a second until completion vs a couple of minutes before the garbage collector finally dies due to lack of oxygen, losing all data computed up to that point.
2 - no enforcement that a null terminator actually exists in the string
3 - C brags about performance and is probably the slowest language to compute string length
4 - manipulating strings requires very carefull handling of buffers, usually forcing everyone to use the heap as easier way out