So why insist on this NUL termination business? I expected this to be necessary in C libraries due to legacy code but the kernel can always do whatever it wants internally.
One thought, most targets are not built as -ffreestanding and while the kernel doesn't link against a libc, it does provide most symbols with the same semantics. This allows the compiler to perform libcall optimizations.
For example, calls to printf can be transformed into calls to puts under certain conditions. The compiler can check + find those, even after optimizations. There are many of these tricks for the str* functions that assume NUL terminated C strings.
Though perhaps if the mem* functions were used to implement a fat strings implementation, many of those might still apply.
Point being; the compiler can help when your strings representation is first class by the language spec.
The compiler can also help with FORTIFY; it can insert compile time length checks in certain cases that can make code safer. This avoids treewide rewrites which are relatively painful to do for the Linux kernel due to its development model, but not impossible. That's another barrier to a new string representation and a library of routines for these.
That said, strscpy is not part of the language spec, so I'm guessing unless it's implemented in terms of language defined functions, it gets neither libcall optimizations nor fortified.
I think that structure will be helpful, and you can also easily pass substrings. (With null-terminated strings, you can easy cut off some characters from the beginning of the string, but not from the end. Passing both the length and the pointer together allows to easily do both, as well as allowing passing strings containing null characters.) I think that this is better than Pascal strings, due to being able to make substrings like this.
(Also the structure can be passed by value (without a pointer to the structure), which also allow to easily use in other function, e.g. to convert a null-terminated string or Pascal string into this structure, without allocating additional memory.)
It boggles my mind that the C standard hasn’t officially added support for this. It seems like such a small change that would dramatically improve the quality of C code.
It would not be a small change if you wanted to actually make them usable for the standard library - ie. add a second 'struct bytes'-variant of EVERY function that currently accepts a NUL-terminated string.
No, just calling a conversion function before (that would need heap allocation) would never be accepted by C programmers (for the overhead).
Then there is inertia - how many really would want to port their application to a different string type? Not to mention, all the libraries you're using would also have to been converted.
You can make the read-only conversion "free" by storing both the size of the string, as well as terminating the actual string contents with a NUL. So the String "bar" would be { length: 3, contents: ['b', 'a', 'r', '\0\] }. All your functions dealing with the "rich" string type work the same using the length, except they have to be aware of the need to preserve the terminating 0, and you if you want to pass a read-only string to a legacy function you can just pass it contents.
Of course that also generates a giant foot gun because you might manage to get the old and new strlen to disagree on the size, because one reads the length field and the other searches the \0.
SDS is cool, I use it in my projects and have also extended it further with some new features... but I worry about the amount of dynamic allocation going on, it would be nice to have an alternative solution such as using stack or pool allocations, or being able to declare SDS string literals at compile-time (probably only possible in C++). Every time I notice that I need to replace a const char* with an SDS I just know I'm slowing things down and adding more complexity.
It doesn't have to be 1:1 equal to SDS, rather at least having WG14 doing something, anything at all, instead of keeping pushing the agenda of C being a Swiss cheese of security exploits, to the point everyone is adopting hardware memory tagging as the ultimate mitigation.
For some reason C seem to be going through standard updates as often as C++ now. Not only that, they've added absolutely huge stuff like threading primitives as well as an insane generic macro thing.
Surely they could add better designed structures and functions for dealing with memory.
Is that really so big a deal, considering they've done that multiple times already anyway? How many versions of strcpy, printf, and other string handling functions are there already?
Nul termination is not only in C but also in some file formats. It has some advantages - apart from modest space savings for short strings, it means that you can read a string from some given location to the end - without any out of band data (length) that necessarily has to be stored in a different, agreed on location. This is a very valuable property.
I'm not saying don't use length fields, I'm saying use nul terminators where possible and use length fields where needed. And, they are not mutually exclusive.
And C doesn't need a standardized length delineated string structure in my opinion. Nul terminators serve the job fine for most Standard APIs (which take only short strings), and can receive length fields as separate function parameters where required.
• On the most widely used architectures, reading a string is much easier if the string is a known length. x86 has its string instructions, ARM has its Load Multiple instructions.
• Even with length-prefixed strings, many uses of short strings are with string literals and so the length does not need to be stored anywhere.
It's inherent to the language. Writing "string" gives you a NUL-terminated string; converting that to another format takes effort.
Interestingly, some Mac OS compilers would let you write "\pABC" to get a structure containing the bytes {3, 'A', 'B', 'C'}. (The "p" stands for Pascal.)
> It's inherent to the language. Writing "string" gives you a NUL-terminated string; converting that to another format takes effort.
Couldn't we just leave the NUL byte in there and pretend it doesn't exist?
const char *literal = "string literal with NUL byte";
struct bytes text;
text.size = strlen(literal); // strlen doesn't count the NUL terminator
text.pointer = literal;
// I know I discarded the const qualifier up there, but it's just to illustrate
Then a copy(text, other) function would conveniently ignore the entire NUL issue. The copy would not even have the terminating NUL.
> Interestingly, some Mac OS compilers would let you write "\pABC" to get a structure containing the bytes {3, 'A', 'B', 'C'}. (The "p" stands for Pascal.)
Pascal strings are nice but their sizes are too limited. Same idea as my structure above but with a uint8_t for length instead of uint64_t = size_t.
Pascal strings can be improved to support strings of any reasonable length.
Say two highest bits of the counter set the size of the counter field. 00 = 6 remaining bits, 01 = 14 bits (2 bytes), 10 = 30 bits, 11 = 62 bits (8 bytes).
A simple `counter* & 0x3f` would remove the width-setting bits, without any shifts, additions, etc.
This allows small strings to use only 1 byte for the counter, while allowing huge strings that span the entire RAM.
That's correct. Using strlen on anything but C string literals is just asking for bugs. The thing is untrusted strings don't come from C itself, they come from I/O.
The kernel has perfectly reasonable I/O interfaces.
Well... At least you would always know the length if the standard C library didn't abstract that perfectly good interface away behind stdio just so it could do buffering and return NUL-terminated strings.
It's just like errno. The kernel simply returns a negated error constant on failure. The C standard library takes that sane interface and turns it into a thread local global variable.
No. You would have extra problems. Not just from the maintenance issues when going along the migration path that Linus Torvalds pointed out. But also especially, for one thing, from the poor thinking generally involved, such as people thinking that the way to address a problem where it is stated that the POSIX API for I/O is reasonable is to "kill POSIX".
Apparently at some point people forgot that if you don't like an API, library or interface, you can just put a wrapper on top of it. Many libraries are just "toolkits" after all, you are not supposed to use the raw API everywhere; e.g. you if don't quickly stop doing that with the BSD sockets API, you are part of the problem.
> if you don't like an API, library or interface, you can just put a wrapper on top of it
We can also simply get rid of all that bloat and just use the system calls directly.
> you are not supposed to use the raw API everywhere
Linux system calls are a stable interface and the entry points are even programming language agnostic. It's okay to use them directly.
> e.g. you if don't quickly stop doing that with the BSD sockets API, you are part of the problem
Yeah it's not a good idea on other operating systems since the system call interfaces are unstable. We have to use their C libraries on those platforms.
> Apparently at some point people forgot that if you don't like an API, library or interface, you can just put a wrapper on top of it.
If I were to survey the state of modern software development and try to characterize the skills lost compared to decades past, "not enough wrappers" would be nowhere on my list.
Ok that was a bit overstated, s/some point people forgot/some people forget/. I don't know what was the common practice a few decades ago because source code was not as visible as it is today, but in my experience what you see on Github is not what the average developer does. Github advertises itself as a social coding network; just like with other social networks, there is a selection bias regarding what is posted.
> That's correct. Using strlen on anything but C string literals is just asking for bugs. The thing is untrusted strings don't come from C itself, they come from I/O.
Indeed, it is just an input check/"sanitization" issue - just like one carefully checks that a JSON or XML input is well formed, if a protocol spec says that some part is an ASCIIZ string, one has to check that there's indeed a zero byte before the end of the data packet.
I know. That string literal is likely to be located in a read only page. In real code, I'd have to allocate some memory and copy the text to the new location if I want the resulting structure to be writable. For clarity's sake I omitted these details.
This isn't unique to my example though. Traditional C strings have the exact same problem and they do get copied all the time.
Both GCC and Clang support this with -fpascal-strings. "\pABC" actually gives you {3, 'A', 'B', 'C', 0} -- the \p is a character and so extends the length of the string by one and it's still nul-terminated, it's not a pure (non-nul-terminated) pascal string.
> I think he's absolutely right. The str* functions are just worse versions of the mem* functions.
I think they aren't "worse versions of the mem* functions"; they are different functions. The "str" functions deal with null-terminated data, "mem" functions deal with data of a specified length, and "strn" deals with whichever is shorter.
Some "str" functions do not have "mem" variants (and vice-versa). For example, there is no "memdup" function.
> they are different functions. The "str" functions deal with null-terminated data
Well, yes. I say they're worse because the only reason they exist is to deal with this NUL terminator nonsense. The str* functions all reduce to the mem* functions after the string length is computed. To me it's like this:
There is no need for these functions to exist if we get rid of this NUL terminator business.
> Some "str" functions do not have "mem" variants (and vice-versa). For example, there is no "memdup" function.
There could easily be. For example, strdup is essentially strcpy(malloc(strlen(string) + 1), string). A memdup function would be even more efficient because the length is already known: memcpy(malloc(length), source, length).
In the early 1990s, there was a popular shareware string library for C in circulation that added a whole raft of extra strXXX() functions, such as strrtrim() and strend(), to a C runtime library. All of these were useful. Indeed, they were fundamental in some other languages, e.g. some dialects of BASIC with their various string functions like MID$ and RIGHT$. But the standard C library never gained them.
The strXXX() set in the C standard library is, rather, in large part an ad hoc set of useful wrappers around stuff that could be done with assembly language idioms, like REPNE SCASB on x86 instruction sets, that had grown up by 1987. The functions weren't intended to be reducible to memXXX(). They were intended to be reducible to assembly language, or even to compiler intrinsics.
The sad part is that the context here is kernel code, in particular Linux kernel code, where human-readable text interfaces (as opposed to machine-parseable interfaces) are the norm. Whitespace-terminated or LF-terminated strings are the norm, in things like procfs for example, and it is a double irony that the C standard library addresses NUL termination more readily than it does those, and even then provides only an ad hoc collection of NUL-terminated string functions with long-since well-known glaring holes unsuitable for kernel code.
Getting rid of the problem in the way that you suggest would necessitate redesigning a lot of kernel APIs to not be human-readable text that operates in terms of variable length strings terminated by special character values with no explicit length counters. No more redirecting the output of the "echo" command to /proc/something . This is exceedingly unlikely to happen.
now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.
On the other hand, if you take a more direct mirror of a Pascal string:
You're back to one memory span but can't reslice it.
And of course the worst codebase is when someone uses the first one because they want to keep slicing and someone else uses the second one because they need to save memory / indirections. To support both you end up writing functions that take a separate size and data pointer, and... well, then what's the point?
Most if not all string algorithms will eventually do this anyway:
size_t length = strlen("some string");
It's so common. Might as well memoize it so it's always available with no need to constantly loop through strings which is an O(N) algorithm. So many string algorithms call strlen, often multiple times on the same string. I remember GTA V took 6 minutes to parse a goddamn JSON file because of stuff like this and part of the fix was to store the string lengths.
you want depends heavily on what you're doing with the string(s) - plus other common variations like len+cap instead of just size, SSO, etc.
So if you can't standardize the data structure, what's the common interface? A function that takes a pointer and a length - which is what we already have. So everyone in this thread appealing to the C standardization process or stdlib to do something wants instead - what, exactly?
I'm not aware of any common string implementation that takes 3-4 words just to put an empty string in your struct especially if it also still requires external allocation with additional size words once the string gets above a certain size. Java takes 1, Go takes 2, SDS takes 1, libstdc++ takes 3 but doesn't require an external store later, etc.
Delphi uses a pointer to the latter, in addition to keeping the actual strings with a zero at the end. That way a cast to a C-style "string" is free.
In order to allow for the pointer to live on stack and minimize copying, the data is also reference counted and the compiler takes care of inserting the necessary reference counting calls where needed.
Overall it's pretty flexible, but the reference counting means it's not ideal to use shared strings in heavily threaded code. Of course the second a thread modifies a string, a new string is allocated and that thread can happily work on its "own" string.
Anyway, just yet another way of implementing strings.
> now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.
Having a size_t on the stack is hardly an issue, it's what every low-level modern language does. It's fast, convenient, and pretty efficient. It also doesn't require deref'ing to get the length, which is a pretty common use case (e.g. checking if a string is empty, or too big, or something along those lines).
That approach only works if the string isn't being mutated in a way which could change its size, though. Otherwise you need to make sure it has a lexical lifetime (and be very careful with it), or if that's not possible pay the double alloc cost.
I would be also worried about any difficulties separating the length and data causes for prefetching/cache lines though.
The second form is also more amenable to SSO; I'm not sure how often that would come up in the kernel but it's saved me a decent chunk of memory in at least one past project. (Still today I'm sometimes frustated by Go `string` porting from Java `String`, like great now I don't have to pay a boxing overhead but if it's often absent/empty my base size is now 2x what it would otherwise be...)
> I think he's absolutely right. The str* functions are just worse versions of the mem* functions.
Per Todd C. Miller (creator of strlcpy and maintainer of sudo) and Theo de Raadt (OpenBSD) in these 1999 (PostScript) slides, the simplest implementation of strlcpy() uses (can use) memcpy():
One problem with memcpy() is that it returns (void *), so there is no way to know if you've truncated things. From the USENIX paper on strlcpy():
> The strlcpy() and strlcat() functions return the total length of the string they tried to create. For strlcpy() that is simply the length of the source; for strlcat() that means the length of the destination (before concatenation) plus the length of the source. To check for truncation, the programmer need only verify that the return value is less than the size parameter. Thus, if truncation has occurred, the number of bytes needed to store the entire string is now known and the programmer may allocate more space and re-copy the strings if he or she wishes.
Yeah, that's an important observation especially in today's unicode world. It just strengthens my point that these "string" functions are really just bytes/memory functions in disguise.
Honestly "string" is a very harmful word that we've all grown used to. As an abstraction it sits somewhere between raw bytes and properly encoded text with proper unicode functions such as those provided by ICU. Python 3 finally forced people to start thinking about this stuff and nobody liked it.
The str functions aren't ASCII-only, they work perfectly fine with multi-byte strings such as UTF-8-encoded strings. The "length" just isn't the number of "characters", but the definition of a "character" itself is murky and bytes are what what you're usually interested in anyways.
> and bytes are what what you're usually interested in anyways.
Bytes are relevant when I have to allocate memory otherwise some definition of "character" is often more relevant. Even if I trim text to fit in a buffer I don't want to trim inside a "character" but get the most number of fitting "characters" Now "characters" are of course complicated as grapheme clusters are what is useful the most for human interaction ... but those are quite out of scope for a "simple" string library ...
I can imagine it's not been analysed so we're all assuming it's just "impossible".
Sometimes you just need to do the boring work so it's done. The Linux kernel is one of the most important pieces of software on the planet - limiting it's performance and safety due to C's string handling legacy is madness.
Have you read the article? Linus' resistance to auto-conversion means one or a few people can't do it, every maintainer has to do a project involving many orgs and thousands of people over several years. It is not impossible, it takes time but it is doable, the question is is this the best way to solve the underlying issue (and what exactly is that?) ? Assuming of course the NUL replacement is backwards compatible in some way, if not then I say focus on migrating to Rust instead.
M. Torvalds never expressed an objection to auto-conversion. Xe expressed an objection to mass conversion, with patches that mass change function calls with little scrutiny and even less actual regression testing of all of the affected code paths. Because xyr past experience was that that was what happened in practice, and stuff broke as a consequence.
Who is M. Torvalds? I am not aware of anyone with first initial M. named Torvalds that is relevant to this discussion and would require such bizarre pronouns to be applied to them.
hes wrong, correct string handling means processing the string serially one char at a time. carrying a size_t along with your strings like that is a waste of a perfectly good gpr.
this was true in the 1980s, but now if you want speed, you want to use vectorized instructions, so you want to do as much work as possible 32 bytes at a time for large strings. if you don't store a length, your string processing will be an order of magnitude slower.
If we define `String` type as `{size_t len, size_t offset, char *data}`, then we can store up to 23 bytes directly in the struct, if we will sacrifice 1 bit of `len` field to indicate that.
strscpy, unlike literally everything in the standard but memccpy, is a good replacement. Not only does it avoid the usual performance pitfalls, it lets you safely create your own desired functionality on top of it: in other words, it’s is composable. This is not true of any of the other functions. You can make your own strlcpy or strncpy (not sure why you would, but bear with me) out of strscpy. Trying to go in the reverse direction without paying a large cost is not possible.
It's so funny to read all this. In college, we were always told, "don't use strcpy! Use strncpy instead!" just to find out later on that strncpy isn't that great either. Then I heard about the better strlcpy, and even now I'm hearing that isn't quite good enough.
> It's so funny to read all this. In college, we were always told, "don't use strcpy! Use strncpy instead!" just to find out later on that strncpy isn't that great either.
TBF that was definitely incorrect, as strncpy was never intended to work with C strings. Most (though not all, that would be too easy) of the n functions work with fixed-size strings. The behaviour of strncpy makes perfect sense then.
All the str* functions work on C strings, except for strncpy. It's the only str* function which doesn't expect or produce a C string, yet its behaviour is such that it will work as if it operates on C strings in most cases.
Because strncpy's peculiarities means it acts as a conversion from a C-string to a null-padded fixed-length string, which interacts interestingly with %s: "%.*s" will stop at the first null or the provided length. If you use memcpy, you need to terminate your padded string by hand.
memcpy is more useful if your fixed-size strings are e.g. space-padded. Obviously it also works if you absolutely know your input is already a fixed-size string, but strncpy will work on both fixed-size and null-terminated inputs.
It's funny to see all these trends come and go for such a fundamentally simple problem. I'm still of the opinion that strcpy() with a good calculation of lengths is the simplest on the occasion that you truly do need to copy a string, which IMHO is often done far more than really necessary. The main point is that the length calculation and check should've been done before the copy; if you find out only when you do the copy that it's too long, then it's already too late.
An "anti-pattern" I seem to be seeing increasingly often in newer code is that of allocating a string (or worse, a dynamically expanding buffer), copying several other strings to it, then only calling another function with that concatenated copy before freeing it, when the other function could've simply been called several times with the individual parts successively.
strlcpy's flaw is its return value. In order to get the length of the string it has to walk the entire thing. If you are copying 20 character strings out of a 20TB mmaped file it will be outrageously slow and that's an unnecessary footgun.
It should have just returned the number of bytes copied to dst if the string was successfully copied and a -1 with an errno of E2BIG if the dst was too small for the src. It would still do the copy and termination of course, and the programmer will know the length in this case because they specified it in the function call. Of course this is what strscpy does.
If you aren't sure about the length of the src buffer strlcpy can be a ticking time bomb. If you aren't sure if the src string is NULL terminated it's also a problem, which is especially bad since one of the big reasons to use strlcpy is to avoid buffer overruns. GTA Online suffered from outrageous load times due to this very same API quirk. This would also make it easy for the function to return -1 and set errno to EINVAL if you specify NULL for the src or dst.
I don’t understand. If you’re doing length checks before “strcpy()”, then you can easily call “memcpy()” instead. And.. memcpy() is faster because it can go word at a time. So, why would anyone ever need “strcpy()”?
With memcpy() you need to know the exact length, i.e. it needs to be kept around and passed to the point of the copy. With strcpy(), once you have determined that a string will always be below a maximum length, it is no longer necessary to retain its exact length until if/when it is actually needed by a copy.
If you use a microcontroller with 4kb of ram you should know the length of all buffers you use so you don't need to store it. For NES programming, which has 2kb you don't program C at all...much less wasting cycles doing things like strlen. It's tedious but rom is bigger and you can "store" the length of things in the instruction themselves (ie., hardcoded lengths), whereas ram is saved for more important things like position of sprites or whatever.
I really hope people aren't seriously doing string parsing on microcontrollers.
But I thought we were talking about compilers here, and defining what a compiler should do when it comes across a string literal, and not about what people should or shouldn’t do with strings on 4KB RAM MCUs.
String parsing is sometimes necessary, and it could be as simple as formatting and sending logs lines through an UART.
I love how, from reading the article, you can't tell if the count parameter of the proposed replacement includes the null terminator (is the length of the dest buffer) or not (is the length - 1).
Also, it is unclear what side effects the function has if it returns ERR2BIG. Is dest null terminated or not in that case?
I could figure these things out, but my point is that all these str*cpy functions are fundamentally error prone because other people can't keep those details straight either, apparently.
I've never had an issue using 1+strlen() to figure out how many bytes to copy, checking the destination buffer size and then invoking memcpy. It's a bit inefficient (two passes, though both are usually using vectorized instructions) but at least it is really clear to the reader.
The main problem is that C lets you typo this:
strlen(s+1)
When you mean this:
strlen(s)+1
Also, of course, for untrusted buffers, you can't use strlen().
strscpy always null terminates, unless the destination buffer has size zero. -E2BIG is returned when the entire string could not be copied. The return value is essentially strlen of the resulting string. It’s good all around.
Even with that, I still don't have enough information to safely invoke it. (What is count?)
These APIs are too complicated/subtle. Honestly, at this point, I just use (ptr, len) pairs that point to non-null terminated strings whenever possible.
I don’t usually call people out for moving the goalposts, but when it comes to string copying routines I get real cranky about it. Stop moving the goalposts. You want a string copying routine. Most suck. This is the good one. Or at least, it’s one in the family of good ones. You have a C-style string, you want it copied elsewhere, you want it to be mindful of the bounds you give it, you want it to not do something you didn’t ask it for, this does that. It’s not trivial to write one, I wrote a whole thing about how it’s not and how this is good. People did it right. Now, shut up.
No, seriously, shut up. After all that work you don’t get to go “uh, this is actually unsafe because what if the original string is not null terminated and the buffer is not actually the size you said it is”. That’s not how strings in C work. Strings in C work in exactly one way and you don’t get to change that, because that’s just how they work, and there’s 50 years of code that uses string copying routines that work with this that would very much like to have the holy grail of string copies and doesn’t need you muddying the waters and saying that this cannot be invoked safely. It can. The invariants are challenging to uphold but this implementation provides every accommodation that the C standard can provide to this API.
Every single time string copying comes up on Hacker News some enlightened commenter is like “yeah why even bother with this I’m just going to use pointer+length pairs and solve everything”. No. You don’t get to drag this discussion about a very real problem into your half-baked “solution” for C. You especially do not get to reply to me explaining exactly why this API improves upon the state of the art with a lazy dismissal where you retreat to something completely different and go “yeah this is way safer lol”. This would be like if I just responded “yeah I’d just use Rust here I don’t get what the deal is” to you. It doesn’t help.
Look, I’d love to have a conversation about fat pointers and alternative strings in C. I’d love to talk about how we can migrate all this code to safer languages. But this isn’t the time or place for that, and I’m sick of this discussion repeating every single time we talk about, for the love of god, projects that are using normal C strings and safer ways to work with them. It’s not just you but you responding to my comment with something I perceive as fundamentally uninspired set me off.
Using snprintf is equivalent with using strlcpy and also equivalent with using a strlen followed by a memcpy.
As explained in the parent article, using something equivalent with strlen is inefficient when the source is much longer than the destination, because you do not need the length of the source, you just need to know that it is longer than the destination.
Instead of defining yet another strcpy-like function, I would have preferred an alternative to strlen, taking 2 arguments, the size of the destination and a pointer to the source, and returning either the length of the source or an error when the source is longer than the provided size.
Then every strcpy can be replaced with that strlen-replacement function followed by memcpy, and the same strlen replacement can be used with all other string functions, e.g. with strcat.
POSIX 2008 has added the strnlen function ("size_t strnlen(const char *s, size_t maxlen);"), which does not have the inefficiency of strlen, by returning at most the size of the destination, but it does not signal an error to indicate truncation.
You can still use strnlen, if you give it a size larger by 1 and you check the return value to detect an oversize source, but it is slightly less convenient than if the strnlen return value would have been defined like the return value of strscpy.
Efficiency is not the goal. Security and safety are. strlen() in kernel code on an (intentionally malicious) unterminated input has the potential to go off reading some other process's data, which opens up possible attack vectors, or the potential to cause unexpected page faults. strnlen() with one greater than the true buffer size, with the one extra memory read that it implies, has the same potential.
No, strnlen with any given length does not have the potential of reading anything that it should not read, like it is true for strlen.
The length argument of strnlen is under the control of the kernel writer, not under the control of whoever provides the source string.
The new strscpy function can also read anything, if the kernel writer allocates a destination buffer long enough, there is no difference between strscpy and strnlen, from the security POV. What they read is determined by the length of the destination buffer.
Because the destination buffer must have an extra byte for the null terminating byte, strnlen must be called with the size of the destination buffer as argument, but if it returns that size, that signals an error, and then a number of bytes that is one less than the return value must be copied from the source (and a null byte must always be written after whatever is copied by memcpy).
The kernel function strscpy is exactly equivalent with using correctly strnlen and memcpy. While strscpy encapsulates that behavior, so it is more convenient, anyone who wants to use strscpy in a user C program must define a strscpy macro or function, using strnlen and memcpy, which are standard functions.
The length given to strnlen() is, as you yourself said, greater than the size of the actual buffer in order to detect an oversize source. So strnlen() has the potential to access the character beyond the end of the array, checking it for NUL, incurring page faults and whatnot.
memchr is less convenient even than strnlen, because it returns a pointer, so you must subtract 2 pointers to get the size that needs to be used in memcpy.
memchr is indeed an alternative for someone who would need to use some old libc, without the strnlen function.
The strscpy function is available only inside the Linux kernel, but strnlen should be available in any environment that conforms to POSIX 2008, so strnlen + memcpy is what I would recommend for copying null-terminated strings in any C program.
Not even close to accurate. I'm not aware of any large C library with a string copy function that guarantees a NULL-terminated result in all possible cases.
I mean, this ask is technically impossible since you may specify zero length to any of them. At that point signalling an error is the only thing it can do.
Unlike C++ the C language takes its freestanding mode seriously. In this mode there's no malloc() and so your string type can't exist. But today in freestanding I can write "ERROR" and get a string literal.
So that would mean either you abolish string literals in freestanding (making C worse) or you have a separate type for these string literals from the type for a expandable string.
Now, Rust pulls off the latter in my view fairly elegantly, but that takes a lot of sophisticated type features in the language including a DeRef and several AsRef implementations. In Rust these are general features open to other types, C could special case them instead, but that's still a lot of engineering for a small feature.
The result is definitely not the elegant language which can be described in a slim book. Maybe it's better anyway, but it's not C.
You mean the lightly specified section of ISO C standard regarding freestanding use, or the compiler extensions most people ignore aren't part of the standard?
The rules around const don't compose with pointers inside structs, which makes this more complicated than you'd like. (I guess, just one struct for mutable and one for view of the string would work.)
I think there is sufficient membership in standard committee that doesn't want such "automated" features specially when touch memory allocation and freeing. And I too find it somewhat antithetical to C as language.
Still. Why isn't there a good quality library or set of libraries that would cover these features is my question.
Dumb(?) question from a non-kernel programmer: can strings in the kernel be limited to those with lengths <2^16 (65536)? Then the length can be stored in the unused bits of the pointer. This probably creates more problems than it solves, especially if we ever have CPUs that want to address more than 256 GiB of memory. This would also only work on 64-bit architectures.
Or if say, most strings tend to be small (<256 chars), use 8 unused bits of the pointer for the length. If the string is longer than that, mask the bits to 0. Then the string handling functions can have a fast path that uses the length from pointer if available, if not, walks the entire string.
So if the length of the string is shorter than the size of a pointer, the string is stored inline in the struct, and if the length is longer it's a separate allocated object.
Tangentially related: FreeBSD tends to use sbuf(3) (https://www.freebsd.org/cgi/man.cgi?query=sbuf) for anything but the very simplest string manipulation. It’s safer and more convenient than NULL-terminated strings.
You can use strncpy if you are careful to do it right (e.g. if you know the end of the destination buffer is null-terminated, specify the length one less than the destination buffer size). Sometimes memcpy is better, though. You can use both in one program.
> You can use strncpy if you are careful to do it right
That’s like saying if you’re careful to do it right you can use a Bowie knife as a screwdriver.
Technically correct but practically will end in tears. If you’re not working with fixed-size strings (which few are these days) just copy over strscpy, or strlcpy, 5 lines won’t kill you.
The express purpose of strncpy is to handle fixed-length strings in classic unix (file) structures (e.g. tar), I'm not convinced it's correct to use in basically any other situation.
i havent used C in quite a while but I did not remember strlcpy using strlen, and it seems the linux kernel's implementation differs quite a bit from the openbsd one:
The while (*src++); on line 45 of lib/libc/string/strlcpy.c is an inline implementation of strlen here. It's done solely for the benefit of the return value (src - osrc - 1) but means that it traverses the entire length of src until the first NUL - even if that's gigabytes or wanders out of mapped memory.
The BSD kernels do not have nearly as many APIs that employ variable-length special-character-terminated strings that have to be parsed from human-readable forms, however. Whilst this is common for things like procfs in the Linux world, the BSDs favour general purpose interfaces like sysctl(3) where data are transferred in machine-parseable forms rather than human-readable ones.
So "all over the BSDs" rather glosses over the fact that whilst common in applications code, there's less need for these sorts of string functions in the kernel. (And it is the Linux kernel that is the headlined subject here.) strlcpy() is in libkern, but it's not anywhere near as frequent in the kernel as it is in applications code.
Glibc isn't much of a standard of excellence for rational behavior considering they have malloc_info() dumping out XML rather than actually returning a useful data structure.
I find myself agreeing with the glibc maintainer in the extended discussion linked from the article:
https://lwn.net/Articles/612244/
> Correct string handling means that you always know how long your strings are and therefore you can you memcpy (instead of strcpy).
I think he's absolutely right. The str* functions are just worse versions of the mem* functions.