Ushering out strlcpy()

matheusmoreira · on Aug 27, 2022

So why insist on this NUL termination business? I expected this to be necessary in C libraries due to legacy code but the kernel can always do whatever it wants internally.

  struct bytes { size_t size; unsigned char *pointer; };

Seems simple enough... A little structure like that has always been useful in my projects.

I find myself agreeing with the glibc maintainer in the extended discussion linked from the article:

https://lwn.net/Articles/612244/

> Correct string handling means that you always know how long your strings are and therefore you can you memcpy (instead of strcpy).

I think he's absolutely right. The str* functions are just worse versions of the mem* functions.

ndesaulniers · on Aug 27, 2022

One thought, most targets are not built as -ffreestanding and while the kernel doesn't link against a libc, it does provide most symbols with the same semantics. This allows the compiler to perform libcall optimizations.

For example, calls to printf can be transformed into calls to puts under certain conditions. The compiler can check + find those, even after optimizations. There are many of these tricks for the str* functions that assume NUL terminated C strings.

Though perhaps if the mem* functions were used to implement a fat strings implementation, many of those might still apply.

Point being; the compiler can help when your strings representation is first class by the language spec.

The compiler can also help with FORTIFY; it can insert compile time length checks in certain cases that can make code safer. This avoids treewide rewrites which are relatively painful to do for the Linux kernel due to its development model, but not impossible. That's another barrier to a new string representation and a library of routines for these.

That said, strscpy is not part of the language spec, so I'm guessing unless it's implemented in terms of language defined functions, it gets neither libcall optimizations nor fortified.

zzo38computer · on Aug 27, 2022

I think that structure will be helpful, and you can also easily pass substrings. (With null-terminated strings, you can easy cut off some characters from the beginning of the string, but not from the end. Passing both the length and the pointer together allows to easily do both, as well as allowing passing strings containing null characters.) I think that this is better than Pascal strings, due to being able to make substrings like this.

(Also the structure can be passed by value (without a pointer to the structure), which also allow to easily use in other function, e.g. to convert a null-terminated string or Pascal string into this structure, without allocating additional memory.)

nicoburns · on Aug 27, 2022

It boggles my mind that the C standard hasn’t officially added support for this. It seems like such a small change that would dramatically improve the quality of C code.

dezgeg · on Aug 27, 2022

It would not be a small change if you wanted to actually make them usable for the standard library - ie. add a second 'struct bytes'-variant of EVERY function that currently accepts a NUL-terminated string.

No, just calling a conversion function before (that would need heap allocation) would never be accepted by C programmers (for the overhead).

Then there is inertia - how many really would want to port their application to a different string type? Not to mention, all the libraries you're using would also have to been converted.

wongarsu · on Aug 27, 2022

You can make the read-only conversion "free" by storing both the size of the string, as well as terminating the actual string contents with a NUL. So the String "bar" would be { length: 3, contents: ['b', 'a', 'r', '\0\] }. All your functions dealing with the "rich" string type work the same using the length, except they have to be aware of the need to preserve the terminating 0, and you if you want to pass a read-only string to a legacy function you can just pass it contents.

Of course that also generates a giant foot gun because you might manage to get the old and new strlen to disagree on the size, because one reads the length field and the other searches the \0.

pjmlp · on Aug 27, 2022

Inertia is no excuse, if something like SDS was adopted by WG14, eventually C applications could slowly be migrated into it.

As it is, it will never happen.

ranger_danger · on Aug 27, 2022

SDS is cool, I use it in my projects and have also extended it further with some new features... but I worry about the amount of dynamic allocation going on, it would be nice to have an alternative solution such as using stack or pool allocations, or being able to declare SDS string literals at compile-time (probably only possible in C++). Every time I notice that I need to replace a const char* with an SDS I just know I'm slowing things down and adding more complexity.

pjmlp · on Aug 27, 2022

It doesn't have to be 1:1 equal to SDS, rather at least having WG14 doing something, anything at all, instead of keeping pushing the agenda of C being a Swiss cheese of security exploits, to the point everyone is adopting hardware memory tagging as the ultimate mitigation.

matheusmoreira · on Aug 27, 2022

For some reason C seem to be going through standard updates as often as C++ now. Not only that, they've added absolutely huge stuff like threading primitives as well as an insane generic macro thing.

Surely they could add better designed structures and functions for dealing with memory.

dralley · on Aug 27, 2022

Is that really so big a deal, considering they've done that multiple times already anyway? How many versions of strcpy, printf, and other string handling functions are there already?

morelisp · on Aug 27, 2022

"I mean it's one mutable string type Michael, what could it take to standardize, 48 pages not including the allocator model?"

duped · on Aug 27, 2022

Copy C++. The underlying string buffer must be null terminated, but still carry its length.

nine_k · on Aug 27, 2022

They were afraid of the C++ situation where there are 3-4 different string implementations in every large project.

pjmlp · on Aug 27, 2022

Hence why something like SDS should be part of the standard.

jstimpfle · on Aug 27, 2022

Nul termination is not only in C but also in some file formats. It has some advantages - apart from modest space savings for short strings, it means that you can read a string from some given location to the end - without any out of band data (length) that necessarily has to be stored in a different, agreed on location. This is a very valuable property.

I'm not saying don't use length fields, I'm saying use nul terminators where possible and use length fields where needed. And, they are not mutually exclusive.

And C doesn't need a standardized length delineated string structure in my opinion. Nul terminators serve the job fine for most Standard APIs (which take only short strings), and can receive length fields as separate function parameters where required.

shadowofneptune · on Aug 27, 2022

A case against:

• On the most widely used architectures, reading a string is much easier if the string is a known length. x86 has its string instructions, ARM has its Load Multiple instructions.

• Even with length-prefixed strings, many uses of short strings are with string literals and so the length does not need to be stored anywhere.

jstimpfle · on Aug 27, 2022

I agree. We're not disagreeing :)

duskwuff · on Aug 27, 2022

> So why insist on this NUL termination business?

It's inherent to the language. Writing "string" gives you a NUL-terminated string; converting that to another format takes effort.

Interestingly, some Mac OS compilers would let you write "\pABC" to get a structure containing the bytes {3, 'A', 'B', 'C'}. (The "p" stands for Pascal.)

matheusmoreira · on Aug 27, 2022

> It's inherent to the language. Writing "string" gives you a NUL-terminated string; converting that to another format takes effort.

Couldn't we just leave the NUL byte in there and pretend it doesn't exist?

  const char *literal = "string literal with NUL byte";

  struct bytes text;
  text.size    = strlen(literal); // strlen doesn't count the NUL terminator
  text.pointer = literal;

  // I know I discarded the const qualifier up there, but it's just to illustrate

Then a copy(text, other) function would conveniently ignore the entire NUL issue. The copy would not even have the terminating NUL.

> Interestingly, some Mac OS compilers would let you write "\pABC" to get a structure containing the bytes {3, 'A', 'B', 'C'}. (The "p" stands for Pascal.)

Pascal strings are nice but their sizes are too limited. Same idea as my structure above but with a uint8_t for length instead of uint64_t = size_t.

nine_k · on Aug 27, 2022

Pascal strings can be improved to support strings of any reasonable length.

Say two highest bits of the counter set the size of the counter field. 00 = 6 remaining bits, 01 = 14 bits (2 bytes), 10 = 30 bits, 11 = 62 bits (8 bytes).

A simple `counter* & 0x3f` would remove the width-setting bits, without any shifts, additions, etc.

This allows small strings to use only 1 byte for the counter, while allowing huge strings that span the entire RAM.

Phelinofist · on Aug 27, 2022

How would that work with untrusted strings? As I understood TFA strlen() is an issue if the string is not null terminated

matheusmoreira · on Aug 27, 2022

That's correct. Using strlen on anything but C string literals is just asking for bugs. The thing is untrusted strings don't come from C itself, they come from I/O.

The kernel has perfectly reasonable I/O interfaces.

  ssize_t bytes_read    =  read(file_descriptor, buffer, size);
  ssize_t bytes_written = write(file_descriptor, buffer, size);

You always know the length.

Well... At least you would always know the length if the standard C library didn't abstract that perfectly good interface away behind stdio just so it could do buffering and return NUL-terminated strings.

It's just like errno. The kernel simply returns a negated error constant on failure. The C standard library takes that sane interface and turns it into a thread local global variable.

Gibbon1 · on Aug 27, 2022

You could add real strings and arrays to C and kill POSIX and you'd have two fewer problems.

JdeBP · on Aug 27, 2022

No. You would have extra problems. Not just from the maintenance issues when going along the migration path that Linus Torvalds pointed out. But also especially, for one thing, from the poor thinking generally involved, such as people thinking that the way to address a problem where it is stated that the POSIX API for I/O is reasonable is to "kill POSIX".

astrobe_ · on Aug 27, 2022

Apparently at some point people forgot that if you don't like an API, library or interface, you can just put a wrapper on top of it. Many libraries are just "toolkits" after all, you are not supposed to use the raw API everywhere; e.g. you if don't quickly stop doing that with the BSD sockets API, you are part of the problem.

matheusmoreira · on Aug 27, 2022

> if you don't like an API, library or interface, you can just put a wrapper on top of it

We can also simply get rid of all that bloat and just use the system calls directly.

> you are not supposed to use the raw API everywhere

Linux system calls are a stable interface and the entry points are even programming language agnostic. It's okay to use them directly.

> e.g. you if don't quickly stop doing that with the BSD sockets API, you are part of the problem

Yeah it's not a good idea on other operating systems since the system call interfaces are unstable. We have to use their C libraries on those platforms.

morelisp · on Aug 27, 2022

> Apparently at some point people forgot that if you don't like an API, library or interface, you can just put a wrapper on top of it.

If I were to survey the state of modern software development and try to characterize the skills lost compared to decades past, "not enough wrappers" would be nowhere on my list.

astrobe_ · on Aug 28, 2022

Ok that was a bit overstated, s/some point people forgot/some people forget/. I don't know what was the common practice a few decades ago because source code was not as visible as it is today, but in my experience what you see on Github is not what the average developer does. Github advertises itself as a social coding network; just like with other social networks, there is a selection bias regarding what is posted.

astrobe_ · on Aug 27, 2022

> That's correct. Using strlen on anything but C string literals is just asking for bugs. The thing is untrusted strings don't come from C itself, they come from I/O.

Indeed, it is just an input check/"sanitization" issue - just like one carefully checks that a JSON or XML input is well formed, if a protocol spec says that some part is an ASCIIZ string, one has to check that there's indeed a zero byte before the end of the data packet.

silon42 · on Aug 27, 2022

the 'discarding const' is quite a problem if you try write real code like this.

matheusmoreira · on Aug 27, 2022

I know. That string literal is likely to be located in a read only page. In real code, I'd have to allocate some memory and copy the text to the new location if I want the resulting structure to be writable. For clarity's sake I omitted these details.

This isn't unique to my example though. Traditional C strings have the exact same problem and they do get copied all the time.

addaon · on Aug 27, 2022

Both GCC and Clang support this with -fpascal-strings. "\pABC" actually gives you {3, 'A', 'B', 'C', 0} -- the \p is a character and so extends the length of the string by one and it's still nul-terminated, it's not a pure (non-nul-terminated) pascal string.

jylam · on Aug 27, 2022

Due to the legacy, forget to put a \p somewhere and you have a massive problem though

jwilk · on Aug 27, 2022

https://clang.llvm.org/docs/ClangCommandLineReference.html#c...

Stock GCC doesn't have such option though. I guess it's implemented with a macOS-specific patch?

JdeBP · on Aug 27, 2022

See https://gcc.gnu.org/legacy-ml/gcc/2003-11/msg00908.html

pjmlp · on Aug 27, 2022

Because Mac OS was originally written in Object Pascal, naturally with enough Assembly, then it transitioned to C++.

Of course this was before the days when everyone knows only C can be used to write OSes. /s

jstimpfle · on Aug 27, 2022

#define STRING(stringlit) ((MyString) {("" stringlit), sizeof (stringlit) - 1 })

At least for string literals, no real effort required :)

zzo38computer · on Aug 27, 2022

> I think he's absolutely right. The str* functions are just worse versions of the mem* functions.

I think they aren't "worse versions of the mem* functions"; they are different functions. The "str" functions deal with null-terminated data, "mem" functions deal with data of a specified length, and "strn" deals with whichever is shorter.

Some "str" functions do not have "mem" variants (and vice-versa). For example, there is no "memdup" function.

matheusmoreira · on Aug 27, 2022

> they are different functions. The "str" functions deal with null-terminated data

Well, yes. I say they're worse because the only reason they exist is to deal with this NUL terminator nonsense. The str* functions all reduce to the mem* functions after the string length is computed. To me it's like this:

  str_function = mem_function(string, strlen(string) + 1)

There is no need for these functions to exist if we get rid of this NUL terminator business.

> Some "str" functions do not have "mem" variants (and vice-versa). For example, there is no "memdup" function.

There could easily be. For example, strdup is essentially strcpy(malloc(strlen(string) + 1), string). A memdup function would be even more efficient because the length is already known: memcpy(malloc(length), source, length).

JdeBP · on Aug 27, 2022

There could easily be so much more than that.

In the early 1990s, there was a popular shareware string library for C in circulation that added a whole raft of extra strXXX() functions, such as strrtrim() and strend(), to a C runtime library. All of these were useful. Indeed, they were fundamental in some other languages, e.g. some dialects of BASIC with their various string functions like MID$ and RIGHT$. But the standard C library never gained them.

The strXXX() set in the C standard library is, rather, in large part an ad hoc set of useful wrappers around stuff that could be done with assembly language idioms, like REPNE SCASB on x86 instruction sets, that had grown up by 1987. The functions weren't intended to be reducible to memXXX(). They were intended to be reducible to assembly language, or even to compiler intrinsics.

The sad part is that the context here is kernel code, in particular Linux kernel code, where human-readable text interfaces (as opposed to machine-parseable interfaces) are the norm. Whitespace-terminated or LF-terminated strings are the norm, in things like procfs for example, and it is a double irony that the C standard library addresses NUL termination more readily than it does those, and even then provides only an ad hoc collection of NUL-terminated string functions with long-since well-known glaring holes unsuitable for kernel code.

Getting rid of the problem in the way that you suggest would necessitate redesigning a lot of kernel APIs to not be human-readable text that operates in terms of variable length strings terminated by special character values with no explicit length counters. No more redirecting the output of the "echo" command to /proc/something . This is exceedingly unlikely to happen.

morelisp · on Aug 27, 2022

So, if you want to do

  struct bytes { size_t size; unsigned char *data; };

now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.

On the other hand, if you take a more direct mirror of a Pascal string:

  struct bytes { size_t size; unsigned char data[]; };

You're back to one memory span but can't reslice it.

And of course the worst codebase is when someone uses the first one because they want to keep slicing and someone else uses the second one because they need to save memory / indirections. To support both you end up writing functions that take a separate size and data pointer, and... well, then what's the point?

matheusmoreira · on Aug 27, 2022

Most if not all string algorithms will eventually do this anyway:

  size_t length = strlen("some string");

It's so common. Might as well memoize it so it's always available with no need to constantly loop through strings which is an O(N) algorithm. So many string algorithms call strlen, often multiple times on the same string. I remember GTA V took 6 minutes to parse a goddamn JSON file because of stuff like this and part of the fix was to store the string lengths.

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

So is a length variable really such a big deal? It even fits in a register.

I understand and agree with your one memory span point. Ideally they should be located as close as possible in memory.

  struct bytes { size_t size; unsigned char data[]; };

  const char *literal = "some text";
  size_t length = strlen(literal);

  struct bytes *text = malloc(sizeof(*text) + length);
  text->size = length;
  memcpy(text->data, literal, length);

morelisp · on Aug 27, 2022

I'm not anti-length-variable! I'm saying, which of

  struct bytes { size_t size; unsigned char *data; };
  struct bytes { size_t size; unsigned char data[]; };

you want depends heavily on what you're doing with the string(s) - plus other common variations like len+cap instead of just size, SSO, etc.

So if you can't standardize the data structure, what's the common interface? A function that takes a pointer and a length - which is what we already have. So everyone in this thread appealing to the C standardization process or stdlib to do something wants instead - what, exactly?

alcover · on Aug 27, 2022

  > then what's the point?

There's no point. Moreover, passing the length separately is annoying and error-prone.

You can have both your models in a single type. That's what my buffer lib achieves : SSO + slicing.

https://github.com/alcover/buffet

morelisp · on Aug 27, 2022

It’s clever and probably makes sense in some contexts but 24b overhead not counting the store feels like a nonstarter for many cases.

alcover · on Aug 27, 2022

You'd help me by explaining those cases. I'm not connected to the 'industry'.

Also some major impl are 24 or even 32 bytes. With a generous SSO you catch a lot of strings w/o overhead.

morelisp · on Aug 27, 2022

I'm not aware of any common string implementation that takes 3-4 words just to put an empty string in your struct especially if it also still requires external allocation with additional size words once the string gets above a certain size. Java takes 1, Go takes 2, SDS takes 1, libstdc++ takes 3 but doesn't require an external store later, etc.

alcover · on Aug 27, 2022

from https://github.com/elliotgoodrich/SSO-23/blob/master/README....

  MSVC  32
  GCC  32
  Clang 24

SDS is not typesafe, no SSO, no slices.

morelisp · on Aug 27, 2022

Your library is very nice.

magicalhippo · on Aug 27, 2022

Delphi uses a pointer to the latter, in addition to keeping the actual strings with a zero at the end. That way a cast to a C-style "string" is free.

In order to allow for the pointer to live on stack and minimize copying, the data is also reference counted and the compiler takes care of inserting the necessary reference counting calls where needed.

Overall it's pretty flexible, but the reference counting means it's not ideal to use shared strings in heavily threaded code. Of course the second a thread modifies a string, a new string is allocated and that thread can happily work on its "own" string.

Anyway, just yet another way of implementing strings.

masklinn · on Aug 27, 2022

> now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.

Having a size_t on the stack is hardly an issue, it's what every low-level modern language does. It's fast, convenient, and pretty efficient. It also doesn't require deref'ing to get the length, which is a pretty common use case (e.g. checking if a string is empty, or too big, or something along those lines).

morelisp · on Aug 27, 2022

That approach only works if the string isn't being mutated in a way which could change its size, though. Otherwise you need to make sure it has a lexical lifetime (and be very careful with it), or if that's not possible pay the double alloc cost.

I would be also worried about any difficulties separating the length and data causes for prefetching/cache lines though.

The second form is also more amenable to SSO; I'm not sure how often that would come up in the kernel but it's saved me a decent chunk of memory in at least one past project. (Still today I'm sometimes frustated by Go `string` porting from Java `String`, like great now I don't have to pay a boxing overhead but if it's often absent/empty my base size is now 2x what it would otherwise be...)

throw0101a · on Aug 27, 2022

> I think he's absolutely right. The str* functions are just worse versions of the mem* functions.

Per Todd C. Miller (creator of strlcpy and maintainer of sudo) and Theo de Raadt (OpenBSD) in these 1999 (PostScript) slides, the simplest implementation of strlcpy() uses (can use) memcpy():

    size_t strlcpy(char *dst, const char *src, size_t siz)
    {
        size_t n;
        size_t slen = strlen(src);
    
        if (siz) {
            if ((n = MIN(slen, siz - 1)))
                memcpy(dst, src, n);
            dst[n] = ’\0’;
        }
        return(slen);
    }

* https://www.openbsd.org/papers/strlcpy-slides.ps

The current OpenBSD is currently different:

* https://github.com/openbsd/src/blob/master/lib/libc/string/s...

* https://cvsweb.openbsd.org/src/lib/libc/string/strlcpy.c

One problem with memcpy() is that it returns (void *), so there is no way to know if you've truncated things. From the USENIX paper on strlcpy():

> The strlcpy() and strlcat() functions return the total length of the string they tried to create. For strlcpy() that is simply the length of the source; for strlcat() that means the length of the destination (before concatenation) plus the length of the source. To check for truncation, the programmer need only verify that the return value is less than the size parameter. Thus, if truncation has occurred, the number of bytes needed to store the entire string is now known and the programmer may allocate more space and re-copy the strings if he or she wishes.

* https://www.usenix.org/legacy/events/usenix99/full_papers/mi...

grandinj · on Aug 27, 2022

the git codebase internally does this, has quite a nice library of C functions to work on them

cmroanirgo · on Aug 27, 2022

It definitely gets a bit murky when dealing with mbcs, when you want characters spanning multiple bytes rather than individual bytes.

I understand the topic is strXxx() funcs which are ascii only, but it does need to be said that size!=len for wide and multi char sets.

matheusmoreira · on Aug 27, 2022

Yeah, that's an important observation especially in today's unicode world. It just strengthens my point that these "string" functions are really just bytes/memory functions in disguise.

Honestly "string" is a very harmful word that we've all grown used to. As an abstraction it sits somewhere between raw bytes and properly encoded text with proper unicode functions such as those provided by ICU. Python 3 finally forced people to start thinking about this stuff and nobody liked it.

mort96 · on Aug 27, 2022

The str functions aren't ASCII-only, they work perfectly fine with multi-byte strings such as UTF-8-encoded strings. The "length" just isn't the number of "characters", but the definition of a "character" itself is murky and bytes are what what you're usually interested in anyways.

johannes1234321 · on Aug 27, 2022

> and bytes are what what you're usually interested in anyways.

Bytes are relevant when I have to allocate memory otherwise some definition of "character" is often more relevant. Even if I trim text to fit in a buffer I don't want to trim inside a "character" but get the most number of fitting "characters" Now "characters" are of course complicated as grapheme clusters are what is useful the most for human interaction ... but those are quite out of scope for a "simple" string library ...

MarkSweep · on Aug 27, 2022

bstring [1] is one example of library implementing this sort of higher-level string library. I've used it in some embedded projects.

The Linux kernel has data structure libraries for things like linked lists and maps. Does it really not have a higher-level string abstraction?

[1] http://bstring.sourceforge.net/

Ygg2 · on Aug 27, 2022

Interesting enough, null terminated strings have faster Simd primitives for null terminated strings.

https://stackoverflow.com/questions/20935769/sse42-sttni-pcm...

badrabbit · on Aug 27, 2022

Can you imagine the conversion effort?

XorNot · on Aug 27, 2022

I can imagine it's not been analysed so we're all assuming it's just "impossible".

Sometimes you just need to do the boring work so it's done. The Linux kernel is one of the most important pieces of software on the planet - limiting it's performance and safety due to C's string handling legacy is madness.

badrabbit · on Aug 27, 2022

Have you read the article? Linus' resistance to auto-conversion means one or a few people can't do it, every maintainer has to do a project involving many orgs and thousands of people over several years. It is not impossible, it takes time but it is doable, the question is is this the best way to solve the underlying issue (and what exactly is that?) ? Assuming of course the NUL replacement is backwards compatible in some way, if not then I say focus on migrating to Rust instead.

JdeBP · on Aug 27, 2022

M. Torvalds never expressed an objection to auto-conversion. Xe expressed an objection to mass conversion, with patches that mass change function calls with little scrutiny and even less actual regression testing of all of the affected code paths. Because xyr past experience was that that was what happened in practice, and stuff broke as a consequence.

pwdisswordfish9 · on Aug 27, 2022

Who is M. Torvalds? I am not aware of anyone with first initial M. named Torvalds that is relevant to this discussion and would require such bizarre pronouns to be applied to them.

quickthrower2 · on Aug 27, 2022

The M is believe is a gender neutral title. And ze a gender neutral pronoun.

snickerbockers · on Aug 27, 2022

hes wrong, correct string handling means processing the string serially one char at a time. carrying a size_t along with your strings like that is a waste of a perfectly good gpr.

adgjlsfhk1 · on Aug 27, 2022

this was true in the 1980s, but now if you want speed, you want to use vectorized instructions, so you want to do as much work as possible 32 bytes at a time for large strings. if you don't store a length, your string processing will be an order of magnitude slower.

noobermin · on Aug 27, 2022

A waste of 8 bytes?

drran · on Aug 27, 2022

If we define `String` type as `{size_t len, size_t offset, char *data}`, then we can store up to 23 bytes directly in the struct, if we will sacrifice 1 bit of `len` field to indicate that.

Thiez · on Aug 27, 2022

7, since keeping track of the length allows you to skip storing the trailing 0.

snickerbockers · on Aug 27, 2022

Well technically that's implementation-defined, but yes.

throw0101a · on Aug 27, 2022

The original strlcpy() was written for OpenBSD (2.4) by Todd C. Miller (who has also been the maintainer of sudo since 1994):

* https://github.com/openbsd/src/blob/master/lib/libc/string/s...

* https://cvsweb.openbsd.org/src/lib/libc/string/strlcpy.c

wahern · on Aug 27, 2022

Don't forget the original USENIX paper: https://www.usenix.org/legacy/events/usenix99/full_papers/mi...

ncmncm · on Aug 27, 2022

Sudo has had its own debacles.

throw0101a · on Aug 27, 2022

Feel free to use one of the alternatives:

* https://www.sudo.ws/docs/alternatives/

saagarjha · on Aug 27, 2022

strscpy, unlike literally everything in the standard but memccpy, is a good replacement. Not only does it avoid the usual performance pitfalls, it lets you safely create your own desired functionality on top of it: in other words, it’s is composable. This is not true of any of the other functions. You can make your own strlcpy or strncpy (not sure why you would, but bear with me) out of strscpy. Trying to go in the reverse direction without paying a large cost is not possible.

(For a more detailed discussion, I wrote a whole blog post about why strncpy is good and it got posted here a while back: https://news.ycombinator.com/item?id=27537900)

saagarjha · on Aug 27, 2022

*why strscpy is good, my iPhone has unfortunately learned the spelling of commonly-used C functions

Night_Thastus · on Aug 27, 2022

It's so funny to read all this. In college, we were always told, "don't use strcpy! Use strncpy instead!" just to find out later on that strncpy isn't that great either. Then I heard about the better strlcpy, and even now I'm hearing that isn't quite good enough.

masklinn · on Aug 27, 2022

> It's so funny to read all this. In college, we were always told, "don't use strcpy! Use strncpy instead!" just to find out later on that strncpy isn't that great either.

TBF that was definitely incorrect, as strncpy was never intended to work with C strings. Most (though not all, that would be too easy) of the n functions work with fixed-size strings. The behaviour of strncpy makes perfect sense then.

saagarjha · on Aug 27, 2022

I (and most documentation) prefers the term “character array” to make the distinction clear.

techwiz137 · on Aug 27, 2022

I guess I should refer to people as "collection of atoms", then. Or even further to probability clouds, to make a distinction.

saagarjha · on Aug 27, 2022

In which case people who are deceased can be said to be termin…nevermind.

mort96 · on Aug 28, 2022

All the str* functions work on C strings, except for strncpy. It's the only str* function which doesn't expect or produce a C string, yet its behaviour is such that it will work as if it operates on C strings in most cases.

knorker · on Aug 27, 2022

Why would you use strncpy on non-c-strings? Why not use memcpy?

masklinn · on Aug 27, 2022

Because strncpy's peculiarities means it acts as a conversion from a C-string to a null-padded fixed-length string, which interacts interestingly with %s: "%.*s" will stop at the first null or the provided length. If you use memcpy, you need to terminate your padded string by hand.

memcpy is more useful if your fixed-size strings are e.g. space-padded. Obviously it also works if you absolutely know your input is already a fixed-size string, but strncpy will work on both fixed-size and null-terminated inputs.

userbinator · on Aug 27, 2022

It's funny to see all these trends come and go for such a fundamentally simple problem. I'm still of the opinion that strcpy() with a good calculation of lengths is the simplest on the occasion that you truly do need to copy a string, which IMHO is often done far more than really necessary. The main point is that the length calculation and check should've been done before the copy; if you find out only when you do the copy that it's too long, then it's already too late.

An "anti-pattern" I seem to be seeing increasingly often in newer code is that of allocating a string (or worse, a dynamically expanding buffer), copying several other strings to it, then only calling another function with that concatenated copy before freeing it, when the other function could've simply been called several times with the individual parts successively.

jandrese · on Aug 27, 2022

strlcpy's flaw is its return value. In order to get the length of the string it has to walk the entire thing. If you are copying 20 character strings out of a 20TB mmaped file it will be outrageously slow and that's an unnecessary footgun.

It should have just returned the number of bytes copied to dst if the string was successfully copied and a -1 with an errno of E2BIG if the dst was too small for the src. It would still do the copy and termination of course, and the programmer will know the length in this case because they specified it in the function call. Of course this is what strscpy does.

If you aren't sure about the length of the src buffer strlcpy can be a ticking time bomb. If you aren't sure if the src string is NULL terminated it's also a problem, which is especially bad since one of the big reasons to use strlcpy is to avoid buffer overruns. GTA Online suffered from outrageous load times due to this very same API quirk. This would also make it easy for the function to return -1 and set errno to EINVAL if you specify NULL for the src or dst.

xorvoid · on Aug 27, 2022

I don’t understand. If you’re doing length checks before “strcpy()”, then you can easily call “memcpy()” instead. And.. memcpy() is faster because it can go word at a time. So, why would anyone ever need “strcpy()”?

userbinator · on Aug 27, 2022

With memcpy() you need to know the exact length, i.e. it needs to be kept around and passed to the point of the copy. With strcpy(), once you have determined that a string will always be below a maximum length, it is no longer necessary to retain its exact length until if/when it is actually needed by a copy.

noobermin · on Aug 27, 2022

In 2022, having 4 bytes to store the length of the string (or even 8 if you want to be fancy) is not onerous.

Like all things, never say "for all" or "always" but almost all strings can be buffers packed with a size in a struct like in the top comment.

scoutt · on Aug 27, 2022

> never say "for all" or "always"

The 4KB of RAM in my microcontroller would appreciate those extra 4 bytes (or 3) used for every string.

noobermin · on Aug 28, 2022

If you use a microcontroller with 4kb of ram you should know the length of all buffers you use so you don't need to store it. For NES programming, which has 2kb you don't program C at all...much less wasting cycles doing things like strlen. It's tedious but rom is bigger and you can "store" the length of things in the instruction themselves (ie., hardcoded lengths), whereas ram is saved for more important things like position of sprites or whatever.

I really hope people aren't seriously doing string parsing on microcontrollers.

scoutt · on Aug 28, 2022

But I thought we were talking about compilers here, and defining what a compiler should do when it comes across a string literal, and not about what people should or shouldn’t do with strings on 4KB RAM MCUs.

String parsing is sometimes necessary, and it could be as simple as formatting and sending logs lines through an UART.

drran · on Aug 27, 2022

Why you need 4 bytes of length when you have 4Kb of RAM? 1 byte should be enough for you.

scoutt · on Aug 27, 2022

If the 4 bytes are reserved by the C compiler then there isn’t much to do (unless there is a flag to control bytes for strings length).

One byte is the actual overhead for strings.

morelisp · on Aug 27, 2022

4KB of RAM doesn’t mean only 4KB of addressable space. (Even if it was, you’d expect a 16b size_t.)

agumonkey · on Aug 27, 2022

50 years down the road, and still arguing about strings

dwheeler · on Aug 27, 2022

To be fair, they are replacing strlcpy() with the very similar strscpy(). I do think that strscpy is an improvement, but it's quite similar.

hedora · on Aug 27, 2022

I love how, from reading the article, you can't tell if the count parameter of the proposed replacement includes the null terminator (is the length of the dest buffer) or not (is the length - 1).

Also, it is unclear what side effects the function has if it returns ERR2BIG. Is dest null terminated or not in that case?

I could figure these things out, but my point is that all these str*cpy functions are fundamentally error prone because other people can't keep those details straight either, apparently.

I've never had an issue using 1+strlen() to figure out how many bytes to copy, checking the destination buffer size and then invoking memcpy. It's a bit inefficient (two passes, though both are usually using vectorized instructions) but at least it is really clear to the reader.

The main problem is that C lets you typo this:

strlen(s+1)

When you mean this:

strlen(s)+1

Also, of course, for untrusted buffers, you can't use strlen().

1over137 · on Aug 27, 2022

>The main problem is that C lets you typo this: >strlen(s+1) >When you mean this: >strlen(s)+1

There are tools to find that, like:

https://clang.llvm.org/extra/clang-tidy/checks/bugprone/misp...

rabf · on Aug 27, 2022

If you want to use this function read the brief man page that covers all these points in very little text. Skip the article!

https://manpages.debian.org/testing/linux-manual-4.8/strscpy...

saagarjha · on Aug 27, 2022

strscpy always null terminates, unless the destination buffer has size zero. -E2BIG is returned when the entire string could not be copied. The return value is essentially strlen of the resulting string. It’s good all around.

hedora · on Aug 27, 2022

Even with that, I still don't have enough information to safely invoke it. (What is count?)

These APIs are too complicated/subtle. Honestly, at this point, I just use (ptr, len) pairs that point to non-null terminated strings whenever possible.

saagarjha · on Aug 27, 2022

I don’t usually call people out for moving the goalposts, but when it comes to string copying routines I get real cranky about it. Stop moving the goalposts. You want a string copying routine. Most suck. This is the good one. Or at least, it’s one in the family of good ones. You have a C-style string, you want it copied elsewhere, you want it to be mindful of the bounds you give it, you want it to not do something you didn’t ask it for, this does that. It’s not trivial to write one, I wrote a whole thing about how it’s not and how this is good. People did it right. Now, shut up.

No, seriously, shut up. After all that work you don’t get to go “uh, this is actually unsafe because what if the original string is not null terminated and the buffer is not actually the size you said it is”. That’s not how strings in C work. Strings in C work in exactly one way and you don’t get to change that, because that’s just how they work, and there’s 50 years of code that uses string copying routines that work with this that would very much like to have the holy grail of string copies and doesn’t need you muddying the waters and saying that this cannot be invoked safely. It can. The invariants are challenging to uphold but this implementation provides every accommodation that the C standard can provide to this API.

Every single time string copying comes up on Hacker News some enlightened commenter is like “yeah why even bother with this I’m just going to use pointer+length pairs and solve everything”. No. You don’t get to drag this discussion about a very real problem into your half-baked “solution” for C. You especially do not get to reply to me explaining exactly why this API improves upon the state of the art with a lazy dismissal where you retreat to something completely different and go “yeah this is way safer lol”. This would be like if I just responded “yeah I’d just use Rust here I don’t get what the deal is” to you. It doesn’t help.

Look, I’d love to have a conversation about fat pointers and alternative strings in C. I’d love to talk about how we can migrate all this code to safer languages. But this isn’t the time or place for that, and I’m sick of this discussion repeating every single time we talk about, for the love of god, projects that are using normal C strings and safer ways to work with them. It’s not just you but you responding to my comment with something I perceive as fundamentally uninspired set me off.

PieUser · on Aug 27, 2022

[flagged]

saagarjha · on Aug 27, 2022

You’d be too if you had to see this same thing play out again and again over years.

diamondlovesyou · on Aug 27, 2022

Naw, just use {pointer, length} tuples. Crisis averted.

saagarjha · on Aug 27, 2022

My dude, I just wrote over 400 words to explain why this is not feasible…

masklinn · on Aug 27, 2022

> Even with that, I still don't have enough information to safely invoke it. (What is count?)

Count is necessarily the target buffer’s size since it’s what you don’t want overrun.

Gibbon1 · on Aug 27, 2022

I'm with you, fix the language to allow for proper strings. And tell anyone that doesn't like it to use std=c89.

1500100900 · on Aug 27, 2022

> The BSD answer to the problems with strncpy() was to introduce a new function called strlcpy()

Not BSD, it was OpenBSD specifically.

Cloudef · on Aug 27, 2022

The only string manipulation function you need is snprintf, every other str* function is dumb

adrian_b · on Aug 27, 2022

Using snprintf is equivalent with using strlcpy and also equivalent with using a strlen followed by a memcpy.

As explained in the parent article, using something equivalent with strlen is inefficient when the source is much longer than the destination, because you do not need the length of the source, you just need to know that it is longer than the destination.

Instead of defining yet another strcpy-like function, I would have preferred an alternative to strlen, taking 2 arguments, the size of the destination and a pointer to the source, and returning either the length of the source or an error when the source is longer than the provided size.

Then every strcpy can be replaced with that strlen-replacement function followed by memcpy, and the same strlen replacement can be used with all other string functions, e.g. with strcat.

POSIX 2008 has added the strnlen function ("size_t strnlen(const char *s, size_t maxlen);"), which does not have the inefficiency of strlen, by returning at most the size of the destination, but it does not signal an error to indicate truncation.

You can still use strnlen, if you give it a size larger by 1 and you check the return value to detect an oversize source, but it is slightly less convenient than if the strnlen return value would have been defined like the return value of strscpy.

JdeBP · on Aug 27, 2022

Efficiency is not the goal. Security and safety are. strlen() in kernel code on an (intentionally malicious) unterminated input has the potential to go off reading some other process's data, which opens up possible attack vectors, or the potential to cause unexpected page faults. strnlen() with one greater than the true buffer size, with the one extra memory read that it implies, has the same potential.

adrian_b · on Aug 27, 2022

No, strnlen with any given length does not have the potential of reading anything that it should not read, like it is true for strlen.

The length argument of strnlen is under the control of the kernel writer, not under the control of whoever provides the source string.

The new strscpy function can also read anything, if the kernel writer allocates a destination buffer long enough, there is no difference between strscpy and strnlen, from the security POV. What they read is determined by the length of the destination buffer.

Because the destination buffer must have an extra byte for the null terminating byte, strnlen must be called with the size of the destination buffer as argument, but if it returns that size, that signals an error, and then a number of bytes that is one less than the return value must be copied from the source (and a null byte must always be written after whatever is copied by memcpy).

The kernel function strscpy is exactly equivalent with using correctly strnlen and memcpy. While strscpy encapsulates that behavior, so it is more convenient, anyone who wants to use strscpy in a user C program must define a strscpy macro or function, using strnlen and memcpy, which are standard functions.

JdeBP · on Aug 30, 2022

The length given to strnlen() is, as you yourself said, greater than the size of the actual buffer in order to detect an oversize source. So strnlen() has the potential to access the character beyond the end of the array, checking it for NUL, incurring page faults and whatnot.

saagarjha · on Aug 27, 2022

memchr?

adrian_b · on Aug 27, 2022

memchr is less convenient even than strnlen, because it returns a pointer, so you must subtract 2 pointers to get the size that needs to be used in memcpy.

memchr is indeed an alternative for someone who would need to use some old libc, without the strnlen function.

The strscpy function is available only inside the Linux kernel, but strnlen should be available in any environment that conforms to POSIX 2008, so strnlen + memcpy is what I would recommend for copying null-terminated strings in any C program.

saagarjha · on Aug 27, 2022

Ah, I didn't spot that strnlen had made it into POSIX. Nice!

ranger_danger · on Aug 27, 2022

Not even close to accurate. I'm not aware of any large C library with a string copy function that guarantees a NULL-terminated result in all possible cases.

morelisp · on Aug 27, 2022

I mean, this ask is technically impossible since you may specify zero length to any of them. At that point signalling an error is the only thing it can do.

kevin_thibedeau · on Aug 27, 2022

*printf()s are slow.

barbegal · on Aug 27, 2022

Printfs can be slow but their performance varies by implementation and may have no meaningful performance implications in many cases.

hyperman1 · on Aug 27, 2022

I've always wondered why the C standard didn't just bite the bullet and added a string object, with a malloc-backed expanding string inside.

tialaramex · on Aug 27, 2022

Unlike C++ the C language takes its freestanding mode seriously. In this mode there's no malloc() and so your string type can't exist. But today in freestanding I can write "ERROR" and get a string literal.

So that would mean either you abolish string literals in freestanding (making C worse) or you have a separate type for these string literals from the type for a expandable string.

Now, Rust pulls off the latter in my view fairly elegantly, but that takes a lot of sophisticated type features in the language including a DeRef and several AsRef implementations. In Rust these are general features open to other types, C could special case them instead, but that's still a lot of engineering for a small feature.

The result is definitely not the elegant language which can be described in a slim book. Maybe it's better anyway, but it's not C.

323 · on Aug 27, 2022

> Unlike C++ the C language takes its freestanding mode seriously

You can easily implement a stack allocated C++ string:

https://forums.4fips.com/viewtopic.php?f=3&t=1075

You can also wrap a const char* in a string_view which has a length.

pjmlp · on Aug 27, 2022

You mean the lightly specified section of ISO C standard regarding freestanding use, or the compiler extensions most people ignore aren't part of the standard?

saagarjha · on Aug 27, 2022

String literals are char arrays, so it seems reasonable to be able to not resize them…

kzrdude · on Aug 27, 2022

The rules around const don't compose with pointers inside structs, which makes this more complicated than you'd like. (I guess, just one struct for mutable and one for view of the string would work.)

Ekaros · on Aug 27, 2022

I think there is sufficient membership in standard committee that doesn't want such "automated" features specially when touch memory allocation and freeing. And I too find it somewhat antithetical to C as language.

Still. Why isn't there a good quality library or set of libraries that would cover these features is my question.

rabf · on Aug 27, 2022

https://github.com/antirez/sds

hyperman1 · on Aug 30, 2022

This is exactly the kind of thing I was thinking about. Great!

selimnairb · on Aug 27, 2022

Dumb(?) question from a non-kernel programmer: can strings in the kernel be limited to those with lengths <2^16 (65536)? Then the length can be stored in the unused bits of the pointer. This probably creates more problems than it solves, especially if we ever have CPUs that want to address more than 256 GiB of memory. This would also only work on 64-bit architectures.

notamy · on Aug 27, 2022

> especially if we ever have CPUs that want to address more than 256 GiB of memory.

Given how common of a scenario this is for Linux, I imagine this sort of a change would be more trouble than it's worth...

selimnairb · on Aug 27, 2022

Or if say, most strings tend to be small (<256 chars), use 8 unused bits of the pointer for the length. If the string is longer than that, mask the bits to 0. Then the string handling functions can have a fast path that uses the length from pointer if available, if not, walks the entire string.

jabl · on Aug 27, 2022

I think most(?) C++ implementations of std::string use something called a short-string optimization, where the string type is something like

  struct string {
    size_t len;
    union {
        char* ptr_to_malloced_string;
        char short_string[sizeof(char*)];
    }
  }

So if the length of the string is shorter than the size of a pointer, the string is stored inline in the struct, and if the length is longer it's a separate allocated object.

mort96 · on Aug 28, 2022

FYI, we have consumer systems with over a terabyte of RAM.

benj111 · on Aug 27, 2022

I'm confused about the race conditions mentioned.

Can't you just copy a byte to a buffer checking for null, counting down the length.

At the assembly level you have to copy a byte at a time anyway.

I don't see why you'd want to check the entire length of the input string for strlcpy either. It seems less than optimal.

I'm guessing the people involved know more than me. So...?

trasz · on Aug 27, 2022

Tangentially related: FreeBSD tends to use sbuf(3) (https://www.freebsd.org/cgi/man.cgi?query=sbuf) for anything but the very simplest string manipulation. It’s safer and more convenient than NULL-terminated strings.

ufo · on Aug 27, 2022

Are there any userspace libraries that provide strscpy, or do we need to reimplement it ourselves?

zzo38computer · on Aug 27, 2022

You can use strncpy if you are careful to do it right (e.g. if you know the end of the destination buffer is null-terminated, specify the length one less than the destination buffer size). Sometimes memcpy is better, though. You can use both in one program.

noselasd · on Aug 27, 2022

But you don't want to use strncpy, especially not inside the kernel - since it may waste a lot of useless cpu.

    char big_enough[512];
    strncpy(big_enough, sizeof 512, small_string];

strncpy now fills the rest of big_enough with nuls, a lot of waste if small_string is 4 characters long.

kevin_thibedeau · on Aug 27, 2022

You're still stuck with the zero pad performance hit when the destination is significantly larger than the source.

masklinn · on Aug 27, 2022

> You can use strncpy if you are careful to do it right

That’s like saying if you’re careful to do it right you can use a Bowie knife as a screwdriver.

Technically correct but practically will end in tears. If you’re not working with fixed-size strings (which few are these days) just copy over strscpy, or strlcpy, 5 lines won’t kill you.

formerly_proven · on Aug 27, 2022

The express purpose of strncpy is to handle fixed-length strings in classic unix (file) structures (e.g. tar), I'm not convinced it's correct to use in basically any other situation.

minusf · on Aug 27, 2022

i havent used C in quite a while but I did not remember strlcpy using strlen, and it seems the linux kernel's implementation differs quite a bit from the openbsd one:

https://elixir.bootlin.com/linux/v5.19.3/source/lib/string.c...

https://github.com/openbsd/src/blob/master/lib/libc/string/s...

qhwudbebd · on Aug 28, 2022

The while (*src++); on line 45 of lib/libc/string/strlcpy.c is an inline implementation of strlen here. It's done solely for the benefit of the return value (src - osrc - 1) but means that it traverses the entire length of src until the first NUL - even if that's gigabytes or wanders out of mapped memory.

ncmncm · on Aug 27, 2022

It seems like a rare bout of sanity for strlcpy to have fallen out of favor, in the kernel.

It never got into glibc, which was I guess another. But of course it is all over the BSDs, and not going anywhere AFaIK, more's the pity.

JdeBP · on Aug 27, 2022

The BSD kernels do not have nearly as many APIs that employ variable-length special-character-terminated strings that have to be parsed from human-readable forms, however. Whilst this is common for things like procfs in the Linux world, the BSDs favour general purpose interfaces like sysctl(3) where data are transferred in machine-parseable forms rather than human-readable ones.

So "all over the BSDs" rather glosses over the fact that whilst common in applications code, there's less need for these sorts of string functions in the kernel. (And it is the Linux kernel that is the headlined subject here.) strlcpy() is in libkern, but it's not anywhere near as frequent in the kernel as it is in applications code.

kevin_thibedeau · on Aug 27, 2022

Glibc isn't much of a standard of excellence for rational behavior considering they have malloc_info() dumping out XML rather than actually returning a useful data structure.

ncmncm · on Aug 27, 2022

Maybe "rare bout of sanity" is an unfamiliar concept?

thayne · on Aug 27, 2022

Is there a userland function equivalent to strscpy in glibc?