What's wrong with `strncpy`?

i80and · 2024-04-03T16:58:38.000000Z

strncpy won't always write a trailing nul byte, causing out of bounds reads elsewhere. It's a nasty little fellow. See the warning at https://linux.die.net/man/3/strncpy

strlcpy() is better and what most people think strncpy() is, but still results in truncated strings if not used carefully which can also lead to big problems.

sirwhinesalot · 2024-04-03T17:07:28.000000Z

Speaking of strlcpy, Linus has some colorful opinions on it:

> Note that we have so few 'strlcpy()' calls that we really should remove that horrid horrid interface. It's a buggy piece of sh*t. 'strlcpy()' is fundamentally unsafe BY DESIGN if you don't trust the source string - which is one of the alleged reasons to use it. --Linus

Maybe strscpy is finally the one true fixed design to fix them all. Personally I think the whole exercise is one of unbeliavable stupidity when the real solution is obvious: using proper string buffer types with length and capacity for any sort of string manipulation.

jjav · 2024-04-03T20:22:05.000000Z

> the real solution is obvious

If it were obvious it would have been done already. Witness the many variants that try to make it better but don't.

> using proper string buffer types with length and capacity

Which you then can't pass to any other library. String management is very easy to solve within the boundaries of your own code. But you'll need to interact with existing code as well.

sirwhinesalot · 2024-04-03T21:56:28.000000Z

> If it were obvious it would have been done already. Witness the many variants that try to make it better but don't.

Every other language with mutable strings, including C++, does it like that. It is obvious. The reason it is not done in C is not ignorance, it is laziness.

> Which you then can't pass to any other library. String management is very easy to solve within the boundaries of your own code. But you'll need to interact with existing code as well.

Ignoring the also obvious solution of just keeping a null terminator around (see: C++ std::string), you should only worry about it at the boundary with the other library.

Same as converting from utf-8 to utf-16 to talk to the Windows API for example.

jjav · 2024-04-04T04:40:16.000000Z

> The reason it is not done in C is not ignorance, it is laziness.

Of course not. C has been around since the dawn of UNIX and the majority of important libraries at the OS level are written in it.

Compatibility with such a vast amount of code is a lot more important than anything else.

If it were so easy why do you think nobody has done it?

> Ignoring the also obvious solution of just keeping a null terminator around

That's not very useful for the general case. If your code relies on the extra metadata (length, size) being correct and you're passing that null-terminated buffer around to libraries outside your code, it won't be correct since nothing else is aware of it.

sirwhinesalot · 2024-04-04T09:17:27.000000Z

> If it were so easy why do you think nobody has done it?

People have done it, there are plenty strbuf implementations to go around. Even the kernel has seq_buf. How you handle string manipulation internally in your codebase does not matter for compatibility with existing libraries.

> That's not very useful for the general case. If your code relies on the extra metadata (length, size) being correct and you're passing that null-terminated buffer around to libraries outside your code, it won't be correct since nothing else is aware of it.

You can safely pass the char* buffer inside a std::string to any C library with no conversion. You're making up issues in your head. Don't excuse incompetence.

jjav · 2024-04-04T16:34:37.000000Z

> People have done it, there are plenty strbuf implementations to go around.

Precisely!

Why plenty and why is none of them the standard in C?

sirwhinesalot · 2024-04-04T16:54:44.000000Z

The TL;DR on that is basically "lazy, security unconscious assholes keep shutting it down".

Dennies Ritchie strongly suggested C should add fat pointers all the way back in 1990. Other people have pointed out the issues with zero terminated strings and arrays decaying into pointers (and the ways to deal with them even with backwards compatibility constraints) for years.

One of the most prominent was Walter Bright's article on "C's Biggest Mistake" back in 2009 and he was a C/C++ commercial compiler developer.

There is no excuse.

jjav · 2024-04-04T20:31:56.000000Z

It is easy to document mistakes in hindsight, since hindsight is 20/20.

It is very easy to write your own one-off secure string handling library. This is a common assignment in intro to C programming classes.

So why isn't it standard in C already?

You offer a theory that there is a gang of "security unconscious assholes [who] keep shutting it down". This gang is so well organized that they have managed to block an easy improvement for many many decades for unknown reasons. That's a pretty wild theory.

Or Occam's razor suggests a different answer: It's actually difficult.

No, not the writing code part, that's easy. It's the seamlessly integrating with ~60 years of mission critical codebases part that's hard.

sirwhinesalot · 2024-04-05T12:56:20.000000Z

There's no need to integrate with 60 years of mission critical codebases, you're making up a problem in your head that doesn't exist.

Nothing needs to be fixed, all it takes is to stop doing the stupid thing.

It does not take a "coordinated gang" to shut down C standard proposals, them getting shut down is the default.

You seem to be neither familiar with the nature of the problem or the struggle that is getting anything passed through ISO standardization. I don't mean to belittle you by saying this, I just hope to make you understand that you are assuming things that are simply not based in reality.

It doesn't even need to be in the standard btw. Just write your own. It's a few lines of code. As you say, a beginner exercise. Yet there is code written after the year 2000 that still uses the strxcpy family. Long after the issues have been known and what the solution is.

"Backwards compatibility" is a total red herring. C++ has the solution right there in its standard library. A backwards compatible string buffer implementation.

jjav · 2024-04-05T20:22:19.000000Z

> Nothing needs to be fixed, all it takes is to stop doing the stupid thing.

Well we'll just agree to disagree I suppose, as I'm equally convinced that you're not grasping what the problem actually is.

All I can say is that if this were as easy to fix as you assert and "all it takes is to stop doing the stupid thing" and we both agree that writing code for the better thing is super easy, then consider why it has not been possible to fix in the C universe.

sirwhinesalot · 2024-04-06T08:56:32.000000Z

I don't know what to tell you. Look at the git codebase, they downright ban any usage of the strcpy family, going so far as to hide them under macros so people can't use them.

Banning them outright is not possible in old codebases before the internet got really popular and people were pointing out how bad these functions were, but they sure could stop using them in any new code written in that codebase. That's what code review is for.

Any C code written after 2010 has absolutely no excuse to use these functions. They are inefficient, unsafe and more annoying to use than a strbuf implementation that takes half an hour to write.

So why have people continued to use them?

Option a) they were already there, the codebase is over 30 years old, and replacing the code entirely would be too much work. This is a valid reason.

Option b) ignorance, they don't know how to write a strbuf type. This one is downright impossible, any C dev knows how to do it, and like I said, literally every other language does it the same way.

Option c) laziness. This is for me the only real reason. As awful as these functions are, they're in the stdlib. You still see people saying "simple usages of strncpy are fine". They are not fine.

If you can think of an option d) I'd love to know, because I honestly can't think of anything else. Note that interfacing with existing 30 year old codebases does not count, as how you internally manipulate strings has no bearing on that, all you need to ensure is the 0 terminator at the end.

You get a mutable char* from the old function. You shove it in a struct strbuf {size_t capacity, size_t length, char* data}. Done.

You get a constant char* from the old function. You call strlen followed by malloc and memcpy into a new buffer for the strbuf. Or if you don't need to actually mutate the string, you store it in a non-zero terminated struct strview {size_t length; char* data}.

So what is the challenge here? Why is usage of strcpy not banned in any codebase less than 20 years old?

lelanthran · 2024-04-04T05:17:37.000000Z

> you should only worry about it at the boundary with the other library.

If this was a mitigation, it would solve all problems with nul-terminated strings i.e. do strict and error-checked conversions to nul-terminated strings at all boundaries to the program, and then nul-terminated strings and len-specified strings are equivalently dangerous (or safe, depending on your perspective).

The problem is precisely that unsanitised input makes its way into the application, bypassing any checks.

sirwhinesalot · 2024-04-04T09:25:14.000000Z

It's impossible to avoid "sanitizing" input if you have a conversion step from a library provided char* to a strbuf type. Any use of the strbuf API is guaranteed to be correct.

That's very different from needing to be on your toes with every usage of the strxcpy family.

lelanthran · 2024-04-05T06:34:17.000000Z

> It's impossible to avoid "sanitizing" input if you have a conversion step from a library provided char* to a strbuf type. Any use of the strbuf API is guaranteed to be correct.

I agree: having a datatype beats sanitising input (I think there's a popular essay somewhere about parsing input vs sanitising input which makes pretty much the same point as you do), but it's still only partially correct.

To get to fully correct you don't need a new string type, you need developers to recognise that the fields "Full Name" and "Email address" and "Phone number", while all being stored as strings, are actually different types and to handle them as such by making those types incompatible so that a `string_copy` function must produce a compilation failure when the destination is "EmailAddressType" and the source is "FullNameType".

Developers in C can, right now, do that with only a few minutes of extra typing effort. Adding a "proper" string type is still going to result in someone, somewhere, parsing a uint8_t from a string into a uint64_t, and then (after some computation) reversing that (now overflowing) uint64_t back into a uint8_t.

If you're doing the right thing and creating types because "Parse, Don't Validate", a better string type doesn't bring any benefits. If you're doing the wrong thing and validating inputs, then you're going to miss one anyway, no matter the underlying string type.

sirwhinesalot · 2024-04-05T13:17:58.000000Z

Sure but now we're talking about a universal problem across languages, rather than a C-specific problem.

lelanthran · 2024-04-06T09:13:51.000000Z

> Sure but now we're talking about a universal problem across languages, rather than a C-specific problem.

Of course, but that's my point - C already gives you the ability to fix the incorrect typing problem, using the existing foundational `str*` functions.

A team who is not using the compiler's ability to warn when mixing types are still going to mix types when there is a safe strbuf_t type.

The problem with the `str*` functions can be fixed today without modifying the language or it's stdlib.

Most C programmers don't do it (myself included). I think that, in one sense, you are correct in that removing the existing string representation (and functions for them) and replacing them with len+data representation for strings will fix some problems.

Trouble is, a lot of useful tokenising/parsing/etc string problems are not possible in a len+data representation (each strtok() type function, for example, needs to make a copy of what it returns) so programmers are just going to do their best to bypass them.

Having programmers trained to create new string types using existing C is just easier, because then you solve the whole 'mixing types' problem even when looking at replacements for things like `strtok`.

Or ... maybe I'm completely off-base and the reason that programmers don't create different types for string-stored data is because it is too much work in current C-as-we-know-it.

jandrese · 2024-04-04T06:17:42.000000Z

For me the "real" solution looks something like this:

    ssize_t strxcpy(char* restrict dst, const char* restrict src, ssize_t len)

Strxcpy copies the string from src to dst. The len parameter is the number of bytes available in the dst buffer. The dst buffer is always terminated with a null byte, so the maximum length of string that can be copied into it is len - 1. strxcpy returns the number of characters copied on success, but can return the following negative values:

    E_INVALID_PARAMETER: Ether dst or src are NULL or len < 1, no data was copied
    W_TRUNCATED: len - 1 bytes were copied but more characters were available in src.

strxcat would work similarly. I have not decided if the return value should include the terminating null or not.

jjav · 2024-04-04T06:33:01.000000Z

How is this useful though? I mean yes, it is useful in avoiding the buffer overruns. But that's not the only consideration, you also want code that handles data correctly. This just truncates at buffer size so data is lost.

So, if you want the code to work correctly, you need to either check the return code and reallocate dst and call the copy again. But if you're going to do that might as well check src len and allocate dst correctly before calling it so it never fails. But if you're already doing that, you can call strcpy just fine and never have a problem.

jandrese · 2024-04-04T14:36:55.000000Z

Sometimes truncation is fine or at least can be managed. Yes, strdup() is a better choice in a lot of situations, but depending on how your data is structured it may not be the correct option. I would say my version is useful in any situation where you were previously using strncpy/cat or strlcpy/cat.

raverbashing · 2024-04-04T10:41:39.000000Z

Wow yeah this seems to summarize well the usual api flakiness and just shuffling of C

It seems people come with "one more improvement" that's broken in one way or the other

jandrese · 2024-04-03T17:08:55.000000Z

The problem with strlcpy is the return value. You can be burned badly if you are using it to for example pull out a fixed chunk of string from a 10TB memory mapped file, especially if you're pulling out all of the 32 byte chunks from that huge file and you just wanted a function to stick the trailing 0 on the string and handle short reads gracefully.

It's even worse if you are using it because you don't fully trust the input string to be null terminated. Maybe you have reasons to be believe that it will be at least as long as you need, but can't trust that it is a real string. As a function that was theoretically written as "fix" for strncpy it is worse in some fundamental ways. At least strncpy is easy enough to make safe by always over-allocating your buffer by 1 byte and stuffing a 0 in the last byte.

kevin_thibedeau · 2024-04-04T01:10:46.000000Z

strncpy() also zero pads the entire buffer. If it's significantly larger than the copied string you're wasting cycles on pointless move operations for normal, low-security string handling. This behavior is for filling in fixed length fields in data structures. It isn't suitable for general purpose string processing.

Borg3 · 2024-04-03T19:21:21.000000Z

#define strncpyz(d,s,l) *(strncpy(d,s,l)+(l))=0

Of course this one is unsafe for macro expansion. But well, its C :)

teo_zero · 2024-04-04T05:41:56.000000Z

I'd rather put the final nul at d+l-1 than at d+l, so that l can be the size of d, not "one more than the size of d":

  strncpyz(buf,src,sizeof buf);

spacechild1 · 2024-04-03T17:28:36.000000Z

As others have already pointed out it, it doesn't guarantee that the result is null-terminated. But that's not the only problem! In addition, it always pads the remaining space with zeros:

    char buf[1000];
    strncpy(buf, "foo", sizeof(buf));

This writes 3 characters and 9997 zeros. It's probably not what you want 99% of the time.

lelanthran · 2024-04-03T19:06:10.000000Z

It's not possible to use it safely unless you know that the source string fits in the destination buffer. Every strncpy must be followed by `dst[sizeof dst - 1] = 0`, and even if you do that you still have no idea if you truncated the source string, so you have to put in a further check.

    strncpy (dst, src, (sizeof dst) - 1);
    dst[(sizeof dst) - 1] = 0;
    int truncated = strlen (dst) - strlen (src);

Without the extra two lines after every strncpy, you're probably going to have a a hard to discover transient bug.

actionfromafar · 2024-04-03T21:24:34.000000Z

if you really want to use standard C string functions, use instead:

    int ret = snprintf(dst, sizeof dst, "%s", src);
    if (ret >= n || ret < 0)
    {
        /* failed */
    }

or as a function:

    bool ya_strcpy(const char* s, char* d, size_t n)
    {
        int cp = snprintf(d, n, "%s", s);
        bool ok = cp >= 0 && cp < n;
        ok ? *s = *s : 0;
        return ok;
    }

kazinator · 2024-04-04T05:56:30.000000Z

snprintf only returns negative if an "encoding error" occurs, which has to do with multi-byte characters.

I think for that to possibly happen, you have to be in a locale with some character encoding in effect and snprintf is asked to print some multi-byte sequence that is invalid for that encoding.

Thus, I suspect, if you don't call that "f...f...frob my C program" function known as setlocale, it will never happen.

lelanthran · 2024-04-04T07:06:36.000000Z

> Thus, I suspect, if you don't call that "f...f...frob my C program" function known as setlocale, it will never happen.

Of all the footguns in a hosted C implementation, I believe setlocale (and locale in general) is so broken that even compilers and library developers can't workaround it to make it safe.

The only other unfixable C-standard footgun that comes close, I think, are the environment-reading-and-writing functions, but at least with those, worst-case is leaking a negligible amount of memory in normal usage, or using an old value even when a newer one is available.

kazinator · 2024-04-04T07:38:56.000000Z

I see that in Glibc, snprintf goes to the same general _IO_vsprintf function, which has various ominous -1 returns.

I don't think I see anything that looks like the detection of a conversion error, but rather other reasons. I would have to follow the code in detail to convince myself that glibc's snprintf cannot return -1 under some obscure conditions.

Defending against that value is probably wise.

As far as C locale goes, come on, the design was basically cemented in more or less its current form in 1989 ANSI C. What the hell did anyone know about internationalizing applications in 1989.

lelanthran · 2024-04-04T05:19:37.000000Z

I actually do use `snprintf()` and friends.

aulin · 2024-04-04T05:58:29.000000Z

except no one does that return code check and worse they often use the return code to advance a pointer in concatenated strings

jlokier · 2024-04-03T18:51:16.000000Z

`strncpy` is commonly misunderstood. It's name misleads people into thinking it's a safely-truncating version of `strcpy`. It's not.

I've seen a lot of code where people changed from `strcpy` to `strncpy` because they thought that was safety and security best practice. Even sometimes creating a new security vulnerability which wasn't there with `strcpy`.

`strncpy` does two unexpected things which lead to safety, security and performance issues, especially in large codebases where the destination buffers are passed to other code:

• `strncpy` does NOT zero-terminate the copied string if it limits the length.

Whatever is given the copied string in future is vulnerable to a buffer-read-overrun and junk characters appended to the string, unless the reader has specific knowledge of the buffer length and is strict about NOT treating it as a null-terminated string. That's unusual C, so it's rarely done correctly. It also doesn't show up in testing or normal use, if `strnlen` is "for safety" and nobody enters data that large.

• `strncpy` writes the entire destination buffer with zeros after the copied string.

Usually this isn't a safety and security problem, but it can be terrible for performace if large buffers are being used to ensure there's room for all likely input data.

I've seen these issues in large, commercial C code, with unfortunate effects:

The code had a security fault because under some circumstances, a password check would read characters after the end of a buffer due to lack of a zero-terminator, that authors over the years assumed would always be there.

A password change function could set the new password to something different than the user entered, so they couldn't login after.

The code was assumed to be "fast" because it was C, and avoided "slow" memory allocation and a string API when processing strings. It used preallocated char arrays all over the place to hold temporary strings and `strncpy` to "safely" copy. They were wrong: It would have run faster with a clean string API that did allocations (for multiple reasons, not just `strncpy`).

Those char arrays had the slight inconvenience of causing oddly mismatched string length limits in text fields all over the place. But it was worth it for performance, they thought. To avoid that being a real problem, buffers tended to be sized to be "larger" than any likely value, so buffer sizes like 256 or 1000, 10000 or other arbitrary lengths plucked at random depending on developer mood at the time, and mismatched between countless different places in the large codebase. `strncpy` was used to write to them.

Using `malloc`, or better a proper string object API, would have run much faster in real use, at the same time as being safer and cleaner code.

Even worse, sometimes strings would be appended in pieces, each time using `strncpy` with the remaining length of the destination buffer. That filled the destination with zeros repeatedly, for every few characters appended. Sometimes causing user-interactions that would take milliseconds if coded properly, to take minutes.

Ironically, even a slow scripting language like Python using ordinary string type would have probably run faster than the C application. (Also Python dictionaries would have been faster than the buggy C hash tables in that application which took O(n) lookup time, and SQLite database tables would have been faster, smaller and simpler than the slow and large C "optimised" data structures they used to store data).

sirwhinesalot · 2024-04-03T16:56:58.000000Z

It doesn't guarantee that the output is null terminated. Big source of exploits.