Hacker News new | past | comments | ask | show | jobs | submit login

> It's inherent to the language. Writing "string" gives you a NUL-terminated string; converting that to another format takes effort.

Couldn't we just leave the NUL byte in there and pretend it doesn't exist?

  const char *literal = "string literal with NUL byte";

  struct bytes text;
  text.size    = strlen(literal); // strlen doesn't count the NUL terminator
  text.pointer = literal;

  // I know I discarded the const qualifier up there, but it's just to illustrate
Then a copy(text, other) function would conveniently ignore the entire NUL issue. The copy would not even have the terminating NUL.

> Interestingly, some Mac OS compilers would let you write "\pABC" to get a structure containing the bytes {3, 'A', 'B', 'C'}. (The "p" stands for Pascal.)

Pascal strings are nice but their sizes are too limited. Same idea as my structure above but with a uint8_t for length instead of uint64_t = size_t.




Pascal strings can be improved to support strings of any reasonable length.

Say two highest bits of the counter set the size of the counter field. 00 = 6 remaining bits, 01 = 14 bits (2 bytes), 10 = 30 bits, 11 = 62 bits (8 bytes).

A simple `counter* & 0x3f` would remove the width-setting bits, without any shifts, additions, etc.

This allows small strings to use only 1 byte for the counter, while allowing huge strings that span the entire RAM.


How would that work with untrusted strings? As I understood TFA strlen() is an issue if the string is not null terminated


That's correct. Using strlen on anything but C string literals is just asking for bugs. The thing is untrusted strings don't come from C itself, they come from I/O.

The kernel has perfectly reasonable I/O interfaces.

  ssize_t bytes_read    =  read(file_descriptor, buffer, size);
  ssize_t bytes_written = write(file_descriptor, buffer, size);
You always know the length.

Well... At least you would always know the length if the standard C library didn't abstract that perfectly good interface away behind stdio just so it could do buffering and return NUL-terminated strings.

It's just like errno. The kernel simply returns a negated error constant on failure. The C standard library takes that sane interface and turns it into a thread local global variable.


You could add real strings and arrays to C and kill POSIX and you'd have two fewer problems.


No. You would have extra problems. Not just from the maintenance issues when going along the migration path that Linus Torvalds pointed out. But also especially, for one thing, from the poor thinking generally involved, such as people thinking that the way to address a problem where it is stated that the POSIX API for I/O is reasonable is to "kill POSIX".


Apparently at some point people forgot that if you don't like an API, library or interface, you can just put a wrapper on top of it. Many libraries are just "toolkits" after all, you are not supposed to use the raw API everywhere; e.g. you if don't quickly stop doing that with the BSD sockets API, you are part of the problem.


> if you don't like an API, library or interface, you can just put a wrapper on top of it

We can also simply get rid of all that bloat and just use the system calls directly.

> you are not supposed to use the raw API everywhere

Linux system calls are a stable interface and the entry points are even programming language agnostic. It's okay to use them directly.

> e.g. you if don't quickly stop doing that with the BSD sockets API, you are part of the problem

Yeah it's not a good idea on other operating systems since the system call interfaces are unstable. We have to use their C libraries on those platforms.


> Apparently at some point people forgot that if you don't like an API, library or interface, you can just put a wrapper on top of it.

If I were to survey the state of modern software development and try to characterize the skills lost compared to decades past, "not enough wrappers" would be nowhere on my list.


Ok that was a bit overstated, s/some point people forgot/some people forget/. I don't know what was the common practice a few decades ago because source code was not as visible as it is today, but in my experience what you see on Github is not what the average developer does. Github advertises itself as a social coding network; just like with other social networks, there is a selection bias regarding what is posted.


> That's correct. Using strlen on anything but C string literals is just asking for bugs. The thing is untrusted strings don't come from C itself, they come from I/O.

Indeed, it is just an input check/"sanitization" issue - just like one carefully checks that a JSON or XML input is well formed, if a protocol spec says that some part is an ASCIIZ string, one has to check that there's indeed a zero byte before the end of the data packet.


the 'discarding const' is quite a problem if you try write real code like this.


I know. That string literal is likely to be located in a read only page. In real code, I'd have to allocate some memory and copy the text to the new location if I want the resulting structure to be writable. For clarity's sake I omitted these details.

This isn't unique to my example though. Traditional C strings have the exact same problem and they do get copied all the time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: