Hacker News new | past | comments | ask | show | jobs | submit login

This isn't the usual way this is coded:

  char *one = "one";
  char *end;
  errno = 0; // remember errno?
  long i = strtol(one, &end, 10);
  if (errno != 0) {
      perror("Error parsing integer from string: ");
  } else if (i == 0 && end == one) {
      fprintf(stderr, "Error: invalid input: %s\n", one);
  } else if (i == 0 && *end != '\0') {
      f__kMeGently(with_a_chainsaw); 
  }
It's actually like this:

  errno = 0;

  long i = strtol(input, &end, 10);

  if (end == input) {
    // no digits were found
  } else if (*end != 0 && no_ignore_trailing_junk) {
    // unwanted trailing junk
  } else if ((i == LONG_MIN || i == LONG_MAX)) && errno != 0) {
    // overflow case
  } else {
    // good!
  }
errno only needs to be checked in the LONG_MIN or LONG_MAX case. These cares are ambiguous: LONG_MIN and LONG_MAX are valid values of type long, and they are used for reporting an underflow or overflow. Therefore errno is reset to zero first. Otherwise what if errno contains a nonzero value, and LONG_MAX happens to be a valid, non-overflowing value out of the function?

Anyway, you cannot get away from handling these cases no matter how you implement integer scanning; they are inherent to the problem.

It's not strtol's fault that the string could be empty, or that it could have a valid number followed by junk.

Overflows stem from the use of a fixed-width integer. But even if you use bignums, and parse them from a stream (e.g. network), you may need to set a cutoff: what if a malicious user feeds you an endless stream of digits?

The bit with errno is a bit silly; given that the function's has enough parameters that it could have been dispensed with. We could write a function which is invoked exactly like strtoul, but which, in the overflow case, sets the *end pointer to NULL:

  // no assignment to errno before strtol

  int i = my_strtoul(input, &end, 10);

  if (end == 0) {
    // underflow or overflow, indicated by LONG_MIN or LONG_MAX value
  } else if (end == input) {
    // no digits were found
  } else if (*end != 0 && no_ignore_trailing_junk) {
    // unwanted trailing junk, but i is good
  } else {
    // no trailing junk, value in i
  }
errno is a pig; under multiple threads, it has to access a thread local value. E.g

  #define errno (*__thread_specific_errno_location())
The designer of strtoul didn't do this likely because of the overriding requirement that the end pointer is advanced past whatever the function was able to recognize as a number, no matter what. This is lets the programmer write a tokenizer which can diagnose the overflow error, and then keep going with the next token.



Sure, you can't get away from handling the cases, but as the article clearly demonstrates, there can be a much better interface for it.


> Sure, you can't get away from handling the cases, but as the article clearly demonstrates, there can be a much better interface for it.

It's a very apples to oranges comparison, to the point that it almost feels like a straw man. "Interface (that does X) sucks for doing Y; look at how easy the Rust interface for doing Y is!"

Yes, there can be a much simpler interface for the case when you want to assert that a string is nothing but digits and must fully convert. That's not what strtol is for though.

Now I think libc sucks (no surprise given its age; complaining about is beating a dead horse), and it sucks more if you don't take various GNU & BSD extensions with it, but I'm kinda getting tired of people complaining that "foo in C is hard" when their unstated requirement is that they can't use any libraries to help them do it. Like this fellow the other day: https://news.ycombinator.com/item?id=29990897

If you look at programs written in "modern" languages, they almost invariably bring a plethora of libraries and dependencies with them anyway so why is C repeatedly judged on the merits of ancient libc interfaces that you don't have to use?


IMO, external libraries are for domain-specific tasks. If something is needed in pretty much every program, it should be a part of the language or the standard library.

Also, it's much easier to use external libraries in other languages. npm install, cargo install, nimble install, cabal install, gem install, …


> If something is needed in pretty much every program, it should be a part of the language or the standard library.

It sure would be convenient that way. That said, you don't need to convert strings in pretty much every program. There's a lot of C code out there that does very little with strings.

Now do you dismiss an entire language if its standard library is lacking or doesn't exist? IMO that would be throwing out baby with the bathwater.

> npm install, cargo install, nimble install, cabal install, gem install, …

Yes, I've witnessed the mountain of unaudited dependencies that somehow turn a 300 line program into something the size of my kernel.. should I dismiss all those languages because people do something I don't like with their libraries?


>Now do you dismiss an entire language if its standard library is lacking or doesn't exist?

As anything much more than a toy, yes. If there's no standard library at all (or nearly so), the language ecosystem is quite likely to end up a complete mess of incompatible implementations of even the most basic functionality, which is a waste of everyone's time to deal with.


I wonder how your programs do I/O if not with strings. Reading numbers from STDIN is the next thing after Hello World.

As another comment pointed out, C has many flaws unrelated to its standard library. Also check out https://eev.ee/blog/2016/12/01/lets-stop-copying-c/.

You know what's the main cause of dependency hell? Needing a library for every basic thing. Notice that mountains of dependencies are much less common in “batteries-included” languages.


> I wonder how your programs do I/O if not with strings.

There's this one weird trick we call binary. Let me give you an example of how I did I/O yesterday:

    static void usb_tx(struct usb_ep *ep, const void *data, uint len) {
      if (len) memcpy(ep->buf, data, len);
      *ep->bufctl = BC_FULL | ep->datax << BC_DATAX_S | BC_AVAIL | len;
      ep->datax ^= 1;
    }
Usage example:

    struct kb_report r = {.m={.id=KB_ID_M, .x=-a[1], .y=-a[0]}};
    usb_tx(KB_IN, &r.m, sizeof r.m);
stdin does not exist in this program.

> As another comment pointed out, C has many flaws unrelated to its standard library.

Yes it does, but this thread has already become a tangent of a tangent. Let's not turn it into a general diatribe against C, as opposed to a discussion about the library interface that TFA takes issue with.

> You know what's the main cause of dependency hell? Needing a library for every basic thing. Notice that mountains of dependencies are much less common in “batteries-included” languages.

In theory, yes. Like I said, libc sucks, and I would love to have a better standard (or de-facto standard) library. But anecdotally C programs are not very prone dependency bloat, perhaps precisely thanks to the fact that C doesn't have a de-facto package manager that allows you to just install a bunch of crap.

Anecdotally, "batteries included" languages are still prone to dependency bloat if there's a package manager. This includes recent experience with Python (I can't remember the last time I had to lay my hands on a python project that didn't need a bunch of things to be installed with pip) and somewhat less recently with Perl (isn't cpan pretty much the grandfather of "oh there's a library for that"?).

Hilariously, my recent experience has people using Python and depending on Python libraries which then depend on C and C++ libraries in order to implement the same things that I'm doing in plain C with no dependencies.

But I'll conclude my participation in this subthread with this message because it's gone too far off the rails into a pointless language flame war.


> isn't cpan pretty much the grandfather of "oh there's a library for that"?

The TeX CTAN in 1992 [1] was clearly the inspiration for CPAN a year or three later [2] (in both name & thing). So, maybe CTAN is the great grandfather? :-) { My intent is only to inform, not be disputatious. I know you said "pretty much". }

To be fair, C has an ecosystem. OS package managers/installers are a thing. There is surely a list of much >1 "core libs/programs" (terminfo/curses/some text editor/compilers/etc.) that would be in most "bare bones" OS installs upon which you could develop. One certainly depends upon OS kernels and device drivers. IMO, at least one mistake "language" package managers make is poor integration with OS package managers. Anyway you cut it, it is hard to write a program without depending upon a lot of code. Yes, some of that is more audited.

As the "lump" gets giant, dark corners also proliferate. There was a recent article [3] and HN discussion [4] about trying to have the "optimal chunkiness/granularity" in various ecosystems. I agree that it is doubtful we will solve any of that in an HN sub-to-the-Nth thread. I think that article/discussion only scratched the surface.

I will close by saying I think it's relatively uncontentious (but maybe not unanimous) that packaging has gone awry when a simple program requires a transitive closure of many hundreds of packages. FWIW, I also often write my own stuff rather than relying on 3rd parties and have done so in many languages. Nim [5] is a nice one for it. It's not perfect - what is? - but it sucks the least in my experience.

[1] https://en.wikipedia.org/wiki/CTAN

[2] https://en.wikipedia.org/wiki/CPAN

[3] https://raku-advent.blog/2021/12/06/unix_philosophy_without_...

[4] https://news.ycombinator.com/item?id=29520182

[5] https://nim-lang.org/


I think my point remains valid, to do safe string stuff in C I have to think a lot harder about stuff to do with lengths that I don't have to think about in go. And I didn't want large dependencies because i was writing a .so to preload and intercept execve and open. And even after all these threads I don't know the name of a small string library to use in C, except TCL because i used it before.


Would you be open to sharing what you did with strings?

My central argument in the response there is that writing buf[len] = ‘\0’; is almost always a sign that you either don't know libc functions, aren't willing to use them, are trying to outperform them (the performance of libc functions is a legitimate complaint for some use cases), or what you're dealing with is not a string but some arbitrary binary blobs that you're trying to make strings out of (in that case, you can't blame the string representation or string handling functions for not knowing what the extent of your binary is; yes, you'll have to first create a string, knowing the length).

To put it more explicitly, if you always provide a valid buffer and size, snprintf() will always terminate your string. strlcat() and strlcpy() will always terminate your string. If you need formatted catenation, you can make a trivial wrapper around snprintf that takes a pointer to the end of your string and updates the "head"; this can be called successively without ever having to compute a length outside the wrapper. asprintf() will allocate and terminate your string. Things that need the length of your string (strspn, strchr, etcetra) will figure it out since it is implied by the already-present nul byte. strtok & co (they have their issues) also work without requiring you to do any manual termination.

What this means in practice is that you can have thousands of lines of string handling code that never manually terminates a string and only deals with lengths to the extent that your "business logic" needs to. Unless you're actually trying to use the string representation to your benefit by manually splicing it any which way, inserting nul bytes based on arcane computations.. in that case, it sounds like you got what you wanted. Yes, people actually do that sometimes: they figure out how easy it is to manipulate the string representation by hand and thus avoid library functions, and then they complain about doing it by hand.

There are always exceptions of course, so I'm giving you benefit of the doubt. That's why I'm curious to see what you were doing. Having to point out library functions however is a regular thing as people seem to always start out by hand-rolling it for some reason.

As for the question about string libraries.. well, I gotta point out that "small" wasn't a qualifier in the previous discussion. Popular libraries include sds, bstring, glib strings. Plan9port also has the extensible string library. There's icu for fancy unicode stuff but I have no experience with it and it probably isn't "small." There are plenty more if you look around, and I'll let you judge the size of the choices for yourself. I'm pretty sure one of these choices is always mentioned in these HN threads when someone asks for recommendations, including sds in the bchs thread.


I just always put the buf[len] = ‘\0’ to cover myself if I screwed up something. Generally I also use calloc if it’s standalone code as well.

I was copying strings from a file of allowed binary names into a list of char * and also logging the first two parameters to execve to disk, appending. It was fine but 100 times scarier than the same in Go would be.

I have used snprintf and strl* functions when I was doing fancier stuff, but have not tried asprintf. It has been a long time that I was doing large amounts of c code, and then I was either doing binary with Len always passed along or else calling some template library, but I do thank you for the lib recommendations.

My point is if you ask for a good string library that makes it as safe and easy as the same in Go, you will not see a pattern of answers t use well known strlib X.


Regarding buf[len] = '\0', I've personally had to use it in many scenarios following strncpy, which doesn't add a null terminator if the maximum length is reached. Do you know of any simpler way of getting a prefix up to a certain length?


snprintf. If you want to stick to the (safest) pattern of only passing the buffer size for the second parameter, you'd do this:

    snprintf(buf, sizeof buf, "%.*s", prefix_length, source_str);
Example:

    $ cat x.c
    #include <stdio.h>
    int main(void) {
      char buf[128], tinybuf[5];
      const char *copythis = "hello there\n";
      snprintf(buf, sizeof buf, "%.*s", 5, copythis);
      snprintf(tinybuf, sizeof tinybuf, "%.*s", 5, copythis);
      printf("buf: %s\n", buf);
      printf("tinybuf: %s\n", tinybuf);
    }

    $ cc -W -Wall -O3 x.c
    x.c: In function ‘main’:
    x.c:6:41: warning: ‘snprintf’ output truncated before the last format character [-Wformat-truncation=]
        6 |  snprintf(tinybuf, sizeof tinybuf, "%.*s", 5, copythis);
          |                                         ^
    x.c:6:2: note: ‘snprintf’ output 6 bytes into a destination of size 5
        6 |  snprintf(tinybuf, sizeof tinybuf, "%.*s", 5, copythis);
          |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    $ ./a.out
    buf: hello
    tinybuf: hell


Thanks! I've never considered using snprintf in that way before; the default warnings are annoying, even though their intent is understandable.


The warning isn't a false positive here; truncation is going on in that line: the chosen prefix doesn't fit into tinybuf.


I'm not convinced. Can the Rust function which is shown (that function alone) tokenize a number out of a string such that the number is overflowing the target type? Yet indicate to the caller where that overflowing number ends, so that tokenization can continue with subsequent characters, if any?

E.g. suppose we have a string with this kind of syntax:

   "12345 : 12345   , 12345"
We can

1. use strtol to get the first integer and a pointer to just after it.

2. use ptr += strspn(ptr, " ") to skip spaces

3. check for the colon and if we find it, skip with ptr++

4. use strtol to get the second integer (possibly preceded by space).

5. similarly to the colon handling, do the comma

6. strtol again to get the integer.

This is efficient: no splitting of the string into pieces requiring memory allocation, and extra list processing.

We can code this robustly: it can recognize valid syntax even if some of the numbers overflow. So for this kind of input:

    "1234523442345345345234534545454545 : 12345  12345" 
the code could diagnose the overflow, and the missing comma in one pass.

If you don't care about the details, just "is this number in range, with no trailing junk, or else is it bad", then raw strtol isn't convenient. But takes only a little code to wrap it.


Rust can return string slices, so a tuple of a return type with a potential number and a slice for the still unprocessed string would be an option, which is much safer than your proposed one and arguably more readable.


How often do you actually need this?


Whenever you do lexical analysis on syntax containing numbers.

On today's hand-held supercomputers avoiding allocations and, generally, exercising memory-efficiency may not be a primary concern, but it very much was at the time when this stuff was built. And it's still relevant today on restricted systems, like microcontrollers, where C is still the primary language.


> Whenever you do lexical analysis on syntax containing numbers.

If that was an intention C should have a full set of lexical analysis functions, but it doesn't (scanf doesn't count). strtol being able to distinguish two error cases and thus being marginally useful for lexical analysis is most likely accidental.


In my experience, reading a single number is a much much much more common operation than doing lexical analysis.


but this is not a appropriate place for that functionality.


It is entirely appropriate for a function which lexically analyzes a buffer in memory in order to match an integer to be able to tell you where that integer ends.


i think thats a reasonable opinion in the context of say, language implementation.

for the relatively simple case of parsing cli arguments i would want an equally simple api. "is this string a valid representation of a number?" and "what number does this string represent?" should be separate apis.

even in language implementation, i'd want the identification of number tokens to be separate from the parsing of those number tokens. i would then have another, separate api for "where does the first number end in this string?" which would probably more likely be "return to me the next substring from this string that represents a number"


Often one can ignore the errno case, as the input is (by then) semi-constrained, and its ambiguity resolution is not needed.

e.g. so an idomatic pattern from some real code would be something like:

  static bool
  parse_thing (char *value, struct thing *th)
  {
      char *endp = NULL;
      unsigned long firstport = strtoul(value, &endp, 10);
      if (endp == value || *endp || firstport > 0xffff) {
          /* Do something with error, like log it */
          return false;
      }
  
      th->firstport = firstport;
      return true;
  }
But granted, for the general case I'd prefer to use some helper like either of:

  bool parse_decimal_uint32(struct string const *str, uint32_t *outval);
  bool parse_decimal_uint32(char const *str, unsigned slen, uint32_t *outval);


If you can ignore the errno case due to a "semi-constrainted" input, and don't care about trailing junk or having a pointer to more string material after the number is scanned, you can just call atol(str).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: