I stopped writing C because string manipulation sucked.

ChrisRR · on April 25, 2019

I agree with that. Recently I started reading "Writing an interpreter in Go" and thought I'd follow along using C.

From the first chapter, the Go code starts using strings as a short-cut to represent tokens . In other languages this is trivial because strings are very easy to create, resize, change, etc. Using C, this became an issue though, as using strings became a roadblock where I started having to implement different solutions rather than focusing my attention on the contents of the book.

aap_ · on April 25, 2019

Using strings for tokens is probably not how you would ever do this in C. Make the tokens enums.

drb91 · on April 26, 2019

Oddly this is also true for go....

aap_ · on April 25, 2019

I never really got that complaint, I'd like to see some examples of what people consider so ugly about C strings.

pjmlp · on April 25, 2019

1 - string contents and actual length are handled in separate variables without correlation

2 - no enforcement that a null terminator actually exists in the string

3 - C brags about performance and is probably the slowest language to compute string length

4 - manipulating strings requires very carefull handling of buffers, usually forcing everyone to use the heap as easier way out

abmackenzie · on April 25, 2019

Storing the length of a string alongside the string is a viable option in C, it's up to you to do it yourself though (whereas Java and almost every other language does it for you). See https://github.com/phusion/nginx/blob/master/src/core/ngx_st..., used by NGINX.

pjmlp · on April 25, 2019

Which breaks down every time a C API needs to be called, and that linked API still has plenty of functions with separate pointer and length parameters.

badsectoracula · on April 25, 2019

You can do it like in Free Pascal/Delphi: store the string both length-prefixed (for fast length access and bounds checking) and zero terminated (for passing to functions that expect zero terminated strings).

pjmlp · on April 25, 2019

And then the callee corrupts it.

badsectoracula · on April 25, 2019

If the callee is meant to corrupt it then it is your fault for misusing it (unless the corruption is intentional and you plan to recalculate the length after calling it). If it wasn't meant to corrupt it then it is a bug and if it is in your code then it is your fault for not using the string functionality that you already have in place to avoid the bug in the first place. If it is not in your code, as long as you had to use it you'd have that bug regardless of what language or framework you used since it is out of your control and there isn't anything you can do about it.

pjmlp · on April 25, 2019

Aka, C's community version of "you are holding it wrong".

badsectoracula · on April 25, 2019

Is there a language that doesn't allow any abuse of an API, including APIs that were not written in that language?

pjmlp · on April 25, 2019

Yes, any system languages that doesn't need to depend on the existence of C.

If you are going to mention Assembly as possibility, check ClearPath where there is no Assembly, NEWP has full control over the hardware stack.

coldtea · on April 25, 2019

>Storing the length of a string alongside the string is a viable option in C, it's up to you to do it yourself though

Obviously, since many other languages that do so are implemented in C.

tyingq · on April 25, 2019

Also sds, used by redis: https://github.com/antirez/sds

aap_ · on April 25, 2019

I was actually more looking for practical issues. Usually the code that I write doesn't even handle strings a lot. Maybe I'm just using other languages for when I do that or maybe I'm using other approaches where others would use strings or maybe I just subjectively don't find them so bad as others. I'd just like to see exactly what people are complaining about so I could find out why I usually don't.

coldtea · on April 25, 2019

>I was actually more looking for practical issues.

Those are not practical? Billions of dollars have been wasted on issues stemming from this...

aap_ · on April 25, 2019

I'd just like to see a concrete example for once.

coldtea · on April 25, 2019

Like, apart from tons of well known bugs, and vulnerabilities caused from string manipulation? How exactly did you miss news items, reports, posts, university lectures, books, and even your own personal experience, on them?

Or is the insinuation that we are hand-wavy about it, and you doubt the existence and scale of the problem? It's a well researched, well established problem, known for almost half a century.

https://en.wikipedia.org/wiki/C_standard_library#Buffer_over...

https://security.web.cern.ch/security/recommendations/en/cod...

http://www.informit.com/articles/article.aspx?p=430402&seqNu...

https://randomascii.wordpress.com/2013/04/03/stop-using-strn...

https://courses.cs.washington.edu/courses/cse341/04wi/lectur... (null termination)

https://www.geeksforgeeks.org/why-strcpy-and-strncpy-are-not...

https://www.owasp.org/index.php/Reviewing_Code_for_Buffer_Ov...:

And let's not even get into format string issues...

owl57 · on April 25, 2019

3 - C brags about performance and is probably the slowest language to compute string length

I have a feeling that of all axes of performace C cares the most about memory overhead. Then the obvious idea is to have it at exatly one byte per "simple" string, and you get to pick the class of programs that can't get away with that default string type:

• One-byte terminator: complicated text-handling application with a lot of (longer than a couple of pointers on average) string slices.

• One-byte length: anything that needs strings longer than 255 chars.

And then of these two solutions you pick the obviously more general one. What could possibly go wrong?

pjmlp · on April 25, 2019

Anything that needs strings longer than 255 chars already had a solution in existing systems programming languages back when C was born.

Character arrays with open length, bound checked.

Naturally it requires better compiler support than C authors were willing to implement.

coldtea · on April 25, 2019

>Naturally it requires better compiler support than C authors were willing to implement.

Which is still the case with many things in Go, a language of close origin to C (though this time not about strings).

pjmlp · on April 25, 2019

It is interesting how in both cases they disregarded what was being made around them.

jstimpfle · on April 25, 2019

What was made around them was only ever the gorilla with jungle thing. Fine for small programs, or larger ones within walled gardens. But not fine for infrastructure work.

pjmlp · on April 25, 2019

Yet Multics was deemed safer than UNIX as per DoD security assement.

I guess security is not relevant as infrastructure work.

jstimpfle · on April 25, 2019

Oh, and what do they use today?

> I guess security is not relevant as infrastructure work.

...

pjmlp · on April 26, 2019

Ada and even Java (PTC/Aonix) when security matters.

marmaduke · on April 25, 2019

these seem like tradeoffs which are straightforward to understand, which allow for simpler ABI & runtime. sure, the "UX" of the language suffers compared to e.g. Python, but at least the mechanics are easier to understand. If you want Python style string handling in C you could just use the Python C-API.

sureaboutthis · on April 25, 2019

And yet I'm betting your language is run through a C program on its way to being interpreted or compiled.

pjmlp · on April 25, 2019

That is the unfortunate reality of having UNIX being widespread.

AnimalMuppet · on April 25, 2019

Windows is the fault of UNIX? I'm skeptical.

pjmlp · on April 26, 2019

Windows is not a pile of C code, rather C, C++ and .NET.

And nowadays C code is considered legacy, with C#, Rust and constrained C++ as the road to the future.

sureaboutthis · on April 25, 2019

It's not unfortunate. If it was a bad thing, UNIX wouldn't be so widespread.

ernst_klim · on April 25, 2019

>If it was a bad thing, UNIX wouldn't be so widespread.

Strange argument. Javascript is widespread, COBOL was widespread, Windows is widespread, X86 is widespread. Widespread doesn't mean good. UNIX was a disaster, and the whole family of UNIX-like OS spent decades just for mitigate its errors and faults.

pjmlp · on April 25, 2019

I guess we have someone here that enjoys using PHP, JavaScript and Perl.

sureaboutthis · on April 25, 2019

As a UNIX professional programmer, no. Other than Perl, which I don't use, why would you think that?

pjmlp · on April 26, 2019

Same set of language design qualities and being widespead due to historical accident.

coldtea · on April 25, 2019

And our money go to some Cobol program on their way to our bank/insurance.

And our data through some JS monstrosity.

So?

johannes1234321 · on April 25, 2019

About point 3: Computing the string length is fast in C. The point is: In other languages you always have the length around, so you never count it up.

All the listed weaknesses also have benefits. For instance it is easy to get a substring without need to copy.

But yes, many bugs in C software originate from spring buffer overflows.

v_lisivka · on April 25, 2019

> All the listed weaknesses also have benefits. For instance it is easy to get a substring without need to copy.

To make substring in some other languages, you need to store pointer to beginning of the substring and length of substring.

To make substring in C, you need to store pointer to beginning of the substring and put '\0' into original string.

johannes1234321 · on April 25, 2019

> To make substring in some other languages, you need to store pointer to beginning of the substring and length of substring.

Plus a pointer to the beginning, plus a reference counter as the user expects it to manage lifetime. In C this is the user's job. Where they know the life time is guaranteed they can optimize.

Nursie · on April 25, 2019

I coded in C for over a decade. Never did I need a reference counter.

jstimpfle · on April 25, 2019

No, you store offset + length.

adrianN · on April 26, 2019

You can't use any of the stdlib string functions if you don't have that \0 at the end though, right?

jstimpfle · on April 26, 2019

Is there any function you would miss? Of those, which one couldn't you recreate in 5 straightforward lines of code?

pjmlp · on April 25, 2019

O(N) is not the same as O(1).

That substring won't remain valid without copying it.

jstimpfle · on April 25, 2019

> That substring won't remain valid without copying it.

Unless it's not modified? And unless, when modified, that shouldn't be its new value?

C Strings (nul-terminated) are the right approach for static storage of small static strings (like strings literals in the source code) since they have low overhead, and "substrings" aren't second-class citizens.

For dynamically allocated strings that won't be modified after creation, the right approach is using a large memory chunk that is shared between many such strings, plus two indices for offset / length (or just offset if it's text that can be terminated with a sentinel).

Having a short string of about 10 characters allocated as a dynamic object in its own allocation is wasteful. Slow to allocate and has about 2x to 3x overhead. This approach isn't good for applications that store a large amount of data.

> O(N) is not the same as O(1).

Don't call strlen() in situations where the strings are large and you need to know the length ahead of time, and running time is paramount. Instead, store the length.

pjmlp · on April 25, 2019

The typical micro-optimization while typing without any profiler feedback, just gut feeling, as prevalent across the community.

jstimpfle · on April 25, 2019

You don't need profiler feedback for these things, just back-of-the-envelope calculations. How much data you would like to store and process is not something a profiler can answer for you.

I can tell you that in one of my programs, the difference between garbage collected strings and optimized ones (actually, strings converted into unique integer handles immediately) for one of my projects (SAT solver in Java), handling a few million variables, was something like a second until completion vs a couple of minutes before the garbage collector finally dies due to lack of oxygen, losing all data computed up to that point.

flohofwoe · on April 25, 2019

I agree, but C (the language) doesn't even have the concept of a 'string'. It's just the convention how some C standard library functions interpret an array of bytes with a zero at the end.

At least in C it's quite obvious that strings are not trivial if you want both an intuitive way to work with strings, and high performance. The C++ std::string type is neither intuitive to work with, nor does it allow to write high-performance code.

For string processing it's really better to use another language with different trade-offs.

atilaneves · on April 25, 2019

> I agree, but C (the language) doesn't even have the concept of a 'string'

It has string literals, so yes it does.

> The C++ std::string type is neither intuitive to work with

Many would disagree.

> nor does it allow to write high-performance code

True, but only due to backwards compatibility with C - std::string operations have to add a null terminator for no other good reason.

badsectoracula · on April 25, 2019

> Many would disagree.

Many would also agree, that means nothing. Personally i dislike C++'s strings... and the rest of STL, which i view as one of the worst standard library APIs in wide use.

jstimpfle · on April 25, 2019

> True, but only due to backwards compatibility with C - std::string operations have to add a null terminator for no other good reason.

That's about the least reason why std::string is inefficient.

pjmlp · on April 25, 2019

C++ std::string is better and more secure and anything that C ever produced.

As for string processing in general, I do agree that other languages are better suited.

aap_ · on April 25, 2019

More often that not I find myself missing C-type strings in other languages. Being able to just walk through the characters and manipulating them is something I found rather ugly in python for instance. The NUL character is in my experience not so terrible, you typically have null pointers at the end of a linked list or whatever as well and nobody complains about that. Now I have to admit I had a bug recently that took me longer to fix then I would like to admit because I wasn't walking a string right, but usually I have very little trouble with them.

adrianN · on April 25, 2019

What's a character? UTF-8 makes that a bit difficult to answer. If you want arrays of ASCII bytes, you can have those in most programming languages.

lugg · on April 25, 2019

1 to 4 bytes.

How does utf8 make that difficult to answer?

When was the last time you iterated over a string of unicode points and said you know what would be handy right now? If these code points were split up into arbitrary and unusable bytes of memory.

adrianN · on April 25, 2019

Well, ä is a character in German. You can either write it as LATIN SMALL LETTER A WITH DIAERESIS, or you can use COMBINING DIAERESIS and a. When you iterate over the German word Mädchen as Unicode code points you might be confused. Other languages do much crazier things.

lugg · on April 26, 2019

That doesn't make it hard..

olau · on April 25, 2019

It's been a long time since I wrote C, but the main problem in my recollection is that the standard library is not intuitive. Something simple like take a couple of arbitrary strings, concatenate them and return the result without leaking memory and not causing buffer overflows is not as trivial as it should be.

I don't think is a huge problem per se, though, you can just use a string library.

See the confusion here for an example: https://stackoverflow.com/questions/308695/how-do-i-concaten...

snprintf looks like it is the easiest way out.

coldtea · on April 25, 2019

>More often that not I find myself missing C-type strings in other languages. Being able to just walk through the characters and manipulating them is something I found rather ugly in python for instance

If what you say if you want mutable strings, many languages have those and you don't need anything like "NULL" to have them (and you can use a bytearray of the string in Python, though Unicode complicates this).

>The NUL character is in my experience not so terrible, you typically have null pointers at the end of a linked list or whatever as well and nobody complains about that

That's not the same thing at all. The linked list is comprised of structs with next fields, that can be null or point to something. Your program can handle either just fine, as both are valid cases (a linked list expects to find the NULL guard at the end but also expects a non-NULL next pointer if the node is not the last one, so will handle both).

OTOH, if an incoming string doesn't have a NUL byte your program will crash/corrupt memory/worse. On top of that, you need to remember it to add it/make space/for most string manipulations. Strings are not expected NOT to end with NUL, and when they don't there's no way you can mitigate it, except to set arbitrary limits to how many characters you consider.

leetcrew · on April 26, 2019

> OTOH, if an incoming string doesn't have a NUL byte your program will crash/corrupt memory/worse.

how are linked lists different? if the last node contains garbage for its next pointer, the outcome will be exactly the same. it's a bit more rare to encounter an "unterminated" linked list, but I've seen it happen plenty of times deserializing a linked list from disk or if the programmer just forgot to initialize the pointer. c strings basically are linked lists with an implicit next pointer.

sureaboutthis · on April 25, 2019

This is why I don't like other languages. They have no concept of functions to interpret an array of bytes with a zero at the end.

coldtea · on April 25, 2019

Actually (and I know this is a lame troll), it's trivial to make such functions in most non-C languages.

dgellow · on April 25, 2019

Why is this comment being down voted? Not everybody here is familiar with C string manipulation, if you down vote or complain at least give more detail that "it sucks".

@aap_, I asked a similar question some time ago and got some answers, you can check the thread here: https://news.ycombinator.com/item?id=19302581

The direct answer I got was:

> I'm guessing because an off-by-one or an extra skip might mean you miss the end of the string and go off into la-la land feeding whatever garbage happens to be in memory to your parser? That would mostly be a C issue (as it has no string abstraction at all).

coldtea · on April 25, 2019

>Why is this comment being down voted? Not everybody here is familiar with C string manipulation, if you down vote or complain at least give more detail that "it sucks".

Well, if someone is not familiar, why do they read a subthread on the matter?

Shouldn't they better start with a tutorial on C/C strings?

Even if people on this thread gave arguments, how would they (not familiar with C and C strings) would evaluate them? They could be totally bogus.

drb91 · on April 26, 2019

I am finding hard to imagine not seeing the difficulty here, so instead I’m just gonna point out simple operations like stripping whitespace, splitting strings on a character pattern, changing case, dealing with character encodings, regex matches all require manually iterating and mutating or copying strings and in the case of regexes require compiling and auditing various libraries. The abstractions other standard libraries have used, such as rust, make it much easier to simply express the string operations as high level operations and spend your time elsewhere while retaining relatively high levels of performance. Often, string processing is not in the inner loop and does not benefit from things like combining multiple string operations into a single pass, traditionally a thing that might make c perform better all other things being equal.

coldtea · on April 25, 2019

If this a joke?

Length not known, so prone to overflows at anytime, atrocious standard library, ... (and let's not even go into the Unicode situation).

mhh__ · on April 25, 2019

You basically have to write your own high level string library just to approximate the features of just about any other language...