Hacker News new | past | comments | ask | show | jobs | submit login
Giving C++ std:regex a C makeover (nullprogram.com)
98 points by signa11 24 days ago | hide | past | favorite | 73 comments



For context: the author (see his other posts) is exploring the possibilities of writing C with no C runtime to avoid having to deal with it on Windows. He began to kind of treat it as a new language, with the string type, arenas and such, which help avoid memory bugs (and from my experience, are very useful).

This is a pretty cool hack. Makes me want to write a regex library again.


TBH, most of the C stdlib is quite useless anyway because the APIs are firmly stuck in the 80's and never had been updated for the new C99 language features and more recent common OS features (like non-blocking IO) - and that's coming from a C die hard ;)


> TBH, most of the C stdlib is quite useless anyway because the APIs are firmly stuck in the 80's (...)

This. A big reason behind Rust managing to get some traction from the onset was how Rust presented itself as an alternative to C for system's programming that offered a modern set of libraries designed with the benefit of having decades of usability research.


Completely agreed. My Rust origin story wasn't about memory safety, fearless concurrency, a modern type system, or anything else like that. Not that I didn't care about those things — I did — but none of them were what convinced me to start learning Rust.

What did convince me was being able to prototype things for the C project I was working on while having access to a standard library that included basic data structures, synchronisation primitives, and I/O handling in a way that used best practices from recent decades. Everything else was just a bonus that I got to learn and use as I went.


Truth be told, I wouldn't mind crowdfunding an alternate stdlib for C for that reason. Everyone has a wishlist of patterns discovered after K&R bestowed upon us System V, that mesh perfectly well with C's philosophy and very minimal environments, and make modern programming much easier when you're dealing with lots of dependencies. (e.g. bounded strings/small strings, arena allocation, error chaining/coalescing, etc..).


Almost many of those benefits were already present in Modula-2, Ada, Object Pascal, among others.

Unfortunely they came up when UNIX and C were the cool kids on the block, and who cares about a buffer overflow or a few miscompilations due to UB, any good coder is going to get them right anyway, no need for straighjaket programming languages.


C became big not because of unix, but because starting in the 1980's the personal computer with small memory and impoverished operating systems quickly became the biggest market to write software for. Then the biggest player, Microsoft, was more interested in being proprietary and uncompatible which stood in the way of standard libraries. Unix became big when the web became big because of servers.


I don't think everything has to be "modernized" and "updated." When I look at software from the 80s that is still with us, I think: "This is robust, keeps working, and has withstood the test of time" not "This must be changed." I still use C and the standard C library because I know how it worked in the past, I know it works today, and I know it will work for decades to come.

(minus the known foot-guns like strcpy() that we learned long ago were not great)


In contrast when I see software from the 80s that is not a security and performance disaster it’s because of continued investment and most of the basics like string handling have been replaced with bespoke or third party libraries and IO heavy lifting is done with OS specific interfaces anyway.


This is all well and good, but just because something came from before doesn’t mean it was a good idea then, or especially now. You’re basically citing survivorship bias. Of course something still used from the 80s is well made, otherwise it would have been replaced 30 years ago.


Most of the networking stuff from the 80s at least wasn’t particularly well made, it’s just been maintained and significantly reworked to not have massive security vulnerabilities.


I don't think any of the networking stuff from the '80s dealt with security in any way. We're talking transport layer stuff.


I on the other hand, as a 1970's child, using computers since the 1980's, see people stuck in the past and old ways.


Blocking IO is usually good though. The entire Unix kernel is designed to manage complexity so you can write “if then else”.

What is grep going to do while it waits for data?


You are right in that the C stdlib is mostly useful as an SDK for writing simple UNIX command line tools. But for other things it's better to go down to OS-specific APIs or up to POSIX (if a POSIX environment is available) - which isn't a great deal to be honest. One of the greatest features of C is that it doesn't depend too much on its stdlib.


That’s not what I said. Everyone benefits from being able to read and write blocking code.


Since C11 not depending too much on its stdlib is kind of relative.


> What is grep going to do while it waits for data?

Two things:

- Search the data it’s already read in. If data is coming in fast enough, it’s better to read and search in parallel rather than alternating.

- If this is a recursive grep, then list, open, or read from additional files.

Even so, thread pools work fine for this kind of thing. An optimized grep already wants to use threads to split the CPU work into parallelizable chunks (where helpful), so using threads for syscalls too shouldn’t make much difference.

However, for recursive grep, you might need to open large numbers of small files, in which case syscall latency might be a big enough factor that something like io_uring would be significantly faster.

Disclaimer: I’m mostly thinking about tools like ripgrep that are only grep-like. I’m not aware of any actual grep implementations that use parallelism to the same extent. But there’s no reason a grep implementation couldn’t do that; it’s just that most grep implementations were written in the age of single-core processors. Also note that I don’t actually know much about ripgrep’s implementation, so this post is mostly speculative.


> Search the data it’s already read in

The kernel handles that. Your program works on the data that’s available while the pipe is filling up again.

> If this is a recursive grep

As you said. This is modeled best by multiple threads or processes, each navigating through their structure.


> It’s just that most grep implementations were written...

...with a design philosophy of composition. Rather than a hundred tools that each try to make too-clever predictions about how to parallelize your work, the idea is to have small streamlined tools that you can compose into the optimal solution for your task. If you need parallelization, you can introduce that in the ways you need to using other small, streamlined tools that provide that.

It had nothing to do with some prevalence of "single-core processors" and was simply just a different way of building things.


That just pushes the task of optimising the workload up to you, complete with opportunities to forget about it & do it badly.

I don't relish the idea of splitting sections of a file up into N chunks and running N grep's in parallel, and would much rather that kind of "smarts" to be in the grep tool


It has no choice but to read file data in chunks or exhaust memory.

If you need to do n parallel searches what better arrangement do you propose?


I propose the search tool decide how to split up the region I want searched, rather than me trying to compose simpler tools to try to achieve the same result.


You can do nonblocking IO using the C std library. Poll and select have been in there for decades. They are even in POSIX.


POSIX isn't the C stdlib though, that's mostly a confusion caused by UNIXes where the libc is the defacto operating system API (and fully implements the POSIX standard).

TBF though, I guess one can implement non-blocking IO in C11 with just the stdlib by moving blocking IO calls into threads.


Isn't threading generally handled by POSIX as well? The p in pthreads?

If you're writing C code in 2024 and your target is a system that has an OS, then it's safe to use select and poll. They're going to be there. This hand wringing over "oh no, they aren't supported on every platform" is silly because the only platforms where they don't exist are the ones where they don't make sense anyway.


> Isn't threading generally handled by POSIX as well?

C11 added threading to the stdlib (https://en.cppreference.com/w/c/thread).

MSVC is really late to the party (as always): https://devblogs.microsoft.com/cppblog/c11-threads-in-visual...

AFAIK select() and poll() are still not supported in MSVC though. IIRC at least select() is provided by 'WinSock' (the Berkeley socket API emulation on Windows), but it only works for socket handles, not for C stdlib file descriptors.

In general, if you're used to POSIX, Windows and MSVC is a world of pain. Sometimes a function under the same name exists but works differently, and sometimes a function exists with an underscore, and still works differently. It's usually better to write your own higher level wrapper functions which call into POSIX functions on UNIX-like operating systems, and into Win32 functions on Windows (e.g. ignoring the C stdlib for those feature areas).


> MSVC is really late to the party (as always)

Mostly because before Satya took over, C on Windows was considered a done deal, and everything was to be done in C++, with C related updated only to the extent required by ISO C++ compliance.

https://herbsutter.com/2012/05/03/reader-qa-what-about-vc-an...

Eventually the change of direction in Microsoft's management, made them backtrack on that decision.

Additionally, in what concerns C++ compliance, they are leading up to C++23 in compliance, while everyone else is still missing on full modules, some concepts stuff, parallel STL,...

Although something has happened, as the VC++ team has switched away to something else, maybe due to the Rust adoption, .NET finally being AOT proper with lowlevel stuff in C#, or something else.

https://old.reddit.com/r/cpp/comments/1ea6gho/microsoft_when...


I guess the difference is that you need maybe 5 peeps to keep the C compiler frontend and stdlib uptodate (in their spare time), but 500 fulltime to do the same thing for C++, and half of those are needed just to decipher the C++ standard text ;)


That isn't something I can disagree with, now if only WG14 took security more seriously.


> If you're writing C code in 2024 and your target is a system that has an OS, then it's safe to use select and poll.

Not on Windows. My target systems include Windows. The world is much larger than just Linux/POSIX.


I mean, no. The most basic of uses on windows require #ifdef'ing as the prototypes, types, error codes and macros in wsock2 aren't exactly the POSIX ones (WSAPoll instead of poll, etc.).

Also some software still target windows xp and this one doesn't even have poll, only select


poll() and select() are POSIX-isms that are not necessarily going to be present in every system's C standard library.

The only reason they happen to be available on Windows is that Microsoft, in an uncharacteristicly freak stroke of respect for existing standards, decided to make WinSock (mostly) function-for-function compatible with BSD sockets.


In practice poll and select exist everywhere it makes sense. There aren't a ton of independent Unix vendors each with their own expensive and broken C implementation running around anymore.

Both are available if you use --std=c89 on gcc and clang. At this point it is safe to assume they are available unless you're doing something weird like writing C for some tiny microcontroller. Practically speaking this has been true for at least 20 years.


> Both are available if you use --std=c89 on gcc and clang.

This is irrelevant - that switch doesn’t have anything to do with controlling functionality not provided by the C standard library. Since poll and select aren’t part of it to begin with it doesn’t affect their availability.

epoll and signalfd will be available (on a gcc target where they are available obviously) as well with that switch I don’t think that makes them C standard library functionality.

select/poll differs enough across common platforms. In MSVC, poll isn’t there, sure you can emulate but now the goal post moving is getting ridiculous. The arguments to select are only superficially compatible (A practical example is that POSIX select supports pipes, but this will not work on Windows outside of specific environments or 3rd party implementations)


Windows has poll(), it's just called WSAPoll() and, like their select(), only works with sockets.

https://learn.microsoft.com/en-us/windows/win32/api/winsock2...


"WSAPoll" because it's not a POSIX poll even though the signature is the same at it works similarly for sockets and I guess Microsoft thought through it better 15 years later. But the original claim is just that select/poll are "everywhere it makes sense", but this depends on what you define select/poll to be. I think something close to POSIX is what makes sense. If you can't use it on pipes and FIFOs (and even regular files without getting an error) that seems like a pretty contrived definition.

The whole moving around of definition of what is the C standard library just because of popularity seems unproductive. ("You can do nonblocking IO using the C std library." -- no, you can't) Most popular 3rd party libraries support or can easily support the most popular targets due in large part to them being popular, that means "in practice, they exist everywhere it makes sense." Does this mean all popular 3rd party libraries are part of the C standard library?

Colloquially redefining terms like what is the C standard library just sows confusion with no benefit (as was illustrated earlier in regards to threads; C11 threads are not pthreads), just say what you mean in this case.


Yes, it is remarkable what little you actually get when you strictly stick to the actual C89/C99/C11 definitions of what's in "The C Standard Library". I still get tripped up on it, and often have to double check: Surely sigaction() is part of the C Standard! NOPE it's POSIX, but signal() is! Surely strptime() is part of the standard! NOPE, but strftime() is. termios stuff? NOPE. It's a minefield out there.


People should look at UNIX, as the C language runtime, and why POSIX became relevant for portable C code.


UNIX isn't the only operating system in the world though, a lot of C programming happens on Windows with MSVC (or embedded platforms, or WASM).


Yeah, and on those platforms the functions exist if they make sense. If you're talking about some tiny embedded thing with no MMU or OS then it might not, but programming on those is a specialized task anyway so it doesn't really matter.


They're talking about Windows, which doesn't have select/poll (except for sockets, kind of).


Thanks for that explanation! I have occasionally fantasized about a similar project - what could C be like, if one abandoned its ancient stdlib and replaced it with something suited to current purposes? - so I'm looking forward now to reading more of this author's writing.


Something like that would probably end up similar to GLib or the Apache Portable Runtime.

https://gitlab.gnome.org/GNOME/glib/

https://apr.apache.org/


Thank you for the context. I wouldn't have read the article without it. I mean, it's a pretty good idea for "no runtime," but when I saw the article title, I thought at first "Why????" Honestly, I'm glad I read it.


Being able to write C without the C standard library on Windows is something we have been doing since Windows exists, nothing special there.

As proven by early editions of Petzold famous book.


NODEFAULTLIB is quite a rite of passage


what's special about Windows for a regex library?


On POSIX systems the OS (well, libc) already provides a C regex library: https://pubs.opengroup.org/onlinepubs/9699919799/functions/r...

Whether you want to use that is another question.


The article was interesting, but even more so was his link to arena allocation in C: https://www.rfleury.com/p/untangling-lifetimes-the-arena-all...

This comprehensive article goes over the problems of memory allocation, how programmers and educators have been trained to wrongly think about the problem, and how the concept of arenas solve it.

As someone who spends most of his time in garbage collected languages, this was wildly fascinating to me.


So bad is the performance of gcc std::regex that I reimplemented part of it using regex(3). Of course, I didn’t discover the problem until I’d committed to the interface, so I put mine in namespace dts, just in case one day the supplied implementation becomes useful.

As it stands, std::regex should come with a warning label. It’s fine for occasional use. As part of a parser, it’s not. Slow is better than broken, until slow is broken.


To be fair, the GNU implementation of std::regex has to conform to the API defined by ISO/IEC 14882 (The C++ Programming Language). If you don't have to provide that API purely in a header file, it gets pretty easy to write something bespoke that is faster, or smaller, or conforms to some special esoteric requirement, or does something completely different that what the C++ standard library specification requires.

The purpose of the C++ standard library is to provide well-tested, well-documented general functionality. If you have specific requirements and have an implementation or API that meets your requirements better than what the C++ standard library supplies, that's great. You're encouraged to use that instead.

If you have an implementation of std::regex that meets all the documented requirements and is provably faster under all or most circumstances than my implementation is, then submit it upstream. It's Free software and it wouldn't be the first time improved implementations of library code have been suggested and accepted by that project. Funny how no one has done that for std::regex in over a decade though, despite the complaints.


I've always heard that it's a backwards compatibility problem with ABI, not API, is that not true?


Mostly yes, and it doesn't help that the template heavy API makes it really hard to improve things internally without breaking the ABI.

See https://stackoverflow.com/questions/70583395/why-is-stdregex...


Around 30 years ago, STL introduced an allocator template parameter everywhere to let you control allocation. Here in 2024 we read about making use of the, erm, strange semantics of dynamic linking to force standard C++ code to allocate your way


I like the newest* introduction of allocators, PMR, I use it quite a lot.


I can't say that I like this very much.

Problematic macro in the header, custom string type compatible with nothing else in C, and I have no idea where the arena type comes from.

Having it magically deallocate memory is nice, but will confuse C programmers reading the caller.

Honestly, adding -lre to the linker is just much easier, and that library comes with docs too.


TFA links to what arenas are and where they come from, how some bits included here would not really be part of this library but assumed part of the project using these techniques, does explain the general point of the exercise, and how this isn't even strictly a suggestion for a library but a "potpouri of techniques".

They are fully aware of -lre and assume that everyone else is too. This isn't about just achieving regex somehow. It's about avoiding the crt and gc and c++ in general while using an environment that normally includes all that by default.

You don't redefine new just to get regex. Obviously there must be some larger point and this regex is just some zoomed-in detail example of existing and operating within that larger point.


Read his other stuff, it’s rather well thought out. The assumption is you don’t use libc or at least you use different interfaces to it.


This is fun and impressive, but it feel the author kind of misses out on explaining in the intro why it would be wrong to just ... use C's regex library [1]?

I guess the entire post could be seen as an exercise in wrapping C++ to C with nice memory-handling properties and so on, but it would also be fine to be open and upfront about that, in my opinion.

1: https://www.man7.org/linux/man-pages/man3/regex.3.html


Probably because that’s not part of the C standard library, but a POSIX offering. Author does cross-platform work including Windows.


Ah, d'uh. Good point. That's what I get from mostly writing stuff like that in Linux, I guess. Thanks.


Back in the old days of console game programming, most SDKs would come with something like:

my_audio_sdk_init(&arena, sizeof(arena)); // char arena[65536]; // or something like this


> The regex engine allocates everything in the arena, including all temporary working memory used while compiling, matching, etc.

I do something quite different. I design the API so any data returned by the library function is allocated by the caller. This means the caller has full control over what style of memory management works best.

For example, you can then choose to use stack allocation, RAII, malloc/free, the GC, static allocation, etc.

For a primitive example, snprintf.


Isn't giving the caller control over the memory exactly what this API does? The caller just passes in a block of memory that will be used for all of the internal allocations as well as the strings returned by the API.


If I misunderstood it, then you are correct.


This guy is brilliant. He tries to simplify things when so many are going the other way.


std::regex has such horrible performance that it's probably better not to use it even in C++.


entirely. Good alternatives are CTRE (https://github.com/hanickadot/compile-time-regular-expressio...) which parses the regex and instantiates the automaton entirely at compile-time, or Google's re2 (https://github.com/google/re2) if you need to generate regular expressions at run-time.

Even boost.regex on which std::regex is originally based, performs better because they can afford to break ABI.


I’d really like to see a full C implementation of this interface. Remove all the C++ complexity in the back end


Why wrapping the extremely poor and slow std::regex, when you have pcre2?


This seems like a bad idea, if only because std::regex (performance) is horrible.


If you read to the end of the article he actually lists the pros/cons where this is mentioned. That aside, the point of the article is maybe not so much using C++ regex but a technique to integrate C++ code into C code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: