Too often that I find the C extension projects introduced here on HN would omit examples for dynamically allocated objects. In the binary tree example, apparently I can't just memcpy an existing tree to a malloc'd tree because the nodes would point to the ones on the old tree. I was halfway writing a copy function that recursively traverses the source tree,
Isn’t that a problem even in managed, garbaged-collected languages? For most (procedural) languages it’s a bit involved if you want to deep-copy heap-allocated objects. If you don’t want the overhead of re-creating the tree from scratch, you can use indices instead of pointers and store the nodes in a heap-allocated contiguous array, and then when you want to create a deep copy of the tree you can just memcpy the whole array in one go.
Still, inside that copy constructor you're going to write the exact boilerplate code for recursively copying each node, so the complexity of code won't change that much. Also, you're going to call quite a lot of malloc/free, which could be a potential performance hit if you're creating/destroying large trees a lot, as well as fragmenting your memory if you're not careful.
Personally I have a much easier time dealing with tree/graph data structures using indices, regardless of the language I'm using. Using indices as handles in general are just so much easier than using pointers in terms of memory management, and generally performs much better in terms of cache coherency. Indices are also easier to serialize/deserialize, and sometimes it just "makes sense" (for example, if you're developing a physics simulation, representing objects as indices are natural because you're going to use those index values inside various math equations.)
The downside might be that if you're creating and destroying nodes in a tree continuously. then you have to be careful about reusing indices. There's a pattern in gamedev (generational indices) that solve the issue (basically, in addition to an index you also store a counter value with it, and you increment the counter everytime you allocate a new node in the same spot of the array. It's a bit more bookkeeping, but there's lots of code examples for it. It's kind of a well-known trick inside gamedev (and recently in the Rust community, in order to bypass borrow checker issues).
The allocation strategy used is only a tangential problem. The purpose of the example is to demonstrate the functionality of the library in brief, not to walk the reader through problems that C programmers face in general.
I would suggest recursively walking through the original tree, creating an equivalent tree in the process instead of copying the original elements.
I predict all languages will adopt some form of sum types and pattern matching in the next 10 years. They will go from esoteric to mainstream in the same way lambdas did.
I find it really interresting that such a simple concept as a tagged union, which appeared at least as far back as ALGOL 68, could be completely abandoned during the 90's and 00's where OOP reigned surpreme, and only seems to be gaining popularity again now that pattern matching is all the rage.
It's just baffling to me that tagged unions could be shunned in favour of the visitor pattern. How could that ever look like progress?
Being a millenial, I don't know what programming was like the 80's, but surely tagged unions must have been bread and butter for Pascal programmers?
Tagged unions never went away, it's just that, as you mention, C++ style OOP became very popular and they worked the other way around: instead of having a single function that matches the tag to figure out what to do with the data, you ship data-specific functions straight inside the object (virtual methods).
In theory you can have both of course, and C++ now has support for variants (because we know C++ tries to be come a superset of all programming languages in existence). It's still a bit clunky to use though, especially because of the lack of "match" support in the language.
But I think a problem is that it's easy to mess up tagged union in low level, non-managed languages. Rust's enum work great, but that's thanks to the very strong type system and borrow checker which makes them effectively impossible to mess up in safe code.
In C or C++ it can become messy and hard to debug. In C you don't really have a choice, so you do it anyway but in C++ virtual methods and inheritance are usually a bit easier to manage.
Tagged unions have always been around as runtime constructs, they just had different names (e.g. "variant"). In C it's understandable why they aren't part of the language: they would be the only data type that requires a "builtin" opaque struct type to hold the tag and payload union (e.g. something like this, but hidden from the programmer):
That would be a pretty big break with "tradition" for the C language ;) (e.g. it would be similar to a "builtin string type")
And of course tagged unions as "generic user-defined types" are much less convenient/useful than being directly integrated into the language syntax and type system.
Structs aren't really any more fundamental than tagged unions though. They also contain bytes which are hidden from the language (padding). If a struct can have automatically generated padding bytes, why shouldn't a tagged union have an automatically generated tag?
Mostly because C structs have a very exactly bound memory representation, and padding bytes are just about the only thing you can - mostly - get away with not caring about.
Coincidentally, even padding bytes break things technically. Since they're undefined, you - in theory - can't ever calculate a bytewise hash value over a struct that has any padding bytes. It's still done frequently, it's just nulled out beforehand and copying is bytewise anyway. Doesn't change the fact that to the spec letter it's undefined behaviour. GCC also added __builtin_clear_padding() recently.
Tagged unions built-in into the language can have layout optimizations, e.g. in Rust sizeof(Option<&i32>) == sizeof(&i32), because Rust knows it can repurpose NULL as the None value.
To be fair, tagged unions are too error prone to be useful without strong types. That never stopped people from using them, but when a statically enforced alternative came along, everybody jumped ship.
"I find it really interresting that such a simple concept as a tagged union, which appeared at least as far back as ALGOL 68, could be completely abandoned during the 90's and 00's where OOP reigned surpreme"
From a modern perspective, it may be difficult to understand just how crushing the OO consensus was in the 1980s and 1990s. It still reverberates in our language design today. OO was good design was OO. If it didn't fit into OO, that was ipso facto proof that it was bad. Close the book. No further evidence required. Sum types? Kinda overlaps inheritance. Bad design. Not OO, therefore bad design. Why is Java a better language than C++? It's more OO, so QED.
OO here, by the way, is specifically that variant that emerged in the form of C++ and Java. Very important to have defined, enforced "private" keywords and deep hierarchies. Smalltalk by this definition is not an OO language. (And also, therefore, bad. Also possibly heretical for trying to steal the term OO... and I mean all the implications of the word "heretical".)
My "Software Engineering" course circa 1999 was basically "How to Design Software via OO". (It has the distinction of being the one course from my entire college career that I'm not sure I agree with a single thing I "learned" in that course. Even at the time I had done enough real work to be suspicious, and from here in 2021 I find the whole thing risible. But I got a 4.0, so....) Formal diagrams with strictly enforced diagram languages like UML, because OO was the future of Good Design and UML was how OO was going to help bring programming to the masses who would only have to draw Object Hierarchies and then {magic goes here that never was worked out} and perfect programs!
The internet hit programmers and programming earliest, and now there's only a remnant of the OO dogma just hanging on by its teeth in school curricula that haven't hardly been changed in 25 years, and it lacks the power to envelop the students the way it used to.
There was of course always a counterculture that didn't listen, which is where you get Smalltalk, Perl, Haskell and its ancestry, Python, etc. But you have to bear in mind that that was a counterculture...
... and you know, even today, Java and C++ are still pretty darned popular and it's not that hard to find people who still only know one of those (or maybe both) and still think OO ≡ good design ≡ OO, and the vast chaotic landscape of languages that don't even have "private" is just kid stuff for people who can't handle Real Design. Which is OO. Because OO is good, and good is OO.
Currently with D the two people in charge both favour library solutions, but so far I'm seriously leaning towards a language one - it seems to hard to do type erasure efficiently otherwise, for example.
std::variant is implemented with variadic templates; but C++ doesn't have a "variadic switch" statement.
So std::visit cannot directly use a switch statement. Instead it's usually implemented as a recursive function with the hope that the compiler will unroll the recursion and optimize it into a switch.
This hope usually doesn't work out in practice.
AFAIK the usual implementation is statically generating an array of function pointers and dispatching based on the type discriminator. This works even less for inlining.
On my phone so cannot check, but I thought this was mostly a solved problem? Inlining/folding a jump table should be easy enough especially since the table can be const
edit: For a short while libc++ adopted mpark's variant visit, which AFAIK compiles down to switch case with some hand coding. AFAIK it had other problems so they rolled back to the constexpr array of function pointers.
How do compilers check for exhaustiveness when the domain is unbound? Or rather: how does one constrain an integer in C? C enums are glorified consts, it’s still legal to do `myenum_t x = 0xFFFF` where 0xFFFF is not a defined value. This is necessary to allow for flags/bitfield enums.
Yeah, actually the most missing thing in C is sum types, as for me. Poica was rather an experiment; datatype99 is rather a practical library. I'm planning to use it on my work.
Zig has the advantage that it can add the syntax sugar right into the language instead of using macro "workarounds", but I'd guess in the end both have similar lowlevel memory layout (a struct with an enum tag, followed by a payload "raw union") and generated code (e.g. it would be interesting to look at the compiler outputs, they should be very similar).
This "advantage" is also its greatest disadvantage. I can stick these macro definitions in any existing project and start incrementally rolling them out. Migrating something — even partially — to Zig is a much bigger task.
We can generate a type ID by defining an extern variable of a sum type inside the `datatype` macro. Its address then will be a type ID (though I am not sure that such two variables will always have different addresses, I need to check the standard).
About type_name: you can deduce a type name if you know at least one tag of a sum type. For example, if a tag is `Foo`, then `FooSumT` is a typedef to an outer sum type (datatype99 generates this typedef).
Two non-const variables will always have different addresses if they are of the same type. Different types can [AFAIR] per the standard theoretically be in different address spaces, though this is rare in practice.
> Different types can [AFAIR] per the standard theoretically be in different address spaces
If you can force the types to contain a member of the same type (say, a int or enum foobar_discriminant), the pointers &const_var_foo.disc and &const_var_bar.disc would have to differ for obvious reasons.
What kind of preprocessor black magic can you do so that this code compiles? I don't dare to open the file datatype99.h lest it casts a dark spell on me.
Yes, as others have already noticed, it is implemented on top of https://github.com/Hirrolot/epilepsy, a metalanguage for preprocessor metaprogramming.
But in most countries they would be able to drink (especially in a private setting), just not purchase. In Europe for instance less than half a dozen countries have a minimum drinking age for private settings, and a dozen more have an MDA in public (but not private), though for some 16 is enough to drink fermented alcohols (alongside a meal with adults for the UK, regardless for Germany and Austria).
LOL, I cannot get enough of this shit! Can you point to some cool implementation tricks that you used to achieve that? I guess there's quite a bit of ## concatenation to create temporary names and stuff?
The reason why I created this project is that I miss sum types in C all the time, literally everywhere I need a custom error type or just any other alternative data representation. And I do think the implementation of datatype99 is straightforward -- only long names, all the dirty hacks are encapsulated into the underlying framework.