In a way, C++ and Rust (and possibly D?) have pretty much an infinity of fat pointers as anything can be made to deref' to something else, and be more than one pointer wide (e.g. C++'s std::string and std::vector and Rust's String and Vec)
Just Googled and seems fat pointers can at least prevent buffer overflow. I want to know more about other returns and costs on changing C to fat pointers.
They can prevent some buffer overflows, assuming that the data in the fat pointer remains correct. If you want to prevent buffer overflows completely you need to go well beyond that and effectively either use a higher-level language with a runtime and GC making sure everything remains coherent or do what Rust is doing with its borrow checker.
Not that I'm against fat pointers in C, I always thought that in hindsight not having a primitive "slice" type was a mistake. In particular using NUL-terminated strings instead of slices is both error-prone and makes implementing even basic string-splitting functions like strtok annoying or inefficient (you need to be able to mutate the source string to insert the NULs or you have to allocate memory and copy each component. If you had slices you could easily return a portion of the string without mutation or copy).
I mean, some types of implementation of fat pointers are fully compatible with the C spec, and would mean that undefined behavior will result in a crash rather than overflow. Slice behavior would not exist though.
The programmer doesn't have to keep both in sync manually, so there is less chance that they blow up, as it happens regularly in C code, even after being reviewed.
You might be interested in this paper:
"CheriABI: enforcing valid pointer provenance and minimizing pointer privilege in the POSIX C run-time environment"
https://blog.acolyer.org/2019/05/28/cheri-abi/
I would interpret the silence of C people as "Thanks, but no, thanks."
The success of C for such long time without having fat pointers, exceptions, RAII, etc... is probably an indication they weren't needed that much.
Generally, when you write a C macro, you should parenthesize each use of any argument inside the macro. The exceptions would be cases like token pasting where you want the literal argument text and only the text.
I don't have a specific failure example in mind for the COUNTOF macro (maybe someone can post one?) - but here is a macro where it is easy to see the problem:
#define TIMESTWO( value ) ( value * 2 )
int ShouldBeEight = TIMESTWO( 2 + 2 ); // Oops, it's 6!
This happens because the macro call expands to:
2 + 2 * 2
Parentheses prevent this problem:
#define TIMESTWO( value ) ( (value) * 2 )
The macro also illustrates a common misconception: sizeof is not a macro or function call, it's an operator. So wrapping the argument to sizeof in parens isn't necessary unless they are needed for other reasons. (They don't hurt anything, they just aren't needed.)
So when I've written a COUNTOF macro (and it is a very handy macro in C/C++ code if your compiler doesn't already provide a similar one like __countof in MSVC), I write it like this:
The space after each sizeof is my attempt to indicate that the following parens are not function/macro call parens, but are there specifically to wrap the array argument safely.
You also shouldn’t repeat the argument: “while(COUNTOF(ptr++) > 0) { }” would do horrible things (if it compiles; didn’t check) for example.
There are various preprocessor tricks to avoid this stuff (eg “do{ x = (array); ... } while (0)”), but, honestly, if you want this sort of memory safety, C++ is your friend.
sizeof doesn't actually evaluate the expression, so in this case ptr++ won't get evaluated at all (nevermind twice). Good thing to be aware of in general though.
Agreed, although I don't know of a straight C way of implementing COUNTOF without repeating the argument. So it becomes a lesser-of-two-evils question: is the macro (with its flaws) better than doing the calculation explicitly every time you need it?
The __countof macro in MSVC does go to some extra work to make it more safe, so it's definitely recommended to use that instead of rolling your own COUNTOF in that environment.
Ah yes, thanks for the reminder. This is also the reason why COUNTOF is valid for a zero-length array.
Even though array[0] is non-existent in this case, it's fine to pass it to sizeof to get the size of an array element as if there were one. After all, a pointer can point one element past the end of an array.
> After all, a pointer can point one element past the end of an array.
Hmm, I wouldn’t put it that way, since sizeof(array[1]) would work just as well. I like to think of sizeof running the type checker on the expression provided and then looking up the size of the type of the expression: this makes it clear that the thing inside can be completely bogus as long as it is syntactically correct.
Subexpressions with side effects being passed to macros is a great way of having the side effect occur repeatedly through repeated evaluation of the operands to the macro. Be careful!
Also, I've seen a buggy linux kernel driver use the above macro, then pass it a pointer. Only caught it because -Wsizeof-pointer-div in a recent compiler.
This is analogous to spans and views in C++. I've rather enjoyed using span-lite [0], an implementation compatible with earlier C++ standards. (Roughly, all the syntactic niceness of an STL container without requiring ownership.)
> Any slice implementation that references to the original data has to be some kind of "fat pointer". Is there any other way that can be done?
Yes, add one more level of indirection.
For instance traditional "oo" languages don't usually use fat pointers for dynamic dispatch, you have a single pointer to an instance which holds a pointer to its vtable.
In Rust however, the "object pointer" is a fat pointer of (vtable, instance). Go's interface pointers are the same (which combined with nils leads to the dreaded "typed nil" issue).
I don’t understand people’s beef with “typed nil”. If an interface is just a fat pointer and the data is itself a pointer, then it stands to reason that the interface can be not nil while the data can be nil. If you understand the concept of pointers (or nillable pointers, anyway), surely this should not be surprising to anyone? It would have the same states as a asterisk-asterisk-int (HN formatting prohibits asterisk literals), would it not?
I think it's a combination of several things: First, you've got a good chunk of Go programmers who haven't dug down into the details of the internal representation of interfaces, because for the most part, they don't need to, and they either don't generally dig in to that level, or just haven't gotten around to it yet. (There's nothing wrong with this and this isn't a criticism. It's just that not everybody is an expert in everything they do or use.)
Secondly, I think that for a lot of people, when they have an interface that is itself nil, they mentally tag it with the label "nil", and when they have an interface that contains a typed value that is nil, they mentally tag this with "nil", and from the there the confusion is obvious. Mix in some concepts from other languages like C where nil instances are always invalid, so that you don't mentally have a model for a "legitimate instance of a type whose pointer is nil" [1], and it just gets worse.
Third, I think it's an action-at-a-distance problem. I'd say the point at which you have, say, an "io.Reader" value, and you put an invalid nil struct pointer in there that doesn't work, the problem is there. Values that are invalid in that way should never be created at all, so "interfaceVal == nil" should be all the check that is necessary. So you have a problem where the code that is blowing up trying to use the "invalid interface" is actually caused by an arbitrary-distant bit of code that created the value that shouldn't have been created in the first place, and that pattern generally causes problems with proper attribution of "fault" in code; the temptation to blame the place that crashed is very strong, and, I mean, that's generally a perfectly sensible default presumption so it's not like that's a crazy idea or anything.
[1]: For those who don't know, in Go, you can have a legitimate value of a pointer which is nil, and implements methods, because the "nil" still has a type the compiler/runtime tracks. I have a memory pool-type thing, for instance (different use case than sync.Pool, and predates it) which if it has an instance does its pooling thing, but if it is nil, falls back to just using make and letting the GC pick up the pieces.
This. Nil is the one case where you can seriously screw your self up in Go because it’s an untyped literal that gets typed on assignment. Interfaces are fat pointers and a “true” nil pointer is nil interface, nil implementation. If your function is going to return a known subtype and you store that nil in a temporary you now return a known interface and nil implementation. Then your == nil check on the interface pointer will fail.
I royally screwed up a deploy once because my functions were returning nil error implementations that were != nil and triggered failure paths.
I agree that nil is error prone, but I don’t think interfaces add any additional confusion so long as you think of them as a reference type and reason accordingly. Also, I think you’re mistaken about “true nil”. A “true nil” only requires the interface to be nil. The value doesn’t matter as there necessarily is no value (moreover interfaces can have values which are value-types that cannot be nil!).
Not since 2007 (java 7).
The extra ints (8bytes) used for marking the beginning/end of the string and the fact the subscring leaked the ref. to the original string caused extra memory footprint. So now it's just a copy unless the substring is exactly the same as the original.
Previously (around 2004) I used to call new String(string.substring) in specific cases to prevent leaks.
>Is there any other way that can be done? <-- yes copies.
Not numpy, but Gorgonia tensors are just pointers. The reason: it's cheaper to pass around one word than the absolute minimum "fatness" to pass around a multidimensional array.
The absolute minimum meta info required for a multidim array is shape, stride. Numpy implements them as a [HUGENUMBER] array of integers for both. Gorgonia implements them as slices. Both of these incur a lot of additional movement of memory.
Another way to implement fat pointers is to keep the pointer representation the same but have another structure like an interval tree that stores a map of pointers to the extra information on bounds. That's how I implemented Bounds Checking GCC back in the day: https://www.doc.ic.ac.uk/~phjk/BoundsChecking.html It has the advantage that pointers remain compatible with code that wasn't specially compiled to know about fat pointers (such as system libraries).
I think AddressSanitizer uses a shadow mapping structure more similar to Valgrind.
Although your point about performance still stands. Intel MPX is dead--support removed from GCC, IIRC, because of hints Intel was abandoning it. We'll never know how well it could have been optimized in practice. I think ultimately we simply need proper fat pointers, but those might not become ubiquitous (in the sense of becoming a de facto standard part of shared, C-based platform ABIs) until we see 128-bit native pointers.
Algol 68 slices are also fat pointers. Only they also carry a `stride` field, since in Algol 68 you can slice multidimensional arrays in any direction.
A couple questions about that design. Is it worth paying the cost of an extra word in every slice, if strided slicing is rare in most programs? And when you wrap OS APIs that don't understand striding, do you have to use a different pointer type?
It's a subjective question, but given that A68's slices are already burdened with carrying their lower bound wherever they go, not much is lost with adding another piece of data to them. They are already quite fat, and more with each dimension. I think the ability of taking a column and just apply any sequence function to it is well worth it.
I don't think you could use OS APIs without boxing and unboxing data, but I think that's the case with every language except C.
Obviously there are all kind of cute games that can be played with pointers if you know arch-specific details (IIRC v8 was using alignment guarantees to steal low bits at one point, might still be). And tagged pointers of various kinds go at least back to Symbolics if not further.
Go interfaces are also fat pointers. Basically, the pointer to the type info/method set and the actual data are stored separately. [1]
As a result of this you can use pointers as interface types without boxing them. For example you can take an interior pointer to an array, or a C pointer, and use it as an interface type without boxing since the type info doesn't need to be stored with the data.
The “uint8* a, usize a_len“ pattern is one of the ugliest aspect of C. It’s probably the source of most bugs. To its defense, it was invented at a time where we were trying to save each byte of data and security was not really a thing. Unfortunately C is still quite popular.
IMHO if C had generics then so many problems could be fixed (e.g. you could easily have a struct Slice<T> for a fat pointer). Existential generics would even give you OOP (and, in a roundabout way, closures) for free. Hmm...
You could put bounds-checked arrays into the 80286 segment descriptor table, but loading a segment register in protected mode was very slow and 16k entries wasn't big enough.
Interesting that this post has been in the #1 place for so long. I don't use Go but I knew this already by having randomly read some things about the language. And slices are central language concepts after all.
Is it that the Go documentation doesn't cover the underlying implementation at the machine word level?
Docs explain this, not as detailed as the post, but it does touch on slices referencing/wrapping an array. However, there is a blog post dedicated to slice internals https://blog.golang.org/go-slices-usage-and-internals that dives further in
Fat pointers + threads permit torn reads/writes, leading to memory unsafety. It seems weird that Go allowed memory unsafety for this narrow use case - AFAIK only slices and interfaces are fat pointers.
> There is one important caveat: Go is not purely memory safe in the presence of concurrency. Sharing is legal and passing a pointer over a channel is idiomatic (and efficient).
> Some concurrency and functional programming experts are disappointed that Go does not take a write-once approach to value semantics in the context of concurrent computation, that Go is not more like Erlang for example. Again, the reason is largely about familiarity and suitability for the problem domain. Go's concurrent features work well in a context familiar to most programmers. Go enables simple, safe concurrent programming but does not forbid bad programming. We compensate by convention, training programmers to think about message passing as a version of ownership control. The motto is, "Don't communicate by sharing memory, share memory by communicating."
If you need thread-safety you must at least use atomics anyway Either your platform has transactional memory, double-wide cas or you need to fall back to locks.
I was of the impression that even regular pointers could allow for memory unsafety but I can’t recall exactly what I had heard that gave me that impression...
Every language sucks: C++ is too complex, Haskell is too wound up with GHC, Lisp is hard to find a use for outside of Emacs, Java attracts bad libraries, Rust is hard to write simple data structures in, JS was written in a week, bash is hard to read, Python has the GIL, etc, etc.
But programming is fun and it’s especially fun to move easily among all these flawed tools and accomplish a goal.
Golang is ok, the runtime is pretty weak compared to modern JVMs and tight C++ is faster, Clojure and Haskell are denser. It’s ok but I wouldn’t stop checking out other stuff.
It's all about cost/benefit tradeoffs. Programming involves making cost/benefit decisions many times a day. Running a programming shop is that multiplied by the number of people, squared. Technical debt is about unresolved cost/benefit tradeoffs from the past.
I think one's first language usually taints the way one looks at other languages. For me I learned Python 3 first. So even Python 2 looks not so great, and learning Go from Python 3 was a bit rough though I've since learned to love Golang -- ask me in a bit when the honeymoon phase ends though ;-)
I think it’s for different reasons. Go as a language is just fine, but Go as a framework for building software is just fantastic. The tooling is comprehensive and easy to use, testing is built in, and the happy path is almost always highlighted for you. No need to figure out which documentation generator, style guideline/formatter, test framework, build system, package manager, etc to use and these are easy to use (with the notable exception of package management, which has been rocky but is improving). Lisp is a neat and powerful little language but the tooling and documentation are often lacking and it’s often difficult to figure out even which lisp is right for your project.
The debugger isn’t great. Most things just work, and they just work well for the most popular use cases. Documentation is just comments—no special syntax to memorize and no need to maintain infrastructure to parse docstrings and publish to a web server. It also runs the examples to make sure they work. The build system doesn’t require you to learn a special (typically imperative) project configuration language or Makefile generator. It also outputs a statically-linked binary by default—no VMs or interpreters or libraries to shop separately. Although it doesn’t have a comprehensive assertion suite, Go ships with a unit test framework that just works, which is more than can be said for most languages. And all of these things are standard so you can pretty much jump into any project and build it, run its test suite, and begin contributing with little more than a text editor and the Go toolchain. So yeah, Go’s tooling is simply fantastic and no other mainstream language except Rust comes close. This isn’t a fanboy statement; there are lots of other factors to consider, but Go’s tooling is top notch.
I can relate. It is almost addictive to write in a Lisp. Clojure specifically. I cannot pinpoint exactly why that is, but I believe it has something to do with its neat syntax and the peace of mind it creates when writing in a mostly functional style. It feels like I don't need to worry about the language but rather about the problem.
> For example, currently on x86-64, only the lower 48 bits of a pointer are actually used. The other 16 bits could carefully be used for other information
Well, you can't dereference these fat pointers directly on amd64, you have to remove the tag.
aarch64 actually has a Top Byte Ignore flag though :)
I’ve seen this technique used in old codebases for strings. A typedef was made for a char* whose -1 address was a count and there wasn’t a zero terminator. This was useful since string length checks were now O(1) and string manipulation can take a surprising amount of time in CPU bound programs.
It was an interesting idea but you needed a whole new set of string functions and bridge functions. It takes a runtime (or codebase) wide commitment to make it work.
That sounds like Pascal strings. Arguably saner than C strings, but lacking the main advantage of slices: multiple slices of different lengths and/or different starting points can refer to parts of the same string, so you can "slice" things without copying.
Swift arrays are the only programming language arrays that have ever baffled me. I still have no idea how to manually reconstruct them given the array struct.
Swift arrays are interesting beasts because they need to support bridging into Obj-C, which makes them rather complicated. Look at ContiguousArray instead if you want the simple version that skips bridging support.
What surprises me in C programmers is that bugs like buffer overflow have been known for several decades but they still use raw pointers for strings and unsafe pointer arithmetics. At least in open source projects I often see this. C is really a language that doesn't move forward.
https://dlang.org/spec/arrays.html#dynamic-arrays
My suggestion for C to adopt fat pointers:
https://www.digitalmars.com/articles/b44.html