Yet in practice the stack on x64 is separate from the heap, GPU memory is general is separate from system memory, and even inside the GPU you have at least a 3 level segmented memory hierarchy.
Even in C-land, you're technically not supposed to fabricate arbitrary heap pointers, you're supposed to offset something you get from malloc(). The Standard says (if memory serves) that only pointers from the beginning of a malloc() to 1-past-the-end have defined behavior (when it comes to the heap).
Of course, there are probably lots of in-practice exceptions when it comes to embedded, kernel code, mmap() shenanigans, etc.
Oh boy, welcome to the exciting world of C pointer provenance :) What you just described is what compiler people call pointer provenance, where each pointer has, in addition to the address, a second piece of info attached to it that describes all the places to which the pointer can point.
This is an extremely simplified and probably incorrect view of https://open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1434r0.... This is complicated because nobody agrees on what the correct behaviour should be, and we have mountains of legacy codebases that all rely on something slightly different when pointers get converted to ints and back.
That's what makes C programming fun: doing what the compiler/language people say you shouldn't be doing and seeing if it works anyways. At least, it works on my machine...
Rational built a truly semantic IDE, and the result of that is that if you change a line of source-code somewhere in a huge complicated project, it will know precisely what the consequences of that change are.
Ultimately, this allows it often just recompile that single line.
The part of this that I most enjoy is that you cannot even obtain some memory from your own allocator implemented on top of mmap or on top of malloc, store ints in it, return it to the allocator, ask the allocator for memory again and receive the same memory back, and store doubles in it.
I've implemented a pools allocator for embedded devices a couple of times. Allocating memory for callers out of a block of static memory. I thought it was portable C. Which detail of the standard did I not realise I was running into?
To be clear, that metadata only (generally) exists in principle - it is not actually materialized and stored anywhere, even at compile time (definitely not at runtime). There may be cases where the compiler actually tracks it, but most often it is only tracked "best effort" - the compiler only needs to prove that two pointers can't be aliased based on any possible value of that metadata, not actually compute that metadata and use it.
Standard C provides only a few ways to obtain valid pointers. implementations can define behaviors in cases that the standard leaves undefined, such as allowing more cases of casts of integers to pointers than the standard defines (common in embedded-land), or functions like mmap or sbrk. so you or your chip vendor could define FOO_REG as (uint32_t )0x80001234 and use it as if it were a variable.
Olde C used to just let you use integers as struct pointers, and there was only one struct member namespace. so code like this was valid and did an integer-size write to address 0177770. old unix did this for device register access; see the lions book.
Although you shouldn't be on a modern system, you'd instead be implementing malloc on top of mmap. So remove sbrk, and make mmap the "object allocator", and tada! You don't really need linear virtual address spaces anymore.
Yes, this is why malloc() can't actually be implemented in C. Actual implementations of it exist because of special compiler dispensation, or mostly that the callers and implementations are in separate libraries so the implementation isn't visible to the caller.
> Attribute malloc indicates that a function is malloc-like, i.e., that the pointer P returned by the function cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P. In addition, the GCC predicts that a function with the attribute returns non-null in most cases.
But it doesn't provide any operations to do the things in this paragraph (create new pointers). The operation that does this is malloc itself.
This will typically not cause problems, but it would if LTO got so good you could include libc in it.
So what? In a bitmask, such as a hardware register or database Bloom filter, two adjacent bits may be unrelated, and their relative position entirely coincidental and insignificant.
That doesn't mean we need a memory model in which bits are separate objects, in some sense.
Real segmentation would have solved the problem described in the article. Under virtual memory segments like on the 80386 (and mainframes before that), you can physically relocate a segment and while adjusting its descriptor so that the addressing doesn't change.
The problem was mainly caused by having no MMU, so moving around objects in order to save space required adjusting pointers. Today, a copying garbage collector will do the same thing; rewrite all the links among the moved objects. You'd have similar hacks on Apple Macintoshes, with their MC68K processors and flat space, like doubly indirected objeccts managed by "handles" and whatnot.
The 386 segmentation system exists above the paging portion of the MMU, so it is possible to move them around without physically copying memory (or adjusting the segment+offset).
Here is a somewhat reasonable diagram of how it all works
It's fun to see that old article again. (I wrote the FixDS program Raymond mentions at the end.)
One pain point in 16-bit Windows programming was that you had to do two special things for any callback function in your application: EXPORT the function in your .DEF file, and call MakeProcInstance() on your callback function to get an "instance thunk".
FixDS made a simple patch to the compiled function prologs that rendered all of this unnecessary. You could just use your callback function directly, like we expect today.
I wrote this piece, frustrated by, what looks to me, like the entire semiconductor industry is only exploring one single computer storage organization, despite the fact that recent inventions like flash practically begs for innovation.
For instance few people realize that the Flash Adaptation Layer in the SSD devices means that we literally run two filesystems on top of each other, because nobody has seriously tried to get rid of the "a disk is an array of individually rewritable sectors" despite this literally being untrue both for modern disks and in particular for Flash based storage.
Similarly, the "flat physical/flat virtual" MMU model is a relic from the days of IBM 360 and VAX 11/780 and utterly inefficient and unsuitable for what we do in userland these days.
As Robert has shown with CHERI, there is plenty of space to innovate without breaking existing code.
And yes, C can be object oriented, all you have to do is keep your hands from the primitives which are necessary to access hardware directly.
Architectually GPU's are a superoptimized distraction, like the vector-units on Cray and Convex computers were 40-50 years ago, but those too can function in a non-flat address-space.
But even C in OO-mode, and C++, Go, Rust, PHP and for that matter LISP and SmallTalk, would benefit from an HW/MMU architecture which focused on delivering fast object service, rather than flat address-spaces which software must then convert into objects.
But to innovate, we must first realize that we are currently stuck in a box, and dare to look outside it.
Interesting but neither (esp. the latter) were known as racehorses.
I'd prefer to keep the hardware simple and fast and push the complexity into the software, and prove stuff.
> would benefit from an HW/MMU architecture which focused on delivering fast object service, rather than flat address-spaces which software must then convert into objects.
That conversion may not be cheap (edit: badly phrased, the object mapping process and hardware may not be cheaper (edit again: = faster) than that mapping done by the MMU for conventional memory) - can you exaplain how it would be done such that it would be cheaper in time than the current mapping in the common/hot/optimistic path, and how it would not be worse than it is now on the rare/cold/pessimistic path? And how it would behave on average, between those 2 extremes?
And why objects everywhere would be better all-round?
Probably if you haven't heard of CHERI and can't be bothered to Google it when one of the most respected systems architects around tells you it's worth looking at, you aren't going to put in the effort needed to read the CHERI tech reports so you can have an informed opinion on the performance cost of putting this kind of protection into hardware. And if the only historical "OO" processors you can think of are Rekursiv and iAPX432, and not the Dorado or the Symbolics line or the B5000/B5500/ClearPath line, it sounds like you have a lot of reading to do to get to having an informed opinion.
The Symbolics 3600, the B5000, and the Smalltalk microcode for the Dorado all had generic operation dispatching in hardware, though they varied in how flexible it was. The iAPX432 and the Rational R1000, as far as I can tell, didn't. Generic late-bound operation dispatching is the essential core of OO.
For many years AMD64 CPUs have had hardware acceleration for this sort of thing in the form of a "branch target buffer", so in this very important sense they're more OO than the iAPX432, though they don't have the hardware bounds-checking and dynamic type checking that all of the other architectures we're discussing did.
Of these, only the Smalltalk microcode for the Dorado came close to the level of hardware support for OO that something like SiliconSqueak has.
You're just repeating buzzwords rather than taking the time to understand what they refer to.
> The Symbolics 3600, the B5000, and the Smalltalk microcode for the Dorado all had generic operation dispatching in hardware
Regarding the symbolics, that seems highly unlikely as lisp is not an object oriented language unless the MOP is draped over it (which is multi-dispatch IIRC and that's not going into hardware). Please provide some links to show I'm wrong.
> The iAPX432 and the Rational R1000, as far as I can tell, didn't.
"9.4.2 Procedure Call and Context Objects
To transfer control to a procedure, a program executes a
CALL instruction, causing the procedure to be invoked. On exe-
cution of a CALL instruction, the hardware constructs a new
context object. The context object is the procedure invocation
record and defines the dynamic addressing environment in
which the procedure executes. All addressing of objects and
scalars occurs through the context object, and the context is
the root of all objects reachable by the procedure."
Oh give me a break, this is just branch prediction and a little caching, it's nothing to do with OO/dispatch because there is no generic dispatch involved. It's just an optimisation for normal dispatch, nothing else. If you don't understand what a branch predictor actually does... ech
> Dorado
I'm not familiar with Dorado, can you provide a link showing this and preferably a bit more information as well actually stating this clearly.
> You're just repeating buzzwords rather than taking the time to understand what they refer to.
I do get tired of HN, I come to learn, I get dragged down by clammy seaweed posts like this, just claims, no concrete anything ("as far as I can tell"), from people who know even less than me ("Generic late-bound operation dispatching .... For many years AMD64 CPUs have had hardware acceleration for this sort of thing in the form of a 'branch target buffer' OMG just stop talking). Don't lecture me until you can deliver the goods, then and only then, lecture away because then I'll be listening.
> Regarding the symbolics, that seems highly unlikely as lisp is not an object oriented language
The Symbolics Genera operating system is largely written in object-oriented style using the Flavors OO-extension. Since the early machines had a micro-programmable CPU, there were with new operating system releases also new CPU extensions to support new Lisp, OOP or logic language (Prolog) features.
Beyond that: Lisp originally used 'generic operations' in a non-OO sense. For example the + operation works for all the kinds of numbers (integer + integer, float + float, integer + float, integer + complex, ... and so on). The CPU determines at runtime which operation runs. Thus there is only one generic ADD instruction and not per-type instructions.
"Dynamic addressing environment" in this context refers to the stack frame in which the procedure stores its local variables (and which may contain, for example, a link to the stack frame of an enclosing procedure, as in Pascal). Lots of things can be dynamic, which is to say, determined at run-time; method dispatch is just one of them. This is a good example of you repeating buzzwords without understanding what they refer to, although in this case the buzzword is also a technical term with a precise meaning.
Intel liked to use the term "object-oriented" to describe the iAPX432 because it was fashionable, but their idea of "objects" was more like CLU "clusters" as filtered through Ada, not the Smalltalk approach the term "object-oriented" was invented to describe.
You're also confusing CLOS and Flavors with CLOS's MOP.
> If you don't understand what a branch predictor actually does...
Possibly in five or ten years if you read this conversation again you will be in a better position to learn from it; right now you seem to be too full of ego to take advantage of the opportunity. Save a bookmark and maybe put a reminder in your calendar.
> Please provide some links to show I'm wrong.
Helping you stop being wrong is not really my responsibility :)
You're treating knowledge as a repulsive medicine that needs to be forced on you, not a precious treasure that merits seeking out. The problem with this is that if you only change your mind when it's profitable for someone else to talk you out of your mistakes, you'll just end up being exploited (and continuing to parrot half-understood nonsense in technical discussions). It isn't society's responsibility to give you the cognitive tools you need to realize your potential; it's yours.
How would you suggest rephrasing a complete, sweeping dismissal of someone's comments, on the basis that they evidently have no relevant knowledge, so that it doesn't come across as rude?
I took a bad situation and made it worse. I have reasons but it shouldn't have happened. I am not happy that I clearly and unnecessarily annoyed you, and would prefer that we put out this fire and move on with a better mood for both of us, and hopefully I can do better next time.
iAPX 432 arguably wasn't bad due to OOP orientation, but due to a bunch of several issues in how intel went about executing the idea. To the point that if they mismanaged a "normal" cpu this way they would have bungled it similarily.
That way innovation can be much faster, because applications can generally move quicker than kernels.
Btw, I'm not a fan of object orientation; and I don't think our hardware design should be infected by that fad. But I think your criticism of the badly fitting abstraction of flat address spaces is still valid. I am just not sure that 'fast object service' is necessarily the remedy.
I'm not a fan of any archtectural radicalism, and tend to think that there are things best done in both hardware, kernel and libraries and applications :-)
That is not to say that the boundaries should be cast in stone, they should obviously be flexible enough that you do not need a complete multi-user management system in a single-service jail or container nor a full-blown journaled COW storage-manager on a small embedded system.
In other words: I am firmly for the "Software Tools" paradigm.
> The defining tragedy of the operating systems community has been the definition of an operating system as software that both multiplexes and abstracts physical resources. The view that the OS should abstract the hardware is based on the assumption that it is possible bath to define abstractions that are appropriate for all areas and to implement them to perform efficiently in all situations. We believe that the fallacy of this quixotic goal is self-evident, and that the operating system problems of the last two decades (poor performance, poor reliability, poor adaptability, and inflexibility) can be traced back to it. The solution we propose is simple: complete elimination of operating system abstractions by lowering the operating system interface to the hardware level.
Basically, they say to let libraries do the abstraction.
The source code of your applications will still mostly look the same as before. It's just that the libraries will do more of the work, and the kernel will do less.
Yes, and I dont (quite) buy that argument, but I understand where it comes from.
The problem starts when you, quite sensibly implement something like SHA256 in hardware. It is a perfect example of something hardware does better than software.
But Dennis, Ken and Brian didn't think about cryptographic hash-algorithms when they created UNIX, and because UNIX no longer have a recognized architectural authority, nobody provides a timely architecture for such new features, and instead we end up with all sorts of hackery, some in kernels, some in libraries and some in applications.
SHA256 should be a standard library API, and if the CPU has a HW implementation, the platforms library should spot that and use that, no need to get the kernel involved, it's just a fancy XOR on purely userland data.
But SHA256 being a good example does not mean that we should throw out the baby with the bath-water.
Things like file-systems are incredibly ill-suited for userland implementations.
What they dont say in the article is that they will need monolithic "libraries" for things like filesystems, and to implement things like locking, atomicity, these libraries will have to coordinate amongst the processes which use the filesystem, and must do so without the control and power available to the kernel.
There are ways to do that, see for instance MACH or the original MINIX. It transpires there are disadvantages.
And that's what I mean by "archtectural radicalism": Try to use the right tool for the job, and sometimes the kernel is the right tool (filesystems) and sometimes it is not (SHA256).
Which of the disadvantages of microkernel userland filesystems do you think are most important and essential to the concept, and which do you think are a matter of bad implementations? I thought L4 and QNX had pretty reasonable filesystem stories, and even on poor old Linux I've been using FUSE with things like NTFS for years without much trouble. Is it just a matter of the cost of context switching between userland processes when you don't have enough cores?
If it's a question of performance, with enough cores and shared memory that's accessible for atomic operations, I'd think talking to a userland filesystem would just* be a matter of pushing requests onto a lock-free request queue in shared memory from your application process and reading the responses from a lock-free response queue. Of course each application needs its own shared-memory area for talking to the filesystem to get fault isolation.
Even if it's a matter of IPC message-passing cost on a single core, I think L4 has shown how to make that cheap enough that we should regard putting the filesystem in the kernel as a dubious optimization, and at that one that's second-best to granting the application mappings on an NVDIMM or something.
Perhaps this is stating the obvious, but I don't think you can get much fault isolation with a pure library filesystem; if all the processes participating in the filesystem are faulty then there's no way to protect the filesystem from fatal corruption from faults. You might be able to reduce the presumed-correct filesystem core process to something like a simplified Kafka: a process that grants other processes read-only access to an append-only log and accepts properly delimited and identified blocks of data from them to append to it.
If we're interested in efficiency and simplicity of mechanism, though, a library filesystem is likely faster and might be simpler than a conventional monolithic filesystem server, particularly a single-threaded one, because you can rely on blocking I/O. And the library might be able to wrap updates to the persistent store in lock-free transactions to reduce the frequency of filesystem corruption.
The Xerox Alto famously used a single-tasking library filesystem similar to MS-DOS, but each sector was sufficiently self-describing that filesystem corruption was usually minor and easy to recover from. The filesystem directory could be reconstructed from the data blocks when required. Neither the Alto nor MS-DOS had to worry about locking, though!
KeyKOS, as you know, took a lot of the ideas from the CAP machine and similar capability machines (and languages like Smalltalk), and implemented them on IBM 370 hardware using its regular MMU, with L4-like lightweight IPCs through the kernel for capability invocations. It went to the opposite extreme from having a library filesystem: each directory and each file was a "domain" of its own, which is to say a single-threaded process. Persistence was handled by a systemwide copy-on-write snapshot of the whole system state, plus a journal-sync call their database used to provide durable transactions. EUMEL and L3 took similar approaches; L4 instead takes persistence and even virtual memory out of the kernel.
I wrote some somewhat sketchy notes on how Flash performance suggests rearchitecting things the other day at https://news.ycombinator.com/item?id=31902551; I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?
First, I have not been actively involved in Fastly, apart from telling Artur to "go for it!" :-)
With respect to Flash technology I have noted elsewhere in this discussion that today our SSD devices effectively contain a filesystem in order to pretend they are disks, and that stacking two filesystems on top of each other is ... suboptimal.
But as I also just noted, flash isn't just flash, some properties are very hard to generalize, so i tend to think that we will have to let the people who decide what to solder onto the PCB provide at least the wear-levelling.
If I were to design an OS today, I think I would stick firmly with the good ol' UNIX name-hierarchy model, but I would probably split the filesystem layer horizontally in a common and uniform "naming layer" serviced by per-mount "object stores".
If you look at FreeBSD, you will see that UFS/FFS is sorta-split that way, but I would move the cut slightly and think in terms of other primitives which take SSD and networks better into account, but see also: Plan9.
The service I would want from a SSD device is simply:
A) Write object, tell me it's name when written.
B) Read object named $bla
C) Forget object named $bla
Then I'll build my filesystem on top of that.
(The NVME crew seems to be moving in the right direction, but it is my impression that some patents prevent them from DTRT, just like Sun's "Prestoserve" patent held up development.).
Keep in mind that (conventional) micro-kernels are not the same as exokernels.
FUSE is fun, I've written my own filesystems with it, but it's basically a micro-kernel idea, not an exokernel one. (L4 is also great! But I don't think it qualifies as an exokernel?)
Exokernels never caught on, at least not under that name. The closest equivalent in widespread use today are actually hypervisors for running virtual machines. (Especially if you are running a so called 'unikernel' on them.)
About filesystems: if you just want the kinds of abstractions that conventional filesystems already give you, you won't get too much out of using an exokernel. (As you mention, perhaps you can get a bit of extra performance?) From the FAQ I linked above:
> Q: In what kind of applications is an exokernel operating system preferable?
There are naturally tradeoffs with the extra flexibility provided e.g. it is
easier to make a mistake in user code.
> A: An exokernel is most attractive to an application that needs to do
something that is possible with an exokernel, but not possible with
other kernels. The main area in which the 1995 exokernel paper increased
flexibility was virtual memory. It turns out there are a bunch of neat
techniques applications can use if they have low-level access to virtual
memory mappings; the Appel and Li paper (citation [5]) discusses some of
them. Examples include distributed shared memory and certain garbage
collection tricks. Many operating systems in 1995 didn't give enough
low-level access to virtual memory mappings to implement such
techniques, but the exokernel did. The exokernel authors wrote a later
paper (in SOSP 1997) that describes some examples in much more depth,
including a web server that uses a customized file system layout to
provide very high performance.
The HN submission we are nominally discussing here is also about memory, so that might be applicable.
An example for filesystems I could envision: direct low-level hardware access to an SSD's internals for a database. Databases don't really care about files, and might also want to deal with SSD's peculiar writing processes in a way that's different from the abstractions typical file systems give you.
> Perhaps this is stating the obvious, but I don't think you can get much fault isolation with a pure library filesystem; if all the processes participating in the filesystem are faulty then there's no way to protect the filesystem from fatal corruption from faults. You might be able to reduce the presumed-correct filesystem core process to something like a simplified Kafka: a process that grants other processes read-only access to an append-only log and accepts properly delimited and identified blocks of data from them to append to it.
That might be possible, but wouldn't really be faster than letting a kernel handle it, I'd guess? (But it would perhaps be more flexible to develop, since it's all userland.) You can also take inspiration from how eBPF allows you to upload user level logic into the Linux kernel and run them securely. Instead of uploading them into the kernel, you could also upload them into your filesystem service, I guess?
Some of the original exokernel papers had some more interesting ideas sketched out.
> I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?
I'm afraid you are mixing me up with someone else?
I agree that L4 is not an exokernel, though it does go a little further in the exokernel direction than conventional microkernels. I agree that FUSE is microkernelish rather than exokernelish, though there's nothing in the exokernel concept as I understand it that excludes the possibility of having servers for things like some or all of your filesystem functionality.
Databases are indeed an application that commonly suffers from having to run on top of a filesystem.
> That might be possible, but wouldn't really be faster than letting a kernel handle it, I'd guess?
I think reading files by invoking library calls that follow pointers around a memory-mapped filesystem might well be faster than reading files by repeatedly context-switching back and forth into even a supervisor-mode kernel, much less IPC rendezvous via a kernel with a filesystem server. This is particularly true in the brave new SSD world where context switch time is comparable to block device I/O latency, rather than being orders of magnitude smaller.
Writes to Kafka are very cheap and support extreme fan-in because the Kafka design pushes almost all the work out to the clients; the Kafka server does very little more than appending chunks of bytes, containing potentially many separate operations, to a log. It seems very plausible to me that this could be faster than handling a series of individual filesystem operations (whether in a kernel or in a microkernel-style server), at least for some applications; particularly with orders of magnitude lower penalties for nonlocality of reference than for traditional filesystems, and for applications where many writes are never read.
Running logic in the kernel or in a server using a restrictive interpreter is indeed an interesting architectural possibility, but from a certain point of view it's the opposite extreme from the Kafka approach.
> > I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?
> I'm afraid you are mixing me up with someone else?
I hope this isn't rude, but I wrote that in response to phk's comment, so I was addressing him in it, not you, eru, although I did enjoy your comment very much as well.
> Running logic in the kernel or in a server using a restrictive interpreter is indeed an interesting architectural possibility, but from a certain point of view it's the opposite extreme from the Kafka approach.
In general, a restricted language. You interpret or compile that language, and still have similar security guarantees.
> I hope this isn't rude, but I wrote that in response to phk's comment, so I was addressing him in it, not you, eru, although I did enjoy your comment very much as well.
Oh, that's fine. I was just confused because that came in a reply to my comment.
There has been a slow trend to hardware virtualization and moving drivers to userspace. The issue is that often the required hardware support (SR-IOV for example) is locked behind premium SKUs and it is trickling into consumer producsts very slowly. As such OSs will be very slow to embrace it fully.
I am very sympathetic to this argument overall and trace the hardware industry's failure back to the spread of C, UNIX, and worse-is-better.
With Wasm I see an opportunity to virtualize away the old world of C and linear address spaces. While we designed it to be low level and sandboxed to get C/C++ and Rust on board, I and others have always had in mind a future world where Wasm has managed (GC'd) data, first-class typed functions, and more. Those features should support a wide variety of source languages.
Wasm should become the new, final ISA. It should become like Unicode; the final, most widely-supported[1] format for executable code. When all languages and programs can run easily on Wasm, then hardware can easily be swapped out.
[1] Sure, Unicode has flaws, and it doesn't support everything equally as well. But it had the property that basically everyone dropped everything else in favor it, because it gained all the momentum.
A lot of this due to the hardware architecture itself. The software abstractions dictated/limited by the HW itself causes many of the risks!
If you designed BOTH HW and SW up to and including the OS, you _might_ have a chance to control the risks better. But by the very separation of duties and roles, papered over by abstraction itself, you create problems. ALL abstractions throw away information and eventually those abstractions bite you in the ass.
This was the case with digital logic once HW speeds rose to a critical level - suddenly the reality that digital is merely an abstraction upon analog and the very abstraction of lumped-model analog started failing which caused digital fail as well.
We definitely can have and have had the same failure occurring with von Neumann architecture - there's NOTHING magical about it that immunizing against model abstraction failure and it can creation "intrinsic failures" that can never be fixed thanks to Gödel's incompleteness theorem.
It’s a big thread (by virtue of being a cool piece), so maybe someone said this already, but isn’t there kind of a middle ground where we let existing software continue to run but acknowledge a changing hardware landscape by really cranking up (default) page sizes?
Userland allocators already work pretty hard to hit in the TLB [1], but huge-page tuning and whatnot is, to your point, only hitting the sweet spot on modern gear via effort/luck.
Channeling my inner Linux, defaulting to larger pages means a drastic loss in memory efficiency when mmaping lots of small files as happens when someone compiles the Linux kernel. If you've got a 2kb file and 4kb pages then half the memory you allocate when the file is paged in is wasted. For larger pages that goes way up.
Absolutely, but you also want a different scheduler for a low-latency server than you do for your desktop.
One size almost never fits all, as I’m sure you’ll agree as someone who cares about compiling the kernel.
With that said, the kernel is pretty good at reclaiming physical pages, so you’d most likely eat into the same disk cache you’re reading from in the scenario you’ve described.
Not to be contradictory, but I’m having a hell of a time getting a toolchain put together that can reproducibly target Linux on x86_64 without liking glibc or libstdc++ and easily accommodate open-source libraries, and it’s not for want of knowing how the individual pieces work.
If you’re promoting computer architecture and OS research, have at it, needs doing.
But that’s a different game to running software on the architectures, operating systems, and tool chains we have today.
A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.
I don't think both layers are "filesystem" in the same sense. We don't need the lower filesystem to provide permissions, ownerships, time stamps, directories, symbolic links and such.
Re: linear
A machine address is a word of bits. A word of bits always has a linear interpretation as a binary number. For instance if we have a 16 bit segment ID, and a 32 bit offset, then we can still pretend that it's a 48 bit linear address. We can compare two pointers for inequality, for instance: p0 < p1, as 48 bit words. That space may be sparsely populated, but that doesn't make it nonlinear; the spaces you are calling linear can also be sparsely populated and have numeric conventions about regions.
You say physical memories are linear, but they are also sparsely populated in the same way: such and such a range is a ROM, such and such a range is certain memory mapped registers, DRAM is over there. Generally speaking, hardware treats some part of an address as an ID that it recognizes, and then the bits below that as an offset. When there is a bus read or write, if the ID part matches that hardware device, then it selects itself for that I/O operation, and uses the offset bits to provide access to the right thing. So physical memory is arguably nonlinear; it is like a subclassed IP address space.
Physical addresses can have bits which are not used for addressing; e.g. a certain range of memory might be available as uncached if you enable a certain upper bit in the address. That looks like linear, but with aliasing between distant regions. Linear virtual memory is sparsely populated; there are mapped pages and unmapped pages. Pages can alias: the same object can be mapped in multiple places, so a write here can be read there.
If you want to split an address into an object ID and offset, you have to gamble about how many bits you need for each one. One application has hundreds of millions of tiny objects: it wants a big object ID part of the address, and a small offset. Another one has a small number of huge objects: it doesn't care for large object ID, but wants big offsets. Either you make that configurable (slow gets even slower), or else waste bits on making both spaces larger at the same time, perhaps ending up with wasteful 128 bit pointers on a system where 64 is more than enough.
All interpretations of the bits of an address above and beyond "pure binary number" add complication and overhead.
The hardware (e.g. DMA bus master) isn't going to understand objects; it will always want a simple address.
Re: C, C++, Go, Rust, PHP, Lisp, Smalltalk
No two implementations of these can agree with each other on what exactly is an object. Implementations will just carry their existing respective portable object representations into the non-linear model. Architectures without flat memory, like the JVM and WebAssembly, tend to only cause pain for Lisp and its ilk.
A Lisp is not going to want some weird object model from the operating system; it will want to do things like packing cons cells tightly together into one larger heap object. That heap object could be a segment. We had those; they are also from the 1960's. Operating systems started ignoring them, opting for just the demand paging support the VM hardware.
> A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.
I think you could argue that certain file system designs are better suited to different storage hardware. Maybe it's appropriate that the kernel runs different code depending on the underlying storage type?
We already have JEDEC standards for flash devices to report block layout, because it's important in microcontrollers where there is no adapation layer. We could have an SSD/M2 standard that reported that information, and then architecturally kernel FS stuff would probably split into a couple of layers. The 'top' that provides the filesystem features that you're used to in something like ZFS, and the bottom storage layer that has a couple of different implementations for 'linear' and 'blocks'.
No, that is actually not what the Flash Adaptation Layer does.
The FAL's primary job is to make the Flash array look like a disk, despite the fact that individual "sectors" are not individually rewriteable.
To do this, the FAL implements which is for all practical purposes a filesystem, where the files are all a single sector long and where the filename is the sector number exposed by the "disk".
In other words: Now we have two filesystems on top of each other, one lying to the other, which does a lot of unnecessary work, because it is being lied to.
s/for all practical purposes/for the sake of rhetorical convenience in my argument/
> , the FAL implements which is for all practical purposes a filesystem, where the files are all a single sector long and where the filename is the sector number exposed by the "disk".
1. That is not a "file system" comparable to the thing above which you're also calling "file system", which means you're essentially equivocating on the term.
2. Any old magnetic hard drive exposes this same file system: it makes equal-sized sectors available under names that are indices. There is no lookup structure that is not a filesystem under your definition.
On a magnetic disk, the exposed "name" is the physical address, on a SSH device, the data is stored ${somewhere} and the FAL has a (pretty interesting!) datastructure to lookup were that is, when you demand a certain "name".
When you write a certain "name", the magnetic disk just overwrites what is already there, at the assigned spot.
When you write a certain "name" on a SSD, a new spot gets allocated (This may require reshuffling akin to garbage collect), and the data structure is updated for the new locations ("=$name") and old location ("unused"), and if making the old location unused means that an entire eraseblock becomes free, then erasing that is scheduled, after which it is added to the free pool.
But that is only the easy part. The hard part is "wear levelling", which essentially amounts to erasing all the blocks approximately the same amount of times, because that is what wears out the gate material in the individual cells.
Wear levelling involves taking data which the host has not asked for, copying to a different location, in order to empty out the erase-block with the least wear (= erase cycles) so that it can shoulder more of the load.
And now comes the truly hideous part: The FAL has to do this in a way where it's own metadata is part of it's own wear-levelling, this is why most competent FAL's have a basic structure much like a classic Log-structured filesystem.
So yes: We do have two filesystems on top of each other, and exposing a more suited model than "array of equal-sized rewritable sectors" could reduce that.
Ignoring the fact that just about all the HD's made since the 1980's when IDE became popular can relocate sectors, and since the 2000's didn't even necessarily expose the physical sector sizes and all kinds of other abstractions.
So, the idea that we shouldn't have a disk abstraction to allow actual filesystems to focus on what matters to the user is sorta nonsense. You probably have this idea that all flash disks are the same, and i'm here to tell you they are not, just like not all computers are 8 cores big.little machines. Disks scale from little tiny emmc, controllers to large shared arrays that are doing deduplication, replication, etc,etc,etc and the media can be individual flash chips, massive parallel flash, spinning disks, arrays of spinning disks, mixes of flash and spinning disk, etc, etc, etc.
There have been a half dozen raw flash layers in linux over the past ~15 years, and generally they all suck, make assumptions about the flash that doesn't hold for more than a couple generations, end up slower than off the shelf flash with FTL's (aka what your calling FAL), and have frequently failed to allow the kinds of filesystem layering that one expects of linux/grub/etc. (and then there are the ones running at various hyperscalers that consume more RAM than most people have in their PC's).
Bad block remapping is just a trivial table lookup, the first disk drives did it with 8-bit microcontrollers.
In my experience a good FAL is about as hard to write as a good filesystem.
While you can parameterize a lot of things, there are some fundamental properties of the flash cells which make it very hard to write a single FAL which works well with all flash chips.
As a matter of principle I do not comment on issues specific to Linux.
So just an inch to the left goalpost of the True Scotsman's "filesystem" definition.
> make it very hard to write a single FAL which works well with all flash chips
Right? So the last thing you want is to foist that logic into filesystems. The layered separation is good.
Hence, someone upthread wrote: A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.
You seem to be more interested on proving me wrong, than actually understanding what I am saying:
Today SSD's expose a datamodel which makes them look like disks.
To implement that datamodel, on a storage substrate which have radically different semantics, they have to implement what is essentially a (log-structured-)filesystem.
(I happen know this first hand, because I have worked on both file-systems and FALs.)
And that is why I say we have two filesystems stacked on each other.
Your limited understanding of filesystems does not change reality.
We have pretty comprehensive user-side documentation of the R1000, but very, very little from the system/vendor side, so lots of things we simply do not know yet.
If you are into data-archaeology and lack for challenges, we have several good outstanding questions for research. For instance the layout of the object-store/filesystem.
If you want to see a R1000 running, and experience the worlds first truly semantic IDE, come to Datamuseum.dk just outside Copenhagen, because we have the only approx 1.83 running R1000 computers.
(We also just started fundraisning for a new permanent building for our huge collection, we are at €113K of €3M goal. See top right corner or homepage or email.)
We know of only four surviving computers, we have two, one is privately owned in NZ and IBM donated one to CHM. The rest have been shredded because of classified mil-spec workload.
If you are local to/affiliated with CHM, and are interested/allowed, we would love to know more about their machine, and if possible, assist to get that running too.
The Mill's memory model is one of its most interesting features IMO [1] and solves some of the same problems, but by going the other way.
On the Mill the whole processor bank uses a global virtual address space. TLB and mapping to physical memory happens at the memory controller. Everything above the memory controller is in the same virtual address space, including L1-L3+ caches. This solves a lot of problems, for example: If you go out to main memory you're already paying ~300 cycles of latency, so having a large silicon area / data structure for translation is no longer a 1-cycle latency problem. Writes to main memory are flushed down the same memory hierarchy that reads come from and succeed as soon as they hit L1. Since all cache lines are in the same virtual address space you don't have to track and synchronize reads and writes across translation zones within the cache hierarchy. When you request an unallocated page you get the whole pre-zeroed page back instantly, since it doesn't need to be mapped to physical pages until writes are flushed out of L3. This means its possible for a page to be allocated, written to, read, and deallocated which never actually touches physical memory throughout the whole sequence and the whole workload is served purely within the cache hierarchy.
Protection is a separate system ("PLB") and can be much smaller and more streamlined since it's not trying to do two jobs at once. The PLB allows processes to give fine-grained temporary access of a portion of its memory to another process; RW, Ro, Wo, byte-addressed ranges, for one call or longer etc. Processes get allocated available address space on start, they can't just assume they own the whole address space or start at some specific address (you should be using ASLR anyways so this should have no effect on well-formed programs, though there is a legacy fallback).
The Mill model is kind of cool, but today, many peripherals (including GPUs and NICs) have the ability to dump bytes straight into L3 cache. This improves latency in a lot of tasks, including the server-side ones that the Mill is designed for. This is possible due to the fact that MMUs are above the L3 cache.
Honestly, I'm happy waiting for 4k pages to die and be replaced by huge pages. Page tables were added to the x86 architecture in 1985, when 1MB of memory was a ton of memory to have. Having 256 pages worth of memory in your computer was weird and exotic. Fast forward to today, and the average user has several GB of memory - mainstream computers can be expanded to over 128 GB today - and we still mainly use 4k pages. That is the problem here. If we could swap to 2M pages in most applications, we would be able to reduce page table sizes by a factor of 512, and they would still be a lot larger than page tables when virtual memory was invented. And we wouldn't waste much memory!
But no, 4k pages for backwards compatibility. 4k pages forever. And while we're at it, let's add features to Linux (like TCP zero copy) that rely on having 4k pages.
Once on an embedded system, we found 30% of our processing time was spent handling TLB misses.
We also didn't have access to the code that set up the page tables. Somehow I got the linker to emit a function hook which I used to defrag the page tables after they were written but before the MMU was enabled.
I remember thinking, "This is not what I learned about in school."
I can probably guess the OS you were running, but lets just say, some OS's put a lot more effort into assure that virtually continuous mappings tend to also be physically continuous because it then becomes possible to map larger blocks of ram continuously.
AKA, both X86 and ARM (as do various other processors) have intermediate bits in their page directory structures that flag 'instead of using another level of PTE/etc below this level assume we are just pointing at an equal size physical range" which on x86-64 can be 2M or 1G. Given the latest Intel x86's have a multilevel TLB, with 8 1G L1 entries, and 1K of L2 TLB entries it should be fairly straightforward to fix with appropriate huge page tweaks unless your data structure exceeds a TB. And even if it does, keeping everything in 1G pages means that the data caches should be able to keep the top level global page directory/etc cached avoiding main ram hits, and resulting in fairly quick TLB refills.
This article compares CHERI to an 80's computer, the Rational R1000 (which I'm glad to know of). It's worth noting that CHERI's main idea was explored in the 70's by the CAP computer[1]. CAP and CHERI are both projects of the University of Cambridge's Computer Lab. It's fairly clear that CAP inspired CHERI.
The original machines like that were the Burroughs 5000 (1961), and the Burroughs 5500 (1964), which was quite successful. Memory was allocated by the OS in variable length chunks. Addresses were not plain numbers; they were more like Unix paths, as in /program/function/variable/arrayindex.
That model works, but is not compatible with C and UNIX.
> That model works, but is not compatible with C and UNIX.
The Unisys Clearpath line of systems have ISO C compliant compilers, and MCP provides a fairly complete POSIX environment. I've never actually programmed them, but I have the manuals. dlopen is conspicuously missing from the list of POSIX routines. I don't know how close the ISA is to 5500, but at least for Unisys Clearpath the only real incongruity in the environment is that unsigned arithmetic requires compiler-generated instrumentation as integers are sign-magnitude, which doesn't naturally provide the unsigned overflow semantics required by the standard.
ISO C has always tried to steer away from integer/pointer conversions for precisely this reason--pointers are special and might be opaque to the runtime environment. C99 adopted intptr_t, but it was optional and had restricted semantics. There's still no way in standard C to convert a function pointer to an object pointer or to intptr_t. But even in a full POSIX system this limitation only effects dlopen.
Reading the CHERI papers, porting all of FreeBSD to CHERI turned out to be surprisingly easy. dlopen and signals were two of the biggest headaches, IIRC. Very little application code had to change. Playing games with pointers just isn't particularly common in run-of-the-mill Unix applications, or even in open source generally, where writing standards-conformant code has always been valued. (Portability is even more valued, but usually the best way to achieve the latter is through the former.) It's more common in proprietary applications, like games, HPC, embedded, etc.
>You lose the advantages of the tagged hardware when you declare a giant array of raw memory to be managed by C.
Sure, but that's not how it works. What does happen is that real-world software written in C doesn't use all those nasty tricks all that often anymore, or isn't that hard to fix. There are a few tricks that _do_ get used, and CHERI provides accommodations for those. Take a look at https://papers.freebsd.org/2019/bsdcan/davis-cheriabi/.
The architecture of the B5000 / B5500 / B6500 lives on today in the Unisys ClearPath line. I believe the OS, MCP, is one of the longest-maintained software operating systems still in active use, too.
One of the original brains behind the R1000 has mentioned that the B5500 was one of the inspirations, but my personal impression is that he was talking more about the stack-based (vs. register-based) instruction set, than the memory management.
True, the CHERI paper[1] cites the Burroughs as well. Amusingly, it cites an article similar to the OP: "The architecture of the Burroughs B5000: 20 years later and still ahead of the times?"[2] - written in 1982...
Worth noting that CHERI _is_ definitely compatible with C and Unix; the default operating system for it is CheriBSD, a FreeBSD fork, and there is a growing number of purecap packages.
Burroughs machines have 2 sorts of pointers - effectively pointers to offsets on the stack (indexed by a full display so you can do full algol scope semantics, more than just C) as well as pointers to heap memory (base and bounds, with paging for large objects)
So recursive routines get access to their own local variables
I'm not quite sure what you mean - the Burroughs machines have full sets of display registers (hardware managed pointers to each of the stack frames that were currently in scope - Algol/etc had more complex scoping that modern C-style languages) - a stack pointer contained a lexical level (an index into the display registers) and an offset into that stack frame - there's a variant that's 'stuffed' (essentially a reference to a variable in a stack frame that's out of scope - based on the base of stack)
I think it's important to understand that it's not a C machine, doesn't have a linear address space in the normal sense
you create an in-scope reference within the function - and then 'stuff' it which converts it into an offset from the base of the stack - converts an "indirect reference word" into a "stuffed indirect reference word"
it's what you use to pass by reference ..... but algol supports pass-by-name semantics (of which pass by reference is an optimisable subset) - to do fill pass by name (for a simple example a[i] where i is a global) the compiler will create a thunk (a tiny subroutine, in scope of the place where the parameter was passed) and pass a pointer to it - the tag of '7' means a PCW - when it gets dereferenced as a pointer the hardware does a subroutine call to the thunk in the correct scope
For a micro, see iAPX 432. It was considered over-complicated for its time but if CHERI catches on, perhaps far ahead of its time.
Also interesting to me was the idea in Darek Mihocka's blog NO EXECUTE! about using emulation to implement advanced processor features rather than forcing everything through a one size fits all hardware policy. Of course emulation will always have a performance hit, but the concept is interesting.
I'm a little confused about how the object base is looked up in these systems, and if they're sparse or dense and have any size or total object count limitations, and if that ends up having the same limitations on total count as page tables that required the current multi-level approach.
As surely you could consider page table as effectively implementing a fixed-size "object cache"? It is just a lookup for an offset into physical memory, after all, with the "object ID" just being the masked first part of the address? And if the objects are variable sized, is it possible to end up with physical address fragmentation as objects of different sizes are allocated and freed?
The claim of single-cycle lookups today would require an on-chip fixed-size (and small!) fast sram, as there's a pretty hard limit on the amount of memory you can get to read in a single clock cycle, no matter how fancy or simple the logic behind deciding to lookup. If we call this area the "TLB" haven't we got back to pagetables again?
And for the size of sram holding the TLB/object cache entries - increasing the amount of data stored in them means you have less total too. A current x86_64 CPU supports 2^48 of physical address space, reduced to 36 bits if you know it's 4k aligned - and 2^57 of virtual address space as the tag, again reduced to 45 bits if we know it's 4k aligned. That means to store the tag and physical address you need a total of 81 bits of SRRAM. A 64-bit object ID, plus 64-bit physical address plus 64-bit size is 192bits, over 2x that, so you could pack 2x the number of TLB entries into the same sram block. To match the capabilities of the example above, 57 bits of physical address (cannot be reduced as arbitrary sizes means it's not aligned), plus a similarly reduced to 48 bit object ID and size still adds up to 153, only slightly less than 2x, though I'm sure people could argue that reducing the capabilities here have merit, I don't know how many objects or their maximum possible size in such a system. And that's "worst case" 4k pages for the pagetable system too.
I can't see how this idea could be implemented without extreme limitations - look at the TLB size of modern processors and that's the maximum number of objects you could have while meeting the claims of speed and simplicity. There may be some advantage in making them flexible in terms of size, rather than fixed-size, but then you run into the same fragmentation issues, and need to keep that size somewhere in the extremely-tight TLB memory.
So I commented on this a bit elsewhere, but the whole object business is irrelevant for how the address translation hardware in this machine actually works. While the subfields of the address are exploited to optimize hash function used, the hardware is otherwise agnostic to what the upper bits of the address mean. The TLB is just huge relative to the amount of memory it had such that there's one entry for each physical page in the system and it deals with collisions in the TLB by evicting pages to disk
> As surely you could consider page table as effectively implementing a fixed-size "object cache"? It is just a lookup for an offset into physical memory, after all, with the "object ID" just being the masked first part of the address? And if the objects are variable sized, is it possible to end up with physical address fragmentation as objects of different sizes are allocated and freed?
Because that's only a base, not a limit. The right pointer arithmetic can spill over to any other object base's memory.
Maybe? If you just round up each "object" to 4k then you can implement this using the current PTE on x86_64, but this removes the (supposed) advantage of only requiring a single PTE for each object (or "object cache" lookup entry or whatever you want to call it) in the cases when an object spans multiple page-sizes of data.
Having arbitrary sizes objects will likely be possible in hardware - it's just an extra size being stored in the PTE if you can mask out the objectID from the address (in the example in the original post, it's a whole 64-bit object ID, allowing a full 64-bits of offset within each object, but totaling a HUGE 128bit effectively address)
But arbitrary sizes feels like it pushes the issues that many current userspace allocators have to deal with today to the hardware/microcode - namely about packing to cope with fragmentation and similar (only instead of virtual address space they'll have to deal with physical address space). The solutions to this today are certainly non-trivial and still can fail in many ways, so far away from being solved, let along solved in a simple enough way to be implemented that close to hardware.
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Well, GPU code is certainly not object-oriented, and I hope it never becomes that. SIMD code won't be able to jump between objects like typical CPU-oriented OOP does (unless all objects within a warp/workgroup jump to the same function pointers?)
GPU code is common in video games. DirectX needs to lay out its memory very specifically as you write out the triangles and other vertex/pixel data for the GPU to later process. This memory layout is then memcopy'd over to PCIe using the linear address space mechanism, and GPUs are now cohesive with this space (thanks to Shared Virtual Memory).
So today, thanks to shared virtual memory and advanced atomics, we can have atomic compare-and-swap coordinate CPU and GPU code operating over the same data (and copies of that data can be cached in CPU-ram or GPU-VRAM and transferred over automatically with PCIe memory barriers and whatnot).
----------
Similarly, shared linear address spaces operate over rDMA (remote direct memory access), a protocol built on top of Ethernet. This means that your linear memory space is mmap'd on your CPU, but then asks for access to someone else's RAM over the network. The mmap then causes this whole "inefficient pointer-traversals" to then get turned into Ethernet packets to share RAM between CPUs.
Ultimately, when you start dealing with high-speed data-sharing between "external" compute units (ie: a GPU, or a ethernet-connected far-away CPU), rather than "just" a NUMA-node or other nearby CPU, the linear address space seems ideal.
--------
Even the most basic laptop, or even Cell Phone, these days, is a distributed system consisting of a CPU + GPU. Apple chips even have a DSP and a few other elements. Passing data between all of these things makes sense in a distributed linear address space (albeit really wonky with PCIe, mmaps, base address pointers and all sorts of complications... but they are figured out, and it does work every day)
I/O devices working directly in memory is going to only become more common. 100Gbps network connections exist in supercomputer labs, 10Gbps Ethernet is around the corner for consumers. NVMe drives are pushing I/O to such high bandwidths that'd make DDR2 RAM blush. GPUs are growing more complicated and are rumored to start turning into distributed chiplets soon. USB3.0 and beyond are high-speed links that directly drop off data into linear address spaces (or so I've been told). Etc. etc.
Thing is that I think TFA author isn't speaking of "object-oriented" in the typical sense. I believe it's more about how the VM paging system and cache and etc. is really working with abstract segment "objects" that are blobs of memory but presented to the process as a single linear address space. I don't think the author means OO as in "Smalltalk" or "Java" [methods + objects + inheritance] but more "object based" in a more generic sense of "there's blobs of stuff with abstract handles pointing to them" that then gets painted to the process as if it's a single linear sequence of bytes when in the actual machine... it's not.
Since "pretty much everything today is object-oriented" is such an important point in the article, I feel the author does his readers and himself a disservice by not defining what he means by object-oriented, especially if he uses it in a different sense than everybody else.
Maybe. But in the OS world "object" does have an established meaning.
I think the broader point is that you can build an "object-oriented" system in the sense that you're expecting overtop of the "object-based" systems he is describing and it will give you much of the base level semantics of an OO system (object allocation, tagging, encapsulation) without enforcing any particular OO language semantics (inheritance, typing, methods, etc.) and while also allowing for non-OO (blob of memory) semantics if you need it.
Think of each C allocation via malloc/free as an object. It's just in his system the pointers that point to those allocations are not describable as a series of integer offsets into a big linear memory, but really are handles or unique IDs which map directly into the VM subsystem of the OS and to some action happening in the MMU.
In essence most object-oriented language VMs/runtimes are doing this themselves, essentially building abstract handles (object references, etc.) overtop of the "linear memory" abstraction that the OS provides. If I understand his gist he's really just talking about cutting out the middleman.
I suspect, though, that there's not really a lot of efficiency gains anyways. This path has been so heavily optimized (in both hardware and software) over the decades that it likely matters little.
The security argument maybe is more compelling. There's good argument for the concept of never being able to turn an integer into a pointer or into an integer and back again. Except this would break probably the majority of C programs out there.
Maybe the middle road is for the OS to present a sandboxed "legacy" or "emulation" environment for programs that do pointer arithmetic and to provide enhanced compilers that flag these kinds of things and encourage/offer alternatives.
Given the prevalence of virtualization tech now, there's probably more room for experimentation in this type of thing these days without breaking compat...
“An obstack is a pool of memory containing a stack of objects. You can create any number of separate obstacks, and then allocate objects in specified obstacks. Within each obstack, the last object allocated must always be the first one freed, but distinct obstacks are independent of each other.
Aside from this one constraint of order of freeing, obstacks are totally general: an obstack can contain any number of objects of any size. They are implemented with macros, so allocation is usually very fast as long as the objects are usually small. And the only space overhead per object is the padding needed to start each object on a suitable boundary.”
I love obstacks, after a lead at a job I was at a decade ago pointed them out to me I always look for a chance to use them, but haven't really had much of a chance.
There isn't a lot of info about them though. Go hunting for performance metrics on them, or discussion of them generally and you won't find much. The glibc pages themselves don't give much justification why you would use them vs malloc, or an arena allocator, etc.
GPU stuff has been "object oriented" for a long time. The ability to just arbitrarily address data as bytes from shaders is relatively new and modern and still constrained.
If you want to talk to a GPU via vulkan or d3d12 you're going to navigate a maze of descriptor sets, pipeline objects, buffers of various types, samplers, etc. These are all Objects with distinct types and data and the GPU hardware will interact with them in specific ways, even if it's also a general purpose computing device that will run some shader code you throw at it. When I was writing bare-metal graphics code for the PS4 there was still a lot of Object Orientation there too even if there weren't vtables or a garbage collector involved.
Seeing RAM as a collection of linear addresses on GPUs (especially as shared virtual memory, pinned memory, or other such GPU/CPU virtual memory sharing features) is a feature from 2010-era, be it from OpenCL, DirectCompute, or CUDA.
DirectCompute just sees the data as a collection of bytes.
> Well, GPU code is certainly not object-oriented, and I hope it never becomes that.
Jonathan Blow has been experimenting a lot with arrays of structs versus structs of arrays, particularly in making them both first class instead of assuming one and ignoring the other. This is essentially the column- versus row-oriented data question answered with a question: why not both?
I think if you had a first class language that supported both, that you wouldn't mind so much if GPU code became object oriented.
I haven’t heard much about JAI for a few years, is this idea catching on?
I don’t know why not both, is it maybe because one of the two layouts will almost always have worse performance for any given access pattern? I can see uses for having both for the non-critical-path parts, though maybe once you’re indexing it in the awkward but perf-friendly way, you’ve already done the hard work and there’s little point to proving the other style of indexing? (Isn’t the best perf always found using whatever layout & indexing is the most difficult… there must be some eponymous law for this, right?) I guess it is possible to have multiple critical paths each favoring a different layout. Not sure how often I’ve ever seen that in practice, I would guess it’s probably rather rare.
As far as I can tell, the 'structure of arrays' design pattern appeared in the game development world some time around 1998-2005. Columnar layouts have been known for upwards of 50 years now, as has the general notion that the way you lay out and represent your data matters and can be adjusted at will for your use cases. Many different layouts and optimisation structures are known and have been used, esp. in databases.
Blow isn’t inventing structs of arrays. It’s as you say a common gaming idiom and one he’s used before. What he was doing is trying to make it a first class citizen. Very few languages do anything for these. None of the language I use have iterator that work for this. You have to write your own bookkeeping logic every time.
The memory layout can be better this way depending on how the fields of a strict are laid out. It also gives the data better locality.
Let's say you're iterating over position coordinates of some set of objects in a game. With an array-of-structs you can lose ou on cache locality if unrelated fields are nearby. If you use a struct-of-arrays you get cache locality advantages anytime you do something like "iterate over this 1 field for each object"
The most obvious one to me is what if you change your mind?
What if 90% of these objects should be compactly slab allocated, but you also run small, very disjointed numbers of them through an editor or an undo list or whatever and so they need to live separate from this array?
I don't know what the answer is, but I feel like a few good candidates will come out of the conversations started by Rust. You can't replace an object with a new object while it's globally readable. But if you own the parent object, you can rewrite it however you want, or whichever ways are compatible with the other invariants you want for the system, like max time per frame or response time.
Those change the type annotations for a memory location. Also if you tell me there are no concurrency gotchas for this then I’ll tell you that you have a single threaded interpreter or you’re lying.
We are talking about retaining the type and swapping the pointers from an independent object to a set of offsets into a struct of arrays.
> if you tell me there are no concurrency gotchas for this then I’ll tell you that you have a single threaded interpreter or you’re lying
You are making a big assumption. Neither common lisp nor smalltalk specify the semantics of concurrent programs, but it is certainly possible to implement such a feature without, as you say, concurrency gotchas. Perhaps a better example is concurrently compacting gcs for java.
In this case GPUs are actually the perfect argument in favor of the article. GPUs only speak in objects, never linear addresses. You allocate a type of object and that allocation just now is that type. eg, a texture is always a texture, never a vertex buffer. Which is a different object altogether. And you never work with addresses. You can point objects at other objects. You can point at an offset from an object. But you can never do arbitrary pointer manipulation, and you never have any idea if the address space is linear or not.
By making pointers treated the same on CPU or GPU (by allowing their 64-address spaces to be identical by sharing the same memory regions + using PCIe to keep those memory regions in sync), you can perform high-speed CPU / GPU communications of even linked data structures (linked lists, trees, graphs).
GPUs utilize many linked data-structures. Oct-trees accelerate bounds testing, BVH trees help raytracing. Etc. etc. The GPU addresses must be linear because they're synchronized to the CPU's memory space.
I think he's advocating fitting a high level language like Ada in the kernel or in the CPU(?), with one global addressing space and no separate individual process addressing space for memory protection, but rather relying on the high language to provide memory protection. That's where his bizarre hyping of "object oriented" addressing space came from.
It has been done. See Singularity project from Microsoft Research, which used C# as the language to provide memory protection, no process, all programs running in the same global memory space, and all of them running in ring-0. It was a fun research project, but never really made it out. There were other research projects like it.
Also his (object, offset) addressing space is essentially a segmented memory model. The object id is the segment id. I bet the object id is in a linear space.
It's been done way earlier than MSFT's SingularityOS: see e.g. Hansen's book "The Architecture of Concurrent Programs" (from 1977) where he writes an OS in Pascal with exactly this approach to process separation.
Heck, I'd argue everyone who writes their own toy OS should probably start with this approach and then retrofit memory protection later because it simplifies things tremendously and you get something actually usable much earlier (context switching between is not that hard but it's complicated, especially on x86).
Memory protection is nice to have because then you're likely to get crashes when you screw up pointer stuff. And crashes are easier to debug than writes to the wrong place.
Of course, you can do identity memory mapping and skip virtual vs physical questions (for x86, memory protection requires using virtual memory, because the protection information is in the page tables, and iiuc, amd64 also requires using virtual memory, but it's still a useful simplification to do identity mapping so the virtual address is the physical address). My toy OS only runs one process, so there was never going to be more than one memory map anyway.
> It has been done. See Singularity project from Microsoft Research, which used C# as the language to provide memory protection, no process, all programs running in the same global memory space, and all of them running in ring-0. It was a fun research project, but never really made it out.
The problem is basically the choice of C#. If today people use WASM, with no process distinctions, same address space, all in ring0 it would already be a much more successful project.
"The R1000 addresses 64 bits of address space instantly in every single memory access. And before you tell me this is impossible: The computer is in the next room, built with 74xx-TTL (transistor-transistor logic) chips in the late 1980s. It worked back then, and it still works today."
That statement has to be coming with some hidden caveats. 64 bits of address space is crazy huge so it's unlikely the entire range was even present. If only a subset of the range was "instantly" available, we have that now. Turn off main memory and run right out of the L1 cache. Done.
This does seem pretty neat though. "CHERI makes pointers a different data type than integers in hardware and prevents conversion between the two types."
I'm definitely curious how the runtime loader works.
"We need to keep in mind, the DRAM ICs themselves have a hierarchy with latency trade-offs" Yes this is the thing -- I'm not a hardware engineer or hardware architecture expert, but -- it seems to me that what we have now is a set of abstractions presented by the hardware to the software based on a model of what hardware "used to" look like, mostly what it used to look like in a 1970s minicomputer, when most of the intensive key R&D in operating systems architecture was done.
One can reasonably ask, like Mr Kamp is, why we should stick to these architectural idols at this point in time. It's reasonable enough, except that the alternative of heterodox, alternative architectures is also heterogenous -- new concepts that don't necessarily "play well with others." All our compiler technology, all our OS conventions, our tooling, etc. would need to be rethought under new abstractions.
And those are fun hobby or thought exercises, but in the real world of industry, they just won't happen. (Though I guess from TFA it could happen in a more specialized domain like aerospace/defence)
In the meantime, hardware engineering is doing amazing things building powerfully performing systems that give us some nice convenient consistent (if sometimes insecure and awkward) myths about how our systems work, and they're making them faster every year.
Makes me wonder if 50 years from now we'll still be stuck with the hardware equivalent of the floppy disk icon, only because retooling the universe over from scratch is too expensive.
As they say, C was designed for the PDP-11 architecture, and modern computers are forced to emulate it, because the tools to describe software (languages and OSes) which we have can't easily describe other architectures.
There were modern semi-successful attempts though, see PS3 / Cell architecture. It did not stick though.
I'd say that the modern heterodox architecture domain is GPUs, but we have one proprietary and successful interface for them (CUDA), and the open alternatives (openCL) are markedly weaker yet. And it's not even touching the OS abstractions.
> As they say, C was designed for the PDP-11 architecture
Not really though. A linear address space was not particularly specific to the PDP-11. The one point where C really was made to fit the PDP-11 was the addition of a byte datatype (aka char), but the PDP-11 wasn't unique in that regard either.
PDP-11 wasn't unique. To the contrary, it had many typical features.
- Uniform memory, cache memory is small, optional and transparent.
- A single linear address space; no pages, stack in the same RAM as data.
- A single CPU, with a single scalar ALU, and completely in-order execution; no need for memory barriers.
A typical modern machine larger than an MCU has several level of memory hierarchy which affect performance enormously, the physical RAM is mapped all over the address space, several execution units process data in parallel and often out of strict order, there are many variants of vector (SIMD) instructions, and usually a whole vector co-processor ("graphics card"). This breaks many of the assumptions that C initially ha made, and hardware tries hard to conceal the memory hierarchy (well, your OS may allow you to schedule your threads to the same NUMA domain), to conceal the parallel execution, to conceal the memory incoherence between processing nodes, etc. Well, you sort of can make the compiler infer that mean a vectorized operation, or use an intrinsic.
In my eyes, the C's assumptions about hardware show their age, and also hold the hardware back.
My memory of the PDP-11 is different.
However, the pdp11 is/was a weird collection of systems, from tiny little single chip implementations to run a VAX console, upto large racks of boards to make up a cpu and fpu like the 11/70.
I mainly worked on 11/44's an 11/73's which both shared a 16bit paged address space, (with no cache that I remember).
They had more physical memory (256k and I think 4M) than could be addressed by the instructions(64k).
The pages where 8k - so eight of them, and waving them around required an OS mapping function call.
The IO controllers where asynchronous, and the OS did preemptive multiprocessing and the problem-space was larger than 64k, and faster than the disk-drive, so multi-processing and locks where required to address it.
We used C and assembler on them. C was nicer than assembler to work with.
I don't see a difference of-kind between the pdp-11 and current computers.
I do see a difference of 'know-ability' of the software stack that makes up a system.
There are so many external dependencies in the systems I have worked on since, many of them larger than the systems that loaded into that pdp-11, so being certain that there is no fault was almost always a pipe-dream. Automated tests helped - somewhat.
Often, confidence is based on the 'trajectory' of the rate of bugs discovered.
> That statement has to be coming with some hidden caveats. 64 bits of address space is crazy huge so it's unlikely the entire range was even present. If only a subset of the range was "instantly" available, we have that now. Turn off main memory and run right out of the L1 cache. Done.
So I did some digging around for documentation about this machine and it looks like it puts the upper 54-bits of the address through a hash function to select an entry in a set associative tag RAM which is then used to select a physical page. This has the possibility for collisions, but it can get away with that because RAM is just a cache for disk contents.
Certain parts of the address technically mean something, but apart from leveraging that in the design of their hash function it has no real relevance to the way the hardware works. This scheme would work with linear 64-bit addresses just fine with an appropriate hash implementation. Basically all that's happening here is that the TLB is large enough to have an entry for reach physical page in the system and a TLB miss means you have to fetch a whole page from disk rather than walking a tree of page tables in memory.
I think the other thing going on here is that the R1000 is a microcoded machine from the 80s with no cache (well unless you're counting main RAM as cache, so it probably has a relatively leisurely memory read cycle which makes it more straightforward to have a very large TLB relative to the size of main memory. There's no magic here and no lessons for modern machines when it comes to how virtual address translation is done
What I mean is that the R1000 memory architecture is not fundamentally different from modern hardware in a way that seems to solve any modern design problems. The tag RAM is functionally equivalent to the TLB on a modern CPU, but it's much larger relative to the size of the RAM it's used with. The 2MB memory boards described in US Patent 4,736,287 (presumably an earlier version of the 32MB boards present in the R1000/s400) have a whopping 2K tag RAM entries. This is the same size as the 2nd level data TLB in Zen 2 which is supporting address lookup for a much larger main memory.
If you were to try and make a modern version of the R1000 architecture you're going to run into the same size vs speed tradeoffs that you see in conventional architectures. The server oriented Rome SKus of Zen 2 support 4 TB max RAM. Even if you bump the page size to 4MB, you still would need 1M TLB/tag RAM entries to support that with an R1000-style implementation.
Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree.
What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.
>Sorry, but you are simply wrong there. The TLB is just one of many hacks made necessary by the ever-deeper page-table-tree.
>
>What the R1000 does is collapse the obj->phys lookup in the DRAM memory cycle, and if we did that today, we wouldn't need any page-tables to begin with, much less TLBs.
You would need a TLB even with a completely flat page table because hitting the DRAM bus (some flavor of DDR on modern systems, but it's still fundamentally DRAM) on every access would absolutely destroy performance on a modern machine even if translation itself was "free". You need translation structures that can keep up with the various on-chip cache levels which means they need to be small and hierarchical. You can't have some huge flat translation structure like you have on the R1000 and have it be fast.
Anyway, my point is that at a mechanical level TLB and tag RAM work the same way. You take a large virtual address, hash the upper bits and use them to do a lookup in a set-associative memory (so basically a hash table with a fixed number of buckets for conflicts). In some CPUs (it's a little unclear to me how common it is for cache to be virtually or physically addressed these days) this even happens in parallel with data fetch from cache just like tag RAM lookup on the R1000 was done in parallel with data fetch from DRAM. This is not some forgotten technique, it's just moved inside the CPU die and various speed and die space constraints keep it from covering all the physical pages of a modern system.
Now, could you perhaps use a more R1000-like approach for the final layer of translation, sure. Integrating it tightly with system memory probably doesn't make sense given the need to be able to map other things like VRAM into a virtual address space, but you could have a flat hashtable like arrangement even if it's just a structure in main RAM. You can even implement such a thing on an existing CPU with a software managed TLB (MIPS, some Sparc)
I'm sure there are improvements to be made but there are pretty fundamental physical reasons why reading a random piece of data from a large pool of memory is going to take longer than reading random piece of data from a small pool of memory. Hence, as memory pools have gotten bigger since the days of the R1000 we use caches both in memory itself and in address translation.
The entire address space (64 address bits, addressing bits, so "only" 61 bit addresses for bytes, but also 3 "kind" bits, separating code/data/control etc.) were available at all times.
The crucial point is that the RAM wasn't laid out as a linear array, but as a page-cache.
In a flat memory space, to allocate a single page far away from all others, you will need four additional pages (the fifth level is always present) for the page-tables, and four memory accesses to look them up before you get to the data.
In the R1000, you present the address you want on the bus, the memory board (think: DIMM) looks that address up in its tag-ram to see if it is present, completes the transaction or generates a memory fault, all in one single memory cycle, always taking the same time.
The problem is that there is a limit on how fast you can make the look-up in hardware. Today, given the large amounts of physical memory and the high frequency that CPUs are clocked at, single cycle lookup would be impossible. In fact today CPUs already have such lookup tables in the form of TLBs as hitting the page table every time would have very high latency; still TLBs cannot cover the whole address space and still need multi-level structures even for a subset of it.
Single Address Space OSs are an option, but it means that you are restricted to memory safe languages, it is very vulnerable to spectre-like attacks, and any bug in the runtime means game over.
"Unsafe at Any Speed" is the name of Ralph Nader's book on car manufacturers resisting car safety measures. It resulted in the creation of the United States Department of Transportation in 1966 and the predecessor agencies of the National Highway Traffic Safety Administration in 1970.
Which makes this article's reference to such an extreme situation a bit click-baity. Generally not something I'd expect of ACM, though maybe I don't read it enough to know if its descended into that realm on a more regular basis.
Perhaps "any speed" was exaggerated but I wouldn't call the thing click bait when it legitimately called out manufacturing practices and reluctance to put resources into safety which thereby resulted in trivially preventable deaths.
The greater awareness of safety factors brought to light by the book resulted in a culture of safety around automobiles & consumer buying practices that made safety features primary marketing & selling points.
That's all a pretty high bar to reach when using "unsafe at any speed" in the context of address spaces.
So long as you kept the tires inflated. It had a strange motorcycle-like difference between rear and front tires. I don't know if you could feel the handling get bad if that wasn't preserved or if it would surprise you in extreme situations.
I have a much dimmer view of the ACM than you. This is exactly the kind of crank, throwaway op-ed I’d expect out of them (and Poul-Henning Kamp in particular — notable for loudly conflating Judaism and the Israeli government).
There is often a quite significant distance between the beautiful, elegant and efficient design that brings tears to the eyes of a designer, and being pragmatic and financially viable.
Building a new competitive processor architecture isn't feasible if you can't at least ensure compile-time compatibility with existing programs. People won't buy a processor that won't run their programs.
The Multics system was designed to have segments (for this discussion == pages) that were handled the way he described, down to the pointer handling. Not bad for the 1960s, though Unix was designed for machines with a lot fewer transistors back at the time when that mattered a lot.
Things like TLBs (not a new invention, but going back to the 1960s) really only matter to systems programmers, as he says, and judicious use simplifies and has simplified programming for a long time. I think if he really wants to go down this path he'll discover that the worst case behavior (five probes to find a page) really is worth it in the long run.
On the 645 it did, and I think in any realistic alternative it would have too as segments could be of different sizes and the working set could vary.
However for my simplified comment I said "for this discussion == pages" simply because the difference is simply an extra indirection table in memory, with the segment carrying the access control (like a process address space in unix) and the pages being indexed within it.
All of which was an attempt to say that the post to which we are all commenting is a bit overwrought in its complaint about the worst case search required to find a page. After all, that's what systems programming is all about: the grotty complex plumbing required so other people can worry about the performance of their algorithms that do actual work.
Another system that had an object-based non-linear address space I believe was the "Rekursiv" CPU developed at Linn (yes, the Swedish audio/drum machine company;
EDIT: Linn. Scottish. Not drum machine. Thanks for the corrections. In fact I even knew this at one time. Yay brain.) in the 80s.
> Linn (yes, the Swedish audio/drum machine company) in the 80s
Uhm.
Linn the audio company, known as Linn Products, are Scottish, being based a little to the south of Glasgow, and named after the park the original workshop was beside.
Linn the drum machine company, known as Linn Electronics, were American, being founded by and named after Roger Linn.
Two totally different companies, run by totally different people, not connected in any way, and neither of them Swedish.
The Linn Rekursiv was designed by the audio company, and was largely unsuccessful, and none exist any more - not even bits of them :-/
Its amazes me that articles like this can talk about obscure architectures, and now CHERI but completely fail to notice that just about every PC in use is actually a lightweight capability machine! AKA all that "cruft" everyone complains about in ia32, you know the LDT/GDT/IDT, the variable length segments referenced there, the selector+offset (aka CS, SS, ES, DS, FS, GS) format maps perfectly to the concept of a data structure, and its offset. The task gates, call gates, interrupt gate etc all are there to support a proper per segment security model.
We have these machines, although granted over the past few decades those mostly unused operations have gotten quite slow, and the model harkens back to a time where people didn't have a lot of ram, so there aren't a lot of "caches" (aka segment registers/etc) in place to support modern computing.
Which is why I find these articles amusing, suddenly its in vogue to rediscover what most computer architects of the 60-80's were doing, until RISC and UNIX basically destroyed it all, with leaky abstractions and insecure designs.
And since the PC is just a pile of legacy garbage no one looks at it close enough to discover they have the HW sitting on their desk to try out some of these ideas.
The type descriptions contain much more than that, for instance valid range for integers, field layout for structures and so on. Also privacy information.
My guess is that it is an object oriented system. The data's type is a pointer to the address that defines the type. Which could be anywhere in the system.
This is also a security feature. If you find a way to randomly change the data's type, you're unlikely to successfully change it to another type.
I don't know if this machine supported it, but it could allow you to have a system-wide unique type for this-struct-in-this-thread-in-this-process, with strong type checking all the way through the compiler into run time. Which would be pretty cool.
is there an example of machine code that doesn't make use of a linear contiguous call stack?
what would the alternative be? compute the size of all the stack frames a-priori in the compiler and then spray them all over main memory and then maintain a linear contiguous list of addresses? doesn't the linear contiguous nature of function call stacks in machine code preserve locality in order to make more efficient use of caches? or would the caches have to become smarter in order to know to preserve "nearby" stack frames when possible?
also, why not just make the addresses wider and put the pid in the high bits? they're already doing this masking stuff for the security descriptors, why not just throw the pid in there as well and be done with it?
> is there an example of machine code that doesn't make use of a linear contiguous call stack?
Early CPUs didn’t have support for a stack, and some early languages such as COBOL and Fortran didn’t need one. They didn’t allow recursive function calls, so return addresses could be stored at fixed addresses, and a return could either be an indirect jump reading from that address or a direct jump whose target address got modified when writing to that fixed address (see https://people.cs.clemson.edu/~mark/subroutines.html for the history of subroutine calls)
If you put the pid in the high bits of every pointer you run into problems when you want to support fork(). The child gets a new pid but keeps all the same pointers as its parent.
And that only protects you against one process trying to read/write another process's memory. We already have good support for this type of protection on current hardware/OSes. What is desired is something that prevents a process from reading/writing its own memory, through an out of bounds pointer. This would prevent buffer overflows, stash smashing, etc. and be a good defense against RCE.
Non-contiguous stacks are typically just linked-lists of call frames, or groups or call frames. One of the examples that used to be regularly trotted out comp.lang.c 15+ years ago was an IBM mainframe environments, where stacks were linked lists but still managed by hardware.
Off the top of my head, current examples of non-contiguous stacks on more familiar system with supporting code compiled directly to machine code (i.e. not language interpreters and VMs, where this is actually very common), include
1) gccgo, which compiles Go code to use split stacks.
2) gcc with -fsplit-stacks, which permits compiling C (and C++?) code which uses the same stack discipline as gccgo.
3) OCaml's new concurrency work utilizes non-contiguous stacks. I'm not familiar with OCaml toolchains, but notably they do generate DWARF and related tables so both debuggers and the garbage collector can properly walk logical stacks. See https://arxiv.org/abs/2104.00250
4) There's a stack-smashing and ROP prevention technique call shadow stacks, where a single logical call stack is implemented using two contiguous stacks, one primarily for return addresses (or related metadata) and the other primarily for objects. See https://en.wikipedia.org/wiki/Shadow_stack Both GCC and clang support software-implemented shadow stack disciplines (SafeStacks), and some newer Intel and ARM ISA extensions implement hardware-enforced shadow stacks.
The linked article doesn't mention call stacks explicitly, but describes the R1000 arch was object+offset addressed in HW. So unless they restricted the call stack to fit into one object and use only offsets, then yes, they must have chained objects together for the stack.
When you have a page-based memory model, you've created the importance of address locality. If you have object-based memory model, and the working set is of objects, not pages, then address locality between objects doesn't matter.
Of course, page-based based memory models are by FAR the most common in practice.
(Note: pages ARE objects, but the objects are significant to the VM system and not to your program. So strictly, page-based models are a corner case of object-based models, where the objects are obscure.)
would be interesting to see how the actual call stack is implemented. they must either have a fixed width object as you mention or some kind of linear chaining like you're describing.
memory and disk are unified into one address space, code is represented by this "diana" structure which can be compressed text, text, ast or machine code. would be curious how procedures are represented in machine code.
When you reverse engineer "normal" code, you know the CPU can handle integers of X, 2X, 4X bit widths, you know it wants things aligned this way or that way etc.
The R1000 is bit addressed, and integers can be any width from 1 to 64 bits, structures can be "variant" (See: Ada) so reverse engineering the storage layout is ... not an accomplished feat yet.
Since the primary task for the R1000 was Ada program development, and since there were a standardized semantic representation of Ada programs ("DIANA"), one of the fundamental hardware/microcode data-types is "DianaTree", which is, I think a pretty generic CS tree of some kind.
These "DianaTree" types seem to also have been used for other stuff, like directories in the object store etc.
Fun fact: for many languages a linear memory block doesn't work - because execution context lifetimes aren't strictly FIFO. For example continuations are a language feature that causes this (unless they're restricted eg to single-shot).
The linear address space is the root reason why any language with FFI support will be inherently unsafe, unfortunately. Any errant pointer from C can accidentally (or intentionally) corrupt objects in the safe language. It's a difficult problem to solve.
Vale's "Fearless FFI" designs [0] says we could sandbox entire third-party libraries using a WebAssembly compilation step, which might work well. Sometimes I wonder what would happen if we made an entire OS using Vale, and what kind of security improvements it might bring.
The underexplored value of early segmentation was the discretionary segment level permissions enforced by hardware.
Years ago I prototyped a system that had filesystem permission support at the segment level. The idea was you could have a secure dynamic library for, say, manipulating the passwd file (you can tell how long ago that was). You could call into it if you had the execute bit set appropriately, even if you didn't have the read bit set, so you couldn't read the memory but could call into it at the allowed locations (i.e. PLT was x only).
However it was clear everyone wanted to get rid of the segment support, so that idea never went anywhere.
They made a decent go at it again in 16 and 32 bit protected mode. The GDT and LDT along with task gates were intended to be used as an hardware object capability system like the iAPX 432's.
I'd hope that anyone at Intel with said nightmares would have read this paper by now (wherein Bob Colwell, et al, argue that the 432 could have been faster with some minor fixes, and competitive with contemporary CPUs with some additional larger modifications).
Since this addressing scheme is <object, offset>, and as these pairs need to fit in 64 bits, I am curious, is the numjer of bits for each part is fixed and what are those fixed widths. In other words what is the maximum possible offset within one object and the max number of objects?
Probably segment registers in x86 can be thought as object identifiers, thus allowing the same non-linear approach?(Isn't that the purpose of segments even?)
Update: BTW, another term for what the author calls "linear" is "flat".
Of course things would be faster if we did away with coarse grained virtual memory protection and instead merged everything into a single address space and guaranteed protection using fine grained permission mechanisms.
The problem with that is that a single error in the fine grained mechanism anywhere in the entire system can quite easily cause complete system compromise. To achieve any safety guarantees requires achieving perfect safety guarantees across all arbitrary code in your entire deployed system. This is astronomically harder than ensuring safety guarantees using virtual memory protection where you only need to analyze the small trusted code base establishing the linear address space and do not need to be able to analyze or even understand arbitrary code to enforce safety and separation.
For that matter, fine grained permissions are a strict superset of the prevailing virtual memory paradigm as you can trivially model the existing coarse grained protection by just making the fine grained protection more coarse. So, if you can make a safe system using fine grained permissions then you can trivially create a safe system using coarse grained virtual memory protection. And, if you can do that then you can create a unhackable operating system right now using those techniques. So where is it?
Anybody who claims to be able to solve this problem should first start by demonstrating a mathematically proven unhackable operating system as that is strictly easier than what is being proposed. Until they do that, the entire idea is a total pipedream with respect to multi-tenant systems.
You have misunderstood the article. It is not advocating for the return to single address space systems. It is advocating for potential alternatives to the linear address space model.
Here [1] is an operating system that I think fits under the description of what you were talking about
The more I research Singularity, the more I like it. I deep dived into all the design docs years ago and the amount of rethinking existing OS infrastructure is astounding.
I think that the plague of speculative execution bugs qualify as a single error in virtual memory systems that cause complete system compromise. This was not a logic error in code, but a flaw in the hardware. It isn't clear to me if CHERI would have been immune to speculative execution problems, but access issues would likely have shown up if the memory ownership tests were in the wrong place.
I have been following CHERI. I note that in order to create the first FPGA implementation they had to first define the HDL for a virtual memory system -- all of the research "processor" models that were available did not have working / complete VM implementations. CHERI doesn't replace VM, it is in addition to having VM.
I've found that memory bugs (including virtual memory ones) are difficult to debug, because the error is almost never in the place where the failures show up and there is no easy way to track back who ought to own the object or how long ago the error happened. CHERI can help with this by at least being able to identify the owner.
Virtual memory systems are usually pretty complex. Take a look at the list of issues for the design of L3 <https://pdos.csail.mit.edu/6.828/2007/lec/l3.html>. The largest section there is for creating address spaces. For the Linux kernel, in this diagram a lot of the MM code is colored green <https://i.stack.imgur.com/1dyzH.png>, it is a significant portion. More code means more bugs and much harder to formally verify.
I am not convinced by the argument that it is possible to take a fine grained system and trivially expand it to a coarse grained system. How is shared memory handled, mmap'ed dylibs, page level copy-on-write?
Speculative execution vulnerabilities largely don’t cause system compromise and instead are constrained by virtual memory. That’s why much of the focus was on the OS kernel: modern operating systems generally include all physical memory in their virtual address space.
tl;dr: conventional design bad, me smart, capability-based pointers (base+offset with provenance) can replace virtual memory, CHERI good (a real modern implementation of capability-based pointers).
The first two points are similar to other Poul-Henning Kamp articles [1]. The last two are more interesting.
I'm inclined to agree with "CHERI good". Memory safety is a huge problem. I'm a fan of improving it by software means (e.g. Rust) but CHERI seems attractive at least for the huge corpus of existing C/C++ software. The cost is doubling the size of pointers, but I think it's worth it in many cases.
I would have liked to see more explanation of how capability-based pointers replacing virtual memory would actually work on a modern system.
* Would we give up fork() and other COW sorts of tricks? Personally I'd be fine with that, but it's worth mentioning.
* What about paging/swap/mmap (to compressed memory contents, SSD/disk, the recently-discussed "transparent memory offload" [2], etc)? That seems more problematic. Or would we do a more intermediate thing like The Mill [3] where there's still a virtual address space but only one rather than per-process mappings?
* What bookkeeping is needed, and how does it compare with the status quo? My understanding with CHERI is that the hardware verifies provenance [4]. The OS would still need to handle the assignment. My best guess is the OS would maintain analogous data structures to track assignment to processes (or maybe an extent-based system rather than pages) but maybe the hardware wouldn't need them?
* How would performance compare? I'm not sure. On the one hand, double pointer size => more memory, worse cache usage. On the other hand, I've seen large systems spend >15% of their time waiting on the TLB. Huge pages have taken a chunk out of that already, so maybe the benefit isn't as much as it seemed a few years ago. Still, if this nearly eliminates that time, that may be significant, and it's something you can measure with e.g. "perf"/"pmu-tools"/"toplev" on Linux.
[4] I haven't dug into how when fetching pointers from RAM rather than pure register operations, but for the moment I'll just assume it works, unless it's probabilistic?
As things stand now, CHERI doesn't replace virtual memory. MMU is still there; CHERI is a layer placed on top (so it's CHERI capabilities -> linear local addresses -> hardware addresses). Which is why things generally work as usual, even though the entire FreeBSD userspace and (sometimes) the kernel are compiled as purecap binaries, using capabilities instead of pointers.
The fork(2) isn't a problem when running like this, but it does become a problem if you want to colocate processes in a single address space. It's not as much of a problem as I'd previously expect: there's vfork(2) and posix_spawn(2); fork is only a problem until subsequent execve(2); and also because many systems don't support fork(2) anyway, userspace had to adapt.
> As things stand now, CHERI doesn't replace virtual memory.
Yeah, he's proposing...something else. It's not clear to me exactly what, except sort of like this obscure historic machine he vaguely described. See e.g. this paragraph:
> The linear address space as a concept is unsafe at any speed, and it badly needs mandatory CHERI seat belts. But even better would be to get rid of linear address spaces entirely and go back to the future, as successfully implemented in the Rational R1000 computer 30-plus years ago.
> I'm inclined to agree with "CHERI good". Memory safety is a huge problem. I'm a fan of improving it by software means (e.g. Rust) but CHERI seems attractive at least for the huge corpus of existing C/C++ software
A lot of C/C++ code assumes that pointers are integers are pointers, so I dunno how big the corpus would actually be. People will cast between them but that's not the end of it, they will also make unions, and they will memcpy from one to another. It wouldn't surprise me if there is a lot of code that even assumes pointers are exactly 64-bit wide.
Note that it's not like with CHERI you can't cast a pointer to int or something. Sure you can, that's one of the main accomplishments: to demonstrate that hardware capabilities can work with real-world source code, like PostgreSQL.
So, it's not like you can't typecast; rather, there are some specific things the hardware will prevent you from doing, eg '(void *)42' - if you force clang to accept it, it will crash at runtime due to missing tag.
Yeah, some source changes would be needed, including removing some clever optimizations. Still much easier than changing languages entirely. I rewrote a small C++ application into Rust. Only a few thousand lines iirc, and I was the sole author of both versions. Even that was a significant effort.
Doing tagged slab allocation seems like a reasonable thing for a hardware microarchitecture to consider.
The article starts by complaining about modern CPU complexity, five-level page tables to maintain the illusion of a linear, uniform memory space.
Then goes on to juxtapose that complexity with the an architecture that would use that complexity to support tagged memory objects more directly, part of the system ABI.
If we're going to spend all of our time manipulating memory objects anyway, why mess around?
Was the Intel iAPX so much of a disaster, compared to a modern x86_64 architecture, that we are forever forbidden to even consider such hardware?
The hardware and modern OSes support large 2MB pages, and huge 1GB pages. They trim 1 or 2 levels from that tree, respectively. With huge pages a single node of that tree addresses 512GB of memory, for most practical applications you’re pretty much guaranteed to translate these addresses without TLB cache misses.
There’re some limitations about them, on Windows the process needs a special security privilege, also these pages are never swapped to a page file. But still, AFAIK when programmers really need the performance, like in a database engine, they do use these things when available.
Huge pages cause 10-20% speedups for all sorts of applications by nearly eliminating the pain of looking at a bunch of page tables all the time. Unfortunately they are completely unusable on Windows even though it's been at least 26 years since they first shipped in consumer CPUs.
Me neither. S/360 was CISC by every definition. It was a pretty clean design and could scale from small to big systems. Over time it acquired instructions to achieve things in hardware as desired by its designers and users.
I think they were really poking fun at RISC and how complicated the instruction sets became to support real world desires. We're a long way from the MIPS R2000.
Yeah, MIPS eventually supported SIMD instructions and the MIPS16 short-length instructions. I'm sure other things were thrown in there. Other instruction sets inspired by early RISC also added lots of things.
Most have stayed true to the fundamental separating memory instructions from arithmetic instructions. What makes an instruction set CISC, in my opinion, is allowing memory operands in arithmetic. Most RISC instruction sets also provide lots of registers to avoid reuse. The idea is generally that the compiler should be responsible for optimally scheduling instructions. But then, modern RISC are usually out-of-order processors anyway.
> They also made it a four-CPU system, with all CPUs operating in the same 64-bit global address space. It also needed a good 1,000 amperes at 5 volts delivered to the backplane through a dozen welding cables.
These days you just use 12v and convert right next to or on die - but we are still in that range of amps for big chips!
Take for example a 3090 at 500w @12v, the core is running at 1.056v, that’s 473 Amps!
That goes a long way to explaining the pin count on those chips! The idea of that much current running through a bunch of angelhair thickness gold wire is horrifying.
Physically, RAM chips have transistors in linear arrangments, one after the other.
Storage cell #1,334,455,224 is physically next to storage cell #1,334,455,223. Both may contain data from different "objects", or not. That does not change the physical reality.
Having a representation of reality in your system/software can be helpful in many cases. A nefarious example would be if you were attempting to write a rowhammer attack. How would you do that if the computer cannot reason about the physical location of storage cells in RAM?
>"The R1000 addresses 64 bits of address space instantly in every single memory access. And before you tell me this is impossible: The computer is in the next room, built with 74xx-TTL (transistor-transistor logic) chips in the late 1980s. It worked back then, and it still works today."
Nice! (This is nothing short of a miracle considering that this technology originated in the 1980's and was built with TTL!)
This article makes me wonder, now that Apple makes its own CPUs and owns the "whole widget" if they might do something more radical with their CPU architecture. They could build in Swift-specific hardware support for example. Apple could get an "unfair" power and performance advantage over their CPU competition.
process thinks it has a linear memory to itself, imagines it can write anywhere. writes trigger the expansion of a complex sparse tree data structure transparently converts virtual linear address space to non-contiguous actual locations in main memory
the R1000 is
program can repeatedly ask for an x KiB page, is given an id, which is actually an index into an array that holds the actual RAM address. data is fetched with an id+offset, which is range/pid checked against metadata in the array. the ids a program gets back aren't guaranteed to be sequential. BUT the array is dense, which is why no tree structure is needed.
dynamic stacks have a little extra work to do, a data return pointer in addition to an instruction return pointer.
tl;dr page-based linear addressing induces performance loss with complicated access policies, e.g. multilevel page tables. Mr. Kamp would prefer an object model of memory access and protection. Also, CHERI (https://dl.acm.org/doi/10.5555/2665671.2665740) increases code safety by treating pointers and integers as distinct types.
Simple: we don't want some low level kernel memory management dictating what constitutes an "object".
Everything isn't object-oriented. E.g. large arrays, memory-mapped files, including executables and libraries.
Linear memory sucks, but every other organization sucks more.
Segmented has been done; the benefit-to-clunk ratio was negligible.