Towards userspaceification of POSIX – part I: signal handling and IO

mananaysiempre · 2024-08-02T23:03:15 1722639795

> POSIX allows file descriptors to be shared by an arbitrary number of processes, after e.g. forks or SCM_RIGHTS transfers (even though this use case is most likely very rare, so it’s not entirely impossible for this state to be moved to userspace).

Normally it’s rare, except shell functionality crucially depends on it[1].

The way to think about this, IMO, is that while a disk file is more or less random-access, when you open it unavoidably turns into strange hybrid of a random-access byte array and a (seekable) byte stream (classically, the “file description”), and that is what the resulting fd and any of its copies point at. Seekable streams made some sense in the age of tape drives, but nowadays it seems to me that random-access files and non-seekable streams have enough of a gulf between them that forcing both to appear like the same kind of object is more confusing than helpful.

(I don’t know what my ideal world would look like, given I still like to direct streams from/to files. I still dislike the “file description” business rather strongly.)

[1] https://utcc.utoronto.ca/~cks/space/blog/unix/DupSharedOffse...

seeknotfind · 2024-08-03T00:12:50 1722643970

Ah! Well as a systems programmer, it's not so rare. FDs are really not a friendly interface, and I'd love a better interface (FDs have their quirks), but as an application programmer, it's really sad we don't take advantage of FD passing to build more modular programs. For instance on desktop operating systems, you pass paths, not capabilities to use files, and the norm is giving programs access to read large swaths of user data. This is not secure, and it's one of the issues well-passed FDs could help solve. In general, despite the challenges in making them secure, I'd strongly advocate for more modular and integrated applications.

As for the seeking abstraction, it fits well with other buffered device driver information streams. Yes, it's a complicated and confusing interface, but the key thing is it allows you to share an OS/system/hardware level resource between multiple programs. We want to take advantage of that. It's the abstraction we've got sure, but it's what we do with it that counts!

zzo38computer · 2024-08-03T19:28:19 1722713299

My idea of operating system design is you do pass capabilities; actually, a message can only pass capabilities and byte sequences, and all I/O must use that interface. Furthermore, "proxy capabilities" are possible; i.e. a program can make up its own capabilities and send them to other capabilities it has access to, and then use those capabilities that it had made up to receive messages and do something with them (such as forward them to other capabilities; this is a simple case that can be used for logging or for revocable capabilities, but more complicated uses are possible). Actually, my operating system design does not have file paths.

This makes it more secure than how UNIX is doing it (if it is designed properly; the way to do this is to avoid making the interface too complicated, since a complicated interface would also increase the complexity of the more advanced uses of proxying), as well as more flexible (and allows modularity). It is possible to then allow to read only a part of a file, or to decompress (or compress) automatically without the application program knowing about the compression, or to log accesses, etc.

However, for seeking it will require a message containing a command to request to seek the file, since "seeking" is not a function known to the operating system, and there is no system call for "seeking"; the system calls are sending and receiving messages, waiting for objects, discarding capabilities, and creating new proxy capabilities.

But, the seeking command and others would be standardized and defined in the operating system specification, so that programs can use them, even though the system call interface does not directly have such a command, and proxies can handle messages in whatever way they want to (since they are just arbitrary byte sequences and/or capability passing).

(My own operating system design also allows "userspaceification of POSIX"; the kernel is not POSIX, but a compatibility layer (for at least much of POSIX, but maybe not all of it necessarily) can be made in user space if it is desired.)

4lDO2 · 2024-08-03T20:37:30 1722717450

Generally this aligns with what the Redox kernel is currently transitioning into, with a few limitations in order to retain compatibility for the (quite larger number) of applications we "need" to support.

> My idea of operating system design is you do pass capabilities; actually, a message can only pass capabilities and byte sequences, and all I/O must use that interface.

File descriptors and capabilities are very similar, and Redox already uses file descriptors, which are handled by scheme daemons, for most interfaces in general. I'm working on a _virtual memory-based capability_ RFC, on top of which the POSIX file table can be implemented in redox-rt (userspace) without forcing (some) POSIX semantics onto capabilities.

> Actually, my operating system design does not have file paths.

We're eventually going to switch fully to the openat* family of path calls, which would generalize the open syscall into a scheme call that sends a capability reference (dirfd), a path, and returns a capability.

> However, for seeking it will require a message containing a command to request to seek the file, ...

But that the cursor to be stored either by the client (requiring extensive messaging for each non-absolute IO call), by the server (requiring unnecessary state since almost all positioned IO nowadays is random-access), or currently, in the kernel. I'd like this state to be stored in userspace, as the article mentions, but this will first require assessing whether it would be feasible to break compatibility for this, or at least "performant compatibility".

> the kernel is not POSIX, but a compatibility layer (for at least much of POSIX, but maybe not all of it necessarily) can be made in user space if it is desired.

This is exactly what Redox is transitioning to: a kernel that's not necessarily Unix-like, with the bulk of POSIX logic implemented in userspace (redox-rt).

zzo38computer · 2024-08-03T23:14:07 1722726847

I have read Redox documentation; mine is in many ways very different (both the working of high-level code and of low-level code, although some goals and features are similar, e.g. "forcing all functionality to occur via file descriptors" is mandatory for all I/O in my system (although they are "capabilities"), but mine does not use namespaces like Redox does, and requires capabilities that the program uses to be passed to it in the initial message that it receives (which is the only way to receive a message without already having a capability; if the initial message does not include any capabilities then the program would be immediately terminated since it cannot do any I/O)) from Redox and from other operating systems. (Mine is also meant to be a specification independently from the implementation; so it would be possible to write an implementation in C or in Ada or whatever else, and the different components implemented independently can be used together.)

> File descriptors and capabilities are very similar

Yes, and it is what I thought too, although my idea of these "proxy capabilities" is simplified compared with UNIX file descriptors in many ways (it is something like only having UNIX sockets, created using the socketpair function, and with SCM_RIGHTS messages as well as bytes; however, there are some significant differences too).

> We're eventually going to switch fully to the openat* family of path calls

I also thought that a POSIX interface can use openat and this is better than using open etc, although my idea of operating system design does not have that either; there are no file paths.

> But that the cursor to be stored ... I'd like this state to be stored in userspace ...

Yes, and I had also considered such things. One disadvantage of requiring the client to specify the file position is that it cannot be used with non-seekable files. However, it may be possible to have a separate POSIX and non-POSIX interface (and then the POSIX interface can be implemented in the POSIX library, which will not be needed by non-POSIX programs); the non-POSIX interface might not need to use the same interface for seekable vs streaming files (a proxy can be created (in user space) if it is necessary to use a seekable file where a streaming file is expected).

surajrmal · 2024-08-04T15:27:06 1722785226

You've roughly described how fuchsia[1] is designed.

[1]: http://fuchsia.dev

zzo38computer · 2024-08-05T02:56:30 1722826590

I had made my own ideas independently, although there are many similarities (as well as many differences; I am describing some of them below). One difference with mine is to make a specification, and that multiple independent implementations of the specification may be made.

My ideas are like a actor model in some ways, though.

Looking at [0], my idea is very similar than Zircon kernel services. However, there are many differences than what is described by [1]. In mine, an implementation might include some additional features in the kernel, although this is just an implementation detail; user programs don't know the difference of if they are kernel services or external programs (since this is how the security model of my system is designed to work).

Mine has only one type of object for IPC rather than five; it is similar than what is called a "Channel" in Fuchsia (although it is not exactly the same, but it is a similar idea). It is the only kind of kernel objects that user processes are able to see.

In mine, process management and hardware support services are not directly exposed by the system call interface; they are only exposed by IPC channels.

Like Fuchsia, mine has no ambient authority. However, proxy capabilities are used to provide security and many other features (e.g. logging accesses, simulating error conditions, transparent compression, network transparency, revocable capabilities, etc). A program receives an initial message when it starts, and this initial message will contain IPC channels (possibly in addition to other data).

Mine does not inherently have namespaces. Files can only be accessed by capabilities, and files can contain links to other files; there are no directory structures and no file names. There is a "root filesystem" but that is only needed for purpose of initializing the system; most processes cannot see it and have no way of identifying it even if it could see it. However, when running POSIX programs, a POSIX-like namespace can be emulated by using a file containing key/value lists which work similar than Fuchsia namespaces in some ways (although such features are implemented entirely in user-mode libraries; the kernel knows nothing about them).

Also, mine does not use Unicode in any way. It also does not use JSON, XML, HTML, etc. Binary file formats are preferred; nearly everything will use binary formats. There are also many other significant differences (including UI stuff). I also consider that some other things are also no good, e.g. USB, UEFI, WWW, etc (this does not mean that it is not possible to write drivers/programs that can use them; it means that the fundamental specifications of the system deliberately avoid them, and that hardware/software designed deliberately for this system are designed to not need them).

I also would have locks and transactions, including the possibility that locks/transactions may involve multiple objects at once; this includes files, but may also need to include remote objects in some cases.

There is still i18n, l10n, a11y, etc, as well as many additional features such as "window indicators", "Common Data Format", "Command, Automation, and Query Language", etc. (The i18n does not work like Fuchsia though. For example, although language identifiers are still needed (although they are not limited to the ones included in Unicode, since it does not use Unicode), identifiers are not needed for date/time, etc (the library that deals with date/time formats can be modified to add whatever kind of calendars you want to do; the application program does not usually need the identifier for it, unless perhaps you want to reference entire months or years, but to do that requires specifying them with the data being processed by the program and is entirely separate from i18n preferences anyways).)

[0] https://fuchsia.dev/static/fuchsia-src/get-started/images/in...

[1] https://fuchsia.dev/fuchsia-src/get-started/sdk/learn/intro/...

saghm · 2024-08-03T05:09:02 1722661742

> As for the seeking abstraction, it fits well with other buffered device driver information streams. Yes, it's a complicated and confusing interface, but the key thing is it allows you to share an OS/system/hardware level resource between multiple programs.

As someone who has only dabbled in OS-level programming but recently had a use case that the seek interface seemed to work well for (parsing a file format that heavily used offsets to reference other parts of the data), I'm super curious about what you think the "complicated and confusing" parts of the interface are. (To be clear, I'm not doubting you; I'm asking because I suspect that my understanding might be more surface-level than I thought and there are probably some pitfalls that I might not be aware of!) Offhand, the only parts that seemed potentially confusing to me are the mix of signed and unsigned integers depending on the offset type (not sure if this was specific to the Rust implementation, but it used signed integers for relative offsets and unsigned for absolute offsets, which makes sense but maybe isn't something people would expect) and the fact that it's valid to seek past the end of a file (which I didn't need for my use case), but are there other subtleties that I didn't think of?

thinkharderdev · 2024-08-03T11:13:05 1722683585

Not the OP but the complicated part to me is just that the fd has a global cursor which makes concurrent access require synchronization. The rust std::fs::File API at least makes this clear through mutability requirements but I imagine in other languages this either can cause a lot of bugs or requires a more complicated API to surface the functionality safely.

saghm · 2024-08-04T16:07:46 1722787666

Ah okay, good to know. I never needed to read the same file concurrently with different cursors, so that might be why the API seemed deceptively simple to me!

4lDO2 · 2024-08-03T13:24:12 1722691452

Rust does however implement the IO traits for `&File` as well (shared), and IIRC also implements `try_clone` which is the dup equivalent.

seeknotfind · 2024-08-04T03:30:07 1722742207

One of the reasons the FD interface is so complicated is because there are many of the same operations, but they do different things to different underlying kernel implementations. In Linux, you have no standard way to tell what the underlying FD is. It's such a wide surface, different kernel surfaces might implement some of the many different file APIs (polling especially) slightly differently. In many cases, you can reuse standard tooling, but random ioctl calls mean you can't always reuse tooling. Nonstandard implementations of standard file calls can make it dangerous if you don't know which type of FD you have. So the good thing is they are standard, you use the same set of tools (file APIs) to operate with them, so it makes the system interface smaller and simpler, but it lacks the fidelity to create orthogonal meta programming over them. It sounds like an unimportant complaint that you can't tell the underlying type, but it also means most languages refer to FDs indiscriminately. You don't typically get type safety in languages for FDs (e.g. different class for an FD backed by shared memory instead of a real exclusive file), and even if you do have these classes you can't really guarantee they are correct at runtime if you get an FD from another process (e.g. over UDS from a child process), so languages leave FDs untyped.

About the global cursor, it's not an issue because you can dupe it to get a copy with a different cursor. So only references to the FD with that specific FD identifier have a location associated with itself, and that's because the OS can only look up the state on that FD. So dup gives cursor, but if an FD represents a physical seek, it might be the case that a dup'd FD actually does affect the functionality of the first FD by linking the two together, as that's a choice you can have in the kernel when implementing an FD. Dup also mean no new cursor.

So that's what I mean, the complexity of the real world means these objects are fundamentally different, and having the same APIs shoe horns them into something they're not. I've worked in large systems where you have very specific type-defined capabilities and messaging, and you end with different engineers creating many types which are equivalent but have independent implementations. This is another kind of nightmare because you need to convert many types to use them, which ends up being the source of a lot of boilerplate, and many things are nice, but it takes a very influential and powerful architect to curb complexity in such a system, and in the end complexity is inevitable, even if specific bouts of it spring up and get fixed, you end up with many many many APIs. So despite FDs being a very corset type API, it's brazen simplicity has eliminated an entire layer of complexity from our software.

Perhaps my complaints of FD's complexity are the misplaced pangs of an idealist, forgetting the importance of the big picture. FDs are the APIs we need, but not the ones we deserve. I think their role in the structure of our programs is far more important than their exact nature. Perhaps one day they can be replaced, but whatever does replace it will certainly have learned a lot from the humble FD.

mananaysiempre · 2024-08-03T00:31:30 1722645090

I have nothing against FD passing, and indeed agree it’s unfortunate we don’t do more of it. An (almost-)everything-is-a-string (shell) language with object-capability powers is still something I’d like to figure out someday. The Lisp-2-ish way Tcl object systems approach this feels interesting, but still a bit off from a real solution.

My reading of TFA was that it’s rare for it to be important that descriptors sharing a description thus also share a file position. And shell-like redirection use cases really are the only case I can think of where that’s important.

> As for the seeking abstraction, it fits well with other buffered device driver information streams.

I don’t think I understand what you’re getting at here. My point was that having some objects (fds, whatever) support {fstat, read/write} only and others {fstat, pread/pwrite, mmap} only would get rid of the confusing user-visible notion of “file description”. Obviously I don’t expect this to happen, but it’s still nice to dream.

tbrownaw · 2024-08-03T04:22:38 1722658958

> My reading of TFA was that it’s rare for it to be important that descriptors sharing a description thus also share a file position.

It reads like this was an assumption rather than an observation.

> And shell-like redirection use cases really are the only case I can think of where that’s important.

.xsession-errors , or really anything of that nature where a process tree shares an error log file on stderr.

zzo38computer · 2024-08-05T03:07:27 1722827247

I think "(almost-)everything-is-a-string (shell) language" is not the way to do it; the command shell can be designed in a better way. But, my intention of design of the command shell programming language of a operating system is that it would have object-capability powers, too (and will be called "Command, Automation, and Query Language").

the8472 · 2024-08-03T11:47:43 1722685663

> but as an application programmer, it's really sad we don't take advantage of FD passing to build more modular programs.

For that programming languages need better unix socket (with SCM_RIGHTS) and directory-handle (openat & co) support. And of course windows does things differently so getting a portable abstraction would be difficult.

EPWN3D · 2024-08-03T14:59:20 1722697160

On Darwin, you can wrap a file descriptor wrapped in a Mach port right. That is leveraged pretty heavily on Apple's platforms for exactly these reasons.

pcwalton · 2024-08-03T01:23:52 1722648232

On desktop Unix, D-Bus provides a friendlier interface to send file descriptors than sendmsg.

mananaysiempre · 2024-08-03T02:31:32 1722652292

Going straight to D-Bus feels excessive. Here's an 85-line file I had lying around that should cover most cases of FD passing: https://paste.rs/6FBFS.c.

folmar · 2024-08-03T18:52:10 1722711130

There are a few existing libraries like libancillary that would do this for you and provide some level of OS compatibility.

astrobe_ · 2024-08-03T06:56:04 1722668164

APIs are/were designed for completeness more than friendliness. Speaking of sendmsg, the whole BSD socket API is plain horrible; it only takes a couple of uses to realize that you never want to use it directly again; you either make your own library on top of it, or a class, or whatever form of code reuse the language deems appropriate.

mananaysiempre · 2024-08-03T13:56:59 1722693419

> the whole BSD socket API is plain horrible; it only takes a couple of uses to realize that you never want to use it directly again

That was my initial impression as well, but recently I’ve had to use it again and surprisingly did not find as bad as I remembered. Except, indeed, for the fd-passing experience, for which see my wrapper elsewhere in the thread (also other sideband stuff, but how often do you really need SCM_CREDENTIALS?).

The syscall/kernel-ABI people seem to love it as well—I remember reading an article that praised it for remaining so stable over its lifetime. I think these are actually two sides of the same coin: BSD sockets essentially layer a second ABI on top of C function invocations. It’s a tad more specific than generic ioctl-ish (selector, payload), but not that much, and the farther away you are from the happy path of send()/recv(), the closer it is to that (and the more extension capability the kernel programmer wants, and the more misery the userland programmer feels).

The Unix approach of exposing syscalls from libc essentially directly was a nice thought, but the sockets API feels like a reductio ad absurdum of it.

zzo38computer · 2024-08-05T03:02:58 1722826978

SCM_CREDENTIALS may be useful for some programs on UNIX systems, but for a better designed capability-based system, SCM_CREDENTIALS is a bad idea.

SCM_RIGHTS is useful though.

I also think that there are several problems with the design and implementation of D-bus.

immibis · 2024-08-03T10:34:09 1722681249

Only because it wraps sendmsg in a nicer API at the language level - something you could also do with raw sendmsg.

4lDO2 · 2024-08-03T09:46:41 1722678401

I'm aware of this example, this was discussed earlier on HN [1]. I think it would be reasonable to (try to) enforce O_APPEND in such scenarios, in which case libc might internally open a pipe (configured to passively or actively be flushed to the underlying file, non-concurrently). A pipe would also be more reliable (and secure, though programs in shell pipelines are usually trusted), since it's no longer possible for isolated programs to change the global cursor, which could overwrite other processes' written data.

[1] https://news.ycombinator.com/item?id=38009458

Also, I feel like you are misquoting me, by not including the whole sentence (judging from the subcomments). I implied shared fds with shared cursors are probably a rare use cases, not shared fds in general (as you explained later on). Shared fds, and especially the ability to send fds, are obviously very fundamental for Unix-like systems, and for moving towards capability-oriented security.

mananaysiempre · 2024-08-03T13:40:55 1722692455

Yeah, cutting the sentence that way evidently did not work out, sorry. For posterity:

> This cursor unfortunately cannot be stored in userspace without complex coordination, since POSIX allows file descriptors to be shared by an arbitrary number of processes, after e.g. forks or SCM_RIGHTS transfers (even though this use case is most likely very rare, so it’s not entirely impossible for this state to be moved to userspace).

Cloudef · 2024-08-03T03:06:36 1722654396

Linux has had plan9 namespaces for long time which you can use from userspace as well. With this you can give a process isolated view of the filesystem, network etc ..

Unfortunately not enabled on AWS lambda runner <https://github.com/aws/aws-lambda-base-images/issues/143>

raggi · 2024-08-02T23:30:25 1722641425

absolutely right re. shells, there's a lot in core old unix that depends on dup and a lot from the 90s/00s that depends on dup2 as well, with all the pain in the butt that it comes with (for multi-threaded and userspace implementors).

readat and friends are certainly becoming more popular in various domains, particularly as they're cheaper in highly parallel programs avoiding the locking on seeks, and avoiding the stalls on mappings (unless you can afford gigantic / up-front & static mappings). At this point what we really need is thread independent mapping solutions, a solution to request a mapping be bound to a thread, and the ability to also pass that around. Mapping will never be free, but if we can avoid whole process stalls and giant mapping pressure both of which are painfully common today that's a big step forward.

wahern · 2024-08-03T03:30:12 1722655812

How would that work in practice? Each thread has their own memory map, but with one shared branch where all the normal process-global virtual memory is mapped, and a non-shared branch for per-thread virtual memory?

Seems like there's the potential for surprise, accidental overlapping of non-shared VM mappings. But I guess this is purely a performance thing; caveat emptor, etc....

How easy would that be to hack into Linux's existing page table structures?

Veserv · 2024-08-03T06:00:21 1722664821

That is not how MMUs work, so it would go very poorly.

MMUs expect a tree. You can not give them “Tree A, except at Node N substitute a different sub-tree”. To have a different sub-tree, you must have a different root. However, you can share sub-trees amongst different trees.

To do what you are expecting, you would actually want multiple processes with their own mappings/tree and you would share a substantial common/shared mapping/sub-tree. The difficulty there is making sure all the processes are aware of the common sub-tree and properly share and synchronize with it.

wahern · 2024-08-03T19:03:15 1722711795

> To have a different sub-tree, you must have a different root. However, you can share sub-trees amongst different trees.

That's what I meant by each thread having their own map (i.e. root), and within each map they have a shared branch, a branch being a pointer to a subtree. In this case, what I had in mind was the shared branch being at the root; basically slicing the user land address space in half (or 1/radix size...), though, that's probably getting too detailed.

Perhaps by "branch" you thought I meant a jump operation in code. That's certainly not how most MMUs work, which walk a data structure constructed by the OS. But some MMUs actually fault to OS provided code to resolve an address, in which case "branch" in the meaning of software code isn't necessarily wrong. (But to be clear, it's not what I had in mind.)

Veserv · 2024-08-04T03:14:37 1722741277

As I explained, you are describing processes, in the Linux parlance, with a shared mapping. Threads with private mapping (shared root, different sub-tree) versus processes with shared mapping (different root, shared sub-tree) are equivalent in what can be expressed, but only the latter conforms to modern MMU hardware. The details of your scheme, and how they differ from existing code as far as I am aware, are in how to keep the various processes synchronized with respect to the shared component (synchronized copies vs references to a single shared sub-tree, keeping the TLBs synchronized, etc.).

As to software-managed TLBs, very few processors these days support such functionality, instead opting for hardware mapping table walkers in their MMUs which is the context for my comment. I am literally the author of memory mapping code for multiple architectures in a commercial operating system and even I do not need to consider such hardware.

wahern · 2024-08-04T04:12:10 1722744730

> As I explained, you are describing processes, in the Linux parlance, with a shared mapping.

I'm not describing processes because I'm not describing anything that currently exists, at least not in Linux or any other popular OS.

I may not be a maintainer for any current VM subsystem, but I've been around long enough to know it's ridiculous to get into semantic arguments about processes, threads, light weight processes, or other similar labels. The meanings behind such terms continually evolve and are dependent on context--particular operating system, hardware, etc.

If it's not possible to implement the OP's notion, please explain why it wouldn't be possible with current hardware; why one process'/thread's/LWP's/whatever's mapping table couldn't have a subtree shared (not copied) with another mapping table which could be manipulated with the appropriate semantics. I don't know if it's possible or not; that was the gist of my question. I don't need a lesson in how "process" and "thread" are currently defined and modeled in typical systems. Obviously they don't model what the OP was suggesting. You've hinted there might be hurdles with keeping the TLB coherent, but you haven't come straight out and said that yet. It would be genuinely interesting if you asserted the claim squarely, perhaps with some context about the relationship between TLB management, root page mapping address, and scheduling contexts.

EDIT: And just to be clear, I understand that in Linux processes can share memory, but the page table entries to the shared physical pages are copies, not references to shared entries, which is why both the protections and virtual addresses can be different. In the OP's proposed scheme having to maintain N copies of those mappings would obviously defeat the purpose; why even bother with threads if every time pages in the shared address space are mapped you've have to update every thread's page table--dozens if not hundreds or even thousands of separate tables. That's facially preposterous and it didn't even occur to me it needed to be stated explicitly, at least not in a forum like this. The whole idea clearly poses some difficult dilemmas. But because only allocations in the shared space need be globally atomic and simple--as they are now--it's not obvious that a scheme with private mappings on the side is impossible; at least, not obvious without recourse to specific knowledge of the details of how typical MMUs, TLBs, etc operate and are managed. And I wouldn't be surprised in the least if it's not possible to achieve what the OP suggested. In fact, I've been skeptical the entire time. I just don't have enough knowledge myself to explain clearly why it's not possible, and was hoping somebody would do that.

Veserv · 2024-08-04T06:30:19 1722753019

I took deliberate care to specify the terms and the concepts they correspond to, so I am confused by your statements that I was using ambiguous or loaded terminology. I was merely pointing out that the specific tactic proposed is untenable and instead demands a different approach and primitives.

As to your actual question, the underlying concept is fairly easy to do and fully supported by the hardware if done appropriately. There are no difficulties with TLB coherence because the problems are a subset of multithreaded TLB coherence, so any kernel supporting multithreading should already have those mechanisms readily available for repurposing.

The only difficulty is if the software abstractions of the memory manager you are working with disagree with the concept. I am unfamiliar with the Linux kernel, but if the abstractions are wrong for the concept then you probably need to go look and modify code related to processes rather than threads.

jart · 2024-08-03T08:52:53 1722675173

Yeah per thread virtual memory is called a process. Within a process the memory manager has to maintain a single virtual address space. You could do it with a red black tree and a global lock. Or you could use potentially cleverer algorithms. Skip lists are friendly to scalable concurrency. You could also do a radix tree of page tables, similar to what x86-64 processors do internally, then have each item be its own atomic long, and use some kind of optimistic transactional locking strategy when allocating mappings that span multiple pages. Personally, I think the rbtree is easiest, since it's better to leave memory scalability to malloc(). The way I like to make malloc() scale is having an arena for each CPU and then indexing them using sched_getcpu().

immibis · 2024-08-03T10:36:12 1722681372

The Linux kernel actually doesn't know the difference between a thread and a process. The clone syscall (which starts a new thread) takes a bitmask specifying what should be shared between the parent and child (e.g. address space, open files, thread group ID (what most people call a process ID))

XorNot · 2024-08-03T03:43:08 1722656588

I don't know that I see the practical difference here: a seekable stream and a random access byte array are one very thin abstraction away from each other, with the important caveat that a stream can represent a byte array which is being appended to while you're using it.

mananaysiempre · 2024-08-03T13:34:09 1722692049

As long as only one thread of execution is using it, yes, there’s essentially no difference.

When it’s shared between several, though, seekable streams become annoying (and difficult to impossible to use correctly). When the sharing spans process boundaries, we get the well-known bane of microkernel Unices that TFA describes—you need every file position for every Unix process to be (at least potentially) tracked by a single systemwide “Unix server”. That’s a lot of pain for not a lot of win.

My point was that I can’t think of any non-niche thing that’s naturally a seekable stream—they’re usually either nonseekable or outright random-access. Tape was the natural example when the API was invented, but it is niche now.

XorNot · 2024-08-03T15:12:43 1722697963

But if the stream represents a file that can receive writes, then one way or another you need to keep track of the order in which writes and reads happen - i.e. you need locks - which means you need global state.

If the access to a file was truly read only, then trivially every reader can just have its own seekable FD tracking position locally (or permissions to access the underlying object and open new ones).

mananaysiempre · 2024-08-03T15:33:58 1722699238

From the point of view of the application, yes, you need to coordinate. But that’s a concern for the application. From the point of view of the kernel / device server / however your microkernel OS works, requests for disk I/O arrive in some order, probably update some cache pages or whatnot, then go onto disk’s queue. That’s arguably global, but it feels like a logical extension of the fact that you only have a single physical disk. However you’re going to multiplex it, something like that is still going to happen, and it has little to do with Unix.

What is not a natural extension of the whole thing is that, even on an otherwise completely quiescent system,

  int fd1 = open("foo", O_RDWR), fd2 = dup(fd1);

behaves very differently from

  int fd1 = open("foo", O_RDWR), fd2 = open("foo", O_RDWR);

(imagine these fds are then passed out to other processes or whatnot).

Even with O_RDONLY, these are still not at all the same. Witness the epitome of CLI design that is OpenSSL /s :

   {
          openssl x509 -out "/etc/swanctl/x509/$1.pem"
          while openssl x509 -out "$t" 2>/dev/null; do
                  fp=$(openssl x509 -in "$t" -noout -md5 -fingerprint |
                       sed 's/.*=//; s/://g')
                  mv -f -- "$t" /etc/swanctl/x509ca/$fp.pem
          done
  } < "/etc/ssl/uacme/$1/cert.pem"

This is how you pick apart a PEM cert bundle using OpenSSL: the shell spawns openssl, which reads a single PEM block from stdin, does its dirty deeds with it, and leaves stdin pointing to the next one, ready to be consumed by the next instance of itself that’s yet to be spawned by the shell. You can’t do that with a per-process file position, trivially or otherwise.

Returning to the application side, imagine you’re making a concurrent B-tree. If one thread wants to write out page X and the other page Y, they have presumably already used some locking to make sure that’ll leave the data structure consistent, so they’re free to just issue their pwrite()s and let them happen in whatever order. On the other hand, if all they have is write() and seek(), they have to hit a global lock even if X and Y are completely unrelated.

layla5alive · 2024-08-03T07:16:04 1722669364

That's true in serial computations, but not true in concurrent ones - the abstraction now includes mutual exclusion in the stream case, and does not require it in the array case.

layer8 · 2024-08-03T15:03:50 1722697430

When you write “file description”, do you mean “file descriptor”?

mananaysiempre · 2024-08-03T15:57:07 1722700627

Crucially, no.

“File descriptions” (or “open file descriptions”; classically, struct file[1]) are what file descriptors designate. The whole hubbub is because a file description is not just a device number, an inode number, and a set of already-checked permissions; they are all that and a position in the file, and that position gets shared along with the rest of the description when you dup(), fork(), or sendmsg() the file descriptor—but not if you open() the same file again, not even[2] via /proc/self/fd.

As I’ve said, I think the most helpful way to think about this is that a file description is a seekable stream object on top of the underlying random-access disk file, and a reference to that stream object is what you’re passing around when you’re handling (hah) file descriptors.

[1] https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/sys/sys/fi...

[2] https://blog.gnoack.org/post/proc-fd-is-not-dup/

4lDO2 · 2024-08-03T15:20:17 1722698417

File descriptors and file descriptions are not the same thing. Descriptors are references to descriptions, along with some metadata, and multiple descriptors can point to the same description.

layer8 · 2024-08-03T15:33:52 1722699232

> multiple descriptors can point to the same descriptor.

You mean to the same description?

And is this the same as what is referred to as “open file description” in the open(2) man page, or are there also other “file descriptions”?

4lDO2 · 2024-08-03T15:36:36 1722699396

Yes, and yes.

nolist_policy · 2024-08-03T00:10:54 1722643854

Passing around fd's via unix sockets is important for sandboxing in e.g. Chrome and Firefox.

tom_ · 2024-08-02T22:12:37 1722636757

Those who understand Unix are condemned to reproduce it exactly.

Aerbil313 · 2024-08-03T09:41:12 1722678072

Which is why the next OS might be written by younglings who know little to nothing about "modern" OSes and just design from first principles, instead of being beholden to a gigantic historical architecture debt.

We're still running on kernels written in 1990. Time for an upgrade?

jbernsteiniv · 2024-08-03T20:51:33 1722718293

We still have banks running code even older than that. Do you want banks to take a large risk and potentially impact millions of people because "old code is bad code"? There is a time for an upgrade where things make sense. Not everything needs to be re-developed. We discovered fire a very long time ago. I don't think we need to retire fire because its so ancient.

eddd-ddde · 2024-08-03T20:55:29 1722718529

I wholly believe it's impossible to get out of our "local minimum" without re developing from zero at some point.

So many of our technologies are just there because they solve some problem we created ourselves.

Aerbil313 · 2024-08-14T20:47:05 1723668425

I don't think it's possible at all. Entertain the thought. If not for anything for social reasons. A hard fork of the entire ecosystem is the only way. Every single piece of software we write today depends on every other. It's not a tree like commonly thought, it's a cyclic graph.

blueflow · 2024-08-03T10:10:19 1722679819

Communicating using a protocol from the 90ies, using a network stack from the 80ies, running an CPU architecture from the 70ies.

Being old and being unsuitable for its purpose are orthogonal.

Aerbil313 · 2024-08-03T16:45:59 1722703559

I disagree. While some amount of stability is a precursor to usability, it does hold us back from progress. I posit we hit a [complexity] ceiling and we need to do a clean rewrite. Maybe hardware guys in the back can't, but we software people surely can.

blueflow · 2024-08-04T11:04:56 1722769496

You are not disagreeing with me.

Aerbil313 · 2024-08-08T21:49:17 1723153757

Yeah, sorry for not expressing this clearly: I believe software also ages with literal time. Because the real world changes, and software ultimately runs in the real world. Being old and being unfit for its purpose are pretty much linked when it comes to software.

exe34 · 2024-08-03T15:12:26 1722697946

we still use roads and wheels, which are positively antique.

Animats · 2024-08-03T05:51:57 1722664317

Why emulate POSIX signal semantics? Those are a holdover from the single-thread UNIX era. Events should be delivered by threads or async calls. Signals should be reserved for exceptions, things you don't return to, like segfaults.

4lDO2 · 2024-08-03T10:00:59 1722679259

The rationale was discussed in-depth earlier on LWN (https://lwn.net/Articles/982186/).

Signals are useful as a software analogue to hardware interrupts, even though they are probably too low-level for most application use cases. They are for example used by the Golang runtime to preempt threads AFAIU, which wouldn't otherwise be possible non-cooperatively. Intel even included an ISA extension called user-IPIs, which is similar to signals.

tbrownaw · 2024-08-03T04:41:56 1722660116

Would shared pipes not have approximately the same issue wrt file positions being shared state? A process group (cron job, systemd service) sharing a pipe for error output is probably a bit more common than directly having an on-disk file.

4lDO2 · 2024-08-03T08:43:24 1722674604

Pipes lack a file cursor, but they still need to store the fcntl file description flags, so yes. This is emulated by the Redox kernel for old schemes, by inserting fcntl calls between the legacy read/write calls that don't allow passing the offset and flags. However, I'd argue that the fcntl flags are even more client-oriented than the cursor, say, O_NONBLOCK. Should it really be possible for isolated processes to set each others' nonblock flag (unless using preadv2/pwritev2 which is non-POSIX)?

up2isomorphism · 2024-08-03T06:33:34 1722666814

This is very typical rust project. Trying to change the old world while tightly following the old world in a very unimaginative way.

Maybe too much memory safety is bad for creativity?

actionfromafar · 2024-08-03T19:34:57 1722713697

It's the way of the world. Once memory safety has properly infiltrated most of the world, we can move on to (re-)discover other kinds safety.

techbrovanguard · 2024-08-03T15:36:04 1722699364

maybe too little memory safety means your wit overflows back to 0?

tredre3 · 2024-08-03T19:09:28 1722712168

Rust overflows in the same way (granted it does provide convenient built-in wrappers to check for overflows).

throwaway984393 · 2024-08-03T12:31:54 1722688314

I love how "secure" just translates to "memory safety" now. Apparently OpenBSD could have just been a basic Rust kernel and stopped thinking about all the other security mechanisms they have

techbrovanguard · 2024-08-05T03:59:05 1722830345

quality post from `throwaway984393`.