About Flat vs. Object stores I wrote this piece, frustrated by, what looks to me...

zasdffaa · on June 30, 2022

It's a rant and it's not information heavy. I've never heard of this processor (and provide a link, don't tell me to look it up).

Other OO processors: Rekursiv (https://en.wikipedia.org/wiki/Rekursiv) and that famously slooow intel iAPX 432 (https://en.wikipedia.org/wiki/Intel_iAPX_432)

Interesting but neither (esp. the latter) were known as racehorses.

I'd prefer to keep the hardware simple and fast and push the complexity into the software, and prove stuff.

> would benefit from an HW/MMU architecture which focused on delivering fast object service, rather than flat address-spaces which software must then convert into objects.

That conversion may not be cheap (edit: badly phrased, the object mapping process and hardware may not be cheaper (edit again: = faster) than that mapping done by the MMU for conventional memory) - can you exaplain how it would be done such that it would be cheaper in time than the current mapping in the common/hot/optimistic path, and how it would not be worse than it is now on the rare/cold/pessimistic path? And how it would behave on average, between those 2 extremes?

And why objects everywhere would be better all-round?

kragen · on June 30, 2022

Probably if you haven't heard of CHERI and can't be bothered to Google it when one of the most respected systems architects around tells you it's worth looking at, you aren't going to put in the effort needed to read the CHERI tech reports so you can have an informed opinion on the performance cost of putting this kind of protection into hardware. And if the only historical "OO" processors you can think of are Rekursiv and iAPX432, and not the Dorado or the Symbolics line or the B5000/B5500/ClearPath line, it sounds like you have a lot of reading to do to get to having an informed opinion.

zasdffaa · on June 30, 2022

Symbolics was a tagged arch, not OO. Ditto Burroughs. They were not OO. Dunno about dorado.

kragen · on July 1, 2022

The Symbolics 3600, the B5000, and the Smalltalk microcode for the Dorado all had generic operation dispatching in hardware, though they varied in how flexible it was. The iAPX432 and the Rational R1000, as far as I can tell, didn't. Generic late-bound operation dispatching is the essential core of OO.

For many years AMD64 CPUs have had hardware acceleration for this sort of thing in the form of a "branch target buffer", so in this very important sense they're more OO than the iAPX432, though they don't have the hardware bounds-checking and dynamic type checking that all of the other architectures we're discussing did.

Of these, only the Smalltalk microcode for the Dorado came close to the level of hardware support for OO that something like SiliconSqueak has.

You're just repeating buzzwords rather than taking the time to understand what they refer to.

zasdffaa · on July 1, 2022

> The Symbolics 3600, the B5000, and the Smalltalk microcode for the Dorado all had generic operation dispatching in hardware

Regarding the symbolics, that seems highly unlikely as lisp is not an object oriented language unless the MOP is draped over it (which is multi-dispatch IIRC and that's not going into hardware). Please provide some links to show I'm wrong.

> The iAPX432 and the Rational R1000, as far as I can tell, didn't.

"9.4.2 Procedure Call and Context Objects To transfer control to a procedure, a program executes a CALL instruction, causing the procedure to be invoked. On exe- cution of a CALL instruction, the hardware constructs a new context object. The context object is the procedure invocation record and defines the dynamic addressing environment in which the procedure executes. All addressing of objects and scalars occurs through the context object, and the context is the root of all objects reachable by the procedure."

from https://homes.cs.washington.edu/~levy/capabook/Chapter9.pdf regarding the Intel iAPX 432

Note: dynamic addressing environment. Repeat: 'dynamic'

> Branch target buffer

Oh give me a break, this is just branch prediction and a little caching, it's nothing to do with OO/dispatch because there is no generic dispatch involved. It's just an optimisation for normal dispatch, nothing else. If you don't understand what a branch predictor actually does... ech

> Dorado

I'm not familiar with Dorado, can you provide a link showing this and preferably a bit more information as well actually stating this clearly.

> You're just repeating buzzwords rather than taking the time to understand what they refer to.

I do get tired of HN, I come to learn, I get dragged down by clammy seaweed posts like this, just claims, no concrete anything ("as far as I can tell"), from people who know even less than me ("Generic late-bound operation dispatching .... For many years AMD64 CPUs have had hardware acceleration for this sort of thing in the form of a 'branch target buffer' OMG just stop talking). Don't lecture me until you can deliver the goods, then and only then, lecture away because then I'll be listening.

lispm · on July 1, 2022

> Regarding the symbolics, that seems highly unlikely as lisp is not an object oriented language

The Symbolics Genera operating system is largely written in object-oriented style using the Flavors OO-extension. Since the early machines had a micro-programmable CPU, there were with new operating system releases also new CPU extensions to support new Lisp, OOP or logic language (Prolog) features.

Beyond that: Lisp originally used 'generic operations' in a non-OO sense. For example the + operation works for all the kinds of numbers (integer + integer, float + float, integer + float, integer + complex, ... and so on). The CPU determines at runtime which operation runs. Thus there is only one generic ADD instruction and not per-type instructions.

zasdffaa · on July 1, 2022

Very informative, thank you

kragen · on July 1, 2022

> Note: dynamic addressing environment. Repeat: 'dynamic'

"Dynamic addressing environment" in this context refers to the stack frame in which the procedure stores its local variables (and which may contain, for example, a link to the stack frame of an enclosing procedure, as in Pascal). Lots of things can be dynamic, which is to say, determined at run-time; method dispatch is just one of them. This is a good example of you repeating buzzwords without understanding what they refer to, although in this case the buzzword is also a technical term with a precise meaning.

Intel liked to use the term "object-oriented" to describe the iAPX432 because it was fashionable, but their idea of "objects" was more like CLU "clusters" as filtered through Ada, not the Smalltalk approach the term "object-oriented" was invented to describe.

You're also confusing CLOS and Flavors with CLOS's MOP.

> If you don't understand what a branch predictor actually does...

Possibly in five or ten years if you read this conversation again you will be in a better position to learn from it; right now you seem to be too full of ego to take advantage of the opportunity. Save a bookmark and maybe put a reminder in your calendar.

> Please provide some links to show I'm wrong.

Helping you stop being wrong is not really my responsibility :)

You're treating knowledge as a repulsive medicine that needs to be forced on you, not a precious treasure that merits seeking out. The problem with this is that if you only change your mind when it's profitable for someone else to talk you out of your mistakes, you'll just end up being exploited (and continuing to parrot half-understood nonsense in technical discussions). It isn't society's responsibility to give you the cognitive tools you need to realize your potential; it's yours.

djmips · on July 1, 2022

Starting a sentence with `You` is a signal to revaluate that sentence. It comes across as rude.

kragen · on July 3, 2022

How would you suggest rephrasing a complete, sweeping dismissal of someone's comments, on the basis that they evidently have no relevant knowledge, so that it doesn't come across as rude?

zasdffaa · on July 3, 2022

I screwed up and I must apologise.

I took a bad situation and made it worse. I have reasons but it shouldn't have happened. I am not happy that I clearly and unnecessarily annoyed you, and would prefer that we put out this fire and move on with a better mood for both of us, and hopefully I can do better next time.

Again, I am genuinely sorry.

kragen · on July 4, 2022

No worries. I'm not injured. I hope you feel better soon, and I hope the conversation doesn't turn you off from learning about computer architectures!

zasdffaa · on July 4, 2022

Thanks! Take care.

kragen · on July 4, 2022

Likewise! I'll try to be gentler next time we interact.

klelatti · on June 30, 2022

On the subject of the iAPX 432 is anyone aware of an emulator? The only material I’ve seen is quite high level and doesn’t go down to ISA level.

p_l · on June 30, 2022

iAPX 432 arguably wasn't bad due to OOP orientation, but due to a bunch of several issues in how intel went about executing the idea. To the point that if they mismanaged a "normal" cpu this way they would have bungled it similarily.

eru · on June 30, 2022

Thanks for writing the article and explaining some background here.

Are you familiar with exokernels? They were an attempt to remove abstractions from kernel land and leave that to applications.

See https://www.classes.cs.uchicago.edu/archive/2019/winter/3310...

That way innovation can be much faster, because applications can generally move quicker than kernels.

Btw, I'm not a fan of object orientation; and I don't think our hardware design should be infected by that fad. But I think your criticism of the badly fitting abstraction of flat address spaces is still valid. I am just not sure that 'fast object service' is necessarily the remedy.

phkamp · on June 30, 2022

I'm not a fan of any archtectural radicalism, and tend to think that there are things best done in both hardware, kernel and libraries and applications :-)

That is not to say that the boundaries should be cast in stone, they should obviously be flexible enough that you do not need a complete multi-user management system in a single-service jail or container nor a full-blown journaled COW storage-manager on a small embedded system.

In other words: I am firmly for the "Software Tools" paradigm.

eru · on June 30, 2022

From the paper:

> The defining tragedy of the operating systems community has been the definition of an operating system as software that both multiplexes and abstracts physical resources. The view that the OS should abstract the hardware is based on the assumption that it is possible bath to define abstractions that are appropriate for all areas and to implement them to perform efficiently in all situations. We believe that the fallacy of this quixotic goal is self-evident, and that the operating system problems of the last two decades (poor performance, poor reliability, poor adaptability, and inflexibility) can be traced back to it. The solution we propose is simple: complete elimination of operating system abstractions by lowering the operating system interface to the hardware level.

Basically, they say to let libraries do the abstraction.

The source code of your applications will still mostly look the same as before. It's just that the libraries will do more of the work, and the kernel will do less.

phkamp · on June 30, 2022

Yes, and I dont (quite) buy that argument, but I understand where it comes from.

The problem starts when you, quite sensibly implement something like SHA256 in hardware. It is a perfect example of something hardware does better than software.

But Dennis, Ken and Brian didn't think about cryptographic hash-algorithms when they created UNIX, and because UNIX no longer have a recognized architectural authority, nobody provides a timely architecture for such new features, and instead we end up with all sorts of hackery, some in kernels, some in libraries and some in applications.

SHA256 should be a standard library API, and if the CPU has a HW implementation, the platforms library should spot that and use that, no need to get the kernel involved, it's just a fancy XOR on purely userland data.

But SHA256 being a good example does not mean that we should throw out the baby with the bath-water.

Things like file-systems are incredibly ill-suited for userland implementations.

What they dont say in the article is that they will need monolithic "libraries" for things like filesystems, and to implement things like locking, atomicity, these libraries will have to coordinate amongst the processes which use the filesystem, and must do so without the control and power available to the kernel.

There are ways to do that, see for instance MACH or the original MINIX. It transpires there are disadvantages.

And that's what I mean by "archtectural radicalism": Try to use the right tool for the job, and sometimes the kernel is the right tool (filesystems) and sometimes it is not (SHA256).

kragen · on June 30, 2022

Which of the disadvantages of microkernel userland filesystems do you think are most important and essential to the concept, and which do you think are a matter of bad implementations? I thought L4 and QNX had pretty reasonable filesystem stories, and even on poor old Linux I've been using FUSE with things like NTFS for years without much trouble. Is it just a matter of the cost of context switching between userland processes when you don't have enough cores?

If it's a question of performance, with enough cores and shared memory that's accessible for atomic operations, I'd think talking to a userland filesystem would just* be a matter of pushing requests onto a lock-free request queue in shared memory from your application process and reading the responses from a lock-free response queue. Of course each application needs its own shared-memory area for talking to the filesystem to get fault isolation.

Even if it's a matter of IPC message-passing cost on a single core, I think L4 has shown how to make that cheap enough that we should regard putting the filesystem in the kernel as a dubious optimization, and at that one that's second-best to granting the application mappings on an NVDIMM or something.

Perhaps this is stating the obvious, but I don't think you can get much fault isolation with a pure library filesystem; if all the processes participating in the filesystem are faulty then there's no way to protect the filesystem from fatal corruption from faults. You might be able to reduce the presumed-correct filesystem core process to something like a simplified Kafka: a process that grants other processes read-only access to an append-only log and accepts properly delimited and identified blocks of data from them to append to it.

If we're interested in efficiency and simplicity of mechanism, though, a library filesystem is likely faster and might be simpler than a conventional monolithic filesystem server, particularly a single-threaded one, because you can rely on blocking I/O. And the library might be able to wrap updates to the persistent store in lock-free transactions to reduce the frequency of filesystem corruption.

The Xerox Alto famously used a single-tasking library filesystem similar to MS-DOS, but each sector was sufficiently self-describing that filesystem corruption was usually minor and easy to recover from. The filesystem directory could be reconstructed from the data blocks when required. Neither the Alto nor MS-DOS had to worry about locking, though!

KeyKOS, as you know, took a lot of the ideas from the CAP machine and similar capability machines (and languages like Smalltalk), and implemented them on IBM 370 hardware using its regular MMU, with L4-like lightweight IPCs through the kernel for capability invocations. It went to the opposite extreme from having a library filesystem: each directory and each file was a "domain" of its own, which is to say a single-threaded process. Persistence was handled by a systemwide copy-on-write snapshot of the whole system state, plus a journal-sync call their database used to provide durable transactions. EUMEL and L3 took similar approaches; L4 instead takes persistence and even virtual memory out of the kernel.

I wrote some somewhat sketchy notes on how Flash performance suggests rearchitecting things the other day at https://news.ycombinator.com/item?id=31902551; I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?

______

* "Just" may be a loaded term here.

phkamp · on July 1, 2022

Sorry, overlooked your question.

First, I have not been actively involved in Fastly, apart from telling Artur to "go for it!" :-)

With respect to Flash technology I have noted elsewhere in this discussion that today our SSD devices effectively contain a filesystem in order to pretend they are disks, and that stacking two filesystems on top of each other is ... suboptimal.

But as I also just noted, flash isn't just flash, some properties are very hard to generalize, so i tend to think that we will have to let the people who decide what to solder onto the PCB provide at least the wear-levelling.

If I were to design an OS today, I think I would stick firmly with the good ol' UNIX name-hierarchy model, but I would probably split the filesystem layer horizontally in a common and uniform "naming layer" serviced by per-mount "object stores".

If you look at FreeBSD, you will see that UFS/FFS is sorta-split that way, but I would move the cut slightly and think in terms of other primitives which take SSD and networks better into account, but see also: Plan9.

The service I would want from a SSD device is simply:

A) Write object, tell me it's name when written.

B) Read object named $bla

C) Forget object named $bla

Then I'll build my filesystem on top of that.

(The NVME crew seems to be moving in the right direction, but it is my impression that some patents prevent them from DTRT, just like Sun's "Prestoserve" patent held up development.).

kragen · on July 2, 2022

Thanks! Interesting!

eru · on July 1, 2022

Keep in mind that (conventional) micro-kernels are not the same as exokernels.

FUSE is fun, I've written my own filesystems with it, but it's basically a micro-kernel idea, not an exokernel one. (L4 is also great! But I don't think it qualifies as an exokernel?)

https://pdos.csail.mit.edu/6.828/2019/lec/faq-exokernel.txt explains a lot about exokernels that's not mentioned in the papers.

Exokernels never caught on, at least not under that name. The closest equivalent in widespread use today are actually hypervisors for running virtual machines. (Especially if you are running a so called 'unikernel' on them.)

About filesystems: if you just want the kinds of abstractions that conventional filesystems already give you, you won't get too much out of using an exokernel. (As you mention, perhaps you can get a bit of extra performance?) From the FAQ I linked above:

> Q: In what kind of applications is an exokernel operating system preferable? There are naturally tradeoffs with the extra flexibility provided e.g. it is easier to make a mistake in user code.

> A: An exokernel is most attractive to an application that needs to do something that is possible with an exokernel, but not possible with other kernels. The main area in which the 1995 exokernel paper increased flexibility was virtual memory. It turns out there are a bunch of neat techniques applications can use if they have low-level access to virtual memory mappings; the Appel and Li paper (citation [5]) discusses some of them. Examples include distributed shared memory and certain garbage collection tricks. Many operating systems in 1995 didn't give enough low-level access to virtual memory mappings to implement such techniques, but the exokernel did. The exokernel authors wrote a later paper (in SOSP 1997) that describes some examples in much more depth, including a web server that uses a customized file system layout to provide very high performance.

The HN submission we are nominally discussing here is also about memory, so that might be applicable.

An example for filesystems I could envision: direct low-level hardware access to an SSD's internals for a database. Databases don't really care about files, and might also want to deal with SSD's peculiar writing processes in a way that's different from the abstractions typical file systems give you.

> Perhaps this is stating the obvious, but I don't think you can get much fault isolation with a pure library filesystem; if all the processes participating in the filesystem are faulty then there's no way to protect the filesystem from fatal corruption from faults. You might be able to reduce the presumed-correct filesystem core process to something like a simplified Kafka: a process that grants other processes read-only access to an append-only log and accepts properly delimited and identified blocks of data from them to append to it.

That might be possible, but wouldn't really be faster than letting a kernel handle it, I'd guess? (But it would perhaps be more flexible to develop, since it's all userland.) You can also take inspiration from how eBPF allows you to upload user level logic into the Linux kernel and run them securely. Instead of uploading them into the kernel, you could also upload them into your filesystem service, I guess?

Some of the original exokernel papers had some more interesting ideas sketched out.

> I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?

I'm afraid you are mixing me up with someone else?

kragen · on July 1, 2022

I agree that L4 is not an exokernel, though it does go a little further in the exokernel direction than conventional microkernels. I agree that FUSE is microkernelish rather than exokernelish, though there's nothing in the exokernel concept as I understand it that excludes the possibility of having servers for things like some or all of your filesystem functionality.

Databases are indeed an application that commonly suffers from having to run on top of a filesystem.

> That might be possible, but wouldn't really be faster than letting a kernel handle it, I'd guess?

I think reading files by invoking library calls that follow pointers around a memory-mapped filesystem might well be faster than reading files by repeatedly context-switching back and forth into even a supervisor-mode kernel, much less IPC rendezvous via a kernel with a filesystem server. This is particularly true in the brave new SSD world where context switch time is comparable to block device I/O latency, rather than being orders of magnitude smaller.

Writes to Kafka are very cheap and support extreme fan-in because the Kafka design pushes almost all the work out to the clients; the Kafka server does very little more than appending chunks of bytes, containing potentially many separate operations, to a log. It seems very plausible to me that this could be faster than handling a series of individual filesystem operations (whether in a kernel or in a microkernel-style server), at least for some applications; particularly with orders of magnitude lower penalties for nonlocality of reference than for traditional filesystems, and for applications where many writes are never read.

Running logic in the kernel or in a server using a restrictive interpreter is indeed an interesting architectural possibility, but from a certain point of view it's the opposite extreme from the Kafka approach.

> > I know you have a very substantial amount of experience with this as a result of Varnish and your involvement with Fastly. What do you think?

> I'm afraid you are mixing me up with someone else?

I hope this isn't rude, but I wrote that in response to phk's comment, so I was addressing him in it, not you, eru, although I did enjoy your comment very much as well.

eru · on July 1, 2022

> Running logic in the kernel or in a server using a restrictive interpreter is indeed an interesting architectural possibility, but from a certain point of view it's the opposite extreme from the Kafka approach.

In general, a restricted language. You interpret or compile that language, and still have similar security guarantees.

> I hope this isn't rude, but I wrote that in response to phk's comment, so I was addressing him in it, not you, eru, although I did enjoy your comment very much as well.

Oh, that's fine. I was just confused because that came in a reply to my comment.

SSLy · on July 1, 2022

>I'm afraid you are mixing me up with someone else?

To clear this up – they were addressing phk, their parent comment.

gpderetta · on June 30, 2022

There has been a slow trend to hardware virtualization and moving drivers to userspace. The issue is that often the required hardware support (SR-IOV for example) is locked behind premium SKUs and it is trickling into consumer producsts very slowly. As such OSs will be very slow to embrace it fully.

titzer · on June 30, 2022

I am very sympathetic to this argument overall and trace the hardware industry's failure back to the spread of C, UNIX, and worse-is-better.

With Wasm I see an opportunity to virtualize away the old world of C and linear address spaces. While we designed it to be low level and sandboxed to get C/C++ and Rust on board, I and others have always had in mind a future world where Wasm has managed (GC'd) data, first-class typed functions, and more. Those features should support a wide variety of source languages.

Wasm should become the new, final ISA. It should become like Unicode; the final, most widely-supported[1] format for executable code. When all languages and programs can run easily on Wasm, then hardware can easily be swapped out.

[1] Sure, Unicode has flaws, and it doesn't support everything equally as well. But it had the property that basically everyone dropped everything else in favor it, because it gained all the momentum.

duped · on June 30, 2022

As I understand it, wasm modules depend heavily on linear memory in growable buffers in each module.

titzer · on June 30, 2022

Yes, because linear memory is the only game in town right now :)

FunnyBadger · on June 30, 2022

This is correct.

A lot of this due to the hardware architecture itself. The software abstractions dictated/limited by the HW itself causes many of the risks!

If you designed BOTH HW and SW up to and including the OS, you _might_ have a chance to control the risks better. But by the very separation of duties and roles, papered over by abstraction itself, you create problems. ALL abstractions throw away information and eventually those abstractions bite you in the ass.

This was the case with digital logic once HW speeds rose to a critical level - suddenly the reality that digital is merely an abstraction upon analog and the very abstraction of lumped-model analog started failing which caused digital fail as well.

We definitely can have and have had the same failure occurring with von Neumann architecture - there's NOTHING magical about it that immunizing against model abstraction failure and it can creation "intrinsic failures" that can never be fixed thanks to Gödel's incompleteness theorem.

benreesman · on June 30, 2022

It’s a big thread (by virtue of being a cool piece), so maybe someone said this already, but isn’t there kind of a middle ground where we let existing software continue to run but acknowledge a changing hardware landscape by really cranking up (default) page sizes?

Userland allocators already work pretty hard to hit in the TLB [1], but huge-page tuning and whatnot is, to your point, only hitting the sweet spot on modern gear via effort/luck.

[1] https://engineering.fb.com/2011/01/03/core-data/scalable-mem...

Symmetry · on June 30, 2022

Channeling my inner Linux, defaulting to larger pages means a drastic loss in memory efficiency when mmaping lots of small files as happens when someone compiles the Linux kernel. If you've got a 2kb file and 4kb pages then half the memory you allocate when the file is paged in is wasted. For larger pages that goes way up.

benreesman · on June 30, 2022

Absolutely, but you also want a different scheduler for a low-latency server than you do for your desktop.

One size almost never fits all, as I’m sure you’ll agree as someone who cares about compiling the kernel.

With that said, the kernel is pretty good at reclaiming physical pages, so you’d most likely eat into the same disk cache you’re reading from in the scenario you’ve described.

phkamp · on June 30, 2022

I think that is a false dictomy.

CHERI has shown that this kind of fundamental architectural improvements can happen with very little impact to running code.

benreesman · on June 30, 2022

Not to be contradictory, but I’m having a hell of a time getting a toolchain put together that can reproducibly target Linux on x86_64 without liking glibc or libstdc++ and easily accommodate open-source libraries, and it’s not for want of knowing how the individual pieces work.

If you’re promoting computer architecture and OS research, have at it, needs doing.

But that’s a different game to running software on the architectures, operating systems, and tool chains we have today.

kazinator · on June 30, 2022

Re: flash.

A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.

I don't think both layers are "filesystem" in the same sense. We don't need the lower filesystem to provide permissions, ownerships, time stamps, directories, symbolic links and such.

Re: linear

A machine address is a word of bits. A word of bits always has a linear interpretation as a binary number. For instance if we have a 16 bit segment ID, and a 32 bit offset, then we can still pretend that it's a 48 bit linear address. We can compare two pointers for inequality, for instance: p0 < p1, as 48 bit words. That space may be sparsely populated, but that doesn't make it nonlinear; the spaces you are calling linear can also be sparsely populated and have numeric conventions about regions.

You say physical memories are linear, but they are also sparsely populated in the same way: such and such a range is a ROM, such and such a range is certain memory mapped registers, DRAM is over there. Generally speaking, hardware treats some part of an address as an ID that it recognizes, and then the bits below that as an offset. When there is a bus read or write, if the ID part matches that hardware device, then it selects itself for that I/O operation, and uses the offset bits to provide access to the right thing. So physical memory is arguably nonlinear; it is like a subclassed IP address space.

Physical addresses can have bits which are not used for addressing; e.g. a certain range of memory might be available as uncached if you enable a certain upper bit in the address. That looks like linear, but with aliasing between distant regions. Linear virtual memory is sparsely populated; there are mapped pages and unmapped pages. Pages can alias: the same object can be mapped in multiple places, so a write here can be read there.

If you want to split an address into an object ID and offset, you have to gamble about how many bits you need for each one. One application has hundreds of millions of tiny objects: it wants a big object ID part of the address, and a small offset. Another one has a small number of huge objects: it doesn't care for large object ID, but wants big offsets. Either you make that configurable (slow gets even slower), or else waste bits on making both spaces larger at the same time, perhaps ending up with wasteful 128 bit pointers on a system where 64 is more than enough.

All interpretations of the bits of an address above and beyond "pure binary number" add complication and overhead. The hardware (e.g. DMA bus master) isn't going to understand objects; it will always want a simple address.

Re: C, C++, Go, Rust, PHP, Lisp, Smalltalk

No two implementations of these can agree with each other on what exactly is an object. Implementations will just carry their existing respective portable object representations into the non-linear model. Architectures without flat memory, like the JVM and WebAssembly, tend to only cause pain for Lisp and its ilk.

A Lisp is not going to want some weird object model from the operating system; it will want to do things like packing cons cells tightly together into one larger heap object. That heap object could be a segment. We had those; they are also from the 1960's. Operating systems started ignoring them, opting for just the demand paging support the VM hardware.

rkangel · on June 30, 2022

> A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.

I think you could argue that certain file system designs are better suited to different storage hardware. Maybe it's appropriate that the kernel runs different code depending on the underlying storage type?

We already have JEDEC standards for flash devices to report block layout, because it's important in microcontrollers where there is no adapation layer. We could have an SSD/M2 standard that reported that information, and then architecturally kernel FS stuff would probably split into a couple of layers. The 'top' that provides the filesystem features that you're used to in something like ZFS, and the bottom storage layer that has a couple of different implementations for 'linear' and 'blocks'.

phkamp · on June 30, 2022

No, that is actually not what the Flash Adaptation Layer does.

The FAL's primary job is to make the Flash array look like a disk, despite the fact that individual "sectors" are not individually rewriteable.

To do this, the FAL implements which is for all practical purposes a filesystem, where the files are all a single sector long and where the filename is the sector number exposed by the "disk".

In other words: Now we have two filesystems on top of each other, one lying to the other, which does a lot of unnecessary work, because it is being lied to.

kazinator · on June 30, 2022

s/for all practical purposes/for the sake of rhetorical convenience in my argument/

> , the FAL implements which is for all practical purposes a filesystem, where the files are all a single sector long and where the filename is the sector number exposed by the "disk".

1. That is not a "file system" comparable to the thing above which you're also calling "file system", which means you're essentially equivocating on the term.

2. Any old magnetic hard drive exposes this same file system: it makes equal-sized sectors available under names that are indices. There is no lookup structure that is not a filesystem under your definition.

phkamp · on July 1, 2022

On a magnetic disk, the exposed "name" is the physical address, on a SSH device, the data is stored ${somewhere} and the FAL has a (pretty interesting!) datastructure to lookup were that is, when you demand a certain "name".

When you write a certain "name", the magnetic disk just overwrites what is already there, at the assigned spot.

When you write a certain "name" on a SSD, a new spot gets allocated (This may require reshuffling akin to garbage collect), and the data structure is updated for the new locations ("=$name") and old location ("unused"), and if making the old location unused means that an entire eraseblock becomes free, then erasing that is scheduled, after which it is added to the free pool.

But that is only the easy part. The hard part is "wear levelling", which essentially amounts to erasing all the blocks approximately the same amount of times, because that is what wears out the gate material in the individual cells.

Wear levelling involves taking data which the host has not asked for, copying to a different location, in order to empty out the erase-block with the least wear (= erase cycles) so that it can shoulder more of the load.

And now comes the truly hideous part: The FAL has to do this in a way where it's own metadata is part of it's own wear-levelling, this is why most competent FAL's have a basic structure much like a classic Log-structured filesystem.

So yes: We do have two filesystems on top of each other, and exposing a more suited model than "array of equal-sized rewritable sectors" could reduce that.

StillBored · on July 1, 2022

Ignoring the fact that just about all the HD's made since the 1980's when IDE became popular can relocate sectors, and since the 2000's didn't even necessarily expose the physical sector sizes and all kinds of other abstractions.

So, the idea that we shouldn't have a disk abstraction to allow actual filesystems to focus on what matters to the user is sorta nonsense. You probably have this idea that all flash disks are the same, and i'm here to tell you they are not, just like not all computers are 8 cores big.little machines. Disks scale from little tiny emmc, controllers to large shared arrays that are doing deduplication, replication, etc,etc,etc and the media can be individual flash chips, massive parallel flash, spinning disks, arrays of spinning disks, mixes of flash and spinning disk, etc, etc, etc.

There have been a half dozen raw flash layers in linux over the past ~15 years, and generally they all suck, make assumptions about the flash that doesn't hold for more than a couple generations, end up slower than off the shelf flash with FTL's (aka what your calling FAL), and have frequently failed to allow the kinds of filesystem layering that one expects of linux/grub/etc. (and then there are the ones running at various hyperscalers that consume more RAM than most people have in their PC's).

phkamp · on July 1, 2022

Bad block remapping is just a trivial table lookup, the first disk drives did it with 8-bit microcontrollers.

In my experience a good FAL is about as hard to write as a good filesystem.

While you can parameterize a lot of things, there are some fundamental properties of the flash cells which make it very hard to write a single FAL which works well with all flash chips.

As a matter of principle I do not comment on issues specific to Linux.

kazinator · on July 2, 2022

> is just a trivial table lookup

So just an inch to the left goalpost of the True Scotsman's "filesystem" definition.

> make it very hard to write a single FAL which works well with all flash chips

Right? So the last thing you want is to foist that logic into filesystems. The layered separation is good.

Hence, someone upthread wrote: A flash adaptation layer solves the following problem: I have M filesystems, that I'd like to use on any one of N different flash technologies. I don't want to complicate each M filesystem with support for N flashes.

phkamp · on July 2, 2022

You seem to be more interested on proving me wrong, than actually understanding what I am saying:

Today SSD's expose a datamodel which makes them look like disks.

To implement that datamodel, on a storage substrate which have radically different semantics, they have to implement what is essentially a (log-structured-)filesystem.

(I happen know this first hand, because I have worked on both file-systems and FALs.)

And that is why I say we have two filesystems stacked on each other.

Your limited understanding of filesystems does not change reality.

Over & out.