I found squashfs to be a great archive format. It preserves Linux file ownership...

lathiat · 2024-03-23T07:25:21

Yeah squashfs is one of the good ones right now.

for sosreports (archives with lots of diagnostic commands and logfiles from a linux host), I wanted to find a file format that can both used zstd compression (or maybe something else that is about as fast and compressible, currently often uses xz which is very very slow) -and- that lets you unpack a single file fast, with an index, ideally so you can mount it loopback or with fuse or otherwise just quickly unpack a single file in a many-GB archive.

You'd be surprised that this basically doesn't exist right now. Theres a bunch of half solutions, but no real good easily available one. Some things add indexes to tar, zstd does support partial/offset unpacking without reading the entire archive in the code but basically no one uses that function, it's kindof silly. There are zip and rar tools with zstd support, but they are not all cross compatible and mostly doesn't exist in the packaged Linux versions.

squashfs with zstd added mostly fits the bill.

I was really surprised not to find anything else given we had this in Zip and RAR files 2 decades ago. But nothing so far that would or could ship on a standard open source system managed to modernise that featureset.

(If anyone has any pointers let me know :-)

OttoCoddo · 2024-03-24T00:04:16

You can do that with Pack:

`pack -i ./test.pack --include=/a/file.txt`

or a couple files and folders at once:

`pack -i ./test.pack --include=/a/file.txt --include=/a/folder/`

Use `--list` to get a list of all files:

`pack -i ./test.pack --list`

Such random access using `--include` is very fast. As an example, if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.

pdimitar · 2024-03-23T17:03:41

I haven't had a chance to use it yet, but https://github.com/mhx/dwarfs claims to be times faster than squashfs, to compress much better, and to have full FUSE support.

toomuchtodo · 2024-03-23T03:00:49

Are you able to seek and selectively extract from squashfs archives using range headers if stored in object storage systems like S3?

Example: https://alexwlchan.net/2019/working-with-large-s3-objects/

nolist_policy · 2024-03-23T08:14:13

Certainly, squashfs is designed to be random-access.

jillesvangurp · 2024-03-23T08:39:47

But S3 isn't.

masklinn · 2024-03-23T13:47:19

It has to be if you can "seek and selectively extract from" a zip file: the ability to do that relies on the ability to read the end of the archive for the central directory, then read the offset and size you get from that to get at the file you need.

squashfs may or may not be able to do it with as few roundtrips (I don't know the details of its layout), but S3 necessarily provides the capabilities for random access otherwise you'd have to download the entire content either way and the original query would be moot.

canucker2016 · 2024-03-23T21:16:56

You can read sequentially through a zip file and dynamically build up a central directory yourself and do whatever desired operations.

There's the caveat of the zipfile itself may have stuff that's not mentioned in the actual central directory of the zipfile.

masklinn · 2024-03-23T21:20:53

> You can read sequentially through a zip file and dynamically build up a central directory yourself and do whatever desired operations.

First, zip files already have a central directory so why would you do that?

Second, you seem to be missing the subject of this subthread entirely, the point is being able to selectively access S3 content without downloading the whole archive. If you sequentially read through the entire zip file, you are in fact downloading the whole archive.

canucker2016 · 2024-03-24T06:04:13

Sorry, I wasn't clear before. You don't need the central directory to process a zipfile. You don't need random access to a zipfile to process it.

A zipfile can be treated as a stream of data and processed as each individual zip entry is seen in the download/read. NO random access is required.

Just enough memory for the local directory entry and buffers for the zipped data and unzipped data. The first two items should be covered by the downloaded zipfile buffer.

If you want to process the zipfile ASAP or don't have the resources to download the entire zipfile first before processing the zipfile, then this is a valid manner to handle the zipfile. If your desired data occurs before the entire zipfile has been downloaded, you can stop the download.

A zipfile can also be treated as a randomly accessed file as you mentioned. Some operations are faster that way - like browsing the each zip entry's metadata.

gkbrk · 2024-03-23T12:58:38

It is. S3 supports fetching arbitrary byte ranges from files.

qwerty456127 · 2024-03-23T01:27:16

> It's also openable in 7zip

If only 7zip could also create them on Windows (it apparently can WIM which seems a direct Windows-native counterpart, also mountable on Linux).

dataflow · 2024-03-23T01:34:55

WIM is the closest thing Windows has to a full file-based capture, but I've noticed that even that doesn't capture everything, unfortunately. I forget exactly, but think it was extended attributes that DISM wouldn't capture, despite the /EA flag. Not sure if that was a file format limitation or just a DISM bug.

qwerty456127 · 2024-03-23T01:57:56

Very sad. Cross-platform extended attributes are the very thing I would love. I even imagine a new archive format which would be just a key-key-value (I mean it - two keys, a set of key-value pairs for every top level key - this is EA / NTFS streams) store with values compressed using a common dictionary (also possibly encrypted/signed with a common key). Needless to say such a format would enable almost any use case, especially if the layout of the file itself is architectured right. MacOS wouldn't have to add their special folder (they add to every ZIP) anymore, tagging files and saving any metadata about them would be possible, saving multiple versions of the same file, alternative names (e.g. what you received it with and what you renamed it to) for the same file.

I even dream about the days when a file main stream would be pure data and all the metadata would go to EAs. Imagine an MP3 file where the main stream only records the sound but no ID3, all the metadata like the artist and the song names are handled as EAs and can be used in file operation commands.

This also can be made tape-friendly and eliminate need in TAR. Just make sure files (streamms/EAs are written contiguous, closely-related streams go right near, compression is optional and the ToC+dictionary is replicated in a number of places like the beginning, the middle and the end).

As you might have guessed I use all the major OSes and single-OS solutions are of little use to me. Apparently I'd just use SquashFS but it's use is limited on Windows because you can hardly make or mount one there - only unpack with 7zip.

mappu · 2024-03-23T02:44:59

It's easy to forget about supporting EAs on Windows - they are extremely uncommon because you practically need to be in kernelspace to write them. Ntoskrnl.exe has one or two EAs, almost nothing else does.

(ADS are super commonplace and the closer analogue to posix xattrs.)

qwerty456127 · 2024-03-25T04:02:36

I didn't know this, thanks. I thought xattrs and ADS are synonymous. Do SquashFS, ext4, HFS+ and APFS have ADS then?

I am looking forward to write my own cross-platform app which would rely on attaching additional data and metatata to my files.

"need to be in kernelspace" does not sound very scary because a user app probably doesn't need to do this very job itself - isn't there usually an OS-native command-line tool which can be invoked to do it?

benibela · 2024-03-31T10:49:44

But it is read-only?

I was trying to change a single file in squashfs container recently and could not find a way to do that.

Aachen · 2024-03-22T23:15:49

That's exactly what I'd like to avoid. I want to transfer a group of files (either to myself, friends, or website visitors), not make assumptions about the target system's permission set. For copies of my own data where permissions are relevant, I've got a restic backup

Wake me up if a simple standard comes to pass that neither has user/group ID, mode fields, nor bit-packed two-second-precision timestamps or similar silliness. Perhaps an executable bit for systems that insist on such things for being able to use the download in the intended way

(I self-made this before: a simple length-prefixed concatenation of filename and contents fields. The problem is that people would have to download an unpacker. That's not broadly useful unless it is, as in that one case, a software distribution which they're going to run anyway)

Brian_K_White · 2024-03-23T00:48:37

No, too simple.

Sometimes you want to include data and sometimes you don't for different reasons in different contexts. It's not a data handlers job to decide what data is or isn't included, it's the senders job to decide what not to include and the receivers job to decide what to ignore.

The simplest example is probably just the file path. tar or zip don't try to say whether or not a file in the container includes the full absolute path, a portion of the path, or no path.

The container should ideally be able to contain anything that any filesystem might have, or else it's not a generally useful tool, it's some annoying domain-specific specialized tool that one guy just luuuuuvs for his one use-case he thinks is obviously the most rational thing for anyone.

If you don't want to include something like a uid, say for security reasons not to disclose the internal workings of something on your end, then arrange not to include it when creating the archive, the same way you wouldn't necessarily include the full path to the same files. Or outside of a security concern like that, include all the data and let the recipient simply ignore any data that it doesn't support.

Aachen · 2024-03-23T09:33:29

Good argument, I've mostly come around to your view. The little "but" that I still see is that the current file formats don't let you omit fields you don't want to pass on, and most decoders don't let you omit fields you don't want to interpret/use while unpacking.

Even if a given decoder could, though, most users wouldn't be able to use that and so they'd get files from 1970 or 1980 if I don't want to pass that on and set it to zeroes, so better is if the field can be omitted (as in, if the header wasn't fixed length but extensible like an IP packet). So I'd still like a "better" archiving format than the ones we have today (though I'm not familiar with the internals of every one of them, like 7z or the mentioned squashfs so tell me if this already exists), but I agree such a format should just support everything ~every filesystem supports

Brian_K_White · 2024-03-23T10:11:30

Oh sure, I was talking in generalities and an imaginary archiver, what should an archiver have, not any particular existing actual one.

os and filesystem features differ all over the place, and there will be totally new filesystems and totally new metadata tomorrow. There is practically no common denominator, not even the basic ascii for the filename let alone any other metadata.

So there should just be metadata fields where about the only thing actually parrt of the spec is the structure of a metadata field, not any particular keys or values or number or order of fields. The writer might or might not even include a filed for say, creation time, and the reader might or might not care about that. If the reader doesn't recognize some strange new xattr field that only got invented yesterday, no problem, because it does know what a field is, and how to consume and discard fields it doesn't care about.

There would be a few fields that most readers and writers would all just recognize by convention, the usual basics like filename. Even the filename might not be technically a requirement but maybe an rfc documents a short list of standard fields just to give everyone a reference point. But for instance it might be legal to have nothing but some uuids or something.

That's getting a bit weird but my main point was just that it's wrong to say an archiver shouldn't include timestamps or uids just because one use of archive files is to transfer files from a unix system to a windows system, or from a system with a "bob" user to a system with no "bob" user.

Symbiote · 2024-03-23T10:12:11

The arguments for tar are --preserve-permissions and --touch (don't extract file modified time).

For unzip, -D skips restoration of timestamps.

For unrar, -ai ignores file attributes, -ts restores the modification time.

There are similar arguments for omitting these when creating the archive, they set the field to a default or specified value, or omit it entirely, depending on the format.

Brian_K_White · 2024-03-23T19:43:44

Those are user controls, to allow the user on one end to decide what to put into the container, and there are others to allow the user at the other side to decide what to take out of the container, not limits of the container.

The comment I'm replying to suggested that since one use case results in metadata that is meaningless or mis-matched between sender and receiver, the format itself should not even have the ability to record that metadata.

maxerickson · 2024-03-23T01:44:17

Is "absolute path" a coherent concept when you are talking about 2 systems?

Brian_K_White · 2024-03-23T06:02:26

Is this question a coherent concept when it doesn't change anything when you substitute any other term like "full path" or "as much path as exists" or "any path"?

dwattttt · 2024-03-23T19:09:31

D:\etc\your.conf would like a word, they seem lost and confused.

wakawaka28 · 2024-03-23T04:07:27

It can be if you make assumptions about the basic structure of both systems. Some people rely on this behavior. It can be a good idea or a bad idea, depending on what you're doing.

jrockway · 2024-03-23T01:47:23

I agree very much with this. Something that annoys me is how much information tar files leak. Like, you don't need to know the username or groupname of the person that originally owned the files. You don't need to copy around any mode bit other than "executable". You definitely don't need "last modified" timestamps, which exist only to make builds that produce archives non-hermetic.

Frankly, I don't even want any of these things on my mounted filesystem either.

> The problem is that people would have to download an unpacker.

Your archive format just needs to be an executable that runs on every platform. https://github.com/jart/cosmopolitan is something that could help with that. ("Who would execute an archive? It could do anything," I hear you scream. Well, tell that to anyone who has run "curl | bash".)

Symbiote · 2024-03-23T10:20:12

  tar --create --owner=0 --group=0 --mtime='2000-01-01 00:00:00' \
    --mode='go-rwxst' --file test.tar /bin/dash /etc/hosts

  tar --list --verbose --file test.tar
  -rwx------ root/root    125688 2000-01-01 00:00 bin/dash
  -rw------- root/root      1408 2000-01-01 00:00 etc/hosts

theamk · 2024-03-23T02:41:18

I know it may not seem this way, but a lot of people don't ever run "curl | bash", or if they do, they do so in throwaway VM (or container if source is mostly trusted)

wakawaka28 · 2024-03-23T04:10:22

It's really a bad idea most of the time to have an archive that doubles as an executable. It's not possible to cover every possible platform, and in the distant future those self-extracting archives may be impossible to extract without the required host system.

wruza · 2024-03-24T02:47:36

In most common scenarios, curl | bash is no different from apt-add-repository && apt install. Running a completely non-curated executable is very different.

lmz · 2024-03-22T23:22:59

> Wake me up if a simple standard comes to pass that neither has user/group ID, mode fields, nor bit-packed two-second-precision timestamps or similar silliness. Perhaps an executable bit for systems that insist on such things for being able to use the download in the intended way

Other than having timestamps isn't this a ZIP file? No user id, no x bit, widely available implementations... Not very simple though I guess.

sph · 2024-03-22T23:31:37

Zip is extremely simple, and well documented.

I wrote a ReadableStream to Zip encoder (with no compression) in 50 lines of Javascript.

Aachen · 2024-03-22T23:45:51

Me too but in php. I couldn't find a streaming zip encoder that you can just require() and use without further hassle, so I wrote one (it's on github somewhere).

The problem is that zip is finicky and extremely poorly documented. I had to look at what other implementations do to figure out some of the fields. About at least one field, the spec (from the early 90s or late 80s I think) says it is up to you to figure out what you want to put there! After all that, I additionally wrote my own docs in case someone coming after me needs to understand the format as well, but some things are just assumptions and "everyone does it this way"s, leading to me having only moderate confidence that I've followed the spec correctly. I haven't found incompatibilities yet, but I'd also not be surprised if an old decoder doesn't eat it or if a modern one made a different choice somewhere.

It's also not as if I haven't come across third party zip files that the Debian command line tool wouldn't open but the standard Debian/Cinnamon GUI utility was perfectly happy about. If it were so well-documented and standard, that shouldn't be a thing. (Similarly, customers on macOS can't open our encrypted 7z pentest report files. The Finder asks for the password and then flat-out tells them "incorrect password", whereas in reality it seems to be unable to handle filename encryption. Idk if that is per the spec but incompatibilities are abound.)

canucker2016 · 2024-03-23T00:05:12

The PKWare Zip file spec is reasonably detailed.

If you're not sure what the spec is trying to say, then either the PKZip binaries or the Info-ZIP zip/unzip source code is your usual source of truth.

When one unzip works but another unzip app doesn't, then you can usually point the finger at the last zip app that modified the zip file. There's some inconsistency in the zip file.

Running "unzip -t -v" on the zip file in question may yield more info about the problem.

Aachen · 2024-03-23T09:40:37

The binaries you refer to as source of truth are a paid product (not sure if the trial version, which requires filling out a form that's currently not loading, includes all options, or how honest it is to use that to make an alternative to their software, or if the terms allow that) and don't seem to run on my operating system. I guess I could buy me a Windows license and read the pkzip EULA to see if you're allowed to use it for making a clone, but I figured the two decoders (that don't always agree with each other) I had on hand would do. If they agree about a field, it's good enough (and decoders can expect that unspecced fields are garbage)

canucker2016 · 2024-03-23T21:20:08

Info-ZIP is open source. Have you never used unzip?

Aachen · 2024-03-24T14:09:09

Isn't pkzip the original? I'm not sure I've heard of info-zip but unzip is a command I use regularly on Debian. I highly doubt that's the original commercial implementation though

canucker2016 · 2024-03-23T00:29:34

Here's the link to the PKWARE APPNOTE.TXT

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

canucker2016 · 2024-03-23T21:24:58

The only special thing about the Zip file format that springs to mind as causing ambiguity is the handling of the OS-specific extra field for a Zip archive entry.

You don't have to include an OS-specific extra field unless you want the information in that specific extra field to be available by the party trying to extract the contents of the zipfile.

sntran · 2024-03-23T14:18:13

Wait until you add support for encryption.

OttoCoddo · 2024-03-23T13:01:15

- As far as I know, squashfs is a file system and not an archive format; the "FS" in the name shows the focus.

- It is read-only; Pack is not. Update and delete are not just public yet, as I wanted people to get the taste first.

- It is clearly focused on archiving, rather than Pack wanting to be a container option for people who want to pack some files/data and store or send them with no privacy dangers.

- Pack is designed to be user-friendly for most people; CLI is very simple to work with, and future OS integration will make working with it like a breeze. It is far different from a good file system focused on Linux.

- I did not compare to squashfs, but I will be happy to see any results from interested people.

My bet is on Pack, obviously, to be much faster.

dur-randir · 2024-03-23T13:27:19

- loop-mount is a thing

- being read only is mostly a benefit to an archive. Back in the days when drives had been small, I occasionally wanted to update a .rar, but in the last ~5 years I can't remember a case for it.

- it's fine, but don't think that others' use cases are invalid because of your vision

- mount is also a CLI interface

dur-randir · 2024-03-23T13:31:34

As a separate note, had I encountered pack.ac link anywhere on the internet other than here with a description attached, I'd have left it immediately. It just lacks for me any info what it is and why should I try it.

bravetraveler · 2024-03-23T13:47:17

They state how squash is nice for archiving and then you go and ramble about specifically Not Archiving

bonki · 2024-03-22T20:52:27

I second this.