Pack: A new container format for compressed files

bno1 · 2024-03-22T20:01:00 1711137660

I found squashfs to be a great archive format. It preserves Linux file ownership and permissions, you can extract individual files without parsing the entire archive like tar and it's mountable. It's also openable in 7zip.

I wonder how pack compares to it, but its home page and github don't tell much.

lathiat · 2024-03-23T07:25:21 1711178721

Yeah squashfs is one of the good ones right now.

for sosreports (archives with lots of diagnostic commands and logfiles from a linux host), I wanted to find a file format that can both used zstd compression (or maybe something else that is about as fast and compressible, currently often uses xz which is very very slow) -and- that lets you unpack a single file fast, with an index, ideally so you can mount it loopback or with fuse or otherwise just quickly unpack a single file in a many-GB archive.

You'd be surprised that this basically doesn't exist right now. Theres a bunch of half solutions, but no real good easily available one. Some things add indexes to tar, zstd does support partial/offset unpacking without reading the entire archive in the code but basically no one uses that function, it's kindof silly. There are zip and rar tools with zstd support, but they are not all cross compatible and mostly doesn't exist in the packaged Linux versions.

squashfs with zstd added mostly fits the bill.

I was really surprised not to find anything else given we had this in Zip and RAR files 2 decades ago. But nothing so far that would or could ship on a standard open source system managed to modernise that featureset.

(If anyone has any pointers let me know :-)

OttoCoddo · 2024-03-24T00:04:16 1711238656

You can do that with Pack:

`pack -i ./test.pack --include=/a/file.txt`

or a couple files and folders at once:

`pack -i ./test.pack --include=/a/file.txt --include=/a/folder/`

Use `--list` to get a list of all files:

`pack -i ./test.pack --list`

Such random access using `--include` is very fast. As an example, if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.

pdimitar · 2024-03-23T17:03:41 1711213421

I haven't had a chance to use it yet, but https://github.com/mhx/dwarfs claims to be times faster than squashfs, to compress much better, and to have full FUSE support.

toomuchtodo · 2024-03-23T03:00:49 1711162849

Are you able to seek and selectively extract from squashfs archives using range headers if stored in object storage systems like S3?

Example: https://alexwlchan.net/2019/working-with-large-s3-objects/

nolist_policy · 2024-03-23T08:14:13 1711181653

Certainly, squashfs is designed to be random-access.

jillesvangurp · 2024-03-23T08:39:47 1711183187

But S3 isn't.

masklinn · 2024-03-23T13:47:19 1711201639

It has to be if you can "seek and selectively extract from" a zip file: the ability to do that relies on the ability to read the end of the archive for the central directory, then read the offset and size you get from that to get at the file you need.

squashfs may or may not be able to do it with as few roundtrips (I don't know the details of its layout), but S3 necessarily provides the capabilities for random access otherwise you'd have to download the entire content either way and the original query would be moot.

canucker2016 · 2024-03-23T21:16:56 1711228616

You can read sequentially through a zip file and dynamically build up a central directory yourself and do whatever desired operations.

There's the caveat of the zipfile itself may have stuff that's not mentioned in the actual central directory of the zipfile.

masklinn · 2024-03-23T21:20:53 1711228853

> You can read sequentially through a zip file and dynamically build up a central directory yourself and do whatever desired operations.

First, zip files already have a central directory so why would you do that?

Second, you seem to be missing the subject of this subthread entirely, the point is being able to selectively access S3 content without downloading the whole archive. If you sequentially read through the entire zip file, you are in fact downloading the whole archive.

canucker2016 · 2024-03-24T06:04:13 1711260253

Sorry, I wasn't clear before. You don't need the central directory to process a zipfile. You don't need random access to a zipfile to process it.

A zipfile can be treated as a stream of data and processed as each individual zip entry is seen in the download/read. NO random access is required.

Just enough memory for the local directory entry and buffers for the zipped data and unzipped data. The first two items should be covered by the downloaded zipfile buffer.

If you want to process the zipfile ASAP or don't have the resources to download the entire zipfile first before processing the zipfile, then this is a valid manner to handle the zipfile. If your desired data occurs before the entire zipfile has been downloaded, you can stop the download.

A zipfile can also be treated as a randomly accessed file as you mentioned. Some operations are faster that way - like browsing the each zip entry's metadata.

gkbrk · 2024-03-23T12:58:38 1711198718

It is. S3 supports fetching arbitrary byte ranges from files.

qwerty456127 · 2024-03-23T01:27:16 1711157236

> It's also openable in 7zip

If only 7zip could also create them on Windows (it apparently can WIM which seems a direct Windows-native counterpart, also mountable on Linux).

dataflow · 2024-03-23T01:34:55 1711157695

WIM is the closest thing Windows has to a full file-based capture, but I've noticed that even that doesn't capture everything, unfortunately. I forget exactly, but think it was extended attributes that DISM wouldn't capture, despite the /EA flag. Not sure if that was a file format limitation or just a DISM bug.

qwerty456127 · 2024-03-23T01:57:56 1711159076

Very sad. Cross-platform extended attributes are the very thing I would love. I even imagine a new archive format which would be just a key-key-value (I mean it - two keys, a set of key-value pairs for every top level key - this is EA / NTFS streams) store with values compressed using a common dictionary (also possibly encrypted/signed with a common key). Needless to say such a format would enable almost any use case, especially if the layout of the file itself is architectured right. MacOS wouldn't have to add their special folder (they add to every ZIP) anymore, tagging files and saving any metadata about them would be possible, saving multiple versions of the same file, alternative names (e.g. what you received it with and what you renamed it to) for the same file.

I even dream about the days when a file main stream would be pure data and all the metadata would go to EAs. Imagine an MP3 file where the main stream only records the sound but no ID3, all the metadata like the artist and the song names are handled as EAs and can be used in file operation commands.

This also can be made tape-friendly and eliminate need in TAR. Just make sure files (streamms/EAs are written contiguous, closely-related streams go right near, compression is optional and the ToC+dictionary is replicated in a number of places like the beginning, the middle and the end).

As you might have guessed I use all the major OSes and single-OS solutions are of little use to me. Apparently I'd just use SquashFS but it's use is limited on Windows because you can hardly make or mount one there - only unpack with 7zip.

mappu · 2024-03-23T02:44:59 1711161899

It's easy to forget about supporting EAs on Windows - they are extremely uncommon because you practically need to be in kernelspace to write them. Ntoskrnl.exe has one or two EAs, almost nothing else does.

(ADS are super commonplace and the closer analogue to posix xattrs.)

qwerty456127 · 2024-03-25T04:02:36 1711339356

I didn't know this, thanks. I thought xattrs and ADS are synonymous. Do SquashFS, ext4, HFS+ and APFS have ADS then?

I am looking forward to write my own cross-platform app which would rely on attaching additional data and metatata to my files.

"need to be in kernelspace" does not sound very scary because a user app probably doesn't need to do this very job itself - isn't there usually an OS-native command-line tool which can be invoked to do it?

benibela · 2024-03-31T10:49:44 1711882184

But it is read-only?

I was trying to change a single file in squashfs container recently and could not find a way to do that.

Aachen · 2024-03-22T23:15:49 1711149349

That's exactly what I'd like to avoid. I want to transfer a group of files (either to myself, friends, or website visitors), not make assumptions about the target system's permission set. For copies of my own data where permissions are relevant, I've got a restic backup

Wake me up if a simple standard comes to pass that neither has user/group ID, mode fields, nor bit-packed two-second-precision timestamps or similar silliness. Perhaps an executable bit for systems that insist on such things for being able to use the download in the intended way

(I self-made this before: a simple length-prefixed concatenation of filename and contents fields. The problem is that people would have to download an unpacker. That's not broadly useful unless it is, as in that one case, a software distribution which they're going to run anyway)

Brian_K_White · 2024-03-23T00:48:37 1711154917

No, too simple.

Sometimes you want to include data and sometimes you don't for different reasons in different contexts. It's not a data handlers job to decide what data is or isn't included, it's the senders job to decide what not to include and the receivers job to decide what to ignore.

The simplest example is probably just the file path. tar or zip don't try to say whether or not a file in the container includes the full absolute path, a portion of the path, or no path.

The container should ideally be able to contain anything that any filesystem might have, or else it's not a generally useful tool, it's some annoying domain-specific specialized tool that one guy just luuuuuvs for his one use-case he thinks is obviously the most rational thing for anyone.

If you don't want to include something like a uid, say for security reasons not to disclose the internal workings of something on your end, then arrange not to include it when creating the archive, the same way you wouldn't necessarily include the full path to the same files. Or outside of a security concern like that, include all the data and let the recipient simply ignore any data that it doesn't support.

Aachen · 2024-03-23T09:33:29 1711186409

Good argument, I've mostly come around to your view. The little "but" that I still see is that the current file formats don't let you omit fields you don't want to pass on, and most decoders don't let you omit fields you don't want to interpret/use while unpacking.

Even if a given decoder could, though, most users wouldn't be able to use that and so they'd get files from 1970 or 1980 if I don't want to pass that on and set it to zeroes, so better is if the field can be omitted (as in, if the header wasn't fixed length but extensible like an IP packet). So I'd still like a "better" archiving format than the ones we have today (though I'm not familiar with the internals of every one of them, like 7z or the mentioned squashfs so tell me if this already exists), but I agree such a format should just support everything ~every filesystem supports

Brian_K_White · 2024-03-23T10:11:30 1711188690

Oh sure, I was talking in generalities and an imaginary archiver, what should an archiver have, not any particular existing actual one.

os and filesystem features differ all over the place, and there will be totally new filesystems and totally new metadata tomorrow. There is practically no common denominator, not even the basic ascii for the filename let alone any other metadata.

So there should just be metadata fields where about the only thing actually parrt of the spec is the structure of a metadata field, not any particular keys or values or number or order of fields. The writer might or might not even include a filed for say, creation time, and the reader might or might not care about that. If the reader doesn't recognize some strange new xattr field that only got invented yesterday, no problem, because it does know what a field is, and how to consume and discard fields it doesn't care about.

There would be a few fields that most readers and writers would all just recognize by convention, the usual basics like filename. Even the filename might not be technically a requirement but maybe an rfc documents a short list of standard fields just to give everyone a reference point. But for instance it might be legal to have nothing but some uuids or something.

That's getting a bit weird but my main point was just that it's wrong to say an archiver shouldn't include timestamps or uids just because one use of archive files is to transfer files from a unix system to a windows system, or from a system with a "bob" user to a system with no "bob" user.

Symbiote · 2024-03-23T10:12:11 1711188731

The arguments for tar are --preserve-permissions and --touch (don't extract file modified time).

For unzip, -D skips restoration of timestamps.

For unrar, -ai ignores file attributes, -ts restores the modification time.

There are similar arguments for omitting these when creating the archive, they set the field to a default or specified value, or omit it entirely, depending on the format.

Brian_K_White · 2024-03-23T19:43:44 1711223024

Those are user controls, to allow the user on one end to decide what to put into the container, and there are others to allow the user at the other side to decide what to take out of the container, not limits of the container.

The comment I'm replying to suggested that since one use case results in metadata that is meaningless or mis-matched between sender and receiver, the format itself should not even have the ability to record that metadata.

maxerickson · 2024-03-23T01:44:17 1711158257

Is "absolute path" a coherent concept when you are talking about 2 systems?

Brian_K_White · 2024-03-23T06:02:26 1711173746

Is this question a coherent concept when it doesn't change anything when you substitute any other term like "full path" or "as much path as exists" or "any path"?

dwattttt · 2024-03-23T19:09:31 1711220971

D:\etc\your.conf would like a word, they seem lost and confused.

wakawaka28 · 2024-03-23T04:07:27 1711166847

It can be if you make assumptions about the basic structure of both systems. Some people rely on this behavior. It can be a good idea or a bad idea, depending on what you're doing.

jrockway · 2024-03-23T01:47:23 1711158443

I agree very much with this. Something that annoys me is how much information tar files leak. Like, you don't need to know the username or groupname of the person that originally owned the files. You don't need to copy around any mode bit other than "executable". You definitely don't need "last modified" timestamps, which exist only to make builds that produce archives non-hermetic.

Frankly, I don't even want any of these things on my mounted filesystem either.

> The problem is that people would have to download an unpacker.

Your archive format just needs to be an executable that runs on every platform. https://github.com/jart/cosmopolitan is something that could help with that. ("Who would execute an archive? It could do anything," I hear you scream. Well, tell that to anyone who has run "curl | bash".)

Symbiote · 2024-03-23T10:20:12 1711189212

  tar --create --owner=0 --group=0 --mtime='2000-01-01 00:00:00' \
    --mode='go-rwxst' --file test.tar /bin/dash /etc/hosts

  tar --list --verbose --file test.tar
  -rwx------ root/root    125688 2000-01-01 00:00 bin/dash
  -rw------- root/root      1408 2000-01-01 00:00 etc/hosts

theamk · 2024-03-23T02:41:18 1711161678

I know it may not seem this way, but a lot of people don't ever run "curl | bash", or if they do, they do so in throwaway VM (or container if source is mostly trusted)

wakawaka28 · 2024-03-23T04:10:22 1711167022

It's really a bad idea most of the time to have an archive that doubles as an executable. It's not possible to cover every possible platform, and in the distant future those self-extracting archives may be impossible to extract without the required host system.

wruza · 2024-03-24T02:47:36 1711248456

In most common scenarios, curl | bash is no different from apt-add-repository && apt install. Running a completely non-curated executable is very different.

lmz · 2024-03-22T23:22:59 1711149779

> Wake me up if a simple standard comes to pass that neither has user/group ID, mode fields, nor bit-packed two-second-precision timestamps or similar silliness. Perhaps an executable bit for systems that insist on such things for being able to use the download in the intended way

Other than having timestamps isn't this a ZIP file? No user id, no x bit, widely available implementations... Not very simple though I guess.

sph · 2024-03-22T23:31:37 1711150297

Zip is extremely simple, and well documented.

I wrote a ReadableStream to Zip encoder (with no compression) in 50 lines of Javascript.

Aachen · 2024-03-22T23:45:51 1711151151

Me too but in php. I couldn't find a streaming zip encoder that you can just require() and use without further hassle, so I wrote one (it's on github somewhere).

The problem is that zip is finicky and extremely poorly documented. I had to look at what other implementations do to figure out some of the fields. About at least one field, the spec (from the early 90s or late 80s I think) says it is up to you to figure out what you want to put there! After all that, I additionally wrote my own docs in case someone coming after me needs to understand the format as well, but some things are just assumptions and "everyone does it this way"s, leading to me having only moderate confidence that I've followed the spec correctly. I haven't found incompatibilities yet, but I'd also not be surprised if an old decoder doesn't eat it or if a modern one made a different choice somewhere.

It's also not as if I haven't come across third party zip files that the Debian command line tool wouldn't open but the standard Debian/Cinnamon GUI utility was perfectly happy about. If it were so well-documented and standard, that shouldn't be a thing. (Similarly, customers on macOS can't open our encrypted 7z pentest report files. The Finder asks for the password and then flat-out tells them "incorrect password", whereas in reality it seems to be unable to handle filename encryption. Idk if that is per the spec but incompatibilities are abound.)

canucker2016 · 2024-03-23T00:05:12 1711152312

The PKWare Zip file spec is reasonably detailed.

If you're not sure what the spec is trying to say, then either the PKZip binaries or the Info-ZIP zip/unzip source code is your usual source of truth.

When one unzip works but another unzip app doesn't, then you can usually point the finger at the last zip app that modified the zip file. There's some inconsistency in the zip file.

Running "unzip -t -v" on the zip file in question may yield more info about the problem.

Aachen · 2024-03-23T09:40:37 1711186837

The binaries you refer to as source of truth are a paid product (not sure if the trial version, which requires filling out a form that's currently not loading, includes all options, or how honest it is to use that to make an alternative to their software, or if the terms allow that) and don't seem to run on my operating system. I guess I could buy me a Windows license and read the pkzip EULA to see if you're allowed to use it for making a clone, but I figured the two decoders (that don't always agree with each other) I had on hand would do. If they agree about a field, it's good enough (and decoders can expect that unspecced fields are garbage)

canucker2016 · 2024-03-23T21:20:08 1711228808

Info-ZIP is open source. Have you never used unzip?

Aachen · 2024-03-24T14:09:09 1711289349

Isn't pkzip the original? I'm not sure I've heard of info-zip but unzip is a command I use regularly on Debian. I highly doubt that's the original commercial implementation though

canucker2016 · 2024-03-23T00:29:34 1711153774

Here's the link to the PKWARE APPNOTE.TXT

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

canucker2016 · 2024-03-23T21:24:58 1711229098

The only special thing about the Zip file format that springs to mind as causing ambiguity is the handling of the OS-specific extra field for a Zip archive entry.

You don't have to include an OS-specific extra field unless you want the information in that specific extra field to be available by the party trying to extract the contents of the zipfile.

sntran · 2024-03-23T14:18:13 1711203493

Wait until you add support for encryption.

OttoCoddo · 2024-03-23T13:01:15 1711198875

- As far as I know, squashfs is a file system and not an archive format; the "FS" in the name shows the focus.

- It is read-only; Pack is not. Update and delete are not just public yet, as I wanted people to get the taste first.

- It is clearly focused on archiving, rather than Pack wanting to be a container option for people who want to pack some files/data and store or send them with no privacy dangers.

- Pack is designed to be user-friendly for most people; CLI is very simple to work with, and future OS integration will make working with it like a breeze. It is far different from a good file system focused on Linux.

- I did not compare to squashfs, but I will be happy to see any results from interested people.

My bet is on Pack, obviously, to be much faster.

dur-randir · 2024-03-23T13:27:19 1711200439

- loop-mount is a thing

- being read only is mostly a benefit to an archive. Back in the days when drives had been small, I occasionally wanted to update a .rar, but in the last ~5 years I can't remember a case for it.

- it's fine, but don't think that others' use cases are invalid because of your vision

- mount is also a CLI interface

dur-randir · 2024-03-23T13:31:34 1711200694

As a separate note, had I encountered pack.ac link anywhere on the internet other than here with a description attached, I'd have left it immediately. It just lacks for me any info what it is and why should I try it.

bravetraveler · 2024-03-23T13:47:17 1711201637

They state how squash is nice for archiving and then you go and ramble about specifically Not Archiving

bonki · 2024-03-22T20:52:27 1711140747

I second this.

SyrupThinker · 2024-03-22T21:31:39 1711143099

Interesting, I've recently spent an unhealthy amount of time researching archival formats to build the same setup of using SQLite with ZStd.

My use case is extremely redundant data (specific website dumps + logs) that I want decently quick random access into, and I was unhappy with either the access speed, quality/usability or even existence of libraries for several formats.

Glancing over the code this seems to use the following setup:

- Aggregate files

- Chunk into blocks

- Compress blocks of fixed size

- Store file to chunk and chunk to block associations

What I did not see is a deduplication step for the chunks, or an attempt to group files (and by extend, blocks) by similarity in an attempt improve compression.

But I might have just missed that due to lack of familiarity with Pascal.

For anyone interested in this strategy, take a look at ZPAQ [1] by Matt Mahoney, you might know him from the Hutter Prize competition [2] / Large Text Compression Benchmark. It takes 14th place with tuned parameters.

There's also a maintained fork called zpaqfranz, but I ran into some issues like incorrect disk size estimates with it. For me the code was also sometimes hard to read due to being a mix of English and Italian. So your mileage may vary.

[1]: http://mattmahoney.net/dc/zpaq.html [2]: http://prize.hutter1.net [3]: https://github.com/fcorbelli/zpaqfranz

OttoCoddo · 2024-03-23T13:43:56 1711201436

Thank you for the detail check. I should thank the syrup too :)

I'm happy to see a fellow enthusiast. Your deduction is on point. And also, Pack is smart; it skips non-compressible files like MP3 [1], so you do not need to choose the "Store" option to have a faster option, and it speedup decompression too. Pack is the first to achieve this, being faster than Store options. Yes, it was a surprise to me too.

ZPAQ is great, and I study the Hutter Prize competition. Pack is on another chart, which is why I proposed CompressedSpeed [2]. The speed of getting to compression needs to be accounted for. You can store anything on an atom if you try hard enough, but hard work takes time. Deduplication step may get added, but in Hard Press [3].

I am curious to see the results of Pack on your data. You can find me here or o at pack.ac.

[1] It is based on content rather than extension; any data that is determined not to be worthy of compression, will be stored as is. And as a file can get chucked, some parts can get compressed and some cannot. Imagine that part of the subtitle in a MKV file can get compressed, and the Video part gets skipped. Although these features will get more updates over time, if they don't cost time,. Pack focus is being seamless and not the most compressed; there are already great works in the field, such as the noted ZPAQ.

[2] CompressedSpeed = (InputSize / OutputSize) * (InputSize / Speed). Materialized compression speed.

[3] You can choose --press=hard to ask for better compression. Even with Hard Press, Pack does not try to eat your hardware just to get a little more; it goes the optimized way I described.

juitpykyk · 2024-03-22T23:44:53 1711151093

For your use case you might want to look at RocksDB.

It supports zstandard compression, random access, and it's very robust.

pdimitar · 2024-03-24T01:30:57 1711243857

Try this one?

https://github.com/mhx/dwarfs

It has a ton of comparison with existing tools in the README -- zpaqfranz included -- and it seems to be the best there is.

jlhawn · 2024-03-22T19:53:49 1711137229

When I read the title, I thought it was a new operating system-level containerization image format for filesystem layers and runtime config. But it looks like "container format" is a more general term for a collection of multiple files or streams into one. https://en.wikipedia.org/wiki/Container_format TIL.

OS containers could use an update too, though. They're often so big and tend to use multiple tar.gz files.

OttoCoddo · 2024-03-23T15:58:49 1711209529

You can use Pack for those cases too. --press=hard creates a more compressed pack for cases of pack once, unpack many.

drpixie · 2024-03-23T05:11:19 1711170679

Similar. I though it would be a media file format ... containing audio, video, tags, pics, etc.

xcdzvyn · 2024-03-22T20:10:48 1711138248

With all due respect, I find it hard to believe the author stumbled upon a trivial method of improving tarballing performance by several orders of magnitude that nobody else had considered before.

If I understand correctly, they're suggesting Pack, which both archives and compresses, is 30x faster than creating a plain tar archive. That just sounds like you used multithreading and tar didn't.

Either way, it'd be nice to see [a version of] Pack support plain archival, rather than being forced to tack on Zstd.

TylerE · 2024-03-22T20:12:23 1711138343

That’s more because plain tar is actually a really dumb way of handling files that aren’t going to tape.

Being better than that is not a hard bar.

cogman10 · 2024-03-22T20:22:53 1711138973

The tar file format is REALLY bad. It's pretty much impossible to thread because it's just doing metadata then content and repeatably concatenating.

IE

    /foo.txt 21
    This is the foo file
    /bar.txt 21
    This is the bar file

That makes it super hard to deal with as you essentially need to navigate the entire tar file before you can list the directories in a tar file. To add a file you have to wait for the previous file to be added.

Using something like sqlite solves this particular problem because you can have a table with file names and a table with file contents that can both be inserted into in parallel (though that will mean the contents aren't guaranteed to be contiguous.) Since SQLite is just a btree it's easy (well, known) how to concurrently modify the contents of the tree.

sargun · 2024-03-23T07:02:20 1711177340

Funnily enough, tar is like 3 different formats (PaX, tar, ustar). One of the annoying parts of the tar format is that even though you scan all the metadata upfront, you have to keep the directory metadata in RAM until the end and have to wait to apply it at the end.

TylerE · 2024-03-22T20:34:21 1711139661

Or just what zip and every other format does an skits put all the metadata at the beginning - enough to list all files, and extract any single one efficiently

monocasa · 2024-03-22T20:51:49 1711140709

zip interestingly sticks the metadata at the end. That lets you add files to a zip without touching what's already been zipped. Just new metadata at the end.

Modern tape archives like LTFS do the same thing as well.

t43562 · 2024-03-23T06:01:05 1711173665

That sounds like you need to have fetched the whole zip before you can unzip it - which is not what one wants when making "virtual tarfiles" which only exist in a pipe. (i.e. you're packing files in at one end of the pipe and unpacking them at the other)

bdhcuidbebe · 2024-03-23T08:49:42 1711183782

Just fseek to the end.

Zip format was not designed for piping.

nullindividual · 2024-03-22T20:38:00 1711139880

Tapes don't (? certainly didn't) operate this way. You need to read the entire tape to list the contents.

Since tar is a Tape ARchive, the way tar operates makes sense (as it was designed for both output to file and device, i.e. tape).

tredre3 · 2024-03-22T20:58:58 1711141138

That point is always raised on every criticism of tar (that it's good at tape).

Yes! It is! But it's awful at archive files, which is what it's used for nowadays and what's being discussed right now.

Over the past 50 years some people did try to improve tar. People did develop ways to append a file table at the end of an archive file. Maintaining compatibility with tapes, all tar utilities, and piping.

Similarly, driven people did extend (pk)zip to cover all the unix-y needs. In fact the current zip utility still supports permissions and symlinks to this day.

But despite those better methods, people keep pushing og tar. Because it's good at tape archival. Sigh.

monocasa · 2024-03-22T20:50:19 1711140619

Tapes currently don't really operate like tar anymore either. Filesystems like LTFS stick the metadata all in one blob somewhere.

nullindividual · 2024-03-22T21:00:12 1711141212

It's been a long time since I've operated tape, so good to know things have changed for the better.

OttoCoddo · 2024-03-23T16:23:10 1711210990

It was hard to believe for me, too. And I didn't stumble upon it; I looked for it closely, and that was a point in the note. People did not look properly for nearly three decades. Many things have changed, but we computer people are still using the same tools. I am not saying old is not good; the current solutions are great, but what are we, if we don't look for the better?

Yes it is that much faster, and a good part of it is because of the multi-thread design, but as a reminder, WinRAR or 7-Zip are too multi-thread, and you can see the difference. To satisfy your doubt, I suggest running Pack for yourself. I am looking for more data on its behaviour on different machines and data.

Can I ask why do you need a version without ZSTD? If you are thinking that compression slows it down, I should say no. Pack is the first of its kinds that "Store" is slowing it down. Because its compression is smart, it will skip any non-compressible content.

On the same machine and the same Linux source code test:

Pack: 194 MB, 1.3 s

Pack (With no Press): 1.25 GB, 1.8 s

xcdzvyn · 2024-03-24T12:45:22 1711284322

My concern with Pack obliging me to compress is that compression becomes less pluggable; I'd much rather my archive format be agnostic of compression, as with tar, so that I can trivially move to a better compression format when one inevitably comes to be.

OttoCoddo · 2024-03-24T13:41:25 1711287685

You got a point. Although with that that option comes a great cost: We will lose portability, speed and even reliability.

Portability: Receiver (or future you) needs to know what you used, and what version even.

Speed: If you want to do the archive part first (tar) and then compress (gz), you will get much lower speed (as shown in the note).

Reliability: Most people use tar with gz anyway, but if you use it with not so popular algorithm and tools, you will risk having a file that may or may not work into the future.

Pack plan is to use the best of time (Zstandard) and if an update is needed in years to come, it will add support for the new algorithm updates. All Pack clients must only write the latest version (and read all previous versions) and that makes sure almost all use the best of their time.

out_of_protocol · 2024-03-23T20:35:26 1711226126

Pure zstd (or .tar.zstd) vs pack vs patched 7z+zstd would be more interesting, how much overhead introduced by pack format itself - in size and speed

OttoCoddo · 2024-03-23T20:54:51 1711227291

I answered this question here: https://news.ycombinator.com/item?id=39801083 If that is not enough, let me know.

out_of_protocol · 2024-03-23T22:16:22 1711232182

tar.zst vs pack is looking great, thanks! Also there is https://github.com/mcmilk/7-Zip-zstd

.pack vs zst-7z with the same compression settings would b interesting. That will be pure container overhead

Hello71 · 2024-03-22T21:13:42 1711142022

Also, 4.7 seconds to read 1345 MB in 81k files is suspiciously slow. On my six-year-old low/mid-range Intel 660p with Linux 6.8, tar -c /usr/lib >/dev/null with 2.4 GiB in 49k files takes about 1.25s cold and 0.32s warm. Of course, the sales pitch has no explanation of which hardware, software, parameters, or test procedures were used. I reckon tar was tested with cold cache and pack with warm cache, and both are basically benchmarking I/O speed.

lilyball · 2024-03-22T22:34:24 1711146864

The footnotes at the bottom says

> Development machine with a two-year-old CPU and NVMe disk, using Windows with the NTFS file system. The differences are even greater on Linux using ext4. Value holds on an old HDD and one-core CPU.

> All corresponding official programs were used in an out-of-the-box configuration at the time of writing in a warm state.

Hello71 · 2024-03-23T06:25:24 1711175124

My apologies, the text color is barely legible on my machine. Those details are still minimal though; what versions of software? How much RAM is installed? Why is 7-Zip set to maximum compression but zstd is not? Why is tar.zst not included for a fair comparison of the Pack-specific (SQLite) improvements on top of from the standard solution?

OttoCoddo · 2024-03-23T16:29:50 1711211390

Using 32GB of RAM, but it is far more than they need.

7-Zip was used as others, just gave it a folder to compress. No configuration.

As requested, here are some numbers on tar.zst of Linux source code (the test subject in the note): tar.zst: 196 MB, 5420 ms (using out-of-the box config and -T0 to let it use all the cores. Without it, it would be, 7570 ms) Pack: 194 MB, 1300 ms Slightly smaller size, and more than 4X faster. (Again, it is on my machine; you need to try it for yourself.) Honestly, ZSTD is great. Tar is slowing it down (because of its old design and being one thread). And it is done in two steps: first creating tar and then compression. Pack does all the steps (read, check, compress, and write) together, and this weaving helped achieve this speed and random access.

Hello71 · 2024-03-30T03:33:30 1711769610

This sounds like a Windows problem, plus compression settings. Your wlog is 24 instead of 21, meaning decompression will use more memory. After adjusting those for a fair comparison, pack still wins slightly but not massively:

  Benchmark 1: tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
    Time (mean ± σ):      2.573 s ±  0.091 s    [User: 8.611 s, System: 1.981 s]
    Range (min … max):    2.486 s …  2.783 s    10 runs
   
  Benchmark 2: bsdtar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
    Time (mean ± σ):      3.400 s ±  0.250 s    [User: 8.436 s, System: 2.243 s]
    Range (min … max):    3.171 s …  4.050 s    10 runs
   
  Benchmark 3: busybox tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
    Time (mean ± σ):      2.535 s ±  0.125 s    [User: 8.611 s, System: 1.548 s]
    Range (min … max):    2.371 s …  2.814 s    10 runs
   
  Benchmark 4: ./pack -i ./linux-6.8.2 -w
    Time (mean ± σ):      1.998 s ±  0.105 s    [User: 5.972 s, System: 0.834 s]
    Range (min … max):    1.931 s …  2.250 s    10 runs
   
  Summary
    ./pack -i ./linux-6.8.2 -w ran
      1.27 ± 0.09 times faster than busybox tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
      1.29 ± 0.08 times faster than tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
      1.70 ± 0.15 times faster than bsdtar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst

Another machine has similar results. I'm inclined to say that the difference is probably mainly related to tar saving attributes like creation and modification time while pack doesn't.

> it is done in two steps: first creating tar and then compression

Pipes (originally Unix, subsequently copied by MS-DOS) operate in parallel, not sequentially. This allows them to process arbitrarily large files on small memory without slow buffering.

OttoCoddo · 2024-03-30T16:54:54 1711817694

Thank you for the new numbers. Sure, it can be different on different machines, especially full systems. For me on Linux and ext4, Pack finishes the Linux code base at just 0.96 s.

Anyway, I do not expect an order of magnitude difference between tar.zst and Pack; after all, Pack is using Zstandard. What makes Pack fundamentally different from tar.zst is Random Access and other important factors like user experience. I shared some numbers on it here: https://news.ycombinator.com/item?id=39803968 and you are encouraged to try them for yourself. Also, by adding Encryption and Locking to Pack, Random Access will be even more beneficial.

fbdab103 · 2024-03-22T23:17:19 1711149439

HDD for testing is a pretty big caveat for modern tooling benchmarks. Maybe everything holds the same if done on a SSD, but that feels like a pretty big assumption given the wildly different performance characteristics between the two.

jrockway · 2024-03-23T03:37:40 1711165060

gzip is really, really, really slow, so it's pretty easy to make a thing that uses gzip fast by switching to Zstandard.

darby_eight · 2024-03-22T20:14:49 1711138489

Eh, it's not that hard to imagine given how rare it is to zip 81k files of around 1kb each.

iscoelho · 2024-03-22T20:25:49 1711139149

Not that rare at all. Take a full disk zip/tar of any Linux/Windows filesystem and you'll encounter a lot of small files.

darby_eight · 2024-03-22T20:30:16 1711139416

Ok? How are you comparing these systems to the benchmark so they might be considered relevant? Compressing "Lots of small files" describes an infinite variety of workloads. To achieve anything close to the benchmark you'd need to specifically only compress only small files in a single directory of an average small size. And even the contents of those files would have large implications as to expected performance....

iscoelho · 2024-03-22T21:53:11 1711144391

My comment is not making any claims about that. It's just a correction that filesystems with "81k 1KB files" are indeed common.

darby_eight · 2024-03-23T18:08:01 1711217281

If that were true, surely it would make sense to demonstrate this directly rather than with a contrived benchmark? The issue is not the preponderance of small files but rather the distribution of data shapes.

OttoCoddo · 2024-03-23T16:30:08 1711211408

Reading many files (81K in this test) is way slower than reading just one big file. For bigger files, Pack is much faster. Here is a link to some results from a kind contributor: https://forum.lazarus.freepascal.org/index.php/topic,66281.m...

(Too long to post here)

viraptor · 2024-03-22T20:52:26 1711140746

That's basically any large source repo.

fbdab103 · 2024-03-22T23:19:40 1711149580

Zipping up a project directory even without git can be a big file collection. Python virtual environment or node_modules, can quickly get into thousands of small files.

paulddraper · 2024-03-23T00:24:43 1711153483

It's like 3x not 30x but yes same skepticism

kevmo314 · 2024-03-22T19:48:09 1711136889

Wow, Pascal! Haven't seen a project in Pascal in a while. https://github.com/PackOrganization/Pack

nkozyra · 2024-03-22T20:08:53 1711138133

Yeah, I'll wait for the ALGOL 68 port.

jksmith · 2024-03-23T01:17:28 1711156648

https://en.wikipedia.org/wiki/Argumentum_ad_populum can be a mistake of the youth. Just sayin.

nkozyra · 2024-03-24T02:01:59 1711245719

Pascal was my first programming language, but I appreciate the link ;)

OttoCoddo · 2024-03-23T16:22:28 1711210948

That is the best joke I've heard all day. Thank you for the laugh :)

mikepurvis · 2024-03-22T21:27:43 1711142863

Indeed, this is the kind of thing I would have expected to see written in Go or Rust. I wonder what the motivation for this implementation choice was.

jksmith · 2024-03-23T01:07:11 1711156031

Let's say the tyranny of the C children has diverted our attention. I'll make a wild ass statement: If Modula-2 (Wirth family with Pascal) had caught on, you would have had whatever you wanted from Rust 20 years ago. But the C noise dominated the narrative.

Use the language that makes you money and encourages you to write code that addresses domains requiring more than just bolting together framework pieces. AI can do a measurable chunk of that work.

theamk · 2024-03-23T03:04:13 1711163053

20 years ago was 2004, and Borland Delphi 7 was out. It was Pascal-based, but it didn't had that much difference from C programs.

It had a nice unit system with separate interface & implementation sections. This was very nice. The unit files were not compatible with anything else, including previous versions of Delphi - this was not nice, especially since a lot of libraries were distributed in compiled form.

The compilation speed was amazingly fast. This is one thing that was unequivocally better than C at this time.

There were range types (type TExample = 1..1000), but they more of a gimmick - turns out there are very few use cases for build-time limits. There were some uses back in DOS days when you'd have hardcoded resolution of 640x480, but in the windows time most variables were just Integer.

Arrays had optional range checks on access, that was also nice. We'd turn them off if we felt programs were too slow.

Otherwise, it was basically same as C with a bit of classes - custom memory allocation/deallocation, dangling pointers, NULLs, threads you start and stop, mutexes and critical sections. When I finally switched from Pascal to C, I didn't see that much difference (except compilation got much slower)

Maybe you'd say that Borland did something wrong, and Wirth's Modula-2 would be much better than Borland's Pascal, but I doubt this.

blackbeans · 2024-03-23T09:36:43 1711186603

You can still use RAD Studio today. Although it's expensive and it's primarily used to maintain old software these days.

Lazarus is the best IDE for Pascal, being completely free, open source and cross platform.

jksmith · 2024-03-24T06:34:11 1711262051

Yep, optional range checks and a variety of other compiler defines to accommodate programmers coming from a C background who preferred to disregard compile time checks in the name of speed of execution. So sure, you can still to this day make pascal act like C. You even get comment delimiters. Kind of adds credence to the influence of C that I'm suggesting.

Wirth languages are about constraints. For instance, when I started writing code in TSM2 and Stonybrook, my general impression was that they both emitted 10-30% more compile time bugs than BP did. If that's too much of a hassle for C programmers, well ok.

Also to add, all the wordiness of Wirth languages, the block delimiters, yes I get it. But all this stuff is just another constraint for sake of correctness. M2, being case sensitive, is even worse about this than Pascal. But the point is to make you look at your code more than once, to proofread it and think about what's going on, because the syntax screams at you a little bit. Of course, with compiler defines, you can turn pascal into C and assume the responsibility for yourself. That's what runtime debuggers are for anyway, yes?

Ok, whatever, but we're missing the point that Wirth was trying to get across, which is to turn the language itself into implicit TDD, starting with first line of code written. C/++ may give you speed, but for the average programmer, all that speed is taken back in the end, due to maintenance costs. IMO, M2 was even better at shifting maintenance costs left of what the C tack did than pascal in the value delivery stream.

Sure, mission critical code can be written even in C/++. Most of SpaceX's code is a C++ codebase. So how did they pull that off? IMO, what they did was write C in the spirit of what Wirth was trying to accomplish. For the sake of maintenance costs, speed is now less of a metric thanks to hardware advances, and correctness is far more of an issue. Which makes sense, because all business is mission critical now and all business runs on more and more software. Would you turn off bounds checks in the compiler now? How about for the programmer who you'll never meet who is writing autonomous driver code for the car you drive?

Way too much money was wasted on the near-sighted value of C. Time to move on, according to Rust developers, who undoubtedly have an impressive background as C/++ programmers. So yeah we all have to follow this C dominated narrative even today, and my charge is, this narrative has retarded the art of programming. So I stick to my original proposition: Whatever you think is great about Rust, like ownership and borrowing, would have been available in production M2 code 20 years ago if we had just given Wirth languages a chance to advance the art in the commercial world. But that narrative would have been too wordy and constraining.

benignslime · 2024-03-22T22:21:03 1711146063

In the "Source" section of the site:

> It is written in the Pascal language, a well-stabilized standard language with compatibility promises for decades. Using the FreePascal compiler and the Lazarus IDE for free and easy development. The code is written to be seen as pseudocode. In place of need, comments are written to help.

p0w3n3d · 2024-03-22T20:52:51 1711140771

Last time is did some Pascal was 2006/7. I think I never saw production-grade code myself.

I wonder if this line is an array in-situ?

  Split(APath, [sfodpoWithoutPathDelimiter, sfodpoWithoutExtension], P, N)

mananaysiempre · 2024-03-22T22:31:31 1711146691

No, it’s a set (bitmask) constant[1].

[1] https://www.freepascal.org/docs-html/current/ref/refse83.htm...

jksmith · 2024-03-23T01:09:20 1711156160

But yeah FPC supports ref counted dynamic arrays with "+" operator.

OttoCoddo · 2024-03-23T16:20:08 1711210808

Yes, and I hope that is a good surprise. As you can see, you can create fast and readable codes with it.

smartmic · 2024-03-22T20:56:40 1711141000

The whole thing makes sense to me and I can't see any major points of criticism in the design rationale. Some thoughts:

* There is already a "native" Sqlite3 container solution called Sqlar [0].

* Sqlite3 itself is certainly suitable as a base and I wouldn't worry about its future at all.

* Pascal is also an interesting choice, it is not the hippest language nor a new kid on the block, but offers its own advantages as being "boring" and "normal". I am thinking especially of the Lindy effect [1].

All in all a nice surprise and I am curious to see the future of Pack. After all, it can only succeed if it gets a stable, critical mass of supporters, both from the user and maintainer spectrum.

[0]: https://sqlite.org/sqlar/doc/trunk/README.md

[1]: https://en.wikipedia.org/wiki/Lindy_effect

OttoCoddo · 2024-03-23T16:28:29 1711211309

Thank you so much for the kind words, refreshing.

Here is the latest sqlar result on Linux source code on the same test machine in warm state:

sqlar: 268 MB, 30.5 s

Pack: 194 MB, 1.3 s

Very good result compared to tar.gz. And much better than ZIP, considering sqlar gives random access like ZIP and unlike tar.gz. I considered sqlar as a proof of concept, and it inspired me to create Pack as a full solution. I always agreed with the great drh (creator of SQLite) points about SQLite as a file format, and Pack is a try to demonstrate that.

I made Pack to give people a better life (at least behind their desks), and as you do, I hope people get to use it and find it useful.

sgammon · 2024-03-23T02:52:31 1711162351

Related note: Lindy's closed in 2018.

https://en.wikipedia.org/wiki/Lindy%27s

epcoa · 2024-03-23T03:38:08 1711165088

And that wasn’t even the original Lindy’s but an unrelated/“unauthorized” reboot since the trademark was declared abandoned. The original closed in 1969 (for which the law was named as it predates that): from 1964 https://web.archive.org/web/20210619015733/https://www.gwern...

mmastrac · 2024-03-22T19:48:44 1711136924

Sqlite3 is universal, but now your spec is entirely reliant on Sqlite3's format and all the quirks required to support that.

If you actually care about the future, spec out your own database format and use that instead. It could even be mostly a copy of Sqlite3, but at least it would be part of the spec itself.

hgs3 · 2024-03-22T20:09:06 1711138146

You're not "wrong" but Sqlite isn't your run-of-the-mill project. "The SQLite file format is stable, cross-platform, and backwards compatible and the developers pledge to keep it that way through the year 2050." [1]

[1] https://www.sqlite.org/

paulddraper · 2024-03-23T00:21:25 1711153285

How many sqlite implementations are there?

Do you need to generate sqlite bindings for every language/runtime? E g. Cloudflare workers

OttoCoddo · 2024-03-23T16:27:36 1711211256

A couple, but simple.

No, you will only need Pack, everything is built into it. Pack is built for Windows and Linux, and more will come. You will be able to run it on almost all CPUs.

wakawaka28 · 2024-03-23T04:14:56 1711167296

I suppose you do need bindings for every language. But sqlite is in C and is extremely popular. If you can't get bindings as one of the first third party libraries your language supports, it's probably a shitty language anyway.

paulddraper · 2024-03-23T04:39:28 1711168768

Brilliant

orf · 2024-03-22T20:53:50 1711140830

How different is this to any other run of the mill project with few active developers on a single implementation, with backwards compatibility based entirely on promises?

Hot take: SQLite has bugs and quirks.

kevmo314 · 2024-03-22T19:50:39 1711137039

On the other hand, by using Sqlite one can reimplement this format in another language with very little effort.

remram · 2024-03-23T15:10:24 1711206624

That's not the usual meaning of "reimplement".

KerrAvon · 2024-03-22T20:34:12 1711139652

It requires the sqlite3 library bindings, which might be a lot of effort.

0cf8612b2e1e · 2024-03-22T20:57:05 1711141025

Is there a mainstream language which does not have SQLite bindings?

0x457 · 2024-03-22T23:07:54 1711148874

Probably Go because of CGO.

klabb3 · 2024-03-23T03:07:40 1711163260

Go has both. The CGO-bindings are preferable, generally. It’s very mature, fast and works great.

https://github.com/mattn/go-sqlite3

sitkack · 2024-03-22T20:12:43 1711138363

sqlite should be an implementation detail. The table format should be fully documented and use a sqlite virtual table module.

lordmauve · 2024-03-22T19:57:56 1711137476

> Most popular solutions like Zip, gzip, tar, RAR, or 7-Zip are near or more than three decades old.

If I can't extract .pack archives 3 decades from now, the use of SQLite 3 will be the reason.

rpdillon · 2024-03-23T03:01:45 1711162905

What makes you think that? It's a very widely-used and stable format, cited as a great format for archival use.

The Library of Congress has a page that goes into some depth with respect to their sustainability analysis for the format.

https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...

OttoCoddo · 2024-03-23T16:23:25 1711211005

Good point, thank you. Note that SQLite format is very simple https://www.sqlite.org/fileformat.html

t43562 · 2024-03-23T06:29:30 1711175370

I had to make an archiver once (commercial) so I did think about it a bit. I am not sure pack would solve anything for me. It obviously solves the authors usecases but tar has some tricks which I don't want to lose;

* Able to write to a pipe/socket - lets you not waste space or time by writing to disc something that you intend to transmit over a pipe or TCP socket anyhow. It's almost a "virtual" archive and it should be possible to make one that is far too big to fit into memory - because as you send each bit of it you deallocate that memory. At the receiver each bit can be written to disc or extracted and then that memory is reused for the next bit - so the archive never fully "exists" but it does the job of serialising some data. An example could be piping the output of tar to an ssh command which untars it on a remote machine.

* Metadata has to be with the file data - not stuck at the end of the file - because you need to be able to start work without waiting till the file is fully received through your pipe. You don't want to be forced to have space to store the archive and the extracted files (may be a huge archive).

* Choice of compression - lzop is super fast such that using it can sometimes give slightly better performance than writing uncompressed data. OTOH that might not be your concern and XZ might suit you by compressing much more thoroughly. Either way it's very nice to have compression that works across multiple files - which is especially helpful when compressing a lot of small files such as source code.

* Ability to encapsulate - should be able to put the packed data into any imaginable container like an encryption or data transmission protocol without insisting that the entire archive has to be fully read before members can start to be extracted/processed. This is essentially the same as the pipe/socket requirement.

I'm not saying that these things matter to everyone - I have just found them incredibly useful in a few critical situations. The world of ZIP users on Windows seems to be sort of blind to them - thinking firmly in that box.

OttoCoddo · 2024-03-23T16:13:52 1711210432

Hey fellow enthusiast.

- Piping is really easy and it will get added to Pack. It is matter of time, until these features get added as they will be added based on popularity and piping is not that popular for most people. But I get you and I will add it for you.

- Metadata is not stored in Pack. I don’t want the metadata of my machine attached to a file. It’s a never-ending nightmare to match source and destination OS metadata. There will always be something missing, and Pack tends to get everything perfect or nothing. Storing metadata adds extra weight that most users don’t care about and complicates the ability to store other types of data alongside files. It may get added as an option in the future if many people need it.

- Pack uses Zstandard under the hood. Great compression speed and ratio. In my opinion, it is the leading algorithm in the field and makes it a proper choice to use instead of DEFLATE, used in ZIP or GZIP.

- At this point you are telling tar features. tar is not random access, Pack is. As an example, if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.

t43562 · 2024-03-23T20:24:50 1711225490

I think random access is a useful feature - e.g. if you want to compress code modules like java does so that you can still load individual modules quickly.

This is not something I've wanted yet personally but that's just random chance. When I do need it I will know which tool to use! Thanks.

It could be handy to be able to mount a pack like a filesystem.

OttoCoddo · 2024-03-23T20:58:55 1711227535

Indeed. Benchmarking Pack as a file system was fun. It is near 10X faster to let you iterate all the files compared to what I get from NTFS (warm with all caching on for both solutions).

Someday, it can be used as a virtual drive. I leave it to future people.

masklinn · 2024-03-23T07:40:39 1711179639

> The world of ZIP users on Windows seems to be sort of blind to them - thinking firmly in that box.

Seems to me like this is you being blind to certain use cases, and so stuck in your streaming oriented box that you can not conceive of other use cases where a streaming format is actively detrimental.

t43562 · 2024-03-23T14:59:35 1711205975

I started on ZIP like most people and discovered what could be done with streaming so I don't think so.

Retr0id · 2024-03-23T03:13:08 1711163588

sqlite is great as a "file format" for a particular application, but I think it's a bad interchange format.

As mediocre as zip and tar are, you can cobble together read/write support without even needing a library. With sqlite, your only real option is to bundle sqlite itself, and while it's relatively lightweight, it's far from trivial.

zip has support for zstd, and if you wanted to make it go faster, you could embed some index metadata.

I can't see any specs for their format, not even a description of the sqlite tables.

Retr0id · 2024-03-23T03:24:13 1711164253

After overwriting the "Pack" magic bytes back to the SQLite default values, I was able to open it and see the following tables

    CREATE TABLE Content(ID INTEGER PRIMARY KEY, Value BLOB);
    CREATE TABLE Item(ID INTEGER PRIMARY KEY, Parent INTEGER, Kind INTEGER, Name TEXT);
    CREATE TABLE ItemContent(ID INTEGER PRIMARY KEY, Item INTEGER, ItemPosition INTEGER, Content INTEGER, ContentPosition INTEGER, Size INTEGER);

According to the `.indexes` directive, there are... no indexes. What's the point of sqlite if you're not going to index things?

All the data is stored in one big blob (the "Value" column of the "Content" table), with the metadata storing offsets into it. It looks like there's still the possibility of things being split over multiple blobs (to circumvent the 2GB blob size limit)

Retr0id · 2024-03-23T05:13:53 1711170833

I've reverse engineered the format and written up my findings here: https://github.com/DavidBuchanan314/pack-analysis

Summary:

- Custom sqlite magic bytes makes the format incompatible with all existing sqlite tooling.

- No support for file metadata.

- There's no version field (afaict), making future format improvements difficult.

Edit: A previous version of this comment had a much longer list of complaints, but after taking a closer look, I retract them. I was looking at the MediaKit.pack file as an example, which, due to being relatively small, packed all its files into a single BLOB. I was under the mistaken impression that the same approach was taken for larger files, but after some further testing I see that they're split up into ~8MB chunks.

Though, if you have lots of small files (say, a couple of kilobytes each) then random access performance could suffer.

OttoCoddo · 2024-03-23T16:28:01 1711211281

Hello David, and thank you for your comment, analysis, and the issues you opened. I will get to them all.

- SQLite tooling: You will not need it unless you are debugging something, then you can change the header or just use the `--activate-other-options --transform-to-sqlite3` parameter to transform a Pack file to SQLite3, and use the `--activate-other-options --transform-to-pack` to go back. This way, you get a true SQLite3 database that you can browse as you wish. For most people, mixing Pack with SQLite was just a call for problems for the SQLite team (imagine people coming and asking to fix their Pack file from the team; that would not be fair) and a harder future for Pack to update.

- Metadata is not stored in Pack. I don’t want the metadata of my machine attached to a file. It’s a never-ending nightmare to match source and destination OS metadata. There will always be something missing, and Pack tends to get everything perfect or nothing. Storing metadata adds extra weight that most users don’t care about and complicates the ability to store other types of data alongside files. It may get added as an option in the future if many people need it.

- There is a version field. It is currently in Draft 0, and it is written using a custom VFS. Look here for more information: https://github.com/PackOrganization/Pack/blob/main/Source/Dr...

- All future versions of Pack must handle previous versions and must only write the latest version. So any files created right now (Draft 0) will be read correctly for ever to come.

- Each Draft proposal will get its own version, and if it gets final, it will be set to final.

- Two byte after 'Pack' header in little endian as (1 (Draft) shl 13 + 0 (version 0) = 8192). Final would be 0, so the first Final version will be 0 shl 13 + 1 = 1. and the second will be 2. It is by design, so any Draft version gets a higher number, preventing future mixups.

- 8 MB chunks are the default; Pack may choose smaller or bigger (16 MB for many small files or 32 MB for Hard Press).

- Random access is proper as unpacking steps take into account what you want and decompress a Content just once for many neighbouring files. But even for reading just one file, here is an example: if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.

Retr0id · 2024-03-23T22:09:52 1711231792

Thank you for the detailed response(s). I must admit I'm warming up to the idea of Pack, it does perform well in my testing (I didn't test at first because I'm on aarch64 linux, for which there are no compatible builds).

Not including metadata is an opinionated stance, but I can certainly get behind it, especially as a default. 99% of the time I do not care about metadata when producing a file archive.

Compatibility with existing SQLite tooling is not just useful for debugging, it is extremely useful for writing alternative implementations. If you want Pack to be successful as a format and not just as a piece of software, I think you should do everything you can to make this easier.

In my experimentation, I wrote a simple python script to extract files from a Pack archive. Conveniently, sqlite is part of the python standard library, but in order to make it work with that version (as opposed to compiling my own) I had to edit the file header first, which is inconvenient and not always possible to do (e.g. if write permissions are not available).

Despite that inconvenience, it took less code than a comparably basic ZIP extractor, which is cool!

I worry that requiring a custom VFS will make it harder for people to produce compatible software implementations.

I think your concerns about people contacting SQLite for support are overblown. I assume you've heard the `etilqs_` story[0], but in this case, you need to use a hex-editor or a utility like `file` to even see the header bytes. I think anyone capable of discovering that it's an SQLite DB will be smart enough not to contact SQLite for support with it.

The `Application ID`[1] field in the SQLite header is designed with this exact purpose in mind

> The application_id PRAGMA is used to query or set the 32-bit signed big-endian "Application ID" integer located at offset 68 into the database header. Applications that use SQLite as their application file-format should set the Application ID integer to a unique integer so that utilities such as file(1) can determine the specific file type rather than just reporting "SQLite3 Database".

It's convenient that `Pack` is 32 bits long ;)

[0] https://github.com/mackyle/sqlite/blob/18cf47156abe94255ae14...

[1] https://www.sqlite.org/pragma.html#pragma_application_id

OttoCoddo · 2024-03-23T22:44:32 1711233872

I am happy to hear that, and I really appreciate your interest.

Did you compile it for yourself? Any problem or steps you used, I will be happy to hear, o at pack.ac or GitHub, as it is hard to follow the building here.

As a reminder, Pack Draft 0 has Compatibility with SQLite tools; the only needed step is to change the first 16 bytes. Again, you can use `--activate-other-options --transform-to-sqlite3` with the CLI tool, and you will get a perfectly working SQLite file.

VFS is not needed; they can change the header after writing; VFS was just cleaner to me.

My first work was using application_id, after a while, it did not feel right to me, so I changed it for good. It allows easier future development, fewer problems for file type detection, a decreased chance of mistaken change (you already saw many negative comments on using SQLite as a base), and the support reason: just yesterday I was reading a forum post about people asking for support on software because it was using SQLite. application_id seems like a great choice if you are doing a DB-related task or making a custom DB for transfer on wire, to communicate between internal and semi-public tools. Using it for a format that could potentially get to an innumerable count seemed unwise.

me-vs-cat · 2024-03-23T19:28:04 1711222084

> - All future versions of Pack must handle previous versions and must only write the latest version.

I believe you are making a mistake by preventing Pack from writing archives that are compatible with prior versions.

OttoCoddo · 2024-03-23T16:27:51 1711211271

Thank you for the check.

No index, as they take space, and I wrote the queries considering SQLite automatic indexes. They will be created on demand, at unpacking time. All the unpacking processes are made to read and decompress content just once, so there are no worries about slowdowns.

I suggest trying Pack for yourself and seeing the speed. Or deeper, use `--activate-other-options --transform-to-sqlite3` to transform a Pack file to SQLite3, create your own indexes, and use `--activate-other-options --transform-to-pack` to convert it to Pack and then try unpacking. You will not see any worthy difference.

Yes, Contents are like packages of raw data from a chunk, a whole, or many of the items (files or data). They may be compressed if needed (With Zstandard). ItemContent table helps to find the needed Item parts.

The Content structure circumvents any BLOB limit, but it is also made to give better compression while keeping random access.

Retr0id · 2024-03-23T21:35:31 1711229731

Fair point, I can see that indexes are not really necessary.

OttoCoddo · 2024-03-23T16:27:45 1711211265

I guess you are overestimating the "cobble together read/write support without even needing a library."

Let's imagine: You want to read a ZIP file. Will you write your own reader? I seriously doubt it, as the work, stabilising, and security (random memory access as an example) would be issues. But let's think we are couraginous. OK, we read rather not so simple format and carefully read the binary. Now, will you write your own DEFLATE and Huffman coding? Again, a bigger doubt.

I would argue that if someone cares enough to reimplement ZIP, it would at worst be twice as hard to write a Pack reader from scratch with no ZSTD or SQLite. And for those serious people, reading a format that lets them store better and faster would be a prize that is hard to say no to. But I get your point, and if you are in a desert and need something to put together fast before going out of water, tar may be a good choice.

Retr0id · 2024-03-23T21:32:21 1711229541

I have written my own zip, deflate, and huffman coding - although the latter two were "just for fun". But I would definitely consider writing ad-hoc zip logic in real software, if I couldn't pull in a library for whatever reason. This isn't just a hypothetical, it happens a lot - there are many independent ZIP implementations in the wild, for better or for worse.

You're right to call out security though, because the multiple implementations cause security issues where they disagree, my favorite example being https://bugzilla.mozilla.org/show_bug.cgi?id=1534483 . Although arguably this is a symptom of ZIP being a poorly thought out file format (too many ambiguous edge-cases), rather than a symptom of it being easy to implement.

OttoCoddo · 2024-03-23T21:41:47 1711230107

You are one of the bravest. And you know that, using SQLite as the base storage, rules out many of the security problems we can face.

Anyone needing to reimplement Pack, can do it, very easily, if not easier than implementing ZIP, IF they use SQLite and Zstandard. Maybe a day of work or less. If they want to rewrite (reading part of) them too, it will be a couple of days of work.

Aachen · 2024-03-22T23:36:41 1711150601

Complaining that all existing tools are old, but I'm looking at the documentation and the what immediately catches my eye is that it doesn't use any modern convention I've gotten used to?

Overwrite with "-w". I've never seen a tool not use "-f"

Not reserving "-h" for help text is also an interesting choice. Makes me think of the mantra "be conservative in what you send out but liberal in what you accept". Per that philosophy, both "--help" and "-h" should be accepted because neither gained a decisive majority in usage and so people might try either. It's not like you'd know what to use because it hasn't told you yet

Forcing use of a long option "--press=N" (for the zstd level setting) is also new/unique terminology for what is usually "-N" (like "-1" to "-9")

(Basic) drop-in compatibility with every other tool from gzip to zstd would also have been nice, but archiving and compression are different things and everything from zip to 7z to tar works in unique ways so this makes enough sense I guess. Still, could have been useful

It's still better than tar or ps, so if it catches on that's still a step forwards in terms of command line standards

drpixie · 2024-03-23T05:09:04 1711170544

Hmmm. I went to the doco hoping for something about the file format :( No doco for me. I guess that would be too old fashioned.

It seems to be "read the code" or nothing - which is fine until they update the code... It's great (probably should be mandatory) to have a reference program, but if they're promoting it as a container format, something along the lines of an RFC would be helpful.

OttoCoddo · 2024-03-23T16:31:19 1711211479

More documentation will be published soon. For now and about CLI: https://pack.ac/cli-documentation

OttoCoddo · 2024-03-23T16:31:26 1711211486

Hey,

The choice of parameters was solely done to be clear, and not what people used to. -f meaning force is not clear; -w meaning overwrite, seemed like a better logical choice, to me.

Nice point on -h. Yes, I did not want to go crazy. After all, almost all (CLI) people use pack as `pack ./test/`. Options are for advanced people like you. Most people will use the OS integration that will be published later on.

--press=hard is the only option there is. There may be more, but with Pack you do not need to choose a level (like 1..9 with ZIP). Just let Pack do its thing, and you will be happy. Hard Press is there for people who want to pack once and unpack many times (like publishing), and it is worth spending extra time on it. Even then, Pack goes the sane way and does not eat your computer just for a kilobyte or two.

out_of_protocol · 2024-04-05T04:14:40 1712290480

> Just let Pack do its thing, and you will be happy

Well, no. Sometimes i want maximum compression while having a lot of CPU and wall clock time. And sometimes being fast is more important than compression level. Also managing server utilization is needed. Level thing is there for a reason

OttoCoddo · 2024-04-05T05:50:17 1712296217

Then `--press=hard` would be the choice for you.

out_of_protocol · 2024-04-05T06:31:32 1712298692

This is not a binary choice, actual level of effort is required. I've seen many times people fine tuning compression levels in all kinds of automation scenarios

out_of_protocol · 2024-04-05T10:38:13 1712313493

Also, zstd covers a lot of ground between super fast compression and good compression ratio

https://raw.githubusercontent.com/facebook/zstd/master/doc/i...

OttoCoddo · 2024-04-05T18:18:19 1712341099

Thank you for the notes. I am well aware of the levels and Pack uses custom configuration to match its inner design. Maybe more level come, or maybe not. But to be clear, Pack supports any valid Zstandard content, and this levels we are discussing are about Pack CLI chosen for better user experience. Any other client can produce and store any valid content for chose level or configuration, and other clients can read it.

out_of_protocol · 2024-04-06T03:54:20 1712375660

There are many different usecases, and each one have different set of requirements. E.g. - for end-user facing cli: support as much conventions as possible (-v --version, -h --help, other options similar to other compressors), sane defaults

- for automatic tasks like making backups via cron: piping, correct exit codes, level of effort configuration, silent modes for reduced logging.

Second one is likely to fly first, can be used isolated on company level if file format is stable

bborud · 2024-03-22T19:48:21 1711136901

Pascal!? My monocle nearly fell out.

OttoCoddo · 2024-03-23T16:31:44 1711211504

It looks clean and pseudocode-like. It helps readers from around the world with different languages understand. FreePascal compiler is very good too.

esafak · 2024-03-22T20:38:31 1711139911

Must be in honor of Wirth's passing!

OttoCoddo · 2024-03-23T16:31:50 1711211510

I just do not want to follow Wirth's law.

rurban · 2024-03-23T06:34:46 1711175686

Unfortunate name:

dnf info pack

    Summary      : Convert code into runnable images
    URL          : https://github.com/buildpacks/pack
    License      : Apache-2.0 and BSD-2-Clause and BSD-3-Clause and ISC and MIT
    Description  : pack is a CLI implementation of the Platform Interface Specification
                 : for Cloud Native Buildpacks.

crq-yml · 2024-03-22T20:37:27 1711139847

The web site behaves strangely on mobile and folds the text as I try to scroll around.

OttoCoddo · 2024-03-23T16:33:27 1711211607

Sorry. Site is very new, and custom made, and needs to be worked on mobile.

indrora · 2024-03-23T17:18:56 1711214336

My suggestion?

If you’re not sure, grab a very basic template from either Hugo, MkDocs, or GitHub pages. They’re all pretty well tested and hard to break.

0x073 · 2024-03-22T20:52:56 1711140776

Yes at least on mobile absolute madness.

xuhu · 2024-03-22T19:47:44 1711136864

Better, faster, stronger but I can't tell from the homepage what's different about it, except that it is based on SQLite and Zstd.

OttoCoddo · 2024-03-23T16:34:33 1711211673

You may like to read https://pack.ac/note/pack and test it for yourself.

raggi · 2024-03-23T22:30:03 1711233003

I went to do some testing in a sandbox system (as the compilation strategy is unclear, missing builds for some of the artifacts).

I was able to initially construct an archive of the linux tree (that failed in decompression), but subsequently went to rebuild it and the tool is repeatedly producing this output even in an otherwise cleaned up environment:

    Runtime error 203 at $0000000100009B72
    $0000000100009B72
    $00000001000225B6
    $0000000100023153
    $000000010002319A

    Runtime error 203 at $0000000100009B72
    $0000000100009B72
    $00000001000225B6
    $0000000100023153
    $000000010002319A

    Runtime error 203 at $0000000100009B72
    $0000000100009B72
    $00000001000225B6
    $0000000100023153
    $000000010002319A

    Runtime error 203 at $0000000100009B72
    $0000000100009B72
    $00000001000225B6
    $0000000100023153
    $000000010002319A

    Runtime error 203 at $0000000100009B72
    $0000000100009B72
    $00000001000225B6
    $0000000100023153
    $000000010002319A

The first output did work, sadly I deleted it. It was over 400mb, so approximately double the size of a zip or a tar.zst of the same files.

zstd natively compresses a tar of this set to 209mb in 800ms in multi-threaded mode, or 3.5s in single threaded mode.

I suspect that sqlite is being held incorrectly (access from multiple threads, with multi-threading disabled), and the vfs lock forwarding is broken on Windows.

OttoCoddo · 2024-03-23T22:36:49 1711233409

Did you compile it for yourself? Any problem or steps you used, I will be happy to hear, o at pack.ac or GitHub, as it is hard to follow the building here. I should prepare more documents on how to build it. I suspect that there is a problem with the custom build. Error and speed issues are not something you see in the official build.

raggi · 2024-03-23T23:39:54 1711237194

That was the binary download from the website. You have a build.sh for the Linux binary artifacts, but no equivalent for the windows artifacts so i did not bother preparing a windows build

OttoCoddo · 2024-03-23T23:50:25 1711237825

Pack binary? Can you tell what machine and what steps?

Build.sh can be used for Windows too, using MSYS2 UCRT64.

raggi · 2024-03-24T01:12:33 1711242753

Windows 11 sandbox, running atop Windows 11. Binary downloaded from your webpage.

Data being packed was an unzipped copy of linux-master.zip fetched from GitHub unpacked with windows zip, selecting skip for the overlapping case files.

OttoCoddo · 2024-03-24T01:27:53 1711243673

What are the parameters you gave to the CLI program? This issue seems interesting, as these files on Windows 11 were tested countless times.

To be clear, you can run Pack as: `pack.exe ./linux-master/`

raggi · 2024-03-24T14:55:35 1711292135

I ran pack that way, and observed the error I posted

rodrigokumpera · 2024-03-22T20:01:23 1711137683

ZStandard is... standardized under rfc 8878

Plus there's no discussion against zstd itself and its container format.

rustyconover · 2024-03-22T20:39:20 1711139960

If you're looking for a debate against ZStandard, its hard to argue against it.

ZStandard is Pareto optimal.

For the argument why, I really recommend this investigation.

https://insanity.industries/post/pareto-optimal-compression/

bonki · 2024-03-22T21:03:05 1711141385

Thanks, superbly written and highly informative article!

Kwpolska · 2024-03-22T20:15:31 1711138531

If I need to compress stuff, it’s either to move a folder around (to places which may not have a niche compression tool, so ZIP wins), or to archive something long-term, where it can take a while to compress. I don’t see the advantages of this, since the compression output size seems quite mediocre even if it’s supposedly fast (compared to what implementations of the other formats?)

jksmith · 2024-03-23T00:52:16 1711155136

Hell yeah some fpc stuff showing its moves. Devs even put together an lpk to load up -bravo! Look for more of this stuff in the future as companies look for alternatives to corporate commodity programming and dbs tethered to major cloud resources. I have a major FPC effort going on right now that I hope to be able to offer on four platforms: browser, win, linux, and mac.

I wrote a similar general purpose pack way back in the early 90's in TopSpeed Modula-2 and run from the command line. Needed to span multiple disks and self launch. Algorithm was fast, but not nearly the same compression ratios. Wore out Mark Nelson's classic, "The Data Compression Book" along the way.

OttoCoddo · 2024-03-23T16:33:32 1711211612

Hell yeah indeed! You can find me on the forum too if you liked to talk Pascal: https://forum.lazarus.freepascal.org/index.php/topic,66281.0...

paulddraper · 2024-03-23T06:24:57 1711175097

How does tar take 4.7s in the benchmark, but Pack takes 1.3s (and 1/7th the size)?

That seems fishy...

OttoCoddo · 2024-03-23T12:55:29 1711198529

Hello to all. I am the author, and I just saw this post and am happy to see this exciting discussion. Let me try to show my respect for it and answer as well as possible.

abcd_f · 2024-03-23T13:36:54 1711201014

The website is completely unusable on iPhone and on iPad. It doesn't scroll, jerks back in place, parts are blank and empty... looks really strange.

OttoCoddo · 2024-03-23T13:47:01 1711201621

Thanks for the note, and sorry for the inconveniences. I did not expect this many users in the iOS world. The site is very new and needs custom work; it will be updated soon.

erinaceousjones · 2024-03-23T14:00:11 1711202411

I think generally it's a mobile layout issue. On Firefox for Android, scrolling the page still triggers a click/mouseup/focus event I guess - when you let go of your finger, it toggles the state of the "Note" -> "Pack" section, so it tends to hide itself as you're reading it!

Would just remove that "accordion" functionality completely or make it always expanded on mobile breakpoints or whatever. Or just move that entire "About pack" section to be on the main page "below the fold" as the first thing people are going to want to do is find out *what it is* :)

jlundberg · 2024-03-23T13:27:54 1711200474

Welcome Otto!

I know many people including myself are curious on why you wrote this in Pascal?

Also, what are the main reasons Pack is faster than the tools you compare Pack with?

OttoCoddo · 2024-03-23T13:44:27 1711201467

Thank you!

About Pascal: It makes me happy. It looks clean and pseudocode-like. It helps readers from around the world with different languages understand. I am happy that Pack made people curious about this old but great goodie.

Speed, Here are some reasons: - Pack does all the steps of pack or unpack (read, check, compress/decompress and write) together and this weaving helped achieve this speed and random access. It is by far the fastest speed I get to see reading or writing random files from file systems, as fast or faster than asynchronous read operations or OVERLAPPED on Windows. To a point, it is limited to file system. For example, on NTFS, Pack can pack Linux code base in around 1.3 s; similar is done on ext4 in 0.96 s.

- It is based on a heavily optimized code base, standard library, and the FreePascal compiler, which produces great binary.

- Multi core design: even mobiles have multi-core CPUs these days. Choosing threads based on the content and machine, it does not eat your machine.

- Speed-configured SQLite. SQLite is much faster than most people think it is.

- Configured the already rapid Zstandard.

In summary, standing on the shoulders of giants while trying hard to improve reliability, speed and user experience is a sign of respect for them.

out_of_protocol · 2024-03-23T20:37:21 1711226241

How about pure zstd (or tar.zstd) vs pack vs patched 7z+zstd benchmark? Measure container overhead, in both speed and bytes

OttoCoddo · 2024-03-23T21:11:41 1711228301

I answered this question here: https://news.ycombinator.com/item?id=39801083

If that is not enough, let me know.

7-Zip with the ZSTD patch is good too, but Pack is much faster at handling many files.

Testing packing the Linux code base (81K files and 1.25 GB) on Windows with NTFS:

7-Zip + Patched with ZSTD (-m0=Zstd): 6.453 s, 194.9 MB (Creating the header takes too much time)

Pack: 1.3 s, 194.5 MB

out_of_protocol · 2024-03-23T22:47:51 1711234071

Thanks for the update! Somehow 7z container overhead is +0.4MB AND is slower by a lot? Huh. Great numbers, need more exposure for pack format. Also, i suggest to add section to the website that shows these numbers.

1) regular archive formats vs pack

2) various containers with the same zstd inside.

OttoCoddo · 2024-03-23T23:07:08 1711235228

Thank you!

Exposure comes from enthusiasts like you.

I did not want to focus the point on speed, or say, "Look, others are bad". They are great; my point was, "Look what we can do if we update our design and code". Pack value comes from user experience, and speed is being one. I was not following the best speed or compression; I wanted an instantaneous feeling for most files. I wanted a better API, an easier CLI, improved OS integration (soon), and more safety and reliability. Tech people (including me) care so much about speed

I am happy about the results, but Pack offers much more that I like others to see.

croemer · 2024-03-23T02:33:10 1711161190

Who is behind this? It's a new Github org, the committer (https://github.com/OttoCoddo) has a totally private Github profile. There's no name in "Legal". Sure, one can be anonymous, but I won't download it, don't trust it.

Retr0id · 2024-03-23T06:04:10 1711173850

They also mention depending on https://github.com/SCLOrganization, which appears to have single similarly anonymous maintainer.

In fact, SCL has exactly two libraries, SQLite and Zstandard, so presumably it's the same developer https://github.com/SCLOrganization/Libraries

OttoCoddo · 2024-03-23T16:29:39 1711211379

Me.

It is the point: if you trust a project based on "who" made it, my friend, that is the start of the big problem we are facing in this current situation of tech. Just look at the code, build it yourself, and check the license.

Pack is made to be a private option; future locking end encryption options will solidify that. Trusting the author is not the correct way to verify the security and safety of such a tool.

chatmasta · 2024-03-22T23:07:16 1711148836

What's the best way to store a bunch of .zip files that all share some data? Assume that I can't decompress or alter the zip files in any way.

Basically I want a shared encoding dictionary. Is there an easy solution for this?

The use case is maintaining an archive of .crx files.

mappu · 2024-03-23T02:48:20 1711162100

Use precomp or antiz to losslessly pre-transform the zip, and then use solid compression (e.g. tar.xz or zpaq).

Scaevolus · 2024-03-23T01:53:18 1711158798

There's not an _easy_ way to do this right now.

Your best bet is a lossless transform that undoes the huffman coding in the zip files, converting the compressed streams from effectively uncorrelated bitstreams to largely similar byte streams, and then pass that through a large-window compression algorithm (zstd?).

Similar techniques are used in ChromeOS for delta updates.

indrora · 2024-03-23T17:24:12 1711214652

Depending on implementation, zip supports Zstd compressed archives with dictionaries.

You could maintain a shared dictionary among all the archives.

If you have a way to post process the archives, you may be able to abuse multipart zip files to split them and reconstitute them later.

sph · 2024-03-22T23:38:23 1711150703

Not sure if I understand the question.

Having a shared encoding dictionary is called compression.

If you want one across many zip files that you cannot modify, then compress them into another archive.

chatmasta · 2024-03-23T00:30:30 1711153830

Yes, I have a bunch of zips I cannot modify. If I could uncompress them all, and zip them into one big archive, then it would be much smaller than if I put all the compressed zips into one big archive. This is because there's a lot more shared strings between uncompressed files across different zips, than there are shared strings between the compressed zips.

(At least... this is my assumption based on how I understand the formats to work. I do need to measure and verify this.)

So basically I'm wondering if there is some way to tell the "outer" zip to reuse the encoding dictionaries of each smaller zip, or to somehow intelligently merge their encoding dictionaries rather than treating each inner zip like an opaque blob.

teraflop · 2024-03-23T01:01:27 1711155687

No, the ZIP format compresses each embedded file separately, so it doesn't matter how much commonality there is between different files in the same archive.

chatmasta · 2024-03-23T01:15:49 1711156549

Wow, really? I always assumed that if I zipped a directory, it would use a shared dictionary to compress all the files. But I guess what you say makes sense, because otherwise it wouldn't be possible to extract just one file.

Are there compression formats that do share an encoding dictionary across multiple files? I guess tar + gzip might do that?