I found squashfs to be a great archive format. It preserves Linux file ownership and permissions, you can extract individual files without parsing the entire archive like tar and it's mountable. It's also openable in 7zip.
I wonder how pack compares to it, but its home page and github don't tell much.
for sosreports (archives with lots of diagnostic commands and logfiles from a linux host), I wanted to find a file format that can both used zstd compression (or maybe something else that is about as fast and compressible, currently often uses xz which is very very slow) -and- that lets you unpack a single file fast, with an index, ideally so you can mount it loopback or with fuse or otherwise just quickly unpack a single file in a many-GB archive.
You'd be surprised that this basically doesn't exist right now. Theres a bunch of half solutions, but no real good easily available one. Some things add indexes to tar, zstd does support partial/offset unpacking without reading the entire archive in the code but basically no one uses that function, it's kindof silly. There are zip and rar tools with zstd support, but they are not all cross compatible and mostly doesn't exist in the packaged Linux versions.
squashfs with zstd added mostly fits the bill.
I was really surprised not to find anything else given we had this in Zip and RAR files 2 decades ago. But nothing so far that would or could ship on a standard open source system managed to modernise that featureset.
Such random access using `--include` is very fast.
As an example, if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.
I haven't had a chance to use it yet, but https://github.com/mhx/dwarfs claims to be times faster than squashfs, to compress much better, and to have full FUSE support.
It has to be if you can "seek and selectively extract from" a zip file: the ability to do that relies on the ability to read the end of the archive for the central directory, then read the offset and size you get from that to get at the file you need.
squashfs may or may not be able to do it with as few roundtrips (I don't know the details of its layout), but S3 necessarily provides the capabilities for random access otherwise you'd have to download the entire content either way and the original query would be moot.
> You can read sequentially through a zip file and dynamically build up a central directory yourself and do whatever desired operations.
First, zip files already have a central directory so why would you do that?
Second, you seem to be missing the subject of this subthread entirely, the point is being able to selectively access S3 content without downloading the whole archive. If you sequentially read through the entire zip file, you are in fact downloading the whole archive.
Sorry, I wasn't clear before. You don't need the central directory to process a zipfile. You don't need random access to a zipfile to process it.
A zipfile can be treated as a stream of data and processed as each individual zip entry is seen in the download/read. NO random access is required.
Just enough memory for the local directory entry and buffers for the zipped data and unzipped data. The first two items should be covered by the downloaded zipfile buffer.
If you want to process the zipfile ASAP or don't have the resources to download the entire zipfile first before processing the zipfile, then this is a valid manner to handle the zipfile. If your desired data occurs before the entire zipfile has been downloaded, you can stop the download.
A zipfile can also be treated as a randomly accessed file as you mentioned. Some operations are faster that way - like browsing the each zip entry's metadata.
WIM is the closest thing Windows has to a full file-based capture, but I've noticed that even that doesn't capture everything, unfortunately. I forget exactly, but think it was extended attributes that DISM wouldn't capture, despite the /EA flag. Not sure if that was a file format limitation or just a DISM bug.
Very sad. Cross-platform extended attributes are the very thing I would love. I even imagine a new archive format which would be just a key-key-value (I mean it - two keys, a set of key-value pairs for every top level key - this is EA / NTFS streams) store with values compressed using a common dictionary (also possibly encrypted/signed with a common key). Needless to say such a format would enable almost any use case, especially if the layout of the file itself is architectured right. MacOS wouldn't have to add their special folder (they add to every ZIP) anymore, tagging files and saving any metadata about them would be possible, saving multiple versions of the same file, alternative names (e.g. what you received it with and what you renamed it to) for the same file.
I even dream about the days when a file main stream would be pure data and all the metadata would go to EAs. Imagine an MP3 file where the main stream only records the sound but no ID3, all the metadata like the artist and the song names are handled as EAs and can be used in file operation commands.
This also can be made tape-friendly and eliminate need in TAR. Just make sure files (streamms/EAs are written contiguous, closely-related streams go right near, compression is optional and the ToC+dictionary is replicated in a number of places like the beginning, the middle and the end).
As you might have guessed I use all the major OSes and single-OS solutions are of little use to me. Apparently I'd just use SquashFS but it's use is limited on Windows because you can hardly make or mount one there - only unpack with 7zip.
It's easy to forget about supporting EAs on Windows - they are extremely uncommon because you practically need to be in kernelspace to write them. Ntoskrnl.exe has one or two EAs, almost nothing else does.
(ADS are super commonplace and the closer analogue to posix xattrs.)
I didn't know this, thanks. I thought xattrs and ADS are synonymous. Do SquashFS, ext4, HFS+ and APFS have ADS then?
I am looking forward to write my own cross-platform app which would rely on attaching additional data and metatata to my files.
"need to be in kernelspace" does not sound very scary because a user app probably doesn't need to do this very job itself - isn't there usually an OS-native command-line tool which can be invoked to do it?
That's exactly what I'd like to avoid. I want to transfer a group of files (either to myself, friends, or website visitors), not make assumptions about the target system's permission set. For copies of my own data where permissions are relevant, I've got a restic backup
Wake me up if a simple standard comes to pass that neither has user/group ID, mode fields, nor bit-packed two-second-precision timestamps or similar silliness. Perhaps an executable bit for systems that insist on such things for being able to use the download in the intended way
(I self-made this before: a simple length-prefixed concatenation of filename and contents fields. The problem is that people would have to download an unpacker. That's not broadly useful unless it is, as in that one case, a software distribution which they're going to run anyway)
Sometimes you want to include data and sometimes you don't for different reasons in different contexts. It's not a data handlers job to decide what data is or isn't included, it's the senders job to decide what not to include and the receivers job to decide what to ignore.
The simplest example is probably just the file path. tar or zip don't try to say whether or not a file in the container includes the full absolute path, a portion of the path, or no path.
The container should ideally be able to contain anything that any filesystem might have, or else it's not a generally useful tool, it's some annoying domain-specific specialized tool that one guy just luuuuuvs for his one use-case he thinks is obviously the most rational thing for anyone.
If you don't want to include something like a uid, say for security reasons not to disclose the internal workings of something on your end, then arrange not to include it when creating the archive, the same way you wouldn't necessarily include the full path to the same files. Or outside of a security concern like that, include all the data and let the recipient simply ignore any data that it doesn't support.
Good argument, I've mostly come around to your view. The little "but" that I still see is that the current file formats don't let you omit fields you don't want to pass on, and most decoders don't let you omit fields you don't want to interpret/use while unpacking.
Even if a given decoder could, though, most users wouldn't be able to use that and so they'd get files from 1970 or 1980 if I don't want to pass that on and set it to zeroes, so better is if the field can be omitted (as in, if the header wasn't fixed length but extensible like an IP packet). So I'd still like a "better" archiving format than the ones we have today (though I'm not familiar with the internals of every one of them, like 7z or the mentioned squashfs so tell me if this already exists), but I agree such a format should just support everything ~every filesystem supports
Oh sure, I was talking in generalities and an imaginary archiver, what should an archiver have, not any particular existing actual one.
os and filesystem features differ all over the place, and there will be totally new filesystems and totally new metadata tomorrow. There is practically no common denominator, not even the basic ascii for the filename let alone any other metadata.
So there should just be metadata fields where about the only thing actually parrt of the spec is the structure of a metadata field, not any particular keys or values or number or order of fields. The writer might or might not even include a filed for say, creation time, and the reader might or might not care about that. If the reader doesn't recognize some strange new xattr field that only got invented yesterday, no problem, because it does know what a field is, and how to consume and discard fields it doesn't care about.
There would be a few fields that most readers and writers would all just recognize by convention, the usual basics like filename. Even the filename might not be technically a requirement but maybe an rfc documents a short list of standard fields just to give everyone a reference point.
But for instance it might be legal to have nothing but some uuids or something.
That's getting a bit weird but my main point was just that it's wrong to say an archiver shouldn't include timestamps or uids just because one use of archive files is to transfer files from a unix system to a windows system, or from a system with a "bob" user to a system with no "bob" user.
The arguments for tar are --preserve-permissions and --touch (don't extract file modified time).
For unzip, -D skips restoration of timestamps.
For unrar, -ai ignores file attributes, -ts restores the modification time.
There are similar arguments for omitting these when creating the archive, they set the field to a default or specified value, or omit it entirely, depending on the format.
Those are user controls, to allow the user on one end to decide what to put into the container, and there are others to allow the user at the other side to decide what to take out of the container, not limits of the container.
The comment I'm replying to suggested that since one use case results in metadata that is meaningless or mis-matched between sender and receiver, the format itself should not even have the ability to record that metadata.
Is this question a coherent concept when it doesn't change anything when you substitute any other term like "full path" or "as much path as exists" or "any path"?
It can be if you make assumptions about the basic structure of both systems. Some people rely on this behavior. It can be a good idea or a bad idea, depending on what you're doing.
I agree very much with this. Something that annoys me is how much information tar files leak. Like, you don't need to know the username or groupname of the person that originally owned the files. You don't need to copy around any mode bit other than "executable". You definitely don't need "last modified" timestamps, which exist only to make builds that produce archives non-hermetic.
Frankly, I don't even want any of these things on my mounted filesystem either.
> The problem is that people would have to download an unpacker.
Your archive format just needs to be an executable that runs on every platform. https://github.com/jart/cosmopolitan is something that could help with that. ("Who would execute an archive? It could do anything," I hear you scream. Well, tell that to anyone who has run "curl | bash".)
I know it may not seem this way, but a lot of people don't ever run "curl | bash", or if they do, they do so in throwaway VM (or container if source is mostly trusted)
It's really a bad idea most of the time to have an archive that doubles as an executable. It's not possible to cover every possible platform, and in the distant future those self-extracting archives may be impossible to extract without the required host system.
In most common scenarios, curl | bash is no different from apt-add-repository && apt install. Running a completely non-curated executable is very different.
> Wake me up if a simple standard comes to pass that neither has user/group ID, mode fields, nor bit-packed two-second-precision timestamps or similar silliness. Perhaps an executable bit for systems that insist on such things for being able to use the download in the intended way
Other than having timestamps isn't this a ZIP file? No user id, no x bit, widely available implementations... Not very simple though I guess.
Me too but in php. I couldn't find a streaming zip encoder that you can just require() and use without further hassle, so I wrote one (it's on github somewhere).
The problem is that zip is finicky and extremely poorly documented. I had to look at what other implementations do to figure out some of the fields. About at least one field, the spec (from the early 90s or late 80s I think) says it is up to you to figure out what you want to put there! After all that, I additionally wrote my own docs in case someone coming after me needs to understand the format as well, but some things are just assumptions and "everyone does it this way"s, leading to me having only moderate confidence that I've followed the spec correctly. I haven't found incompatibilities yet, but I'd also not be surprised if an old decoder doesn't eat it or if a modern one made a different choice somewhere.
It's also not as if I haven't come across third party zip files that the Debian command line tool wouldn't open but the standard Debian/Cinnamon GUI utility was perfectly happy about. If it were so well-documented and standard, that shouldn't be a thing. (Similarly, customers on macOS can't open our encrypted 7z pentest report files. The Finder asks for the password and then flat-out tells them "incorrect password", whereas in reality it seems to be unable to handle filename encryption. Idk if that is per the spec but incompatibilities are abound.)
If you're not sure what the spec is trying to say, then either the PKZip binaries or the Info-ZIP zip/unzip source code is your usual source of truth.
When one unzip works but another unzip app doesn't, then you can usually point the finger at the last zip app that modified the zip file. There's some inconsistency in the zip file.
Running "unzip -t -v" on the zip file in question may yield more info about the problem.
The binaries you refer to as source of truth are a paid product (not sure if the trial version, which requires filling out a form that's currently not loading, includes all options, or how honest it is to use that to make an alternative to their software, or if the terms allow that) and don't seem to run on my operating system. I guess I could buy me a Windows license and read the pkzip EULA to see if you're allowed to use it for making a clone, but I figured the two decoders (that don't always agree with each other) I had on hand would do. If they agree about a field, it's good enough (and decoders can expect that unspecced fields are garbage)
Isn't pkzip the original? I'm not sure I've heard of info-zip but unzip is a command I use regularly on Debian. I highly doubt that's the original commercial implementation though
The only special thing about the Zip file format that springs to mind as causing ambiguity is the handling of the OS-specific extra field for a Zip archive entry.
You don't have to include an OS-specific extra field unless you want the information in that specific extra field to be available by the party trying to extract the contents of the zipfile.
- As far as I know, squashfs is a file system and not an archive format; the "FS" in the name shows the focus.
- It is read-only; Pack is not. Update and delete are not just public yet, as I wanted people to get the taste first.
- It is clearly focused on archiving, rather than Pack wanting to be a container option for people who want to pack some files/data and store or send them with no privacy dangers.
- Pack is designed to be user-friendly for most people; CLI is very simple to work with, and future OS integration will make working with it like a breeze. It is far different from a good file system focused on Linux.
- I did not compare to squashfs, but I will be happy to see any results from interested people.
- being read only is mostly a benefit to an archive. Back in the days when drives had been small, I occasionally wanted to update a .rar, but in the last ~5 years I can't remember a case for it.
- it's fine, but don't think that others' use cases are invalid because of your vision
As a separate note, had I encountered pack.ac link anywhere on the internet other than here with a description attached, I'd have left it immediately. It just lacks for me any info what it is and why should I try it.
Interesting, I've recently spent an unhealthy amount of time researching archival formats to build the same setup of using SQLite with ZStd.
My use case is extremely redundant data (specific website dumps + logs) that I want decently quick random access into,
and I was unhappy with either the access speed, quality/usability or even existence of libraries for several formats.
Glancing over the code this seems to use the following setup:
- Aggregate files
- Chunk into blocks
- Compress blocks of fixed size
- Store file to chunk and chunk to block associations
What I did not see is a deduplication step for the chunks, or an attempt to group files (and by extend, blocks) by similarity in an attempt improve compression.
But I might have just missed that due to lack of familiarity with Pascal.
For anyone interested in this strategy, take a look at ZPAQ [1] by Matt Mahoney, you might know him from the Hutter Prize competition [2] / Large Text Compression Benchmark. It takes 14th place with tuned parameters.
There's also a maintained fork called zpaqfranz, but I ran into some issues like incorrect disk size estimates with it.
For me the code was also sometimes hard to read due to being a mix of English and Italian. So your mileage may vary.
Thank you for the detail check. I should thank the syrup too :)
I'm happy to see a fellow enthusiast. Your deduction is on point.
And also, Pack is smart; it skips non-compressible files like MP3 [1], so you do not need to choose the "Store" option to have a faster option, and it speedup decompression too. Pack is the first to achieve this, being faster than Store options. Yes, it was a surprise to me too.
ZPAQ is great, and I study the Hutter Prize competition. Pack is on another chart, which is why I proposed CompressedSpeed [2]. The speed of getting to compression needs to be accounted for. You can store anything on an atom if you try hard enough, but hard work takes time. Deduplication step may get added, but in Hard Press [3].
I am curious to see the results of Pack on your data. You can find me here or o at pack.ac.
[1] It is based on content rather than extension; any data that is determined not to be worthy of compression, will be stored as is. And as a file can get chucked, some parts can get compressed and some cannot. Imagine that part of the subtitle in a MKV file can get compressed, and the Video part gets skipped. Although these features will get more updates over time, if they don't cost time,. Pack focus is being seamless and not the most compressed; there are already great works in the field, such as the noted ZPAQ.
[3] You can choose --press=hard to ask for better compression. Even with Hard Press, Pack does not try to eat your hardware just to get a little more; it goes the optimized way I described.
When I read the title, I thought it was a new operating system-level containerization image format for filesystem layers and runtime config. But it looks like "container format" is a more general term for a collection of multiple files or streams into one. https://en.wikipedia.org/wiki/Container_format TIL.
OS containers could use an update too, though. They're often so big and tend to use multiple tar.gz files.
With all due respect, I find it hard to believe the author stumbled upon a trivial method of improving tarballing performance by several orders of magnitude that nobody else had considered before.
If I understand correctly, they're suggesting Pack, which both archives and compresses, is 30x faster than creating a plain tar archive. That just sounds like you used multithreading and tar didn't.
Either way, it'd be nice to see [a version of] Pack support plain archival, rather than being forced to tack on Zstd.
The tar file format is REALLY bad. It's pretty much impossible to thread because it's just doing metadata then content and repeatably concatenating.
IE
/foo.txt 21
This is the foo file
/bar.txt 21
This is the bar file
That makes it super hard to deal with as you essentially need to navigate the entire tar file before you can list the directories in a tar file. To add a file you have to wait for the previous file to be added.
Using something like sqlite solves this particular problem because you can have a table with file names and a table with file contents that can both be inserted into in parallel (though that will mean the contents aren't guaranteed to be contiguous.) Since SQLite is just a btree it's easy (well, known) how to concurrently modify the contents of the tree.
Funnily enough, tar is like 3 different formats (PaX, tar, ustar). One of the annoying parts of the tar format is that even though you scan all the metadata upfront, you have to keep the directory metadata in RAM until the end and have to wait to apply it at the end.
Or just what zip and every other format does an skits put all the metadata at the beginning - enough to list all files, and extract any single one efficiently
zip interestingly sticks the metadata at the end. That lets you add files to a zip without touching what's already been zipped. Just new metadata at the end.
Modern tape archives like LTFS do the same thing as well.
That sounds like you need to have fetched the whole zip before you can unzip it - which is not what one wants when making "virtual tarfiles" which only exist in a pipe. (i.e. you're packing files in at one end of the pipe and unpacking them at the other)
That point is always raised on every criticism of tar (that it's good at tape).
Yes! It is! But it's awful at archive files, which is what it's used for nowadays and what's being discussed right now.
Over the past 50 years some people did try to improve tar. People did develop ways to append a file table at the end of an archive file. Maintaining compatibility with tapes, all tar utilities, and piping.
Similarly, driven people did extend (pk)zip to cover all the unix-y needs. In fact the current zip utility still supports permissions and symlinks to this day.
But despite those better methods, people keep pushing og tar. Because it's good at tape archival. Sigh.
It was hard to believe for me, too. And I didn't stumble upon it; I looked for it closely, and that was a point in the note. People did not look properly for nearly three decades. Many things have changed, but we computer people are still using the same tools.
I am not saying old is not good; the current solutions are great, but what are we, if we don't look for the better?
Yes it is that much faster, and a good part of it is because of the multi-thread design, but as a reminder, WinRAR or 7-Zip are too multi-thread, and you can see the difference.
To satisfy your doubt, I suggest running Pack for yourself. I am looking for more data on its behaviour on different machines and data.
Can I ask why do you need a version without ZSTD? If you are thinking that compression slows it down, I should say no. Pack is the first of its kinds that "Store" is slowing it down. Because its compression is smart, it will skip any non-compressible content.
On the same machine and the same Linux source code test:
My concern with Pack obliging me to compress is that compression becomes less pluggable; I'd much rather my archive format be agnostic of compression, as with tar, so that I can trivially move to a better compression format when one inevitably comes to be.
You got a point.
Although with that that option comes a great cost: We will lose portability, speed and even reliability.
Portability: Receiver (or future you) needs to know what you used, and what version even.
Speed: If you want to do the archive part first (tar) and then compress (gz), you will get much lower speed (as shown in the note).
Reliability: Most people use tar with gz anyway, but if you use it with not so popular algorithm and tools, you will risk having a file that may or may not work into the future.
Pack plan is to use the best of time (Zstandard) and if an update is needed in years to come, it will add support for the new algorithm updates. All Pack clients must only write the latest version (and read all previous versions) and that makes sure almost all use the best of their time.
Also, 4.7 seconds to read 1345 MB in 81k files is suspiciously slow. On my six-year-old low/mid-range Intel 660p with Linux 6.8, tar -c /usr/lib >/dev/null with 2.4 GiB in 49k files takes about 1.25s cold and 0.32s warm. Of course, the sales pitch has no explanation of which hardware, software, parameters, or test procedures were used. I reckon tar was tested with cold cache and pack with warm cache, and both are basically benchmarking I/O speed.
> Development machine with a two-year-old CPU and NVMe disk, using Windows with the NTFS file system. The differences are even greater on Linux using ext4. Value holds on an old HDD and one-core CPU.
> All corresponding official programs were used in an out-of-the-box configuration at the time of writing in a warm state.
My apologies, the text color is barely legible on my machine. Those details are still minimal though; what versions of software? How much RAM is installed? Why is 7-Zip set to maximum compression but zstd is not? Why is tar.zst not included for a fair comparison of the Pack-specific (SQLite) improvements on top of from the standard solution?
Using 32GB of RAM, but it is far more than they need.
7-Zip was used as others, just gave it a folder to compress. No configuration.
As requested, here are some numbers on tar.zst of Linux source code (the test subject in the note): tar.zst: 196 MB, 5420 ms (using out-of-the box config and -T0 to let it use all the cores. Without it, it would be, 7570 ms) Pack: 194 MB, 1300 ms Slightly smaller size, and more than 4X faster. (Again, it is on my machine; you need to try it for yourself.) Honestly, ZSTD is great. Tar is slowing it down (because of its old design and being one thread). And it is done in two steps: first creating tar and then compression. Pack does all the steps (read, check, compress, and write) together, and this weaving helped achieve this speed and random access.
This sounds like a Windows problem, plus compression settings. Your wlog is 24 instead of 21, meaning decompression will use more memory. After adjusting those for a fair comparison, pack still wins slightly but not massively:
Benchmark 1: tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
Time (mean ± σ): 2.573 s ± 0.091 s [User: 8.611 s, System: 1.981 s]
Range (min … max): 2.486 s … 2.783 s 10 runs
Benchmark 2: bsdtar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
Time (mean ± σ): 3.400 s ± 0.250 s [User: 8.436 s, System: 2.243 s]
Range (min … max): 3.171 s … 4.050 s 10 runs
Benchmark 3: busybox tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
Time (mean ± σ): 2.535 s ± 0.125 s [User: 8.611 s, System: 1.548 s]
Range (min … max): 2.371 s … 2.814 s 10 runs
Benchmark 4: ./pack -i ./linux-6.8.2 -w
Time (mean ± σ): 1.998 s ± 0.105 s [User: 5.972 s, System: 0.834 s]
Range (min … max): 1.931 s … 2.250 s 10 runs
Summary
./pack -i ./linux-6.8.2 -w ran
1.27 ± 0.09 times faster than busybox tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
1.29 ± 0.08 times faster than tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
1.70 ± 0.15 times faster than bsdtar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
Another machine has similar results. I'm inclined to say that the difference is probably mainly related to tar saving attributes like creation and modification time while pack doesn't.
> it is done in two steps: first creating tar and then compression
Pipes (originally Unix, subsequently copied by MS-DOS) operate in parallel, not sequentially. This allows them to process arbitrarily large files on small memory without slow buffering.
Thank you for the new numbers.
Sure, it can be different on different machines, especially full systems. For me on Linux and ext4, Pack finishes the Linux code base at just 0.96 s.
Anyway, I do not expect an order of magnitude difference between tar.zst and Pack; after all, Pack is using Zstandard.
What makes Pack fundamentally different from tar.zst is Random Access and other important factors like user experience. I shared some numbers on it here: https://news.ycombinator.com/item?id=39803968 and you are encouraged to try them for yourself.
Also, by adding Encryption and Locking to Pack, Random Access will be even more beneficial.
HDD for testing is a pretty big caveat for modern tooling benchmarks. Maybe everything holds the same if done on a SSD, but that feels like a pretty big assumption given the wildly different performance characteristics between the two.
Ok? How are you comparing these systems to the benchmark so they might be considered relevant? Compressing "Lots of small files" describes an infinite variety of workloads. To achieve anything close to the benchmark you'd need to specifically only compress only small files in a single directory of an average small size. And even the contents of those files would have large implications as to expected performance....
If that were true, surely it would make sense to demonstrate this directly rather than with a contrived benchmark? The issue is not the preponderance of small files but rather the distribution of data shapes.
Zipping up a project directory even without git can be a big file collection. Python virtual environment or node_modules, can quickly get into thousands of small files.
Let's say the tyranny of the C children has diverted our attention. I'll make a wild ass statement: If Modula-2 (Wirth family with Pascal) had caught on, you would have had whatever you wanted from Rust 20 years ago. But the C noise dominated the narrative.
Use the language that makes you money and encourages you to write code that addresses domains requiring more than just bolting together framework pieces. AI can do a measurable chunk of that work.
20 years ago was 2004, and Borland Delphi 7 was out. It was Pascal-based, but it didn't had that much difference from C programs.
It had a nice unit system with separate interface & implementation sections. This was very nice. The unit files were not compatible with anything else, including previous versions of Delphi - this was not nice, especially since a lot of libraries were distributed in compiled form.
The compilation speed was amazingly fast. This is one thing that was unequivocally better than C at this time.
There were range types (type TExample = 1..1000), but they more of a gimmick - turns out there are very few use cases for build-time limits. There were some uses back in DOS days when you'd have hardcoded resolution of 640x480, but in the windows time most variables were just Integer.
Arrays had optional range checks on access, that was also nice. We'd turn them off if we felt programs were too slow.
Otherwise, it was basically same as C with a bit of classes - custom memory allocation/deallocation, dangling pointers, NULLs, threads you start and stop, mutexes and critical sections. When I finally switched from Pascal to C, I didn't see that much difference (except compilation got much slower)
Maybe you'd say that Borland did something wrong, and Wirth's Modula-2 would be much better than Borland's Pascal, but I doubt this.
Yep, optional range checks and a variety of other compiler defines to accommodate programmers coming from a C background who preferred to disregard compile time checks in the name of speed of execution. So sure, you can still to this day make pascal act like C. You even get comment delimiters. Kind of adds credence to the influence of C that I'm suggesting.
Wirth languages are about constraints. For instance, when I started writing code in TSM2 and Stonybrook, my general impression was that they both emitted 10-30% more compile time bugs than BP did. If that's too much of a hassle for C programmers, well ok.
Also to add, all the wordiness of Wirth languages, the block delimiters, yes I get it. But all this stuff is just another constraint for sake of correctness. M2, being case sensitive, is even worse about this than Pascal. But the point is to make you look at your code more than once, to proofread it and think about what's going on, because the syntax screams at you a little bit. Of course, with compiler defines, you can turn pascal into C and assume the responsibility for yourself. That's what runtime debuggers are for anyway, yes?
Ok, whatever, but we're missing the point that Wirth was trying to get across, which is to turn the language itself into implicit TDD, starting with first line of code written. C/++ may give you speed, but for the average programmer, all that speed is taken back in the end, due to maintenance costs. IMO, M2 was even better at shifting maintenance costs left of what the C tack did than pascal in the value delivery stream.
Sure, mission critical code can be written even in C/++. Most of SpaceX's code is a C++ codebase. So how did they pull that off? IMO, what they did was write C in the spirit of what Wirth was trying to accomplish. For the sake of maintenance costs, speed is now less of a metric thanks to hardware advances, and correctness is far more of an issue. Which makes sense, because all business is mission critical now and all business runs on more and more software. Would you turn off bounds checks in the compiler now? How about for the programmer who you'll never meet who is writing autonomous driver code for the car you drive?
Way too much money was wasted on the near-sighted value of C. Time to move on, according to Rust developers, who undoubtedly have an impressive background as C/++ programmers. So yeah we all have to follow this C dominated narrative even today, and my charge is, this narrative has retarded the art of programming. So I stick to my original proposition: Whatever you think is great about Rust, like ownership and borrowing, would have been available in production M2 code 20 years ago if we had just given Wirth languages a chance to advance the art in the commercial world. But that narrative would have been too wordy and constraining.
> It is written in the Pascal language, a well-stabilized standard language with compatibility promises for decades. Using the FreePascal compiler and the Lazarus IDE for free and easy development.
The code is written to be seen as pseudocode. In place of need, comments are written to help.
The whole thing makes sense to me and I can't see any major points of criticism in the design rationale. Some thoughts:
* There is already a "native" Sqlite3 container solution called Sqlar [0].
* Sqlite3 itself is certainly suitable as a base and I wouldn't worry about its future at all.
* Pascal is also an interesting choice, it is not the hippest language nor a new kid on the block, but offers its own advantages as being "boring" and "normal". I am thinking especially of the Lindy effect [1].
All in all a nice surprise and I am curious to see the future of Pack. After all, it can only succeed if it gets a stable, critical mass of supporters, both from the user and maintainer spectrum.
Here is the latest sqlar result on Linux source code on the same test machine in warm state:
sqlar: 268 MB, 30.5 s
Pack: 194 MB, 1.3 s
Very good result compared to tar.gz. And much better than ZIP, considering sqlar gives random access like ZIP and unlike tar.gz.
I considered sqlar as a proof of concept, and it inspired me to create Pack as a full solution. I always agreed with the great drh (creator of SQLite) points about SQLite as a file format, and Pack is a try to demonstrate that.
I made Pack to give people a better life (at least behind their desks), and as you do, I hope people get to use it and find it useful.
And that wasn’t even the original Lindy’s but an unrelated/“unauthorized” reboot since the trademark was declared abandoned. The original closed in 1969 (for which the law was named as it predates that): from 1964 https://web.archive.org/web/20210619015733/https://www.gwern...
Sqlite3 is universal, but now your spec is entirely reliant on Sqlite3's format and all the quirks required to support that.
If you actually care about the future, spec out your own database format and use that instead. It could even be mostly a copy of Sqlite3, but at least it would be part of the spec itself.
You're not "wrong" but Sqlite isn't your run-of-the-mill project. "The SQLite file format is stable, cross-platform, and backwards compatible and the developers pledge to keep it that way through the year 2050." [1]
No, you will only need Pack, everything is built into it. Pack is built for Windows and Linux, and more will come. You will be able to run it on almost all CPUs.
I suppose you do need bindings for every language. But sqlite is in C and is extremely popular. If you can't get bindings as one of the first third party libraries your language supports, it's probably a shitty language anyway.
How different is this to any other run of the mill project with few active developers on a single implementation, with backwards compatibility based entirely on promises?
I had to make an archiver once (commercial) so I did think about it a bit. I am not sure pack would solve anything for me. It obviously solves the authors usecases but tar has some tricks which I don't want to lose;
* Able to write to a pipe/socket - lets you not waste space or time by writing to disc something that you intend to transmit over a pipe or TCP socket anyhow. It's almost a "virtual" archive and it should be possible to make one that is far too big to fit into memory - because as you send each bit of it you deallocate that memory. At the receiver each bit can be written to disc or extracted and then that memory is reused for the next bit - so the archive never fully "exists" but it does the job of serialising some data. An example could be piping the output of tar to an ssh command which untars it on a remote machine.
* Metadata has to be with the file data - not stuck at the end of the file - because you need to be able to start work without waiting till the file is fully received through your pipe. You don't want to be forced to have space to store the archive and the extracted files (may be a huge archive).
* Choice of compression - lzop is super fast such that using it can sometimes give slightly better performance than writing uncompressed data. OTOH that might not be your concern and XZ might suit you by compressing much more thoroughly. Either way it's very nice to have compression that works across multiple files - which is especially helpful when compressing a lot of small files such as source code.
* Ability to encapsulate - should be able to put the packed data into any imaginable container like an encryption or data transmission protocol without insisting that the entire archive has to be fully read before members can start to be extracted/processed. This is essentially the same as the pipe/socket requirement.
I'm not saying that these things matter to everyone - I have just found them incredibly useful in a few critical situations. The world of ZIP users on Windows seems to be sort of blind to them - thinking firmly in that box.
- Piping is really easy and it will get added to Pack. It is matter of time, until these features get added as they will be added based on popularity and piping is not that popular for most people. But I get you and I will add it for you.
- Metadata is not stored in Pack. I don’t want the metadata of my machine attached to a file. It’s a never-ending nightmare to match source and destination OS metadata. There will always be something missing, and Pack tends to get everything perfect or nothing. Storing metadata adds extra weight that most users don’t care about and complicates the ability to store other types of data alongside files. It may get added as an option in the future if many people need it.
- Pack uses Zstandard under the hood. Great compression speed and ratio. In my opinion, it is the leading algorithm in the field and makes it a proper choice to use instead of DEFLATE, used in ZIP or GZIP.
- At this point you are telling tar features. tar is not random access, Pack is. As an example, if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.
I think random access is a useful feature - e.g. if you want to compress code modules like java does so that you can still load individual modules quickly.
This is not something I've wanted yet personally but that's just random chance. When I do need it I will know which tool to use! Thanks.
It could be handy to be able to mount a pack like a filesystem.
Indeed. Benchmarking Pack as a file system was fun. It is near 10X faster to let you iterate all the files compared to what I get from NTFS (warm with all caching on for both solutions).
Someday, it can be used as a virtual drive. I leave it to future people.
> The world of ZIP users on Windows seems to be sort of blind to them - thinking firmly in that box.
Seems to me like this is you being blind to certain use cases, and so stuck in your streaming oriented box that you can not conceive of other use cases where a streaming format is actively detrimental.
sqlite is great as a "file format" for a particular application, but I think it's a bad interchange format.
As mediocre as zip and tar are, you can cobble together read/write support without even needing a library. With sqlite, your only real option is to bundle sqlite itself, and while it's relatively lightweight, it's far from trivial.
zip has support for zstd, and if you wanted to make it go faster, you could embed some index metadata.
I can't see any specs for their format, not even a description of the sqlite tables.
According to the `.indexes` directive, there are... no indexes. What's the point of sqlite if you're not going to index things?
All the data is stored in one big blob (the "Value" column of the "Content" table), with the metadata storing offsets into it. It looks like there's still the possibility of things being split over multiple blobs (to circumvent the 2GB blob size limit)
- Custom sqlite magic bytes makes the format incompatible with all existing sqlite tooling.
- No support for file metadata.
- There's no version field (afaict), making future format improvements difficult.
Edit: A previous version of this comment had a much longer list of complaints, but after taking a closer look, I retract them. I was looking at the MediaKit.pack file as an example, which, due to being relatively small, packed all its files into a single BLOB. I was under the mistaken impression that the same approach was taken for larger files, but after some further testing I see that they're split up into ~8MB chunks.
Though, if you have lots of small files (say, a couple of kilobytes each) then random access performance could suffer.
Hello David, and thank you for your comment, analysis, and the issues you opened. I will get to them all.
- SQLite tooling: You will not need it unless you are debugging something, then you can change the header or just use the `--activate-other-options --transform-to-sqlite3` parameter to transform a Pack file to SQLite3, and use the `--activate-other-options --transform-to-pack` to go back. This way, you get a true SQLite3 database that you can browse as you wish. For most people, mixing Pack with SQLite was just a call for problems for the SQLite team (imagine people coming and asking to fix their Pack file from the team; that would not be fair) and a harder future for Pack to update.
- Metadata is not stored in Pack. I don’t want the metadata of my machine attached to a file. It’s a never-ending nightmare to match source and destination OS metadata. There will always be something missing, and Pack tends to get everything perfect or nothing. Storing metadata adds extra weight that most users don’t care about and complicates the ability to store other types of data alongside files. It may get added as an option in the future if many people need it.
- All future versions of Pack must handle previous versions and must only write the latest version. So any files created right now (Draft 0) will be read correctly for ever to come.
- Each Draft proposal will get its own version, and if it gets final, it will be set to final.
- Two byte after 'Pack' header in little endian as (1 (Draft) shl 13 + 0 (version 0) = 8192). Final would be 0, so the first Final version will be 0 shl 13 + 1 = 1. and the second will be 2. It is by design, so any Draft version gets a higher number, preventing future mixups.
- 8 MB chunks are the default; Pack may choose smaller or bigger (16 MB for many small files or 32 MB for Hard Press).
- Random access is proper as unpacking steps take into account what you want and decompress a Content just once for many neighbouring files. But even for reading just one file, here is an example: if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.
Thank you for the detailed response(s). I must admit I'm warming up to the idea of Pack, it does perform well in my testing (I didn't test at first because I'm on aarch64 linux, for which there are no compatible builds).
Not including metadata is an opinionated stance, but I can certainly get behind it, especially as a default. 99% of the time I do not care about metadata when producing a file archive.
Compatibility with existing SQLite tooling is not just useful for debugging, it is extremely useful for writing alternative implementations. If you want Pack to be successful as a format and not just as a piece of software, I think you should do everything you can to make this easier.
In my experimentation, I wrote a simple python script to extract files from a Pack archive. Conveniently, sqlite is part of the python standard library, but in order to make it work with that version (as opposed to compiling my own) I had to edit the file header first, which is inconvenient and not always possible to do (e.g. if write permissions are not available).
Despite that inconvenience, it took less code than a comparably basic ZIP extractor, which is cool!
I worry that requiring a custom VFS will make it harder for people to produce compatible software implementations.
I think your concerns about people contacting SQLite for support are overblown. I assume you've heard the `etilqs_` story[0], but in this case, you need to use a hex-editor or a utility like `file` to even see the header bytes. I think anyone capable of discovering that it's an SQLite DB will be smart enough not to contact SQLite for support with it.
The `Application ID`[1] field in the SQLite header is designed with this exact purpose in mind
> The application_id PRAGMA is used to query or set the 32-bit signed big-endian "Application ID" integer located at offset 68 into the database header. Applications that use SQLite as their application file-format should set the Application ID integer to a unique integer so that utilities such as file(1) can determine the specific file type rather than just reporting "SQLite3 Database".
I am happy to hear that, and I really appreciate your interest.
Did you compile it for yourself? Any problem or steps you used, I will be happy to hear, o at pack.ac or GitHub, as it is hard to follow the building here.
As a reminder, Pack Draft 0 has Compatibility with SQLite tools; the only needed step is to change the first 16 bytes. Again, you can use `--activate-other-options --transform-to-sqlite3` with the CLI tool, and you will get a perfectly working SQLite file.
VFS is not needed; they can change the header after writing; VFS was just cleaner to me.
My first work was using application_id, after a while, it did not feel right to me, so I changed it for good. It allows easier future development, fewer problems for file type detection, a decreased chance of mistaken change (you already saw many negative comments on using SQLite as a base), and the support reason: just yesterday I was reading a forum post about people asking for support on software because it was using SQLite. application_id seems like a great choice if you are doing a DB-related task or making a custom DB for transfer on wire, to communicate between internal and semi-public tools. Using it for a format that could potentially get to an innumerable count seemed unwise.
No index, as they take space, and I wrote the queries considering SQLite automatic indexes. They will be created on demand, at unpacking time. All the unpacking processes are made to read and decompress content just once, so there are no worries about slowdowns.
I suggest trying Pack for yourself and seeing the speed.
Or deeper, use `--activate-other-options --transform-to-sqlite3` to transform a Pack file to SQLite3, create your own indexes, and use `--activate-other-options --transform-to-pack` to convert it to Pack and then try unpacking. You will not see any worthy difference.
Yes, Contents are like packages of raw data from a chunk, a whole, or many of the items (files or data). They may be compressed if needed (With Zstandard). ItemContent table helps to find the needed Item parts.
The Content structure circumvents any BLOB limit, but it is also made to give better compression while keeping random access.
I guess you are overestimating the "cobble together read/write support without even needing a library."
Let's imagine: You want to read a ZIP file. Will you write your own reader? I seriously doubt it, as the work, stabilising, and security (random memory access as an example) would be issues.
But let's think we are couraginous. OK, we read rather not so simple format and carefully read the binary. Now, will you write your own DEFLATE and Huffman coding? Again, a bigger doubt.
I would argue that if someone cares enough to reimplement ZIP, it would at worst be twice as hard to write a Pack reader from scratch with no ZSTD or SQLite. And for those serious people, reading a format that lets them store better and faster would be a prize that is hard to say no to.
But I get your point, and if you are in a desert and need something to put together fast before going out of water, tar may be a good choice.
I have written my own zip, deflate, and huffman coding - although the latter two were "just for fun". But I would definitely consider writing ad-hoc zip logic in real software, if I couldn't pull in a library for whatever reason. This isn't just a hypothetical, it happens a lot - there are many independent ZIP implementations in the wild, for better or for worse.
You're right to call out security though, because the multiple implementations cause security issues where they disagree, my favorite example being https://bugzilla.mozilla.org/show_bug.cgi?id=1534483 . Although arguably this is a symptom of ZIP being a poorly thought out file format (too many ambiguous edge-cases), rather than a symptom of it being easy to implement.
You are one of the bravest. And you know that, using SQLite as the base storage, rules out many of the security problems we can face.
Anyone needing to reimplement Pack, can do it, very easily, if not easier than implementing ZIP, IF they use SQLite and Zstandard. Maybe a day of work or less. If they want to rewrite (reading part of) them too, it will be a couple of days of work.
Complaining that all existing tools are old, but I'm looking at the documentation and the what immediately catches my eye is that it doesn't use any modern convention I've gotten used to?
Overwrite with "-w". I've never seen a tool not use "-f"
Not reserving "-h" for help text is also an interesting choice. Makes me think of the mantra "be conservative in what you send out but liberal in what you accept". Per that philosophy, both "--help" and "-h" should be accepted because neither gained a decisive majority in usage and so people might try either. It's not like you'd know what to use because it hasn't told you yet
Forcing use of a long option "--press=N" (for the zstd level setting) is also new/unique terminology for what is usually "-N" (like "-1" to "-9")
(Basic) drop-in compatibility with every other tool from gzip to zstd would also have been nice, but archiving and compression are different things and everything from zip to 7z to tar works in unique ways so this makes enough sense I guess. Still, could have been useful
It's still better than tar or ps, so if it catches on that's still a step forwards in terms of command line standards
Hmmm. I went to the doco hoping for something about the file format :( No doco for me. I guess that would be too old fashioned.
It seems to be "read the code" or nothing - which is fine until they update the code... It's great (probably should be mandatory) to have a reference program, but if they're promoting it as a container format, something along the lines of an RFC would be helpful.
The choice of parameters was solely done to be clear, and not what people used to. -f meaning force is not clear; -w meaning overwrite, seemed like a better logical choice, to me.
Nice point on -h. Yes, I did not want to go crazy. After all, almost all (CLI) people use pack as `pack ./test/`. Options are for advanced people like you. Most people will use the OS integration that will be published later on.
--press=hard is the only option there is. There may be more, but with Pack you do not need to choose a level (like 1..9 with ZIP). Just let Pack do its thing, and you will be happy. Hard Press is there for people who want to pack once and unpack many times (like publishing), and it is worth spending extra time on it. Even then, Pack goes the sane way and does not eat your computer just for a kilobyte or two.
> Just let Pack do its thing, and you will be happy
Well, no. Sometimes i want maximum compression while having a lot of CPU and wall clock time. And sometimes being fast is more important than compression level. Also managing server utilization is needed. Level thing is there for a reason
This is not a binary choice, actual level of effort is required. I've seen many times people fine tuning compression levels in all kinds of automation scenarios
Thank you for the notes. I am well aware of the levels and Pack uses custom configuration to match its inner design.
Maybe more level come, or maybe not. But to be clear, Pack supports any valid Zstandard content, and this levels we are discussing are about Pack CLI chosen for better user experience. Any other client can produce and store any valid content for chose level or configuration, and other clients can read it.
There are many different usecases, and each one have different set of requirements. E.g.
- for end-user facing cli: support as much conventions as possible (-v --version, -h --help, other options similar to other compressors), sane defaults
- for automatic tasks like making backups via cron: piping, correct exit codes, level of effort configuration, silent modes for reduced logging.
Second one is likely to fly first, can be used isolated on company level if file format is stable
Summary : Convert code into runnable images
URL : https://github.com/buildpacks/pack
License : Apache-2.0 and BSD-2-Clause and BSD-3-Clause and ISC and MIT
Description : pack is a CLI implementation of the Platform Interface Specification
: for Cloud Native Buildpacks.
I went to do some testing in a sandbox system (as the compilation strategy is unclear, missing builds for some of the artifacts).
I was able to initially construct an archive of the linux tree (that failed in decompression), but subsequently went to rebuild it and the tool is repeatedly producing this output even in an otherwise cleaned up environment:
Runtime error 203 at $0000000100009B72
$0000000100009B72
$00000001000225B6
$0000000100023153
$000000010002319A
Runtime error 203 at $0000000100009B72
$0000000100009B72
$00000001000225B6
$0000000100023153
$000000010002319A
Runtime error 203 at $0000000100009B72
$0000000100009B72
$00000001000225B6
$0000000100023153
$000000010002319A
Runtime error 203 at $0000000100009B72
$0000000100009B72
$00000001000225B6
$0000000100023153
$000000010002319A
Runtime error 203 at $0000000100009B72
$0000000100009B72
$00000001000225B6
$0000000100023153
$000000010002319A
The first output did work, sadly I deleted it. It was over 400mb, so approximately double the size of a zip or a tar.zst of the same files.
zstd natively compresses a tar of this set to 209mb in 800ms in multi-threaded mode, or 3.5s in single threaded mode.
I suspect that sqlite is being held incorrectly (access from multiple threads, with multi-threading disabled), and the vfs lock forwarding is broken on Windows.
Did you compile it for yourself? Any problem or steps you used, I will be happy to hear, o at pack.ac or GitHub, as it is hard to follow the building here. I should prepare more documents on how to build it.
I suspect that there is a problem with the custom build. Error and speed issues are not something you see in the official build.
That was the binary download from the website. You have a build.sh for the Linux binary artifacts, but no equivalent for the windows artifacts so i did not bother preparing a windows build
Windows 11 sandbox, running atop Windows 11. Binary downloaded from your webpage.
Data being packed was an unzipped copy of linux-master.zip fetched from GitHub unpacked with windows zip, selecting skip for the overlapping case files.
If I need to compress stuff, it’s either to move a folder around (to places which may not have a niche compression tool, so ZIP wins), or to archive something long-term, where it can take a while to compress. I don’t see the advantages of this, since the compression output size seems quite mediocre even if it’s supposedly fast (compared to what implementations of the other formats?)
Hell yeah some fpc stuff showing its moves. Devs even put together an lpk to load up -bravo! Look for more of this stuff in the future as companies look for alternatives to corporate commodity programming and dbs tethered to major cloud resources. I have a major FPC effort going on right now that I hope to be able to offer on four platforms: browser, win, linux, and mac.
I wrote a similar general purpose pack way back in the early 90's in TopSpeed Modula-2 and run from the command line. Needed to span multiple disks and self launch. Algorithm was fast, but not nearly the same compression ratios. Wore out Mark Nelson's classic, "The Data Compression Book" along the way.
Hello to all.
I am the author, and I just saw this post and am happy to see this exciting discussion.
Let me try to show my respect for it and answer as well as possible.
Thanks for the note, and sorry for the inconveniences. I did not expect this many users in the iOS world. The site is very new and needs custom work; it will be updated soon.
I think generally it's a mobile layout issue. On Firefox for Android, scrolling the page still triggers a click/mouseup/focus event I guess - when you let go of your finger, it toggles the state of the "Note" -> "Pack" section, so it tends to hide itself as you're reading it!
Would just remove that "accordion" functionality completely or make it always expanded on mobile breakpoints or whatever. Or just move that entire "About pack" section to be on the main page "below the fold" as the first thing people are going to want to do is find out *what it is* :)
About Pascal:
It makes me happy. It looks clean and pseudocode-like. It helps readers from around the world with different languages understand. I am happy that Pack made people curious about this old but great goodie.
Speed, Here are some reasons:
- Pack does all the steps of pack or unpack (read, check, compress/decompress and write) together and this weaving helped achieve this speed and random access. It is by far the fastest speed I get to see reading or writing random files from file systems, as fast or faster than asynchronous read operations or OVERLAPPED on Windows. To a point, it is limited to file system. For example, on NTFS, Pack can pack Linux code base in around 1.3 s; similar is done on ext4 in 0.96 s.
- It is based on a heavily optimized code base, standard library, and the FreePascal compiler, which produces great binary.
- Multi core design: even mobiles have multi-core CPUs these days. Choosing threads based on the content and machine, it does not eat your machine.
- Speed-configured SQLite. SQLite is much faster than most people think it is.
- Configured the already rapid Zstandard.
In summary, standing on the shoulders of giants while trying hard to improve reliability, speed and user experience is a sign of respect for them.
Thanks for the update! Somehow 7z container overhead is +0.4MB AND is slower by a lot? Huh. Great numbers, need more exposure for pack format. Also, i suggest to add section to the website that shows these numbers.
I did not want to focus the point on speed, or say, "Look, others are bad". They are great; my point was, "Look what we can do if we update our design and code". Pack value comes from user experience, and speed is being one. I was not following the best speed or compression; I wanted an instantaneous feeling for most files. I wanted a better API, an easier CLI, improved OS integration (soon), and more safety and reliability. Tech people (including me) care so much about speed
I am happy about the results, but Pack offers much more that I like others to see.
Who is behind this? It's a new Github org, the committer (https://github.com/OttoCoddo) has a totally private Github profile. There's no name in "Legal". Sure, one can be anonymous, but I won't download it, don't trust it.
It is the point: if you trust a project based on "who" made it, my friend, that is the start of the big problem we are facing in this current situation of tech. Just look at the code, build it yourself, and check the license.
Pack is made to be a private option; future locking end encryption options will solidify that. Trusting the author is not the correct way to verify the security and safety of such a tool.
Your best bet is a lossless transform that undoes the huffman coding in the zip files, converting the compressed streams from effectively uncorrelated bitstreams to largely similar byte streams, and then pass that through a large-window compression algorithm (zstd?).
Similar techniques are used in ChromeOS for delta updates.
Yes, I have a bunch of zips I cannot modify. If I could uncompress them all, and zip them into one big archive, then it would be much smaller than if I put all the compressed zips into one big archive. This is because there's a lot more shared strings between uncompressed files across different zips, than there are shared strings between the compressed zips.
(At least... this is my assumption based on how I understand the formats to work. I do need to measure and verify this.)
So basically I'm wondering if there is some way to tell the "outer" zip to reuse the encoding dictionaries of each smaller zip, or to somehow intelligently merge their encoding dictionaries rather than treating each inner zip like an opaque blob.
No, the ZIP format compresses each embedded file separately, so it doesn't matter how much commonality there is between different files in the same archive.
Wow, really? I always assumed that if I zipped a directory, it would use a shared dictionary to compress all the files. But I guess what you say makes sense, because otherwise it wouldn't be possible to extract just one file.
Are there compression formats that do share an encoding dictionary across multiple files? I guess tar + gzip might do that?
Yeah, if you compress a tar file with gzip (or bzip2 or zstd or whatever) then the compressor doesn't care about the file boundaries, so it will be able to take advantage of redundancies.
However, those compressors generally only have a small context window, so they'll only be able to take advantage of relatively nearby redundancies in the archive. They won't help you if the common substrings are separated by many megabytes of unrelated data. So the order in which you pack the files matters.
In theory, you could do a two-pass approach, where you first scan the entire set of files to create a (relatively small) shared dictionary that's as useful as possible, and then uses that dictionary to compress each file independently. I don't know of any archive format that does that, but you could roll your own using the zstd command line compressor.
Is anyone using this or sqllite archives for anything at scale? They always seemed like a good solution for certain scientific outputs. But data integrity obviously is a concern.
As far as I know, PAQ strives to give the best compression for archival purposes. Pack, on the other hand, tries to give the best compression, while keeping speed as instantaneous as possible.
I know you don't have a duty to look around for your answer, but you too don't have a duty to say yuck to a project that has been done with a lot of effort.
I am ok with your comment, but maybe go easy on the next project.
> Pack format is based on the universal and arguably one of the most safe and stable file formats, SQLite3, and compression is based on Zstandard, the leading standard algorithm in the field.
yeah, no thanks. SQlite3 automatically means:
- Single implementation (yes, it's a nice one but still a single dependency)
- No way to write directly to pipe (SQlite requires real on-disk file)
- No standard way to read without getting the whole file first
- No guarantees in number of disk seeks required to open the file (relevant for NFS, sshfs or any other remote filesystem use)
- The archive file might be changed just by opening in read-only mode
- Damaged file recovery is very hard
- Writing is explicitly not protected against several common scenarios, like backup being taken in the middle of file write
- No parallel reads from multiple threads
Look, sqlite3 is great for it's designed purpose (embedded database). But trying to apply for other purposes is often a bad idea.
Often new things are met with an excess of skepticism but I agree here.
I'd take this more seriously if the format was documented at all, but so far it appears to be "this implementation relies on sqlite and zstd therefore it's better", without even a specification of the sql schema, let alone anything else.
The github repo contains precompiled binaries of zstd and sqlite. The sqlite builds appear to have thread support disabled so not only will it be single writer it'll be single reader too.
The schema is missing strictly typed tables, and the implementation appears to lack explicit collation handling for names and content.
The described benchmark appears to involve files with an average size of 16KB. I suspect it was executed on Windows on top of NTFS with an AV package running, which is a pathological case for single threaded use of the POSIXy IO APIs that undoubtedly most of the selected implementations use.
It's slightly odd that it appears to perform better when SQLite is being built with thread safety disabled (https://github.com/PackOrganization/Pack/blob/main/Libraries...) and yet the implementation is inserting in a thread group: https://github.com/PackOrganization/Pack/blob/main/Source/Dr.... I suspect the answer here is that because the implementation is using a thread group to read files and compress chunks, it's amortizing the slow cost of file opens in this benchmark using threading, but is heavily constrained by the sqlite locking - and the compression ratio will take a substantial hit in some cases as a result of the limited range of each compression operation. I suspect that zstd(1) with -T0 would outperform this for speed and compression ratio, and it's already installed on a lot of systems - even Windows 11 gained native support for .zst files recently.
The premise that we could do with something more portable than TAR and with less baggage is somewhat reasonable - we probably could do with a simple, safe format. There are a lot more key considerations to making such a format good, such as many you outline, such as choices around seeking, syncing, incremental updates, compression efficiency, parallelism, etc. There is no single set of trade-offs to cover all cases but it would be possible to make a file format that can be shared among them, while constraining the design somewhat for safety and ease of portability.
Hello, and thank you for the notes. Unfortunately, your points seem to be mostly wrong, so let me clarify them a little. Do not worry; many people misunderstand SQLite and its abilities.
- Single implementation: Sure. Working with SQLite convinced me that nobody cared to reimplement it, as it worked so well that nobody wanted or needed to rework it. I may write an unpacker just to prove that it is not hard at all to read SQLite format. The complicated part is the SQL engine (and many other features that are not used in Pack), and for Pack, you can live without it.
- SQLite does not require a disk. It has a memory option. Pack can have piping and properly will. I did not implement it because, well, it is too new, and I do what I feel is needed first. You can subscribe to the newsletter on the site (https://pack.ac/notes) or follow GitHub.
- Of course you can read SQLite without reading the whole file. It is a database, not a tar file.
- SQLite is highly optimized to read the lowest amount of data, and it has layers of smart caching. There is a reason it is used on almost any device that has a computer on it, even smartwatches.
- Of course the archive is safe from changes in unpacking. It will be opened in read-only mode, guarded by the OS and file system, and Pack also uses code isolation, which prevents calling write on any file.
- There are a lot of tools that help repair a damaged SQLite file. Pack too is guarded with transactions. The file will not get corrupted unless the disk goes corrupt; the mentioned tools come handy then. And in today's world of SSDs, the risk is shrinking rapidly.
- On unpack, Pack reads, decompresses, checks, and writes in a multithreaded. So yes, parallel reading is possible and done in Pack.
- I suggest trying Pack for yourself. It gives you the feeling you need to have to be sure.
I am very curious how you are going to get SQLite working with piping, especially on extract... It's pretty common to do stuff like "curl ... | tar xvf -", so that you can start extraction the moment first kilobyte of data arrives. This really saves a lot of time, as disk + network work in parallel.
A less common tar's feature is packing on compress -- stuff like "ssh remote tar cvf - ... > local-file.tar", which skips temporary file on remote machine, and also saves lots of time in transfer.
But for both of those, sqlite's "memory" won't help you there - memory or not, you still need to have the entire file to read it. So if you just store file contents in the sql database, then you have to fetch everything up to the latest byte before you can get any data out.
Maybe you can have index in sqlite, and append data as-is... but where would you put that index?
if you put it in front (like squashfs), you need to produce entire metadata before writing first data byte.. and that should include compressed sizes too (assuming you want to support random extraction), which means you cannot stream file out until you finish compressing all the data. And also sometimes you will not be able to add files to the archive without rewriting the whole archive (if the index grows and you didn't leave enough padding). This might be OK, but definitely should be mentioned.
If you put it at the end (like zip), you will be able to stream file out during compression, but fast decompression would be impossible. Also, you'll forego any sqlite transitional guarantees - since the database will be created in-memory, and only written at the very end once all the files are written.
So frankly, I don't see how you can win on a streaming front, unless you really have a custom format and "sqlite3" is just a small part of it.
(Another problem is there is not even a short spec - how is sqlite3 used, what is your schema, and so on. And I am sorry, but I am not going to read the source code just to figure this stuff out).
I think that point is about it not supporting streaming compression where the output of the packer is immediately fed into something like a pipe or a TCP connection.
You can do this with both tar and ZIP. If all you have is SQLite, you need to fully create the local database (be it in a file or in memory) before you can transmit it somewhere else to be stored or unpacked.
Which is a weird gripe - in this use case, `tar` makes most sense to use. It's not like this format claims to be a full replacement for tar or anything close to that.
That's exactly what they do - they explicitly call out this format as replacement to tar and zip at least:
> Most popular solutions like Zip, gzip, tar, RAR, or 7-Zip are near or more than three decades old. While holding value for such a long time (in the computer world) is testimony to their greatness, the work is far from done.
> Pack tries to continue this path as the next step and proposes the next universal choice in the field.
Yes, I am.
Pack can hold millions of files with no problem. One field in which it shines is that, aside from being fast at processing large amounts of data, it can process many small files much faster than similar tools or even many popular file systems.
About piping, it can be done, and it is on my list. I will finish features based on their popularity and making sense.
As Pack has random access support, you can choose a file in a big pack, and it can stream it out to the output. It is already able to unpack partially to your file system (using --include="file path in pack"); streaming/piping it would not be a problem.
it’s not safe to disable SQLite’s thread safety as you do here: https://github.com/PackOrganization/Pack/blob/main/Libraries... and then do your own locking. You attempt to pass the flag at open time to enable serialized mode, however quoting the SQLite docs for the build flag you set:
Note that when SQLite is compiled with SQLITE_THREADSAFE=0, the code to make SQLite threadsafe is omitted from the build. When this occurs, it is impossible to change the threading mode at start-time or run-time.
SQLite’s APIs are often hazardous in these ways, it really should be erroring rather than silently ignoring the fullmutex flag, but alas.
Also if you dare try to naturally scroll up after opening a container it's interpreted as a refresh as it redraws. Might be an awesome format but web design fail negates it entirely.
Had the same issue, plus I think one of my Safari extensions was hiding a cookie banner so half the page was just “dark”. A more mobile friendly view is absolutely needed.
I wonder how pack compares to it, but its home page and github don't tell much.