Hacker News new | past | comments | ask | show | jobs | submit login

With all due respect, I find it hard to believe the author stumbled upon a trivial method of improving tarballing performance by several orders of magnitude that nobody else had considered before.

If I understand correctly, they're suggesting Pack, which both archives and compresses, is 30x faster than creating a plain tar archive. That just sounds like you used multithreading and tar didn't.

Either way, it'd be nice to see [a version of] Pack support plain archival, rather than being forced to tack on Zstd.




That’s more because plain tar is actually a really dumb way of handling files that aren’t going to tape.

Being better than that is not a hard bar.


The tar file format is REALLY bad. It's pretty much impossible to thread because it's just doing metadata then content and repeatably concatenating.

IE

    /foo.txt 21
    This is the foo file
    /bar.txt 21
    This is the bar file
That makes it super hard to deal with as you essentially need to navigate the entire tar file before you can list the directories in a tar file. To add a file you have to wait for the previous file to be added.

Using something like sqlite solves this particular problem because you can have a table with file names and a table with file contents that can both be inserted into in parallel (though that will mean the contents aren't guaranteed to be contiguous.) Since SQLite is just a btree it's easy (well, known) how to concurrently modify the contents of the tree.


Funnily enough, tar is like 3 different formats (PaX, tar, ustar). One of the annoying parts of the tar format is that even though you scan all the metadata upfront, you have to keep the directory metadata in RAM until the end and have to wait to apply it at the end.


Or just what zip and every other format does an skits put all the metadata at the beginning - enough to list all files, and extract any single one efficiently


zip interestingly sticks the metadata at the end. That lets you add files to a zip without touching what's already been zipped. Just new metadata at the end.

Modern tape archives like LTFS do the same thing as well.


That sounds like you need to have fetched the whole zip before you can unzip it - which is not what one wants when making "virtual tarfiles" which only exist in a pipe. (i.e. you're packing files in at one end of the pipe and unpacking them at the other)


Just fseek to the end.

Zip format was not designed for piping.


Tapes don't (? certainly didn't) operate this way. You need to read the entire tape to list the contents.

Since tar is a Tape ARchive, the way tar operates makes sense (as it was designed for both output to file and device, i.e. tape).


That point is always raised on every criticism of tar (that it's good at tape).

Yes! It is! But it's awful at archive files, which is what it's used for nowadays and what's being discussed right now.

Over the past 50 years some people did try to improve tar. People did develop ways to append a file table at the end of an archive file. Maintaining compatibility with tapes, all tar utilities, and piping.

Similarly, driven people did extend (pk)zip to cover all the unix-y needs. In fact the current zip utility still supports permissions and symlinks to this day.

But despite those better methods, people keep pushing og tar. Because it's good at tape archival. Sigh.


Tapes currently don't really operate like tar anymore either. Filesystems like LTFS stick the metadata all in one blob somewhere.


It's been a long time since I've operated tape, so good to know things have changed for the better.


It was hard to believe for me, too. And I didn't stumble upon it; I looked for it closely, and that was a point in the note. People did not look properly for nearly three decades. Many things have changed, but we computer people are still using the same tools. I am not saying old is not good; the current solutions are great, but what are we, if we don't look for the better?

Yes it is that much faster, and a good part of it is because of the multi-thread design, but as a reminder, WinRAR or 7-Zip are too multi-thread, and you can see the difference. To satisfy your doubt, I suggest running Pack for yourself. I am looking for more data on its behaviour on different machines and data.

Can I ask why do you need a version without ZSTD? If you are thinking that compression slows it down, I should say no. Pack is the first of its kinds that "Store" is slowing it down. Because its compression is smart, it will skip any non-compressible content.

On the same machine and the same Linux source code test:

Pack: 194 MB, 1.3 s

Pack (With no Press): 1.25 GB, 1.8 s


My concern with Pack obliging me to compress is that compression becomes less pluggable; I'd much rather my archive format be agnostic of compression, as with tar, so that I can trivially move to a better compression format when one inevitably comes to be.


You got a point. Although with that that option comes a great cost: We will lose portability, speed and even reliability.

Portability: Receiver (or future you) needs to know what you used, and what version even.

Speed: If you want to do the archive part first (tar) and then compress (gz), you will get much lower speed (as shown in the note).

Reliability: Most people use tar with gz anyway, but if you use it with not so popular algorithm and tools, you will risk having a file that may or may not work into the future.

Pack plan is to use the best of time (Zstandard) and if an update is needed in years to come, it will add support for the new algorithm updates. All Pack clients must only write the latest version (and read all previous versions) and that makes sure almost all use the best of their time.


Pure zstd (or .tar.zstd) vs pack vs patched 7z+zstd would be more interesting, how much overhead introduced by pack format itself - in size and speed


I answered this question here: https://news.ycombinator.com/item?id=39801083 If that is not enough, let me know.


tar.zst vs pack is looking great, thanks! Also there is https://github.com/mcmilk/7-Zip-zstd

.pack vs zst-7z with the same compression settings would b interesting. That will be pure container overhead


Also, 4.7 seconds to read 1345 MB in 81k files is suspiciously slow. On my six-year-old low/mid-range Intel 660p with Linux 6.8, tar -c /usr/lib >/dev/null with 2.4 GiB in 49k files takes about 1.25s cold and 0.32s warm. Of course, the sales pitch has no explanation of which hardware, software, parameters, or test procedures were used. I reckon tar was tested with cold cache and pack with warm cache, and both are basically benchmarking I/O speed.


The footnotes at the bottom says

> Development machine with a two-year-old CPU and NVMe disk, using Windows with the NTFS file system. The differences are even greater on Linux using ext4. Value holds on an old HDD and one-core CPU.

> All corresponding official programs were used in an out-of-the-box configuration at the time of writing in a warm state.


My apologies, the text color is barely legible on my machine. Those details are still minimal though; what versions of software? How much RAM is installed? Why is 7-Zip set to maximum compression but zstd is not? Why is tar.zst not included for a fair comparison of the Pack-specific (SQLite) improvements on top of from the standard solution?


Using 32GB of RAM, but it is far more than they need.

7-Zip was used as others, just gave it a folder to compress. No configuration.

As requested, here are some numbers on tar.zst of Linux source code (the test subject in the note): tar.zst: 196 MB, 5420 ms (using out-of-the box config and -T0 to let it use all the cores. Without it, it would be, 7570 ms) Pack: 194 MB, 1300 ms Slightly smaller size, and more than 4X faster. (Again, it is on my machine; you need to try it for yourself.) Honestly, ZSTD is great. Tar is slowing it down (because of its old design and being one thread). And it is done in two steps: first creating tar and then compression. Pack does all the steps (read, check, compress, and write) together, and this weaving helped achieve this speed and random access.


This sounds like a Windows problem, plus compression settings. Your wlog is 24 instead of 21, meaning decompression will use more memory. After adjusting those for a fair comparison, pack still wins slightly but not massively:

  Benchmark 1: tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
    Time (mean ± σ):      2.573 s ±  0.091 s    [User: 8.611 s, System: 1.981 s]
    Range (min … max):    2.486 s …  2.783 s    10 runs
   
  Benchmark 2: bsdtar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
    Time (mean ± σ):      3.400 s ±  0.250 s    [User: 8.436 s, System: 2.243 s]
    Range (min … max):    3.171 s …  4.050 s    10 runs
   
  Benchmark 3: busybox tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
    Time (mean ± σ):      2.535 s ±  0.125 s    [User: 8.611 s, System: 1.548 s]
    Range (min … max):    2.371 s …  2.814 s    10 runs
   
  Benchmark 4: ./pack -i ./linux-6.8.2 -w
    Time (mean ± σ):      1.998 s ±  0.105 s    [User: 5.972 s, System: 0.834 s]
    Range (min … max):    1.931 s …  2.250 s    10 runs
   
  Summary
    ./pack -i ./linux-6.8.2 -w ran
      1.27 ± 0.09 times faster than busybox tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
      1.29 ± 0.08 times faster than tar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
      1.70 ± 0.15 times faster than bsdtar -c ./linux-6.8.2 | zstd -cT0 --zstd=strat=2,wlog=24,clog=16,hlog=17,slog=1,mml=5,tlen=0 > linux-6.8.2.tar.zst
Another machine has similar results. I'm inclined to say that the difference is probably mainly related to tar saving attributes like creation and modification time while pack doesn't.

> it is done in two steps: first creating tar and then compression

Pipes (originally Unix, subsequently copied by MS-DOS) operate in parallel, not sequentially. This allows them to process arbitrarily large files on small memory without slow buffering.


Thank you for the new numbers. Sure, it can be different on different machines, especially full systems. For me on Linux and ext4, Pack finishes the Linux code base at just 0.96 s.

Anyway, I do not expect an order of magnitude difference between tar.zst and Pack; after all, Pack is using Zstandard. What makes Pack fundamentally different from tar.zst is Random Access and other important factors like user experience. I shared some numbers on it here: https://news.ycombinator.com/item?id=39803968 and you are encouraged to try them for yourself. Also, by adding Encryption and Locking to Pack, Random Access will be even more beneficial.


HDD for testing is a pretty big caveat for modern tooling benchmarks. Maybe everything holds the same if done on a SSD, but that feels like a pretty big assumption given the wildly different performance characteristics between the two.


gzip is really, really, really slow, so it's pretty easy to make a thing that uses gzip fast by switching to Zstandard.


Eh, it's not that hard to imagine given how rare it is to zip 81k files of around 1kb each.


Not that rare at all. Take a full disk zip/tar of any Linux/Windows filesystem and you'll encounter a lot of small files.


Ok? How are you comparing these systems to the benchmark so they might be considered relevant? Compressing "Lots of small files" describes an infinite variety of workloads. To achieve anything close to the benchmark you'd need to specifically only compress only small files in a single directory of an average small size. And even the contents of those files would have large implications as to expected performance....


My comment is not making any claims about that. It's just a correction that filesystems with "81k 1KB files" are indeed common.


If that were true, surely it would make sense to demonstrate this directly rather than with a contrived benchmark? The issue is not the preponderance of small files but rather the distribution of data shapes.


Reading many files (81K in this test) is way slower than reading just one big file. For bigger files, Pack is much faster. Here is a link to some results from a kind contributor: https://forum.lazarus.freepascal.org/index.php/topic,66281.m...

(Too long to post here)


That's basically any large source repo.


Zipping up a project directory even without git can be a big file collection. Python virtual environment or node_modules, can quickly get into thousands of small files.


It's like 3x not 30x but yes same skepticism




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: