sqlite is great as a "file format" for a particular application, but I think it'...

Retr0id · 2024-03-23T03:24:13 1711164253

After overwriting the "Pack" magic bytes back to the SQLite default values, I was able to open it and see the following tables

    CREATE TABLE Content(ID INTEGER PRIMARY KEY, Value BLOB);
    CREATE TABLE Item(ID INTEGER PRIMARY KEY, Parent INTEGER, Kind INTEGER, Name TEXT);
    CREATE TABLE ItemContent(ID INTEGER PRIMARY KEY, Item INTEGER, ItemPosition INTEGER, Content INTEGER, ContentPosition INTEGER, Size INTEGER);

According to the `.indexes` directive, there are... no indexes. What's the point of sqlite if you're not going to index things?

All the data is stored in one big blob (the "Value" column of the "Content" table), with the metadata storing offsets into it. It looks like there's still the possibility of things being split over multiple blobs (to circumvent the 2GB blob size limit)

Retr0id · 2024-03-23T05:13:53 1711170833

I've reverse engineered the format and written up my findings here: https://github.com/DavidBuchanan314/pack-analysis

Summary:

- Custom sqlite magic bytes makes the format incompatible with all existing sqlite tooling.

- No support for file metadata.

- There's no version field (afaict), making future format improvements difficult.

Edit: A previous version of this comment had a much longer list of complaints, but after taking a closer look, I retract them. I was looking at the MediaKit.pack file as an example, which, due to being relatively small, packed all its files into a single BLOB. I was under the mistaken impression that the same approach was taken for larger files, but after some further testing I see that they're split up into ~8MB chunks.

Though, if you have lots of small files (say, a couple of kilobytes each) then random access performance could suffer.

OttoCoddo · 2024-03-23T16:28:01 1711211281

Hello David, and thank you for your comment, analysis, and the issues you opened. I will get to them all.

- SQLite tooling: You will not need it unless you are debugging something, then you can change the header or just use the `--activate-other-options --transform-to-sqlite3` parameter to transform a Pack file to SQLite3, and use the `--activate-other-options --transform-to-pack` to go back. This way, you get a true SQLite3 database that you can browse as you wish. For most people, mixing Pack with SQLite was just a call for problems for the SQLite team (imagine people coming and asking to fix their Pack file from the team; that would not be fair) and a harder future for Pack to update.

- Metadata is not stored in Pack. I don’t want the metadata of my machine attached to a file. It’s a never-ending nightmare to match source and destination OS metadata. There will always be something missing, and Pack tends to get everything perfect or nothing. Storing metadata adds extra weight that most users don’t care about and complicates the ability to store other types of data alongside files. It may get added as an option in the future if many people need it.

- There is a version field. It is currently in Draft 0, and it is written using a custom VFS. Look here for more information: https://github.com/PackOrganization/Pack/blob/main/Source/Dr...

- All future versions of Pack must handle previous versions and must only write the latest version. So any files created right now (Draft 0) will be read correctly for ever to come.

- Each Draft proposal will get its own version, and if it gets final, it will be set to final.

- Two byte after 'Pack' header in little endian as (1 (Draft) shl 13 + 0 (version 0) = 8192). Final would be 0, so the first Final version will be 0 shl 13 + 1 = 1. and the second will be 2. It is by design, so any Draft version gets a higher number, preventing future mixups.

- 8 MB chunks are the default; Pack may choose smaller or bigger (16 MB for many small files or 32 MB for Hard Press).

- Random access is proper as unpacking steps take into account what you want and decompress a Content just once for many neighbouring files. But even for reading just one file, here is an example: if I want to extract just a .c file from the whole codebase of Linux, it can be done (on my machine) in 30 ms, compared to near 500 ms for WinRAR or 2500 ms for tar.gz. And it will just worsen when you count encryption. For now, Pack encryption is not public, but when it is, you can access a file in a locked Pack file in a matter of milliseconds rather than seconds.

Retr0id · 2024-03-23T22:09:52 1711231792

Thank you for the detailed response(s). I must admit I'm warming up to the idea of Pack, it does perform well in my testing (I didn't test at first because I'm on aarch64 linux, for which there are no compatible builds).

Not including metadata is an opinionated stance, but I can certainly get behind it, especially as a default. 99% of the time I do not care about metadata when producing a file archive.

Compatibility with existing SQLite tooling is not just useful for debugging, it is extremely useful for writing alternative implementations. If you want Pack to be successful as a format and not just as a piece of software, I think you should do everything you can to make this easier.

In my experimentation, I wrote a simple python script to extract files from a Pack archive. Conveniently, sqlite is part of the python standard library, but in order to make it work with that version (as opposed to compiling my own) I had to edit the file header first, which is inconvenient and not always possible to do (e.g. if write permissions are not available).

Despite that inconvenience, it took less code than a comparably basic ZIP extractor, which is cool!

I worry that requiring a custom VFS will make it harder for people to produce compatible software implementations.

I think your concerns about people contacting SQLite for support are overblown. I assume you've heard the `etilqs_` story[0], but in this case, you need to use a hex-editor or a utility like `file` to even see the header bytes. I think anyone capable of discovering that it's an SQLite DB will be smart enough not to contact SQLite for support with it.

The `Application ID`[1] field in the SQLite header is designed with this exact purpose in mind

> The application_id PRAGMA is used to query or set the 32-bit signed big-endian "Application ID" integer located at offset 68 into the database header. Applications that use SQLite as their application file-format should set the Application ID integer to a unique integer so that utilities such as file(1) can determine the specific file type rather than just reporting "SQLite3 Database".

It's convenient that `Pack` is 32 bits long ;)

[0] https://github.com/mackyle/sqlite/blob/18cf47156abe94255ae14...

[1] https://www.sqlite.org/pragma.html#pragma_application_id

OttoCoddo · 2024-03-23T22:44:32 1711233872

I am happy to hear that, and I really appreciate your interest.

Did you compile it for yourself? Any problem or steps you used, I will be happy to hear, o at pack.ac or GitHub, as it is hard to follow the building here.

As a reminder, Pack Draft 0 has Compatibility with SQLite tools; the only needed step is to change the first 16 bytes. Again, you can use `--activate-other-options --transform-to-sqlite3` with the CLI tool, and you will get a perfectly working SQLite file.

VFS is not needed; they can change the header after writing; VFS was just cleaner to me.

My first work was using application_id, after a while, it did not feel right to me, so I changed it for good. It allows easier future development, fewer problems for file type detection, a decreased chance of mistaken change (you already saw many negative comments on using SQLite as a base), and the support reason: just yesterday I was reading a forum post about people asking for support on software because it was using SQLite. application_id seems like a great choice if you are doing a DB-related task or making a custom DB for transfer on wire, to communicate between internal and semi-public tools. Using it for a format that could potentially get to an innumerable count seemed unwise.

me-vs-cat · 2024-03-23T19:28:04 1711222084

> - All future versions of Pack must handle previous versions and must only write the latest version.

I believe you are making a mistake by preventing Pack from writing archives that are compatible with prior versions.

OttoCoddo · 2024-03-23T16:27:51 1711211271

Thank you for the check.

No index, as they take space, and I wrote the queries considering SQLite automatic indexes. They will be created on demand, at unpacking time. All the unpacking processes are made to read and decompress content just once, so there are no worries about slowdowns.

I suggest trying Pack for yourself and seeing the speed. Or deeper, use `--activate-other-options --transform-to-sqlite3` to transform a Pack file to SQLite3, create your own indexes, and use `--activate-other-options --transform-to-pack` to convert it to Pack and then try unpacking. You will not see any worthy difference.

Yes, Contents are like packages of raw data from a chunk, a whole, or many of the items (files or data). They may be compressed if needed (With Zstandard). ItemContent table helps to find the needed Item parts.

The Content structure circumvents any BLOB limit, but it is also made to give better compression while keeping random access.

Retr0id · 2024-03-23T21:35:31 1711229731

Fair point, I can see that indexes are not really necessary.

OttoCoddo · 2024-03-23T16:27:45 1711211265

I guess you are overestimating the "cobble together read/write support without even needing a library."

Let's imagine: You want to read a ZIP file. Will you write your own reader? I seriously doubt it, as the work, stabilising, and security (random memory access as an example) would be issues. But let's think we are couraginous. OK, we read rather not so simple format and carefully read the binary. Now, will you write your own DEFLATE and Huffman coding? Again, a bigger doubt.

I would argue that if someone cares enough to reimplement ZIP, it would at worst be twice as hard to write a Pack reader from scratch with no ZSTD or SQLite. And for those serious people, reading a format that lets them store better and faster would be a prize that is hard to say no to. But I get your point, and if you are in a desert and need something to put together fast before going out of water, tar may be a good choice.

Retr0id · 2024-03-23T21:32:21 1711229541

I have written my own zip, deflate, and huffman coding - although the latter two were "just for fun". But I would definitely consider writing ad-hoc zip logic in real software, if I couldn't pull in a library for whatever reason. This isn't just a hypothetical, it happens a lot - there are many independent ZIP implementations in the wild, for better or for worse.

You're right to call out security though, because the multiple implementations cause security issues where they disagree, my favorite example being https://bugzilla.mozilla.org/show_bug.cgi?id=1534483 . Although arguably this is a symptom of ZIP being a poorly thought out file format (too many ambiguous edge-cases), rather than a symptom of it being easy to implement.

OttoCoddo · 2024-03-23T21:41:47 1711230107

You are one of the bravest. And you know that, using SQLite as the base storage, rules out many of the security problems we can face.

Anyone needing to reimplement Pack, can do it, very easily, if not easier than implementing ZIP, IF they use SQLite and Zstandard. Maybe a day of work or less. If they want to rewrite (reading part of) them too, it will be a couple of days of work.