I was working on a trick to save files in a more compact way by taking advantage...

gildas · 2024-12-21T21:55:19 1734818119

I took another approach in SingleFile by offering a way to save pages as self-extracting pages (i.e. ZIP/HTML polyglot files), see [1] for more info.

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

Dwedit · 2024-12-21T22:05:07 1734818707

Really cool stuff there, how did you read the bytes back? Normally you get a CORS error if you try to use a network request to read back yourself.

gildas · 2024-12-21T22:08:40 1734818920

The saved page is encoded in windows-1252. It includes "consolidation data" to read the ZIP data as text from the DOM and recover the replacements of \r and \r\n occurrences (this is the only data loss and it represents approx. 1% of ZIP data), see the links below for more info.

https://gildas-lormeau.github.io/Polyglot-HTML-ZIP-PNG/en-EN...

Dwedit · 2024-12-22T21:18:57 1734902337

If "CR" is the only bad byte, that means that 255/256 of the symbols are okay to use. That beats UTF-16 embedded in a string, where only 63481/65536 of the symbols are okay to use.

My approach was to use very large integers. You can split the input file into blocks of X bits, then represent that block as X+1 bits. The output is bigger because it can't have any forbidden bytes in there.

For the case of 255 of 256 symbols, packing 1415 bits of data into 1416 bits of space is the most efficient block size (before reaching a ridiculously large size) at 0.0706215% expansion. (For an infinite block size, you'd have an expansion of 1 - (log base 256 of 255), or 0.070582%)

Encoding: Turn 1415 bits of data into a very large number. Repeatedly divide and modulo by 255, giving a range of 0-254. Then add 1 to all bytes "CR" or larger. Now you have 1416 bits of encoded data, which cannot be "CR".

Decoding: Read a byte, decode back to 0-254 by subtracting 1 if it's greater than "CR". Multiply by 255 and add to your big number. At the end, you'll have a really big number that holds 1415 bits of data. This would be 177 big multiplies, and 177 big adds.

Decoding (the faster way):

Javascript uses floats, but you can treat them as 48-bit integers. Just watch out for the bitwise operators, they will truncate results down to 32 bits. That means use actual multiplication and division instead of bit shifting.

6 bytes at a time: 48 bits can hold 6 bytes. With normal floating point math, you can multiply each byte by 255^0, 255^1, 255^2, 255^3, 255^4, 255^5, and sum them together. Then you multiply-and-add these 6-byte chunks to a big int. Then the operations afterwards use big ints. First 6 bytes get multiplied by 255^0, next 6 bytes get multiplied by 255^6, then 255^12, 255^18, etc. Whole thing is summed together. This cuts it down to 30 bigint multiply-and-adds, (30 multiplies and 30 adds)

Homemade bigint: It's an array of doubles, but used as 48-bit integers. Compared to the actual BigInt, it removes all allocations, and you can access the bits inside directly, speeding up the part where you extract bits from the number. Only mathematical operation required for decoding is the "multiply and accumulate" operation. Using the homemade bigint sped things up dramatically.

---

So then, that's a lot of math just to avoid escaping (or fixing up) your bytes, but I think that would get close to the minimum possible expansion.

jclarkcom · 2024-12-21T22:57:59 1734821879

Neat idea, how does it compare to just using base64 when gzip is in the middle?

Dwedit · 2024-12-22T01:12:21 1734829941

You won't be using gzip on compressed binary files (images, videos, audio, etc).

KolenCh · 2024-12-22T01:44:09 1734831849

But they are talking about gzipping base64 encoded data though.