Hacker News new | past | comments | ask | show | jobs | submit login

I was working on a trick to save files in a more compact way by taking advantage of UTF-16 encoded HTML files.

I have an example HTML files on my website here: https://www.dwedit.org/files/test.html

(Tip: type "codeText" into the JS console to see the second-stage loader. First stage-loader is in clear text in the source)

It's storing a 64479 byte JPEG into 64680 bytes of UTF-16 text, for around 0.2985% expansion. The decoder isn't size-optimized right now (there's even comments in the code), so it takes about 10KB more.

This is doing two things:

First stage is a basic kind of UTF-16 packing, where you use a simple Javascript string literal, and anything that needs escaping gets escaped. This keeps most 8-bit data the same size, but byte pairs that turn into a forbidden string character get escaped. Then you split your result string into bytes.

Forbidden characters (requiring escaping) in a Javascript string: 0x00, 0x0A, 0x0D, 0x22, 0x5C, 0x2028, 0x2029, 0xD800-0xDFFF

---

Second stage involves using very large integers to pack 335 bits into every 336 actual bits. This avoids all escaping completely in the JS string, as you can avoid the forbidden characters.

During decoding, my code is using a handmade 48-bit BigInt (actually stores 7 48-bit words for 336 total bits), allowing support on web browsers that predate native bigints, and it runs faster too.

Let me know if I should make an article about this.

(I also made a WASM version of the decoder to see if it would run any faster, but it didn't.)




I took another approach in SingleFile by offering a way to save pages as self-extracting pages (i.e. ZIP/HTML polyglot files), see [1] for more info.

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG


Really cool stuff there, how did you read the bytes back? Normally you get a CORS error if you try to use a network request to read back yourself.


The saved page is encoded in windows-1252. It includes "consolidation data" to read the ZIP data as text from the DOM and recover the replacements of \r and \r\n occurrences (this is the only data loss and it represents approx. 1% of ZIP data), see the links below for more info.

https://gildas-lormeau.github.io/Polyglot-HTML-ZIP-PNG/en-EN...

https://gildas-lormeau.github.io/Polyglot-HTML-ZIP-PNG/en-EN...

https://gildas-lormeau.github.io/Polyglot-HTML-ZIP-PNG/en-EN...


If "CR" is the only bad byte, that means that 255/256 of the symbols are okay to use. That beats UTF-16 embedded in a string, where only 63481/65536 of the symbols are okay to use.

My approach was to use very large integers. You can split the input file into blocks of X bits, then represent that block as X+1 bits. The output is bigger because it can't have any forbidden bytes in there.

For the case of 255 of 256 symbols, packing 1415 bits of data into 1416 bits of space is the most efficient block size (before reaching a ridiculously large size) at 0.0706215% expansion. (For an infinite block size, you'd have an expansion of 1 - (log base 256 of 255), or 0.070582%)

Encoding: Turn 1415 bits of data into a very large number. Repeatedly divide and modulo by 255, giving a range of 0-254. Then add 1 to all bytes "CR" or larger. Now you have 1416 bits of encoded data, which cannot be "CR".

Decoding: Read a byte, decode back to 0-254 by subtracting 1 if it's greater than "CR". Multiply by 255 and add to your big number. At the end, you'll have a really big number that holds 1415 bits of data. This would be 177 big multiplies, and 177 big adds.

Decoding (the faster way):

Javascript uses floats, but you can treat them as 48-bit integers. Just watch out for the bitwise operators, they will truncate results down to 32 bits. That means use actual multiplication and division instead of bit shifting.

6 bytes at a time: 48 bits can hold 6 bytes. With normal floating point math, you can multiply each byte by 255^0, 255^1, 255^2, 255^3, 255^4, 255^5, and sum them together. Then you multiply-and-add these 6-byte chunks to a big int. Then the operations afterwards use big ints. First 6 bytes get multiplied by 255^0, next 6 bytes get multiplied by 255^6, then 255^12, 255^18, etc. Whole thing is summed together. This cuts it down to 30 bigint multiply-and-adds, (30 multiplies and 30 adds)

Homemade bigint: It's an array of doubles, but used as 48-bit integers. Compared to the actual BigInt, it removes all allocations, and you can access the bits inside directly, speeding up the part where you extract bits from the number. Only mathematical operation required for decoding is the "multiply and accumulate" operation. Using the homemade bigint sped things up dramatically.

---

So then, that's a lot of math just to avoid escaping (or fixing up) your bytes, but I think that would get close to the minimum possible expansion.


Neat idea, how does it compare to just using base64 when gzip is in the middle?


You won't be using gzip on compressed binary files (images, videos, audio, etc).


But they are talking about gzipping base64 encoded data though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: