Singlefile: A web extension to save a complete web page into a single HTML file

Dwedit · 2024-12-21T21:47:32 1734817652

I was working on a trick to save files in a more compact way by taking advantage of UTF-16 encoded HTML files.

I have an example HTML files on my website here: https://www.dwedit.org/files/test.html

(Tip: type "codeText" into the JS console to see the second-stage loader. First stage-loader is in clear text in the source)

It's storing a 64479 byte JPEG into 64680 bytes of UTF-16 text, for around 0.2985% expansion. The decoder isn't size-optimized right now (there's even comments in the code), so it takes about 10KB more.

This is doing two things:

First stage is a basic kind of UTF-16 packing, where you use a simple Javascript string literal, and anything that needs escaping gets escaped. This keeps most 8-bit data the same size, but byte pairs that turn into a forbidden string character get escaped. Then you split your result string into bytes.

Forbidden characters (requiring escaping) in a Javascript string: 0x00, 0x0A, 0x0D, 0x22, 0x5C, 0x2028, 0x2029, 0xD800-0xDFFF

---

Second stage involves using very large integers to pack 335 bits into every 336 actual bits. This avoids all escaping completely in the JS string, as you can avoid the forbidden characters.

During decoding, my code is using a handmade 48-bit BigInt (actually stores 7 48-bit words for 336 total bits), allowing support on web browsers that predate native bigints, and it runs faster too.

Let me know if I should make an article about this.

(I also made a WASM version of the decoder to see if it would run any faster, but it didn't.)

gildas · 2024-12-21T21:55:19 1734818119

I took another approach in SingleFile by offering a way to save pages as self-extracting pages (i.e. ZIP/HTML polyglot files), see [1] for more info.

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

Dwedit · 2024-12-21T22:05:07 1734818707

Really cool stuff there, how did you read the bytes back? Normally you get a CORS error if you try to use a network request to read back yourself.

gildas · 2024-12-21T22:08:40 1734818920

The saved page is encoded in windows-1252. It includes "consolidation data" to read the ZIP data as text from the DOM and recover the replacements of \r and \r\n occurrences (this is the only data loss and it represents approx. 1% of ZIP data), see the links below for more info.

https://gildas-lormeau.github.io/Polyglot-HTML-ZIP-PNG/en-EN...

Dwedit · 2024-12-22T21:18:57 1734902337

If "CR" is the only bad byte, that means that 255/256 of the symbols are okay to use. That beats UTF-16 embedded in a string, where only 63481/65536 of the symbols are okay to use.

My approach was to use very large integers. You can split the input file into blocks of X bits, then represent that block as X+1 bits. The output is bigger because it can't have any forbidden bytes in there.

For the case of 255 of 256 symbols, packing 1415 bits of data into 1416 bits of space is the most efficient block size (before reaching a ridiculously large size) at 0.0706215% expansion. (For an infinite block size, you'd have an expansion of 1 - (log base 256 of 255), or 0.070582%)

Encoding: Turn 1415 bits of data into a very large number. Repeatedly divide and modulo by 255, giving a range of 0-254. Then add 1 to all bytes "CR" or larger. Now you have 1416 bits of encoded data, which cannot be "CR".

Decoding: Read a byte, decode back to 0-254 by subtracting 1 if it's greater than "CR". Multiply by 255 and add to your big number. At the end, you'll have a really big number that holds 1415 bits of data. This would be 177 big multiplies, and 177 big adds.

Decoding (the faster way):

Javascript uses floats, but you can treat them as 48-bit integers. Just watch out for the bitwise operators, they will truncate results down to 32 bits. That means use actual multiplication and division instead of bit shifting.

6 bytes at a time: 48 bits can hold 6 bytes. With normal floating point math, you can multiply each byte by 255^0, 255^1, 255^2, 255^3, 255^4, 255^5, and sum them together. Then you multiply-and-add these 6-byte chunks to a big int. Then the operations afterwards use big ints. First 6 bytes get multiplied by 255^0, next 6 bytes get multiplied by 255^6, then 255^12, 255^18, etc. Whole thing is summed together. This cuts it down to 30 bigint multiply-and-adds, (30 multiplies and 30 adds)

Homemade bigint: It's an array of doubles, but used as 48-bit integers. Compared to the actual BigInt, it removes all allocations, and you can access the bits inside directly, speeding up the part where you extract bits from the number. Only mathematical operation required for decoding is the "multiply and accumulate" operation. Using the homemade bigint sped things up dramatically.

---

So then, that's a lot of math just to avoid escaping (or fixing up) your bytes, but I think that would get close to the minimum possible expansion.

jclarkcom · 2024-12-21T22:57:59 1734821879

Neat idea, how does it compare to just using base64 when gzip is in the middle?

Dwedit · 2024-12-22T01:12:21 1734829941

You won't be using gzip on compressed binary files (images, videos, audio, etc).

KolenCh · 2024-12-22T01:44:09 1734831849

But they are talking about gzipping base64 encoded data though.

mdaniel · 2024-12-21T21:24:33 1734816273

seems there was some robust discussion on the 2022 submission https://news.ycombinator.com/item?id=30527999 but I didn't dig into it to discover "but, why?" as compared to the "save" built into the browsers

having said that, this here is an instant "nope" for me https://github.com/gildas-lormeau/SingleFile/blob/master/faq...

> By default, SingleFile removes scripts because they can alter the rendering and there is no guarantee they will work offline.

so, not complete; got it

freehorse · 2024-12-21T21:28:58 1734816538

The very next sentence says

> However, you can save them by unchecking the option "Network > blocked resources > scripts" [..]

mdaniel · 2024-12-21T21:31:49 1734816709

Yeah, understood, there seems to be a bunch of "yes, well, monkey with this thing" options, but if the default behavior is not to preserve all the resources because of some pearl-clutching paranoia or whatever, then it's not a complete web page now is it?

wakawaka28 · 2024-12-21T22:05:40 1734818740

It depends what you want... Lots of JS will not work offline and I don't think a plugin can know what will work or not. If the page is dynamically populated from an API, saving the page to rely on JS later could be basically wrong.

varelaseb · 2024-12-21T21:35:45 1734816945

This is the worst gotcha I've ever seen

lynndotpy · 2024-12-22T00:48:12 1734828492

It's not paranoia, it's a feature that I consider to be useful and desirable. When I want to download the JavaScript, I use the built-in "save" feature. When I don't, I use SingleFile.

gildas · 2024-12-21T21:37:04 1734817024

See https://github.com/gildas-lormeau/SingleFile/blob/master/faq...

maxloh · 2024-12-21T21:36:09 1734816969

An alternative approach would be saving all outbound requests as a HAR file using your browser's DevTools. This captures fetched API responses as well. HAR is a JSON-based format, making it straightforward to inspect.

Unfortunately, there aren't any tools currently for opening a HAR archive directly as a webpage. Perhaps someone could develop one using Electron? (Hold your downvotes – we actually need a full browser environment to render HTML and execute JavaScript. Isn't it?)

smittywerben · 2024-12-22T08:28:20 1734856100

Recording all of your SSL-decrypted network logs and reconstructing them later is one step further. I'm sure there's a chain of custody issue not much further.

But what I like about SingleFile is that printing the SingleFile page tends to print to PDF better than an actual dynamic page (annoying modals etc show up in the print that are not normally visible). I have a python service running that organizes my SingleFile html files into folders by each domain and stored my google drive.

It's like a confused mix of wget's -x directories and a postscript printer so you can avoid chrome's print to skia pdf pipeline until a brighter day.

stuffoverflow · 2024-12-22T18:43:31 1734893011

I'm pretty sure I've used some tool in the past to convert a HAR in to WARC which can then be browsed with https://replayweb.page/

dannyobrien · 2024-12-21T21:38:28 1734817108

If today's Hacker News trend is "ways to archive web pages for permanent storage", let me also point folks to https://webrecorder.net/archivewebpage/ and the WebRecorder suite of tools.

dang · 2024-12-21T21:44:54 1734817494

What other thread(s) from today are you thinking of? (maybe https://news.ycombinator.com/item?id=42441609?)

We try to avoid repetition on the frontpage—people tend to post related/follow-up links as submissions (bad) instead of in the comments (good, as you did here!)

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

dannyobrien · 2024-12-29T21:49:57 1735508997

Yep, that was the one -- I think it was more dominated by "how to save webpages" discussions when I was looking at it.

I had no idea that posting in the comments was preferred (though it does make sense!)

crgk · 2024-12-21T22:12:17 1734819137

FWIW, the discussion you linked came to my mind when I saw this one.

glenstein · 2024-12-21T21:39:20 1734817160

Thanks for this recommendation, this is exactly a kind of tool that I think I had hoped might exist out there somewhere. I don't suppose there's a Firefox version?

maxloh · 2024-12-21T21:22:29 1734816149

Chromium-based browsers (Chrome, Edge, Brave, etc.) already have great MHTML support. Just right click on a page > "Save as...", and select "Webpage, Single File" in the dialog.

freehorse · 2024-12-21T21:59:35 1734818375

Because having just an html file can just be simpler and html files are more ubiquitously rendered. Personally having a single html file with base64 encoded images embedded into it serves 99.9% of what I usually want to do when wanting to have a webpage saved offline. Such a file can be eg be put in my dropbox and rendered in dropbox web interface which makes it easy to share with others.

I have actually made a script that looks for image tags in an html file on my computer, looks in the local directory location for them and base64 embeds them in the html file itself, so this seems to simplify the process.

hahn-kev · 2024-12-22T04:25:17 1734841517

If that's all you want then couldn't you just print to PDF?

freehorse · 2024-12-22T10:57:09 1734865029

Because I do not want a page structure in the document. Images are arranged better in an html file because a pdf would have to arrange things into pages and thus create undesirable gaps or put images in weird ways. Plus html is just a text file I can further edit if I want.

eviks · 2024-12-22T05:22:23 1734844943

It's not responsive anymore

thekevan · 2024-12-21T21:29:57 1734816597

I was just going to ask why this was better than just saving an MHTML file in the browser.

maxloh · 2024-12-21T21:42:14 1734817334

It seems that Firefox doesn't support saving a page as MHTML yet.

Maybe that's the reason.

https://connect.mozilla.org/t5/ideas/add-support-of-mhtml-fo...

xeonmc · 2024-12-22T06:16:35 1734848195

Also it doesn’t quite fully save all of the assets, like fonts if it were referenced in CSS files

gildas · 2024-12-21T21:39:14 1734817154

Because the MHTML format produced by Chromium can almost be considered as proprietary.

maxloh · 2024-12-21T21:44:18 1734817458

That's not correct. MHTML has been defined as internet standard since 1999. MHTML has been an internet standard since 1999, with support dating back to IE 5.0.

https://datatracker.ietf.org/doc/html/rfc2557

gildas · 2024-12-21T21:52:02 1734817922

I'm actually referring to the "set of modifications" described here [1] and the fact that only Chromium can open (these) MHTML files today.

[1] https://docs.google.com/document/d/1FvmYUC0S0BkdkR7wZsg0hLdK...

maxloh · 2024-12-21T22:08:35 1734818915

Good to know that!

Are you meaning that these modifications cause the generated MHTML files to become non-standards-compliant?

I wouldn't consider them proprietary though. At least the primary software used to open these files, Chromium, is open-source software.

gildas · 2024-12-21T22:21:15 1734819675

I agree that it's debatable, which is why I used the word “almost” ;). The fact remains that this info isn't in RFCs and is hard to find on the web.

n8henrie · 2024-12-21T20:52:28 1734814348

If anyone wants to help me update the nix package for the CLI, I'd appreciate it. The whole Deno thing is foreign territory for me.

https://github.com/NixOS/nixpkgs/blob/81629effd3f7e0cea0c1cf...

mdaniel · 2024-12-21T21:33:21 1734816801

You linked to an invocation of grep without any further context. What, specifically, do you need help with?

n8henrie · 2024-12-22T03:45:52 1734839152

> You linked to an invocation of grep without any further context.

I linked to an invocation of grep in the package in question. Sorry, on mobile so I just copied the first link to the package from my search results.

> What, specifically, do you need help with?

As I mentioned, I don't know how deno works. To elaborate, now that single-file-cli is using deno, I don't know how to package something that uses deno for nix.

dang · 2024-12-21T21:41:03 1734817263

Related. Others?

How SingleFile Transformed My Obsidian Workflow - https://news.ycombinator.com/item?id=39147615 - Jan 2024 (1 comment)

New Feature: Self-Extracting Zip Files Added to SingleFile - https://news.ycombinator.com/item?id=37791838 - Oct 2023 (1 comment)

Show HN: SingleFile is finally available on Safari (macOS/iOS) - https://news.ycombinator.com/item?id=33643192 - Nov 2022 (50 comments)

SingleFile version compatible with Manifest V3 - https://news.ycombinator.com/item?id=33063619 - Oct 2022 (274 comments)

SingleFile: Save a complete web page into a single HTML file - https://news.ycombinator.com/item?id=30527999 - March 2022 (240 comments)

Show HN: SingleFile Lite, new version of SingleFile compatible with Manifest V3 - https://news.ycombinator.com/item?id=29331038 - Nov 2021 (2 comments)

Show HN: Save web pages as self-extracting HTML/ZIP files from the CLI - https://news.ycombinator.com/item?id=25218947 - Nov 2020 (3 comments)

Browser extension and CLI tool to save a complete webpage as single HTML file - https://news.ycombinator.com/item?id=22449602 - Feb 2020 (1 comment)

Store the proof of a webpage saved with SingleFile in Bitcoin - https://news.ycombinator.com/item?id=21970779 - Jan 2020 (77 comments)

SingleFileZ, a web extension for saving pages as HTML/ZIP hybrid files - https://news.ycombinator.com/item?id=21426056 - Nov 2019 (36 comments)

Show HN: SingleFileZ – Save a web page in a HTML file which is also a zip file - https://news.ycombinator.com/item?id=19933196 - May 2019 (18 comments)

SingleFile can now save a web page from the command line - https://news.ycombinator.com/item?id=18974357 - Jan 2019 (2 comments)

SingleFile 1.0 is out - https://news.ycombinator.com/item?id=17457943 - July 2018 (1 comment)

jftuga · 2024-12-21T23:14:06 1734822846

I use this extension to locally save my Claude chats. It does a great job preserving code blocks.

betaby · 2024-12-21T23:54:18 1734825258

For command line I can recommend `monolith` https://github.com/Y2Z/monolith

nunez · 2024-12-23T03:39:03 1734925143

This extension is awesome for writing tests while building Web scraping apps.

create-username · 2024-12-21T22:00:09 1734818409

Still no way to make a backup of xml books or BlinkLearning websites

xrd · 2024-12-21T23:36:23 1734824183

Just saying, my static blog tool does this:

https://www.npmjs.com/package/svekyll-cli

https://extrastatic.dev/svekyll/svekyll-cli

There is something really exciting about creating a communication medium that is in a single file. It feels like you can then easily publish to any web server, but also to things like IPFS.

pentagrama · 2024-12-21T22:30:02 1734820202

Tried it, and it worked perfectly. Neat!

But should I uninstall it? I noticed that this extension (at least the Firefox version) requires "Access your data for all websites" and there's no option to grant access only when needed, clinking on the extension button. I’m not comfortable with extensions having access to all my browsing data unless it’s absolutely necessary (e.g., uBlock Origin).

gildas · 2024-12-21T22:34:33 1734820473

SingleFile respects your privacy, you can find the privacy policy here [1].

[1] https://github.com/gildas-lormeau/SingleFile/blob/master/pri...