More

alexwlchan · 2024-06-04T15:40:03

1/ Why not wget?

For this project I wanted a consistent file format for my entire collection.

I have a bunch of stuff I want to save which is behind paywalls/logins/clickthroughs that are tricky for wget to reach. I know I can hand wget a cookies file, but that’s mildly fiddly. I save those pages as Safari webarchive files, and then they can drop in alongside the files I’ve collected programatically. Then I can deal with all my saved pages as a homogeneous set, rather than being split into two formats.

Plus I couldn't find anybody who'd done this, and it was fun :D

This is only for personal stuff where I know I'll be using Safari/macOS for the foreseeable future. I don't envisage using this for anything professional, or a shared archive -- you're right that a less proprietary format would be better in those contexts. I think I'm in a bit of a niche here.

(I'm honestly surprised this is on the front page; I didn't think anybody else would be that interested.)

2/ Proprietary format: it is, but before I started I did some experiments to see what's actually inside. It's a binary plist and I can recover all the underlying HTML/CSS/JS files with Python, so I'm not totally hosed if Safari goes away.

Notes on that here: https://alexwlchan.net/til/2024/whats-inside-safari-webarchi...

pvg · 2024-06-04T15:53:39

I didn't think anybody else would be that interested.

'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem, especially programmatically, so the niche is probably a little roomier than you might initially suspect.

DaSHacka · 2024-06-04T17:05:08

> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem

You may be interested in SingleFile[1]

[1] https://github.com/gildas-lormeau/SingleFile

I use it all the time to archive webpages, and I imagine it wouldn't be hard to throw together a script to use FireFox's headless mode in combination with SingleFile to selfhost a clone of the wayback machine.

freedomben · 2024-06-05T00:37:42

This is what I was going to say as well. Somebody on HN told me about SingleFile and I use it all the time now! Really amazing extension.

pvg · 2024-06-04T17:13:05

Thanks, I've seen it, last I tried it it missed bg images. But my point is this is something browsers should support better and kind of sort of do now but even with that it's a hassle.

tedmiston · 2024-06-04T17:36:31

I tested this just now on the blog post that this HN page points to and SingleFile handled the background image fine.

cxr · 2024-06-04T22:27:12

> FireFox's

It's just "Firefox".

sturakov · 2024-06-04T21:49:34

I've enjoyed using this

https://github.com/webrecorder

It has a standardized format and acts like a recorder for what you see.

factormeta · 2024-06-04T21:47:58

Thanks all the JS - SPA develops that insisting on putting JS all over the place. Wouldn't it be better to have everything in one .html, using <script> <style> just inline. Then it is also just one file over the internet. There must be a bundler that does that no?

Seems JS developer just want their code to the obfuscated and unachievable as possible unless it is via their web server.

cxr · 2024-06-04T22:25:59

> using <script> <style> just inline

These SPA bundles are on the order of megabytes, not kilobytes. You want your users, for their own sake and yours, to be able to cache as much as possible instead of delivering a unique megablob payload for every page they hit.

vmfunction · 2024-06-05T03:08:46

Good point on the cache. However things such as putting background image in CSS, so user can right click to download the image is just stupid. Why is css all the sudden in control of the image display? It just makes archiving pages harder.

diggan · 2024-06-04T16:18:43

> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem

Is it really? I remember hacking around with with JavaScript's XMLSerializer (I think) like 5 years ago and solved that for ~90% of the websites I tried to archive. It'd save the DOM as-is when executed.

Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.

pvg · 2024-06-04T17:05:00

90% feels like an overestimate to me but it's already quite poor, you wouldn't accept that for saving most other things. Another problem is highlighted in the piece - it's a hassle to ensure external tools handle session state and credentials. Dynamic content is poorly handled, the default behaviours are miserable (a browser will run random Javascript from the network but not Javascript you've saved, etc).

There's a lot of interest in 'digital preservation' and perhaps one sign of how it's very much early days of the field - it's tricky to 'just save' the results of one of the most basic current computer interactions - looking at a web page.

diggan · 2024-06-04T17:08:48

But if you serialize the DOM as-is, you literally get what you see on the page when you archive it. Nothing about it is dynamic, and there is no sessions nor credentials to handle. Granted, it's a static copy of a specific single page.

If you need more than that, then WARC is probably the best. For my measly needs of just preserving exactly what I see, serializing the DOM and saving the result seems to do just fine.

pvg · 2024-06-04T17:19:45

Yes you save something that's mildly better than print-page-to-PDF. But it still misses things and the interactive stuff is very much part of 'exactly what I see'. Like, a random article with an interactive graph, for instance - like this recent HN hit https://ciechanow.ski/airfoil/

It's not that there aren't workarounds, it's that they are clunky and 'you can't actually save the most common computery entity you deal with' is just a strange state of affairs we've somehow Stockholmed ourselves to.

tedmiston · 2024-06-04T17:48:36

> Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.

One category that the archivers do poorly with is news articles where a pop-up renders on page load which then requires client-side JS execution to dismiss the pop-up.

Sometimes it is easily circumvented by manual DOM manipulation, but that's hardly a bulletproof solution. And it feels automateable.

brnt · 2024-06-05T05:13:30

Print to PDF seems to be the only way to ensure you record what you saw.

alexwlchan · 2024-02-01T06:33:55

Argh! I knew I was going to make a numerical mistake somewhere, thanks for spotting it. Correction will be up shortly. Thanks for spotting it! :D

And thanks for the text example! This looks like what I was trying, but clearly I had a mistake somewhere.

dingensundso · 2024-02-01T07:21:31

Spotted another math mistake: > The default unit size is 1/72 inch, so the page is 300 × 72 = 4.17 inches.

alexwlchan · 2024-01-05T18:44:18

> I think that all "photos" or "videos" are just a view of the underlying "photo or video object". If you crop a video, the full-size video will remain. Only if you export the video, it will be cropped and the smaller file size will manifest.

Yup, the Photos app keeps the unmodified original file, and then any edits/crops are stored separately. You can always revert to the original file and redo your edits. So they might be storing multiple copies of the same image, with and without edits.

Which API were you looking at for "file size"?

I was able to get the size data from Photos.app with the PhotoKit API [1]. I've only tested it with my library of ~26k items, but it was useful for getting an indicator of the biggest items. (Although I didn't think to check whether exporting a 1GB video caused my iCloud usage to drop by 1GB.)

[1]: https://alexwlchan.net/2023/finding-big-photos/

asdaq1312512 · 2024-01-05T22:14:03

Ahh, I did consider PHAsset.fetchAssets but my understanding was that the method will download the file if not present locally - which wouldn't be acceptable for an app, I guess.

Do you know more? The introduction says "Retrieve asset metadata or request full asset content.", but I can't find clarification when it actually accesses full content.

alexwlchan · 2024-01-06T08:21:19

Yeah, that was what I thought when I first worked with these APIs! But when you use PhotoKit, you have to explicitly opt-in to downloading files from iCloud.

AFAICT, PHAsset is only metadata. When I'm downloading the full-sized images, I use PHImageManager.requestImage() and pass in the PHAsset I'm looking at [1][2]. I know there's something similar for video, but I've never used it.

You can control the behaviour by passing a PHImageRequestOptions instance. This includes an isNetworkAccessAllowed bool which controls where Photos.app will download the file from iCloud if not present locally, and it defaults to false.

[1]: https://developer.apple.com/documentation/photokit/loading_a...

[2]: https://developer.apple.com/documentation/photokit/phimagema...

[3]: https://developer.apple.com/documentation/photokit/phimagere...

ComputerGuru · 2024-01-05T23:54:52

It’s (much) more complicated than that. HEIC format allows multiple “frames” to be stored in the container, each one derived from the source (or each other). So there may literally be just one file but it has (algorithmically) the definition for generating (losslessly) the other files, too. So there isn’t/wouldn’t be even a separate “smaller copy” at all as it is generated on-the-fly.

alexwlchan · on July 4, 2023

https://alexwlchan.net/writing/

I passed 400 posts a month or so ago; been writing for about a decade. It's a mix of programming, arty stuff, digital preservation, personal thoughts – the first link describes the sort of writing I do, and examples of each.

Some favourites:

* https://alexwlchan.net/2022/screenshots/ – You should take more screenshots, a perennial darling of HN

* https://alexwlchan.net/2022/marquee-rocket/ – Launching a rocket in the worst possible way, aka abusing the <marquee> tag

* https://alexwlchan.net/2022/bure-valley/ – A day out at the Bure Valley Railway, trains!

* https://alexwlchan.net/2022/snapped-elastic/ – Finding a tricky bug in Elasticsearch 8.4.2, the sort of deep-dive debugging I don’t do often enough

(And a fairly basic post about prime factorisation with Python has been on the HN front page several times, for reasons I do not understand at all)

alexwlchan · on Feb 21, 2023

This is a follow-up to Writing JavaScript without a build system (https://jvns.ca/blog/2023/02/16/writing-javascript-without-a...)

Discussed here: https://news.ycombinator.com/item?id=34825676

cxr · on Feb 22, 2023

> My secret trick: use the browser

> Whenever you run tests, you need to run code – and browsers are very good at running JavaScript!

Such a novel insight. Much wow.

Don't let the NPM/NodeJS camp find out—they're so invested in their tooling that if you told them that the browser turns out to already be as good as or better than their preferred approach for the two things that they live and die by—downloading JS from the Internet and running it—then they might actually do that: die (from maxed out Surprised Pikachu shock levels, presumably). It turns out the browser is a lot better at stability across versions and sandboxing (i.e. not granting an arbitrary tool total access to your local workstation...), too.

Related/previously:

> > avoid Node.js based applications altogether.

> I ran into this recently. Firefox has a "packager" for putting together add-ons. It uses "node.js". All it really does is apply "zip" to some files. I tried to install the "packager" on Ubuntu 18.04 LTS. It had several hundred dependencies, and wouldn't run because of some version dependency in node.js for something totally irrelevant to the task at hand. Mozilla support suggested upgrading the operating system on the development machine.

<https://news.ycombinator.com/item?id=24495646>

Moar: How to displace JS <https://www.colbyrussell.com/2019/03/06/how-to-displace-java...>

(Spoiler alert: it's by upgrading from JS to... JS: <https://kosmos.social/@colby/107383819674336646>).

alexwlchan · on Aug 1, 2022

Apparently so: https://en.wikipedia.org/wiki/List_of_works_rejected_by_the_...

I’m surprised by how long the list is; I thought it was much rarer and more exceptional. The only time I’d heard of them rejecting a film was when it showed unsimulated animal cruelty, but I guess there must be other reasons.

OJFord · on Aug 1, 2022

Huh, thanks for that. I had heard about requesting cuts now you mention it. (And this is also done to target a rating, if they want to get a 12A but as-is it's a 15 for example.)

I wonder if the ones never released are just because they didn't want to cut enough, preferring to abandon the project though.

> The only time I’d heard of them rejecting a film was when it showed unsimulated animal cruelty, but I guess there must be other reasons.

Porn, or rather it somehow going too far, seems to be a big one based on a lot of the titles there.

implements · on Aug 1, 2022

Some films are so grim that if they cut out all the unacceptable stuff there’d be no film left worth seeing. Porn gets an R18 rating - provided it’s not dangerous, violent, illegal or obscene - and can shown at specially licensed sex cinemas.

alexwlchan · on June 13, 2017

For CI systems like Travis, people add it to the cached directories, and it's shared between runs. I know Travis, Circle and AppVeyor all have some way to cache data between runs – nominally for dependencies, but .hypothesis works too.

According to our docs (http://hypothesis.readthedocs.io/en/latest/database.html?hig...), you can check the examples DB into a VCS and it handles merges, deletes, etc. I don't know anybody who actually does this, and I've never looked at the code for handling the examples database, so I have no idea how (well) this works.

If tests do throw up a particularly interesting and unusual example, we recommend explicitly adding it to the tests with an `@example` decorator, which causes us to retest that value every time. Easier to find on a code read, and won't be lost if the database goes away.

(Disclaimer: I'm a Hypothesis maintainer)

alexwlchan · on April 2, 2015

There's a paragraph in the Phase I Audit Report (published a year ago) which includes a checksum:

> The iSEC team reviewed the TrueCrypt 7.1a source code, which is publicly available as a zip archive (“truecrypt 7.1a source.zip”) at http://www.truecrypt.org/downloads2. The SHA1 hash of the reviewed zip archive is 4baa4660bf9369d6eeaeb63426768b74f77afdf2.

The Phase II report (today;s release) claims to be auditing 7.1a, so I assume it's exactly the same version and ZIP file.

Last June, they published "a verified TrueCrypt v. 7.1 source and binary mirror", including file hashes, on GitHub: https://github.com/AuditProject/truecrypt-verified-mirror

I just cloned that repo and inspected the source ZIP; the SHA1 sum matches what they quote in the report.

alexwlchan · on April 2, 2015

Stack of pull requests coming your way :-)

(I was having some bad dreams, and apparently proofreading healthcare documents was what I needed to shake it. Thanks!)

alexwlchan · on July 26, 2012

Even if you could raise the money, I doubt that he would do it, or that it would be comparable to his OS X articles.

If you listen to Hypercritical (his weekly 5by5 podcast), you'll have heard that he struggles just getting the OS X reviews out the door. Since Apple is trying to move to a yearly release cycle, that’s just going to get harder. When’s he going to get the time to write an Ubuntu review?

It’s also worth considering that, “He has been a Mac user since 1984” (from his Ars bio). Part of what makes his reviews so good is his deep-rooted knowledge of the Mac platform, and having watched OS X (and previous versions of Mac OS) “grow up”, so to speak. I don’t know how much experience he has with Ubuntu, but I bet it’s not as extensive as OS X.

And that’s putting aside all the arguments of whether it’s a good thing to do, or whether the Ubuntu community would want him to write such a thing.