HTTrack Website Copier – Offline browser

boulos · on July 10, 2021

Ahh, the good ole days. I used httrack nearly 20 years ago to make the CD copies of the osha.gov site (e.g., [1]). Back then, for ADA and internet access compliance, government websites had to also be made available for offline use if possible.

I haven’t followed httrack since, but it seems like scrapy and similar are much better replacements.

[1] https://forum.httrack.com/readmsg/3556/index.html

xroche · on July 10, 2021

Glad the project helped a bit a few people :) I don't have much time unfortunately to enhance the engine nowadays, and the code is dirty and broken beyond any repair. Yet I'm still puzzled to see how many people are still using the project today.

You'll probably find better approaches, and while I never tried scrapy, it seems to be using a javascript engine for hard cases, which was something I thought about (but this was way above my skills at that time).

The hard parts remains however, if you want a functional site: you need to rewrite links, or use an external proxy-like mechanism. Having a fully functional offline, file-based site, is the real tricky part. Cases will remain unsolvable, as the inside code logic can produce whatever external link resource based on randomness, time, etc.

The approach in httrack was both ugly and pragmatic: attempting to recognize link/files patterns within javascript and fetch/replace what can be replaced with local links. Javascript producing html will typically be analyzed with really dumb - yet sometimes effective - js parsers. (parental advisory: don't look at the parsers code, your eyes would melt)

And obviously this is not going to solve all cases and will even break pages with tricky js

boulos · on July 10, 2021

Let me be clear: Thanks, Xavier!

httrack was extremely helpful and there really was no equal. The “modern” web requires a live JS engine, but as you point out, even the “old” web had server-side logic that couldn’t be captured.

In that light, I think httrack has stood up pretty well and nobody expects you to go rewrite it or clean it up. If someone today has a mostly static site they want to archive without writing custom code, I would still recommend httrack (it’s more controllable than wget or similar). I just assume that those sites are mostly gone :(.

robtherobber · on July 10, 2021

I'm using this software on and off, but it's especially useful when clients plan to redo their websites and I want to make sure I have a copy of the pages offline but don't have access to server backups or things like that.

But, generally speaking, being able to preserve "the internet" by saving whole websites offline should be something we give more attention to.

Just read this recently: https://www.theatlantic.com/technology/archive/2021/06/the-i...

enqk · on July 10, 2021

I still use it to mirror websites :) After all I also witnessed its creation! Dirty code can still be super useful

fouc · on July 10, 2021

The good ole days when web pages were web pages!

dzign · on July 10, 2021

And the information was under a unique domain...

wilsonfiifi · on July 10, 2021

I’v recently been using Monolith[0] and I find it’s creation of a single html file much more convenient. It’s also written in Rust so I’m sure that will make the source code a bit more accessible for some.

[0] https://github.com/Y2Z/monolith

kybernetikos · on July 10, 2021

It doesn't look like it follows links though, so it's much more of an alternative to "Save As -> mhtml" than to HTTrack.

chowderman · on July 10, 2021

There is also a similar program called HyperFiler[0]* that bundles web pages into single HTML files with a few more options such as a headless chromium transport option, built in minifiers, page sanitizers, and an option to grayscale the output pages, among other options. It's TypeScript based and has an programmatic API to customize the bundling process as well.

[0] https://github.com/chowderman/hyperfiler

* disclaimer: I created HyperFiler

AnyTimeTraveler · on July 10, 2021

Cool! Thanks for sharing! I've been on the lookout for replacements for archiving recipies on cooking sites and this tool works great.

Gys · on July 10, 2021

There is also SingleFile as a browser extension: https://github.com/gildas-lormeau/SingleFile

gildas · on July 10, 2021

Note that SingleFile can also crawl websites when run from the CLI (see the --crawl-* options)

robtherobber · on July 10, 2021

I also recommend this wholeheartedly.

Minor49er · on July 10, 2021

I used to use HTTrack pretty often to save entire sites. But I learned that wget can take care of my copying needs even more easily most of the time. Something like this usually does the trick:

  wget -E -r -k -p --span-host http://mycoolhomepage.com

Jaruzel · on July 10, 2021

Wow, I didn't think people still use this. I have a copy on my PC, and wheel it out every so often, when I find a great small site that quite obviously has been abandoned by it's owner so could vanish any moment, and Wayback doesn't have a full copy. I could switch to something better, but I know how HTTrack works, and it works well enough for me.

slumdev · on July 10, 2021

You know what might be a great add-on?

A tool to crawl all of the links within a website and submit each one of them to Wayback...

bagpuss · on July 10, 2021

https://archivarix.com/

is an interesting solution to GP's issue, works better than httrack and can also pull down multiple timestamps of archived site

bradknowles · on July 12, 2021

The ArchiveTeam has tools that do that. In fact, they have a number of different tools, some of which I believe are closely related to httrack.

Disclaimer: I tried to get involved with ArchiveTeam to help get some websites that I care about properly archived and saved to The Way Back Machine, but they weren’t allowing new members to sign up. They were happy to talk to me about what needed to be done and one of their members set things up to archive everything that was available, but I wasn’t able to help in that process.

notRobot · on July 10, 2021

HTTrack is one of my favourite pieces of software - it makes it super easy to create offline mirrors of websites and browse them later. It's sorta like wget on steroids for that use case.

mosselman · on July 10, 2021

Wget has the —mirror flag which makes it much like httrack for minor scrapes. Httrack is faster because it can work in parallel.

icythere · on July 10, 2021

I used httrack to transform the public version of my wordpress blogs to a static site. It often crashed but as long as I had a copy of its local data(base) it's just fine to restart it.

I really like the tool. I doubt if that is helpful today, bc. of the raise of the Javascript stuff...

daniel_iversen · on July 10, 2021

I looked into it the other day but AFAIR it easily breaks down if a site is using Cloudflare to protect itself from abuse, which I imagine could be quite a fair few sides these days.

Scoundreller · on July 10, 2021

I wonder if there’s a way to integrate it with my own browser so it can mirror a website over days/weeks by using my regular behaviour to avoid suspicion.

luismedel · on July 10, 2021

You can set the user agent, randomize link retrieval order and set a longer delay between requests to avoid detection.

klkvsk · on July 10, 2021

I used to use HTTrack, but then I found Cryotek WebCopy[1]. It's basically same, but with more user friendly GUI and some more features useful for modern web. Like for example, it can fetch resources by URLs contained in JS code or from special attributes like data-src. And it's free, but available only for Windows, though.

[1] https://www.cyotek.com/cyotek-webcopy

DerekBickerton · on July 10, 2021

The SingleFile[0] Firefox addon is handy too, if you just want to archive a webpage with all the images, styling etc intact all enclosed in a single HTML file.

[0] https://addons.mozilla.org/en-US/firefox/addon/single-file/

Gys · on July 10, 2021

Similar project (cross platform commandline): https://github.com/wabarc/wayback

cubano · on July 10, 2021

HTTrack is a powerful package, but unless it can now handle JS-embedded links properly, it still can't take you all the way to. perfect local site mirroring.

gengear · on July 10, 2021

Used it in undergraduate to mirror courses. Very good. doesn't work with js links or SPA. But a nice to have tool if you are into self hosting.

cyberge99 · on July 10, 2021

How does this differ from Save As in a browser?

okareaman · on July 10, 2021

There's no comparison. It's way more powerful than that, but careful what you point it at because it will download gigabytes of everything and might annoy the website owner.

judge2020 · on July 10, 2021

Seems to be more powerful, eg. it can spider an entire website and save all linked pages and assets.

EvilGretzky · on July 10, 2021

What makes this better than DarcyRipper?

judge2020 · on July 10, 2021

For one, that software's website seems to be down - http://darcyripper.com/