Hacker News new | past | comments | ask | show | jobs | submit login
HTTrack Website Copier – Offline browser (httrack.com)
131 points by evo_9 on July 10, 2021 | hide | past | favorite | 35 comments



Ahh, the good ole days. I used httrack nearly 20 years ago to make the CD copies of the osha.gov site (e.g., [1]). Back then, for ADA and internet access compliance, government websites had to also be made available for offline use if possible.

I haven’t followed httrack since, but it seems like scrapy and similar are much better replacements.

[1] https://forum.httrack.com/readmsg/3556/index.html


Glad the project helped a bit a few people :) I don't have much time unfortunately to enhance the engine nowadays, and the code is dirty and broken beyond any repair. Yet I'm still puzzled to see how many people are still using the project today.

You'll probably find better approaches, and while I never tried scrapy, it seems to be using a javascript engine for hard cases, which was something I thought about (but this was way above my skills at that time).

The hard parts remains however, if you want a functional site: you need to rewrite links, or use an external proxy-like mechanism. Having a fully functional offline, file-based site, is the real tricky part. Cases will remain unsolvable, as the inside code logic can produce whatever external link resource based on randomness, time, etc.

The approach in httrack was both ugly and pragmatic: attempting to recognize link/files patterns within javascript and fetch/replace what can be replaced with local links. Javascript producing html will typically be analyzed with really dumb - yet sometimes effective - js parsers. (parental advisory: don't look at the parsers code, your eyes would melt)

And obviously this is not going to solve all cases and will even break pages with tricky js


Let me be clear: Thanks, Xavier!

httrack was extremely helpful and there really was no equal. The “modern” web requires a live JS engine, but as you point out, even the “old” web had server-side logic that couldn’t be captured.

In that light, I think httrack has stood up pretty well and nobody expects you to go rewrite it or clean it up. If someone today has a mostly static site they want to archive without writing custom code, I would still recommend httrack (it’s more controllable than wget or similar). I just assume that those sites are mostly gone :(.


I'm using this software on and off, but it's especially useful when clients plan to redo their websites and I want to make sure I have a copy of the pages offline but don't have access to server backups or things like that.

But, generally speaking, being able to preserve "the internet" by saving whole websites offline should be something we give more attention to.

Just read this recently: https://www.theatlantic.com/technology/archive/2021/06/the-i...


I still use it to mirror websites :) After all I also witnessed its creation! Dirty code can still be super useful


The good ole days when web pages were web pages!


And the information was under a unique domain...


I’v recently been using Monolith[0] and I find it’s creation of a single html file much more convenient. It’s also written in Rust so I’m sure that will make the source code a bit more accessible for some.

[0] https://github.com/Y2Z/monolith


It doesn't look like it follows links though, so it's much more of an alternative to "Save As -> mhtml" than to HTTrack.


There is also a similar program called HyperFiler[0]* that bundles web pages into single HTML files with a few more options such as a headless chromium transport option, built in minifiers, page sanitizers, and an option to grayscale the output pages, among other options. It's TypeScript based and has an programmatic API to customize the bundling process as well.

[0] https://github.com/chowderman/hyperfiler

* disclaimer: I created HyperFiler


Cool! Thanks for sharing! I've been on the lookout for replacements for archiving recipies on cooking sites and this tool works great.


There is also SingleFile as a browser extension: https://github.com/gildas-lormeau/SingleFile


Note that SingleFile can also crawl websites when run from the CLI (see the --crawl-* options)


I also recommend this wholeheartedly.


I used to use HTTrack pretty often to save entire sites. But I learned that wget can take care of my copying needs even more easily most of the time. Something like this usually does the trick:

  wget -E -r -k -p --span-host http://mycoolhomepage.com


Wow, I didn't think people still use this. I have a copy on my PC, and wheel it out every so often, when I find a great small site that quite obviously has been abandoned by it's owner so could vanish any moment, and Wayback doesn't have a full copy. I could switch to something better, but I know how HTTrack works, and it works well enough for me.


You know what might be a great add-on?

A tool to crawl all of the links within a website and submit each one of them to Wayback...


https://archivarix.com/

is an interesting solution to GP's issue, works better than httrack and can also pull down multiple timestamps of archived site


The ArchiveTeam has tools that do that. In fact, they have a number of different tools, some of which I believe are closely related to httrack.

Disclaimer: I tried to get involved with ArchiveTeam to help get some websites that I care about properly archived and saved to The Way Back Machine, but they weren’t allowing new members to sign up. They were happy to talk to me about what needed to be done and one of their members set things up to archive everything that was available, but I wasn’t able to help in that process.


HTTrack is one of my favourite pieces of software - it makes it super easy to create offline mirrors of websites and browse them later. It's sorta like wget on steroids for that use case.


Wget has the —mirror flag which makes it much like httrack for minor scrapes. Httrack is faster because it can work in parallel.


I used httrack to transform the public version of my wordpress blogs to a static site. It often crashed but as long as I had a copy of its local data(base) it's just fine to restart it.

I really like the tool. I doubt if that is helpful today, bc. of the raise of the Javascript stuff...


I looked into it the other day but AFAIR it easily breaks down if a site is using Cloudflare to protect itself from abuse, which I imagine could be quite a fair few sides these days.


I wonder if there’s a way to integrate it with my own browser so it can mirror a website over days/weeks by using my regular behaviour to avoid suspicion.


You can set the user agent, randomize link retrieval order and set a longer delay between requests to avoid detection.


I used to use HTTrack, but then I found Cryotek WebCopy[1]. It's basically same, but with more user friendly GUI and some more features useful for modern web. Like for example, it can fetch resources by URLs contained in JS code or from special attributes like data-src. And it's free, but available only for Windows, though.

[1] https://www.cyotek.com/cyotek-webcopy


The SingleFile[0] Firefox addon is handy too, if you just want to archive a webpage with all the images, styling etc intact all enclosed in a single HTML file.

[0] https://addons.mozilla.org/en-US/firefox/addon/single-file/


Similar project (cross platform commandline): https://github.com/wabarc/wayback


HTTrack is a powerful package, but unless it can now handle JS-embedded links properly, it still can't take you all the way to. perfect local site mirroring.


Used it in undergraduate to mirror courses. Very good. doesn't work with js links or SPA. But a nice to have tool if you are into self hosting.


How does this differ from Save As in a browser?


There's no comparison. It's way more powerful than that, but careful what you point it at because it will download gigabytes of everything and might annoy the website owner.


Seems to be more powerful, eg. it can spider an entire website and save all linked pages and assets.


What makes this better than DarcyRipper?


For one, that software's website seems to be down - http://darcyripper.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: