Hacker News new | past | comments | ask | show | jobs | submit login
The IA Client – The Swiss Army Knife of Internet Archive (archive.org)
172 points by bryanrasmussen on June 6, 2019 | hide | past | favorite | 21 comments



If you're looking for a convenient way to search web pages on the Internet Archive, you may also use my browser extension for viewing archived and cached versions of web pages, it supports 15 data sources, and page archiving is also planned.

https://github.com/dessant/view-page-archive


Great idea! And a good way to solve a repetitive task I run into often.


I was a little disappointed with the IA app. It works fairly well for the most part, but it seems to lag behind in features for the site

I wanted to download all of the Computer Chronicles. Both for viewing offline and to have my own "set" of files. I even re-encoded them to HEVC (from MPEG-2) and put them up here https://intelminer.com/torrents/TV%20SHOWS/Computer%20Chroni...

Getting them from the Archive through was an exercise in frustration. IA offers (and heavily recommends) using the torrent download option to ease on bandwidth cost

Unfortunately for what ever reason, there's no way to pull down the .torrent files using this method.

In the end I had to simply pull the MPEG-2 videos down one by one over the course of several months (due to speed limiting on IA's end)


I just tried this in bash:

for each in `ia search collection:computerchronicles --itemlist`; do ia download $each --glob=*.torrent; done

I have myself a directory of torrents.


I'm glad it works now. At the time it didn't seem to list .torrent as a valid option though


# If are not a Python user or want to try something different (faster), can be done with sh, sed, openssl, curl/wget/etc. plus a simple utility I wrote called "yy025" (https://news.ycombinator.com/item?id=17689152). yy025 is a more generalised "Swiss Army Knife" for making requests to any website. This solution uses a traditional method called "http pipelining".

   export Connection=keep-alive;
   n=1;while true;do
   test $n -le 8||break
   echo https://archive.org/details/computerchronicles?\&sort=-downloads\&page=$n
   n=$((n+1));done \
   |yy025|openssl s_client -connect archive.org:443 -ign_eof \
   |sed '/item-ia\" [^ ]/!d;s,.*=\",,;s/\"//;s,.*,https://archive.org/download/&/&_archive.torrent,' \
   |yy025|openssl s_client -connect archive.org:443 -ign_eof|sed '/Location:/!d;s/Location: //' 
     
# Additional command-line options for openssl s_client omitted for sake of brevity. The above outputs the torrent urls. Feed those to curl or wget or whatever similar program you choose, or maybe directly to a torrent client. Something like

   |while read url;do curl -O $url;done


I though HTTP pipelining was discouraged and that most servers disabled by default. Is that not true?


You are probably thinking of pipelining in terms of the popular web browsers. Those programs want to do pipelining so they can load up resources (read: today, ads) from a variety of domains in order to present a web page with graphics and advertising.

That never really worked. Thus, we have HTTP/2, authored by an ad sales company. It is very important for an ad sales company that web pages contain not only what the user is requesting but also heaps of automatically followed pointers to third party resources hosted on other domains. That is, pages need to be able to contain advertising. HTTP/1.1 pipelining is of little benefit to the ad ecosystem.

However, sometimes the user is not trying to load up a graphical web page full of third party resources. Here, the HN commenter is just trying to get some HTML, extract some URLs and then download some files. The HTML is all obtained from the same domain. This is text retrieval, nothing more.

If all the resources the user wants are from the same domain, e.g., archive.org, then pipelining works great. I have been using HTTP/1.1 pipelining to do this for several decades and it has always worked flawlessly.

Typically httpd settings for any website would allow at least 100 pipelined requests per connection. As you might imagine, often the httpd settings are just unchanged defaults. Today the limits I see are often much higher, e.g., several hundred.

It is very rare in my experience to find a site that has pipelining disabled. More likely they are disabling Connection: keep-alive and forcing all requests to be Connection: close. I rarely see this.

The HTTP/1.1 specification suggests a max connection limit per browser of two. There is no suggested limit on the number of requests per connection. In terms of efficiency, the more the better. How many connections does a popular we browser make when loading an "average" web page today? It is a lot more than two! In any event, pipelining as I have shown here stays under the two connection limit.


There's no speed limit if you login. At least, this was the case when I also batch downloaded all the CC MPEG-2 files, and it finished overnight.


Is there a tutorial or introduction to the Internet Archive? There's a giant mass of fascinating stuff, but I've always had a hard time getting a handle on it.


I'm working on one, but there isn't one as such, no.


now just need to be able to access it like traversing the "real" internet with a time travel button


It's a decent client, but be aware that you might want to increase the file descriptor limit, the client at the moment doesn't properly close files from my experience in using it to upload a fairly large folder structure.

A simple `ulimit -n` with the raised descriptor limit should take care of it.


I think this should fix that:

https://github.com/jjjake/internetarchive/commit/1ac200cbbbe...

This change will be in the next release, v1.8.5.


Ah, that is good to see then!

(sorry it took me a bit to respond, I'm not on HN all that often.)


Meta: is "The Swiss Army Knife of $Something" the new "$Something for humans" ? I really hate these phrases as they technically can be put in front of almost anything and stay semi-accurate, but not give any additional information, besides being marketing-speak


It is not new :-)

> "Perl is the Swiss Army chainsaw of scripting languages" - https://www.perl.com/pub/2000/10/begperl1.html/ > > Doug Sheppard, 2000-10-16

(Earliest quote I could find in three minutes.)


I chose the phrasing because I deemed it accurate for the situation. This script has multiple functions and does a number of distinct things well depending on how it is invoked.


Be grateful it's not 'a pragmatic internet tool' or something.


My org has been using the client and python library for a couple of years to interact with IA. It's a fantastic tool -- Jake Johnson's a superhero in my book!


Last time I checked, you could only give the download subcommand one option at a time, so I wrote this shell script:

    #!/usr/bin/env zsh

    for i in "$@"
    do
         ia download $(basename $i)
    done




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: