Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Instantly create a GitHub repository to take screenshots of a web page (simonwillison.net)
214 points by simonw on March 14, 2022 | hide | past | favorite | 36 comments
I built a GitHub repository template which automates the process of configuring a new repository to take web page screenshots using GitHub Actions.

You can try this out at https://github.com/simonw/shot-scraper-template

Use the https://github.com/simonw/shot-scraper-template/generate interface to create a new repository using that template, and paste the URL that you want to take screenshots of in as the "description" field.

The new repository will then configure itself using GitHub Actions, take the screenshot and save it back to the repo!




Related to this: you can also use my shot-scraper tool to scrape web pages from the command line using JavaScript:

    % pip install shot-scraper
    % shot-scraper install
    % shot-scraper javascript https://datasette.io/ "({
        title: document.title,
        tagline: document.querySelector('.tagline').innerText
    })"
    {
        "title": "Datasette: An open source multi-tool for exploring and publishing data",
        "tagline": "An open source multi-tool for exploring and publishing data"
    }
More here, including an example of using it to scrape data from Hacker News that's not available in the API: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

HN post about this from yesterday (which failed to get any traction): https://news.ycombinator.com/item?id=30667588


That's fantastic!

I'll definitely investigate using this. I implemented my own MacGyver version of this basic functionality off selenium to grab screenshots for search.marginalia.nu/explore/random -- but that script is super sketchy and held together in with bubble gum and duct tape. Yours looks a lot better.

By the way, is there a way to extract favicons as well?


No, I've not thought about favicons. That's a really interesting challenge.

I wonder if there's a way to detect favicons just using JavaScript that runs against a page? Not sure if it's easy to detect /favicon.ico v.s. the various meta tags.


Would be kind of fun to write JavaScript that runs against the page that first looks for the meta tags, then tries to fetch("/favicon.ico") and returns either the URL or a base64 encoded copy of the image (since the "shot-scraper javascript" command requires you to return JSON).


There's a lot of weird edge cases for favicons, most browsers fall back to just looking for /favicon.ico if you don't explicitly specify it in the meta tags, and if you do, there's sometimes different versions.

Yeah, maybe it's a pipe dream :-/ But even without them, it looks really useful!


I've found some fairly reliable repos in the past, or https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&... still works (but for how long)


Here's a GitHub code search which shows repos that people have created using my template: https://github.com/search?q=shot-scraper-template+-user%3Asi...

My favourite so far is this one, which is taking screenshots of a variety of French news websites: https://github.com/ggtr1138/UneJournaux


This is an awesome tool, as all of yours!

Just to mention: The example of french news websites shows that popovers and consent forms are a serious problem for this kind of screenshotting.

Do you have any thoughts on how to best deal with them?


Yes - you can run extra JavaScript to hide those before you take the screenshot. Example here: https://github.com/palewire/news-homepages/issues/4#issuecom...


This is cool. Tried it. Curious question. Won't it be better to default to width and not height as a fixed dimension (in the example) (your code already does the job)? Websites' height are variable while width is usually fixed.


If you delete the height: 800 line from the YAML you'll get the full length of the page.

I decided to set a height of 800 by default to avoid people accidentally saving 10MB+ images just because they were trying out the tool.


Great idea to use Github for this, I've been working on https://app.trackwebpage.com/ which also tracks the changes on web pages and sends email notifications when changes happen (if you wanted to), it's totally free now, you can just sign up and track as much web pages as you want.


Love the project. Do you ever worry about Microsoft tightening the purse strings on these types of off-label uses for github?


I do. I'm happy to pay for Actions minutes on private repos, but I do worry that they'll change their policy with regards to free minutes for public repos at some point.

I felt a lot better about my git scraping work (https://simonwillison.net/2020/Oct/9/git-scraping/) after GitHub released https://next.github.com/projects/flat-data/ which was inspired by that work, as it feels like it's now acknowledged as an OK use of their platform.

I'm hoping people don't abuse shot-scraper too much in terms of saving huge binary files to free repositories - that's why I haven't yet included tips on running scheduled scrapers in the shot-scraper-template documentation.


I saw you were poking with some image diff tools recently as well, and I'm sure you've thought about this already but I'd just like to explicitly state it: it'd be great if you could scrape a screenshot periodically and only commit it if the new one is significantly different.


Yeah that's the idea behind https://github.com/simonw/image-diff but it's not quite fit for purpose yet.


GitHub does have some very generous free tier stuff. But as these sort of things become more common it's destined to be lowered or more heavily limited. Nothing against the project, I think it's an awesome idea with great execution. I'll probably set it up sometime soon.


This is really cool and I love the idea of using SVGs to even add annotations, as you mentioned on your tweets. I might be "borrowing" that idea soon for some our own work, and will try to credit you!


This but automated visual regression testing (eg testing screenshots are the same) would be amazing. Anyone got any ideas how this might work with shot-scraper?


You could absolutely get this working with GitHub Actions with a bit of creativity.

I've been playing around with my own image-diff tool for this kind of thing, but it's not yet in a decent state: https://github.com/simonw/image-diff - there are other, better options out there such as https://github.com/mapbox/pixelmatch

Needle is an older system that did this using Selenium - updating that to work with Playwright (or Playwight via shot-scraper) would be an interesting project: https://github.com/python-needle/needle


I used pixelmatch + puppeteer at a previous job to automate this. The service would screenshot each page on the master branch and the PR branch, then comment on the PR with a link showing the Before, Diff, and After of each page that changed. I highly recommend this type of setup.


Where did you upload the screenshots?


The same service kept the screenshots in its file system, and served them along with with the web pages that displayed them.


I love that idea.


Awesome Visual Regression Testing > lists quite a few tools and online services: https://github.com/mojoaxel/awesome-regression-testing#tools...

"visual-regression": https://github.com/topics/visual-regression

Cypress.io can run in a CI job, does Time Travel, works with DevTools debugger, can take screenshots and [headless] video, and it looks like there's a visual regression testing thing for it: https://github.com/mjhea0/cypress-visual-regression https://docs.cypress.io/guides/overview/why-cypress#Features


I’ve built something like this and stored the diff regions in PostGIS, really easy way to build a visual diff search tool using bounding boxes.


Now you've got me thinking if that would be possible using SpatiaLite.


I wish there were a way to use an iPhone or android as spare computers easily (no app). I keep it charged all day, a way to send some JavaScript to it in order to accomplish things like this in a “serverless” fashion would be neat


You can do that easily on android. Termux, lineage, postmarketOS


This is very clever hack. I did not know you could do something like this by reading metadata of a repo.

I'm thinking of connecting it to Zapier so that I can send the screenshot to some places.


Thanks for sharing! It does not work though for pages that take time to load (at best - image previous, at worst - only background. Any ideas how to change that?


There is a "wait"[0] option you can set to wait for page to load.

Edit: Sorry, I didn't see the author has replied you.

[0] https://github.com/simonw/shot-scraper-template#configuring-...


You can add this to your YAML:

    - url: https://...
      wait: 3000


It works, thanks!


I use https://thumbnail.ws/ via curl.


What sort of limits are in place to keep people from setting up their own web archive with this?


[deleted]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: