Show HN: Instantly create a GitHub repository to take screenshots of a web page

simonw · on March 14, 2022

Related to this: you can also use my shot-scraper tool to scrape web pages from the command line using JavaScript:

    % pip install shot-scraper
    % shot-scraper install
    % shot-scraper javascript https://datasette.io/ "({
        title: document.title,
        tagline: document.querySelector('.tagline').innerText
    })"
    {
        "title": "Datasette: An open source multi-tool for exploring and publishing data",
        "tagline": "An open source multi-tool for exploring and publishing data"
    }

More here, including an example of using it to scrape data from Hacker News that's not available in the API: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

HN post about this from yesterday (which failed to get any traction): https://news.ycombinator.com/item?id=30667588

marginalia_nu · on March 14, 2022

That's fantastic!

I'll definitely investigate using this. I implemented my own MacGyver version of this basic functionality off selenium to grab screenshots for search.marginalia.nu/explore/random -- but that script is super sketchy and held together in with bubble gum and duct tape. Yours looks a lot better.

By the way, is there a way to extract favicons as well?

simonw · on March 14, 2022

No, I've not thought about favicons. That's a really interesting challenge.

I wonder if there's a way to detect favicons just using JavaScript that runs against a page? Not sure if it's easy to detect /favicon.ico v.s. the various meta tags.

simonw · on March 14, 2022

Would be kind of fun to write JavaScript that runs against the page that first looks for the meta tags, then tries to fetch("/favicon.ico") and returns either the URL or a base64 encoded copy of the image (since the "shot-scraper javascript" command requires you to return JSON).

marginalia_nu · on March 14, 2022

There's a lot of weird edge cases for favicons, most browsers fall back to just looking for /favicon.ico if you don't explicitly specify it in the meta tags, and if you do, there's sometimes different versions.

Yeah, maybe it's a pipe dream :-/ But even without them, it looks really useful!

Freeboots · on March 15, 2022

I've found some fairly reliable repos in the past, or https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&... still works (but for how long)

simonw · on March 14, 2022

Here's a GitHub code search which shows repos that people have created using my template: https://github.com/search?q=shot-scraper-template+-user%3Asi...

My favourite so far is this one, which is taking screenshots of a variety of French news websites: https://github.com/ggtr1138/UneJournaux

uniqueuid · on March 15, 2022

This is an awesome tool, as all of yours!

Just to mention: The example of french news websites shows that popovers and consent forms are a serious problem for this kind of screenshotting.

Do you have any thoughts on how to best deal with them?

simonw · on March 15, 2022

Yes - you can run extra JavaScript to hide those before you take the screenshot. Example here: https://github.com/palewire/news-homepages/issues/4#issuecom...

Brajeshwar · on March 15, 2022

This is cool. Tried it. Curious question. Won't it be better to default to width and not height as a fixed dimension (in the example) (your code already does the job)? Websites' height are variable while width is usually fixed.

simonw · on March 15, 2022

If you delete the height: 800 line from the YAML you'll get the full length of the page.

I decided to set a height of 800 by default to avoid people accidentally saving 10MB+ images just because they were trying out the tool.

moharoune · on March 14, 2022

Great idea to use Github for this, I've been working on https://app.trackwebpage.com/ which also tracks the changes on web pages and sends email notifications when changes happen (if you wanted to), it's totally free now, you can just sign up and track as much web pages as you want.

0des · on March 14, 2022

Love the project. Do you ever worry about Microsoft tightening the purse strings on these types of off-label uses for github?

simonw · on March 14, 2022

I do. I'm happy to pay for Actions minutes on private repos, but I do worry that they'll change their policy with regards to free minutes for public repos at some point.

I felt a lot better about my git scraping work (https://simonwillison.net/2020/Oct/9/git-scraping/) after GitHub released https://next.github.com/projects/flat-data/ which was inspired by that work, as it feels like it's now acknowledged as an OK use of their platform.

I'm hoping people don't abuse shot-scraper too much in terms of saving huge binary files to free repositories - that's why I haven't yet included tips on running scheduled scrapers in the shot-scraper-template documentation.

beardicus · on March 14, 2022

I saw you were poking with some image diff tools recently as well, and I'm sure you've thought about this already but I'd just like to explicitly state it: it'd be great if you could scrape a screenshot periodically and only commit it if the new one is significantly different.

simonw · on March 14, 2022

Yeah that's the idea behind https://github.com/simonw/image-diff but it's not quite fit for purpose yet.

dudus · on March 15, 2022

GitHub does have some very generous free tier stuff. But as these sort of things become more common it's destined to be lowered or more heavily limited. Nothing against the project, I think it's an awesome idea with great execution. I'll probably set it up sometime soon.

cancan · on March 14, 2022

This is really cool and I love the idea of using SVGs to even add annotations, as you mentioned on your tweets. I might be "borrowing" that idea soon for some our own work, and will try to credit you!

lloydatkinson · on March 14, 2022

This but automated visual regression testing (eg testing screenshots are the same) would be amazing. Anyone got any ideas how this might work with shot-scraper?

simonw · on March 14, 2022

You could absolutely get this working with GitHub Actions with a bit of creativity.

I've been playing around with my own image-diff tool for this kind of thing, but it's not yet in a decent state: https://github.com/simonw/image-diff - there are other, better options out there such as https://github.com/mapbox/pixelmatch

Needle is an older system that did this using Selenium - updating that to work with Playwright (or Playwight via shot-scraper) would be an interesting project: https://github.com/python-needle/needle

trevinhofmann · on March 15, 2022

I used pixelmatch + puppeteer at a previous job to automate this. The service would screenshot each page on the master branch and the PR branch, then comment on the PR with a link showing the Before, Diff, and After of each page that changed. I highly recommend this type of setup.

lloydatkinson · on March 15, 2022

Where did you upload the screenshots?

trevinhofmann · on March 15, 2022

The same service kept the screenshots in its file system, and served them along with with the web pages that displayed them.

simonw · on March 15, 2022

I love that idea.

westurner · on March 15, 2022

Awesome Visual Regression Testing > lists quite a few tools and online services: https://github.com/mojoaxel/awesome-regression-testing#tools...

"visual-regression": https://github.com/topics/visual-regression

Cypress.io can run in a CI job, does Time Travel, works with DevTools debugger, can take screenshots and [headless] video, and it looks like there's a visual regression testing thing for it: https://github.com/mjhea0/cypress-visual-regression https://docs.cypress.io/guides/overview/why-cypress#Features

dewski · on March 15, 2022

I’ve built something like this and stored the diff regions in PostGIS, really easy way to build a visual diff search tool using bounding boxes.

simonw · on March 15, 2022

Now you've got me thinking if that would be possible using SpatiaLite.

endisneigh · on March 14, 2022

I wish there were a way to use an iPhone or android as spare computers easily (no app). I keep it charged all day, a way to send some JavaScript to it in order to accomplish things like this in a “serverless” fashion would be neat

crickcreek · on March 14, 2022

You can do that easily on android. Termux, lineage, postmarketOS

ediardo · on March 17, 2022

This is very clever hack. I did not know you could do something like this by reading metadata of a repo.

I'm thinking of connecting it to Zapier so that I can send the screenshot to some places.

stared · on March 15, 2022

Thanks for sharing! It does not work though for pages that take time to load (at best - image previous, at worst - only background. Any ideas how to change that?

max23_ · on March 15, 2022

There is a "wait"[0] option you can set to wait for page to load.

Edit: Sorry, I didn't see the author has replied you.

[0] https://github.com/simonw/shot-scraper-template#configuring-...

simonw · on March 15, 2022

You can add this to your YAML:

    - url: https://...
      wait: 3000

stared · on March 15, 2022

It works, thanks!

adriangrigore · on March 15, 2022

I use https://thumbnail.ws/ via curl.

dudus · on March 15, 2022

What sort of limits are in place to keep people from setting up their own web archive with this?

on March 14, 2022

[deleted]