Hacker News new | past | comments | ask | show | jobs | submit login

Related to this: you can also use my shot-scraper tool to scrape web pages from the command line using JavaScript:

    % pip install shot-scraper
    % shot-scraper install
    % shot-scraper javascript https://datasette.io/ "({
        title: document.title,
        tagline: document.querySelector('.tagline').innerText
    })"
    {
        "title": "Datasette: An open source multi-tool for exploring and publishing data",
        "tagline": "An open source multi-tool for exploring and publishing data"
    }
More here, including an example of using it to scrape data from Hacker News that's not available in the API: https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...

HN post about this from yesterday (which failed to get any traction): https://news.ycombinator.com/item?id=30667588




That's fantastic!

I'll definitely investigate using this. I implemented my own MacGyver version of this basic functionality off selenium to grab screenshots for search.marginalia.nu/explore/random -- but that script is super sketchy and held together in with bubble gum and duct tape. Yours looks a lot better.

By the way, is there a way to extract favicons as well?


No, I've not thought about favicons. That's a really interesting challenge.

I wonder if there's a way to detect favicons just using JavaScript that runs against a page? Not sure if it's easy to detect /favicon.ico v.s. the various meta tags.


Would be kind of fun to write JavaScript that runs against the page that first looks for the meta tags, then tries to fetch("/favicon.ico") and returns either the URL or a base64 encoded copy of the image (since the "shot-scraper javascript" command requires you to return JSON).


There's a lot of weird edge cases for favicons, most browsers fall back to just looking for /favicon.ico if you don't explicitly specify it in the meta tags, and if you do, there's sometimes different versions.

Yeah, maybe it's a pipe dream :-/ But even without them, it looks really useful!


I've found some fairly reliable repos in the past, or https://t1.gstatic.com/faviconV2?client=SOCIAL&type=FAVICON&... still works (but for how long)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: