More

sakisv · 2024-09-28T09:02:55 1727514175

Honest question: How was it discovered?

Was it reported by a pentester? (ex-)employee? Facebook itself? How do we know that it goes back to 2012?

I know in the public sector you have to disclose such things to ICO, but does that also apply to private companies? Who is going to hold them accountable?

sakisv · 2024-08-07T17:09:51 1723050591

Well I'm just thinking of concourse the same way it describes itself, "a continuous thing doer".

I want something that will run some code when something happens. In my case that "something" is a specific time of day. The code will spin up a server, connect it to tailscale, run the 3 scraping jobs and then tear down the server and parse the data. Then another pipeline runs that loads the data and refreshes the caches.

Of course I'm also using it for continuously deploying my app across 2 environments, or its monitoring stack, or running terraform etc.

Basically it runs everything for me so that I don't have to.

sakisv · 2024-08-07T16:46:44 1723049204

...or worse, if there _is_ an API call but the response is HTML instead of a json

sakisv · 2024-08-07T16:45:24 1723049124

Ha, you can't imagine how many times I've thought of doing just that - it's just that it's somewhat blocked by other things that need to happen before I even attempt to do it

sakisv · 2024-08-07T16:43:10 1723048990

Thanks!

> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!

sakisv · 2024-08-07T10:22:06 1723026126

Haha all we have to do is agree on the format, right?

Spivak · 2024-08-07T10:35:04 1723026904

We already did. The format supports attaching related content, the scraped info, with the archive itself. So you get your data along with the means to generate it yourself if you want something different.

https://en.m.wikipedia.org/wiki/WARC_(file_format)

joelthelion · 2024-08-07T12:17:22 1723033042

Honestly I don't think that matters a lot. Even if all sites were scraped in a different format, the collection would still be insanely useful.

The most important part is being able to consistently scrape every day or so for a long time. That isn't easy.

sakisv · 2024-08-07T08:06:04 1723017964

When you click on a product you get its full price history by default.

I did consider adding a 3 and 6 month button, but for some reason I decided against it, don't remember why. It wasn't performance because I'm heavily caching everything so it wouldn't have made a difference. Maybe aesthetics?

sakisv · 2024-08-07T08:00:23 1723017623

The few random checks that I did on a few products as I was shopping didn't show any difference.

Either I was lucky or they don't bother, who knows

sakisv · 2024-08-07T07:56:43 1723017403

Thanks for your kind words!

I haven't thought about monetizing it, because honestly I don't really see anything about it to be monetized in its current form.

sakisv · 2024-08-07T07:49:32 1723016972

Thankfully I'm not there yet.

Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.

BTW, how could the network request not appear in the network tab?

For me the hardest part is to correlate and compare products across supermarkets

langsoul-com · 2024-08-10T05:29:44 1723267784

If they don't populate the page via Ajax or network requests. Ie server side, then no requests for supermarket data will appear.

See nextjs server side, I believe they mention that as a security thing in their docs.

In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.