Hacker News new | past | comments | ask | show | jobs | submit | sakisv's comments login

Honest question: How was it discovered?

Was it reported by a pentester? (ex-)employee? Facebook itself? How do we know that it goes back to 2012?

I know in the public sector you have to disclose such things to ICO, but does that also apply to private companies? Who is going to hold them accountable?


Well I'm just thinking of concourse the same way it describes itself, "a continuous thing doer".

I want something that will run some code when something happens. In my case that "something" is a specific time of day. The code will spin up a server, connect it to tailscale, run the 3 scraping jobs and then tear down the server and parse the data. Then another pipeline runs that loads the data and refreshes the caches.

Of course I'm also using it for continuously deploying my app across 2 environments, or its monitoring stack, or running terraform etc.

Basically it runs everything for me so that I don't have to.


...or worse, if there _is_ an API call but the response is HTML instead of a json


Ha, you can't imagine how many times I've thought of doing just that - it's just that it's somewhat blocked by other things that need to happen before I even attempt to do it


Thanks!

> I have found that it is best to split the task of scraping and parsing into separate processes. By saving the raw JSON or HTML, you can always go back and apply fixes to your parser.

Yes, that's exactly what I've been doing and it saved me more times than I'd care to admit!


Haha all we have to do is agree on the format, right?


We already did. The format supports attaching related content, the scraped info, with the archive itself. So you get your data along with the means to generate it yourself if you want something different.

https://en.m.wikipedia.org/wiki/WARC_(file_format)


Honestly I don't think that matters a lot. Even if all sites were scraped in a different format, the collection would still be insanely useful.

The most important part is being able to consistently scrape every day or so for a long time. That isn't easy.


When you click on a product you get its full price history by default.

I did consider adding a 3 and 6 month button, but for some reason I decided against it, don't remember why. It wasn't performance because I'm heavily caching everything so it wouldn't have made a difference. Maybe aesthetics?


The few random checks that I did on a few products as I was shopping didn't show any difference.

Either I was lucky or they don't bother, who knows


Thanks for your kind words!

I haven't thought about monetizing it, because honestly I don't really see anything about it to be monetized in its current form.


Thankfully I'm not there yet.

Since this is just a side project, if it starts demanding too much of my time too often I'll just stop it and open both the code and the data.

BTW, how could the network request not appear in the network tab?

For me the hardest part is to correlate and compare products across supermarkets


If they don't populate the page via Ajax or network requests. Ie server side, then no requests for supermarket data will appear.

See nextjs server side, I believe they mention that as a security thing in their docs.

In terms of comparison, most names tend to be the same. So some similarity search if it's in the same category matches good enough.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: