Wonderful work! Any chance you could talk more about how you built this website? For example, where do you get your news from, how do you aggregate it, and which technology stack do you use on the server side?
Thanks :) Yup, I source news from local news portals as shown in the website. Hourly, each source is crawled in each relevant category to get the latest news items, and stored into a database. Another job then goes and fetches the body of all these articles. (I use https://github.com/robfig/cron for cron jobs). Server and HTML templates are both Golang. As for the aggregation (grouping) algorithm, I'll just say that it's straight out of the textbook http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
So, in other words, you're using the MinHash algorithm as well as Locality-sensitive hashing (LSH)? How much volume are you able to process in how much time?
By the way, I first learned about this topic through Stanford’s “Mining of Massive Datasets” (MMDS) course that used to be free on Coursera. So it's thrilling to see someone put it to use in the real world and talk about it, too! :-)
Yup, MinHash with LSH. It's quite fast and low compute intensive, because articles shown are limited by recency (e.g. past 24 hours), say order of hundreds and thousands in a few seconds. Someone wrote an open source LSH on github on Golang, so no credits to me :) Probably would not have been able to code LSH myself.
It would be awesome if you blogged about your entire experience setting up your news aggregator. But I guess your first priority is PageDash these days so I can keep dreaming. :-)
Wonderful work! Any chance you could talk more about how you built this website? For example, where do you get your news from, how do you aggregate it, and which technology stack do you use on the server side?