Hacker News new | past | comments | ask | show | jobs | submit | gaauch's comments login

A long term side project of mine is to try to build a recommendation algorithm trained on HN data.

I trained a model to predict if a given post will reach the front page, get flagged etc, I collected over a 1000 RSS feeds and rank the RSS entries with my ranking models.

I submit the high ranking entries on HN to test out my models and I can reach the front page consistently sometimes having multiple entries on the front page at a given time.

I also experiment with user->content recommendation, for that I use comment data for modeling interactions between users and entries, which seems to work fine.

Only problem I have is that I get a lot of 'out of distribution' content in my RSS feeds which causes my ranking models to get 'confused' for this I trained models to predict if a given entry belongs HN or not. On top of that I have some tagging models trained on data I scraped from lobste.rs and hand annotated.

I had been working on this on and off for the last 2 years or so, this account is not my main, and just one I created for testing.

AMA


Could you explain more about what you mean by modeling interactions between comments and entities?


did you find if submitted entries are more likely to reach the frontpage depending on the title or the content?

i.e. do HN users upvote more based on the title of the article or on actually reading them?


I tried making an LLM generate different titles for a given article and compared their ranking scores. There seems to be a lot of variation in the ranking scores based on the way the title is worded. Titles that are more likely to generate 'outrage' seems to be getting ranked higher, but at the same time that increases is_hn_flagged score which tries to predict if a entry will get flagged.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: