Hacker News new | past | comments | ask | show | jobs | submit login

this is a great use case for bayesean filtering, and I bet you could even hook up an existing bayes engine to this problem space, if not just write one up using common recipes - I've done great things with the one described in "Programming Collective Intelligence" (http://shop.oreilly.com/product/9780596529321.do). Things like "googlebot", "certain IP range", "was observed clicking 100 links in < 1 minute", and "didn't have a referrer" all go into the hopper of observable "features", which are all aggregated to produce a corpus of bot/non bot server hits. run it over a gig of log files to train it and you'll have a decent system which you can tune to favor false negatives vs. false positives.

I'm sure facebook engineers could come up with something way more sophisticated.




Well yeah, getting a training data set, and then having your algo not be poisoned, seems like it is a challenge all around!


The algo can't be poisoned, if I understand the term correctly, if you only train it manually against datasets you trust.

Each iteration of training that produces improved behavior can then be used to generate a new, more comprehensive training set. When I wrote a bayes engine to classify content, my first training set was based on a corpus that I produced entirely heuristically (not using bayes). I manually tuned it (about 500 pieces of content) and from that produced a new training set of about 5000 content items.

Eyeballing it for a couple of months initially, manually retraining periodically, spot checking daily as well as watching overall stats daily for any anomalous changes in observed bot/non bot ratios will work wonders.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: