this is a great use case for bayesean filtering, and I bet you could even hook u...

th0ma5 · on July 30, 2012

Well yeah, getting a training data set, and then having your algo not be poisoned, seems like it is a challenge all around!

zzzeek · on July 30, 2012

The algo can't be poisoned, if I understand the term correctly, if you only train it manually against datasets you trust.

Each iteration of training that produces improved behavior can then be used to generate a new, more comprehensive training set. When I wrote a bayes engine to classify content, my first training set was based on a corpus that I produced entirely heuristically (not using bayes). I manually tuned it (about 500 pieces of content) and from that produced a new training set of about 5000 content items.

Eyeballing it for a couple of months initially, manually retraining periodically, spot checking daily as well as watching overall stats daily for any anomalous changes in observed bot/non bot ratios will work wonders.