Hacker News new | past | comments | ask | show | jobs | submit login

Maybe you could start with all the comments on the Bile Blog (bileblog.org). It is a seething pit of bitterness and trolling.



You could probably also use all the comments on Youtube as a good 'stupid corpus'.


Stupid comments on HN are likely to be very different in form than stupid comments on Youtube.

They are more likely to be written by smart-ish people who are either acting belligerent or trolling.

Reddit might be a much better source of highbrow stupid.

Any site that refers lots of new traffic to HN might also be considered as a source of potentially stupid comments. A markovian discriminator could get very good at identifying, say, TechCrunch-style comments. ;-)


But with Reddit you have to triage the comments, because there are always some intelligent ones. I think 4chan might be the best choice. I've heard it described as "where smart people go to be stupid", and I think that's reasonably accurate. ICHC would be good too if you ran it through a LOLspeak -> English translator first.


You have to triage comments from any source, at least at first (and occasionally thereafter). But the filters learn very fast.


Ideally you'd triage them anyway, but the whole idea we're discussing here is to find sites monolithic enough that you can get away with not doing so.


No, I think you'd have to settle for mining good sources of stupid, especially the sort of stupid likely to start creeping into HN.


Not all of them, but you should only need to hand-pick a few before the classifier starts to make the job much less tedious.


True, although that would conflate stupid with spam unless you took further measures.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: