I've often thought about doing something like this for comments. I think it woul...

mixmax · on March 25, 2008

"The hard part is getting the initial corpora of stupid and non-stupid text"

Can't you simply use comment votes for this - use the comments with the least votes? According to the users of a site these would be the stupid ones. Maybe comments with some algorithmic tool to include the karma of the user that wrote the comment.

The advantage of this approach is that it will work across sites where the definition of stupid might differ.

comatose_kid · on March 25, 2008

I don't know if the stupidity of the message correlates to its vote tally. For example, a comment with a negative rating may hold an unpopular but valid view.

In fact, if the mapping was that good, one wouldn't need to run a 'stupid filter' on the message body in the first place.

derefr · on March 26, 2008

The data might be more useful together--first generate a simple intelligence-scale number (between -1 and 1), then weight all of a user's votes by their intelligence, and recalculate everyone's intelligence from their weighted karma. That is to say, if a lot of stupid people hate you, you appear smarter.

The problem might be with trolls, where the stupid are tricked and vote down in retribution, but the intelligent notice right away but are simply amused at the skill of the execution, and vote up in humor. This makes the troll appear much more intelligent than they are.

jakewolf · on March 25, 2008

Maybe something that could be incorporated into disqus.

pg · on March 25, 2008

That would be a really good idea. You could make a version 1 in a day.

tim2 · on March 25, 2008

Even the least skillful trolls mask their ill intent by blending in and maintaining a non-descript tone. This is done in order to get past the "human spam filter." If you run stupidity detection based purely on the contents of their comments then you run the risk of simply banning controversial topics or people with bad spelling instead of controversial (trolling, roughly) behavior.

To better detect trolling behavior, I'd focus on responses:

- Content of the responses. Lots of shouting? Length.

- Number of responders. 50x more replies than you would otherwise expect from the thread?

- Depth of thread. Conversation still dragging on after all sane people have left?

- Timing of responses. Heated arguments leave no time for cooling off.

So measure effect of the troll, and your system won't have to try to understand what he's saying.

+ But you were referring to "stupid" measuring based on the submitters data may work just fine.

abstractbill · on March 25, 2008

All of the above also apply to running-joke threads, which I'm quite fond of as long as they don't take over a site completely.

henning · on March 25, 2008

Maybe you could start with all the comments on the Bile Blog (bileblog.org). It is a seething pit of bitterness and trolling.

Xichekolas · on March 25, 2008

You could probably also use all the comments on Youtube as a good 'stupid corpus'.

icky · on March 25, 2008

Stupid comments on HN are likely to be very different in form than stupid comments on Youtube.

They are more likely to be written by smart-ish people who are either acting belligerent or trolling.

Reddit might be a much better source of highbrow stupid.

Any site that refers lots of new traffic to HN might also be considered as a source of potentially stupid comments. A markovian discriminator could get very good at identifying, say, TechCrunch-style comments. ;-)

dfranke · on March 26, 2008

But with Reddit you have to triage the comments, because there are always some intelligent ones. I think 4chan might be the best choice. I've heard it described as "where smart people go to be stupid", and I think that's reasonably accurate. ICHC would be good too if you ran it through a LOLspeak -> English translator first.

icky · on March 26, 2008

You have to triage comments from any source, at least at first (and occasionally thereafter). But the filters learn very fast.

dfranke · on March 26, 2008

Ideally you'd triage them anyway, but the whole idea we're discussing here is to find sites monolithic enough that you can get away with not doing so.

icky · on March 26, 2008

No, I think you'd have to settle for mining good sources of stupid, especially the sort of stupid likely to start creeping into HN.

dreish · on March 25, 2008

Not all of them, but you should only need to hand-pick a few before the classifier starts to make the job much less tedious.

henning · on March 25, 2008

True, although that would conflate stupid with spam unless you took further measures.

nickb · on March 26, 2008

"The hard part is getting the initial corpora of stupid and non-stupid text. "

I think this could be solved very easily by scanning email inbox + older news.yc comments for non-stupid text and just scraping youtube/digg/reddit comments as stupid text.

far33d · on March 25, 2008

Wow. That particular user has an exactly 100% troll:total comments ratio. Most trolls occasionally have something real to say.

derefr · on March 26, 2008

I believe the "user" is a puppet account of another user, who has mentioned before that they only use the account when they want to take on a particular tone.