Hacker News new | past | comments | ask | show | jobs | submit login
An open-source filter software that can detect rampant stupidity in written English (stupidfilter.org)
41 points by jakewolf on March 25, 2008 | hide | past | favorite | 39 comments



The XKCD folks implemented what they call "Robot9000". Robot9000 attempts to ensure that every comment being added to a site or chat channel is unique when compared against the history of the channel. It basically hashes a somewhat stripped-down version of each comment and compares it against the entire historical corpus of their chat. If the comment is found then the user is muted for an exponentially-increasing amount of time for each infraction. I believe there's a slow decay on the mute duration as well. This sort of filter won't stop stupid text, but it seems to be working for them. It's a novel approach to the problem of signal-dilution as a social network grows.

Robot9000 release announcement: http://blag.xkcd.com/2008/01/14/robot9000-and-xkcd-signal-at...

Perl source: http://media.peeron.com/tmp/ROBOT9000.html


I've often thought about doing something like this for comments. I think it would work.

The hard part is getting the initial corpora of stupid and non-stupid text. Stupid writing is harder to recognize than spam. It might work to use sites as proxies.

Another related filter that might be worth trying to build would be one for recognizing trolls. It would be easy to collect the bad corpus for this filter, because the design of most forums makes it easy to see all the comments by a particular user, e.g.

http://reddit.com/user/qwe1234/


"The hard part is getting the initial corpora of stupid and non-stupid text"

Can't you simply use comment votes for this - use the comments with the least votes? According to the users of a site these would be the stupid ones. Maybe comments with some algorithmic tool to include the karma of the user that wrote the comment.

The advantage of this approach is that it will work across sites where the definition of stupid might differ.


I don't know if the stupidity of the message correlates to its vote tally. For example, a comment with a negative rating may hold an unpopular but valid view.

In fact, if the mapping was that good, one wouldn't need to run a 'stupid filter' on the message body in the first place.


The data might be more useful together--first generate a simple intelligence-scale number (between -1 and 1), then weight all of a user's votes by their intelligence, and recalculate everyone's intelligence from their weighted karma. That is to say, if a lot of stupid people hate you, you appear smarter.

The problem might be with trolls, where the stupid are tricked and vote down in retribution, but the intelligent notice right away but are simply amused at the skill of the execution, and vote up in humor. This makes the troll appear much more intelligent than they are.


Maybe something that could be incorporated into disqus.


That would be a really good idea. You could make a version 1 in a day.


Even the least skillful trolls mask their ill intent by blending in and maintaining a non-descript tone. This is done in order to get past the "human spam filter." If you run stupidity detection based purely on the contents of their comments then you run the risk of simply banning controversial topics or people with bad spelling instead of controversial (trolling, roughly) behavior.

To better detect trolling behavior, I'd focus on responses:

- Content of the responses. Lots of shouting? Length.

- Number of responders. 50x more replies than you would otherwise expect from the thread?

- Depth of thread. Conversation still dragging on after all sane people have left?

- Timing of responses. Heated arguments leave no time for cooling off.

So measure effect of the troll, and your system won't have to try to understand what he's saying.

+ But you were referring to "stupid" measuring based on the submitters data may work just fine.


All of the above also apply to running-joke threads, which I'm quite fond of as long as they don't take over a site completely.


Maybe you could start with all the comments on the Bile Blog (bileblog.org). It is a seething pit of bitterness and trolling.


You could probably also use all the comments on Youtube as a good 'stupid corpus'.


Stupid comments on HN are likely to be very different in form than stupid comments on Youtube.

They are more likely to be written by smart-ish people who are either acting belligerent or trolling.

Reddit might be a much better source of highbrow stupid.

Any site that refers lots of new traffic to HN might also be considered as a source of potentially stupid comments. A markovian discriminator could get very good at identifying, say, TechCrunch-style comments. ;-)


But with Reddit you have to triage the comments, because there are always some intelligent ones. I think 4chan might be the best choice. I've heard it described as "where smart people go to be stupid", and I think that's reasonably accurate. ICHC would be good too if you ran it through a LOLspeak -> English translator first.


You have to triage comments from any source, at least at first (and occasionally thereafter). But the filters learn very fast.


Ideally you'd triage them anyway, but the whole idea we're discussing here is to find sites monolithic enough that you can get away with not doing so.


No, I think you'd have to settle for mining good sources of stupid, especially the sort of stupid likely to start creeping into HN.


Not all of them, but you should only need to hand-pick a few before the classifier starts to make the job much less tedious.


True, although that would conflate stupid with spam unless you took further measures.


"The hard part is getting the initial corpora of stupid and non-stupid text. "

I think this could be solved very easily by scanning email inbox + older news.yc comments for non-stupid text and just scraping youtube/digg/reddit comments as stupid text.


Wow. That particular user has an exactly 100% troll:total comments ratio. Most trolls occasionally have something real to say.


I believe the "user" is a puppet account of another user, who has mentioned before that they only use the account when they want to take on a particular tone.


I'd be more interested in the complement -- filters tuned to pick up smart or interesting writing. I'm not convinced that it's necessarily an identical problem.

It would be neat to run a battery of standard semantic analysis tools against the text of web pages ranked highly on HN, compared with pages not ranked highly.


That's actually a fascinating idea.


Input: "If I had 6 hours to chop down a tree, I'd spend the first 4 sharpening the axe."

Output: "Text is not likely to be stupid."

Input: "You wanna see my pics?"

Output: "Text is likely to be stupid."


Yes, huge flaws and easily gamed. I just liked the attempt.


OMG, it failed on 50 Cent lyrics!

"You can find me in the club / bottle full of bub / Mama, I got that X if you're in to taking drugs / So come give me a hug if you're in to getting rubbed"

Any reasonable "stupid" corpus must begin with lyrics from 50 Cent, T Pain, Limp Bizkit, Kottonmouth Kings, Insane Clown Posse, and other craptacular music popular amongst suburban teenagers.


//USER COMMENT REDACTED BY STUPIDITY FILTER//


Thats a really hard task to accomplish. Is it poor english stupid? What about foreigners writing in english? Are they all stupid, they won't write 100% proper english. What about misspelled words?

I think irrelevancy and inaccuracy are the best way to distinguish stupid from smart and the key to know what is one what is the other is probably on the subject of the comments and that would be a related/non-related filter not a stupid filter.

Honestly, by the name it got I think it is more intended to get a lot of buzz than to really become a real product. Isn't Mr. Ortiz just trying to get some attention? The definition of stupid is directly related to the reader so you can't have a filter for that, it would have to be personal.


> The definition of stupid is directly related to the reader so you can't have a filter for that, it would have to be personal.

If you mean that it is subjective, Oritz's entire experiment is to find the degree to which it is so, or rather, the degree to which one can objectively measure intelligence in writing from its characteristics.



"these sample comments are then compared to “smart” text from a body of work on sites like Project Gutenberg, an online catalog of great world literature. Mr. Ortiz says he took snippets from classics by such authors as Jules Verne and J.D. Salinger to serve as a baseline for “the edited English language."

Did anyone else laugh at the thought of setting Holden Caulfield as the paragon of English prose?


Anyone who didn't is a phony.


It is an easy mistake to believe that tools as simple as Bayesian filters can emulate intelligence. It requires intelligence to determine whether someone else is intelligent or not, not a bunch of rules and filters.

As we've seen with spam, any unintelligent system can be circumvented given enough time and ingenuity. Bayesian filters are now but a small part of e-mail analysis.


For things as simple as simple Bayesian filters, I'd agree. But I wouldn't be surprised if a sufficiently advanced filter is indistinguishable from intelligence. Otherwise, you could make the argument: "It is an easy mistake to believe that things as simple as neurons can emulate intelligence. It requires intelligence to determine whether someone else is intelligent or not, not a bunch of firing thresholds."


>>> An open-source filter software that can detect rampant stupidity in written English

Text is not likely to be stupid.

CLASSIFY succeeds; success probability: 0.5043 pR: 0.0075 Best match to file #0 (/home/sfp/code/nonstupid_cor.css) prob: 0.5043 pR: 0.0075


Reddit should use a filter like this that a comment must pass before it is allowed to be posted (except for on the lolcat and NSFW subreddits).


not scalable, look at their "stupid" and "non-stupid" data


They give an example of filtering out lowercase text. Lowercase and stupid are very different. I sometimes write whole essays in lowercase.


++ on the lowercase. I do uppercase sentence starts, but the rest i like to leave lowercase.

German uppercases all nouns. its really hard to get used to doing.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: