Responding to jacquesm's challenge

davidw · on March 19, 2010

Did you look at posting times at all to exclude people who never post at certain times? (I should add that that's not my idea but something someone posted on the original thread).

jgrahamc · on March 19, 2010

No, I had a limited amount of time to get this done so there are lots of other things I could have tried.

eru · on March 19, 2010

You could include the time something is posted in the Bayesian filtering.

swombat · on March 19, 2010

Hahaha, you still haven't found me out.

Oh wait..

jgrahamc · on March 19, 2010

You were #50 on my list.

If I modify the parser to take case into account you jump up to #16.

jacquesm · on March 19, 2010

Thanks, that at least puts the heat off my name popping up again.

Maybe we subconsciously adopt the style of the person we reply to :) ?

Quarrelsome · on March 19, 2010

We do. It's called neuro linguistic programming or the "familiarity principle". Sales people use this for effect.

Splines · on March 19, 2010

Seeing people do this in real time must be interesting. Seeing it work must be astonishing.

Quarrelsome · on March 19, 2010

I used to do cold call sales for a living and from the "pick up" you can tell what kind of person is answering and modify your voice accordingly. You basically just follow their lead.

E.G:

* Quiet people don't want shouty salesmen so go quiet, reserved and professional.

* Some guy off a council estate/trailer park is typically not going to want to talk to a suit so in that case you'll adopt a very casual and matey tone as if you've bumped into him in a pub/bar.

* If I EVER met someone who "knew the score" (e.g. someone smart or knows how telesales work) then I'd just drop all pretences and give it to them "straight".

Those are the "obvious" kinds of NLP, there are subtle bits and pieces including copying their tone or turns of phrase. It does increase your conversion rate.

Obviously (almost) everyone HATES cold calls so one of my goals was to see how long I could keep "angrys" on the phone. NLP definitely helps here (although empathy, humility and diplomacy are probably more useful) I even managed to sell to one once! :D

eru · on March 19, 2010

> [...] so one of my goals was to see how long I could keep "angrys" on the phone.

Just as a challenge? Or with anything else in mind?

Quarrelsome · on March 20, 2010

For the challenge, mostly. Also because it can result in a good call back as well (apparently after I left my call-backs had good conversion rates).

As long as you combine the right amount of humility, amiability and possibly humour you can turn almost anyone around. For example if someone is getting A LOT of sales calls and yours is the straw that broke the camel's back then you can still help them. Provide information, give em the telephone preference service number (to remove them from sales calls databases), tell them how automated diallers work or how the regulations work in regards to cold calls. Be honest.

There is no reason why any call should have to end in a negative on either side. :)

eru · on March 20, 2010

> There is no reason why any call should have to end in a negative on either side. :)

Unless risking this would increase success in general. ;o)

orborde · on March 19, 2010

If you are indeed the mystery user, make a post from that account here and prove it.

JesseAldridge · on March 19, 2010

That wouldn't prove anything. Swombat and mystery user could collaborate to deceive us :)

jgrahamc · on March 19, 2010

I asked PG if swombat and onetimetoken had set off the HN sock puppet detector and he said that onetimetoken had used an IP address never before seen by HN.

So, swombat, time to prove that you are onetimetoken.

swombat · on March 19, 2010

If I was onetimetoken, and I went to all this hassle to create a properly anonymous account and even laid down a challenge to you to find me out... why would I help you to find me out by giving you proof?

jgrahamc · on March 19, 2010

You are the one making the claim to be onetimetoken. I'm just asking that you back that up.

swombat · on March 19, 2010

Where did I make such a claim? I made a joke about such a claim...

or did I...

endtime · on March 20, 2010

The comment similarity problem is a lot more interesting than the is-swombat-just-being-coy problem...could you just tell us?

jxcole · on March 19, 2010

This is certainly an interesting problem,but without having some access to the data I don't think I could really approach it.

Perhaps I am alone in saying this, but I think data mining is interesting while web crawling is boring. Could somebody make the data available so that we don't have to write a crawler? Or is this part of the challenge?

I think this is a classic example of unsupervised learning, for which I would generally use a system like Fuzzy ART. I think that might perform better than a naive Basyesian text classifier though I can't be sure until I try it out.

jdrock · on March 19, 2010

If anyone wants to use 80legs for this challenge, just drop us a line at http://www.80legs.com/contact.html. We might be able to set up some custom free plans.

adrianwaj · on March 19, 2010

Would the owner of the comment just fess up for pete's sake? We won't hurt you.

edit: there's some new text to put through the filter: http://news.ycombinator.com/user?id=onetimetoken

and please try running the previous suspects through the filter: marketer, citizenparker, martythemaniak, eru, vaksel, neilc, vanelsas, swelljoe

Also, if it's not you of those, please formally deny it.

JesseAldridge · on March 19, 2010

There's a misspelled word in the profile.

http://searchyc.com/identitiy

dangoldin · on March 19, 2010

Isn't the Naive Bayes classifier biased to users with a large volume of text? Ie if there are two users and one writes 99% of the content it's very likely that that user will be picked as the author for almost anything? At the same time, this may be desired since someone who does contribute a lot on HN may have also desired to have some fun.

jgrahamc · on March 19, 2010

The key calculation is given a word w what's the frequency with which this user uses word w. So that's number of times w occurs / numbers of words that user has used. So it doesn't matter as long as a user has 'enough' text so that they've covered a good portion of the overall dictionary of words in use.

The prior probability is based on the number of comments a user makes. In this case that prior is insignificant because the sample text is large.

eru · on March 19, 2010

Training on word pairs or triples may also be worth a look, instead of going for single words only.

sgoranson · on March 19, 2010

Very cool, and appropriate that you're basically using PG's spam filtering to identify users on his site :)

I think the next step is to write a more complex filter that does not assume word probabilities are independent of each other, i.e. take unusual phrases like "entirely dissimilar" into account.

wynand · on March 19, 2010

What did you use to compute P(D|Ci)?

This uses your notation from the Dr Dobbs article, so that Ci (a category) is a user and D is a document (a comment?).

Did you use something like trigram signatures?

Also, is P(Ci) equal to #comments made by Ci/total number of users?

Interesting stuff jgrahamc!

jgrahamc · on March 19, 2010

I just used whitespace separated words after stripping punctuation.

pbhjpbhj · on March 19, 2010

I'd have thought that capitalisation and punctuation were key elements in any textual analysis. In the subject text there is a very unusual hyphenation "pure-ad" for example.

jfarmer · on March 19, 2010

In authorship identification punctuation can be very important, as can non-grammatical features like the distribution of the number of syllables per word.

hooande · on March 19, 2010

That's a nice approach, but a naive bayes classifier doesn't seem like it would be the best method for this particular problem.

You probably want to do an N-gram analysis, like that performed by libtextcat http://software.wise-guys.nl/libtextcat/. This will perform a comparison based on common combinations of letters used (like "wo", "or", "rd"). Seems like it would be more accurate with such a relatively small sample of comments. If you had a list of 10-20 possible candidates, you could narrow it down to just a few.

gcheong · on March 19, 2010

Didn't somebody code up a website a while ago that looked for other HN members that were similar in commenting style to oneself?

petercooper · on March 19, 2010

As I was in the list on there, I just want to confirm it wasn't me but when I read the original comments left by the anonymous commenter, I saw a lot of my own syntax mannerisms - at least the algorithm isn't too bad, eh? ;-)

DanielBMarkham · on March 19, 2010

So basically everybody struck out. Most likely due to sample size.

There's an interesting lesson here that probably says something like the coolness of the tool used has no direct relation to the usefulness of the conclusions provided.

stse · on March 19, 2010

I think it would be more interesting if the "guesses" would actually take into consideration how successful or unsuccessful the method is with the data available. For example, how likely are each of the names he mentioned and how likely is it that it's any one of them?

Edit: If someone here has a background in intelligence I would love to here their take on the challange.

caffeine · on March 20, 2010

Sadly, most of the statistics community continues to ignore your pithy lesson, to the great distress of all involved.

ErrantX · on March 19, 2010

Great to see an analytical approach to the challenge :D Although reviewing your first list most of them actually seem unlikely (for a variety of reasons).

jgrahamc · on March 19, 2010

I'm dying to know if this turns out to be right, or not. I actually have a totally different list which is based on a different handling of punctuation. It suggests that the most likely person who also commented on that thread is zaveri.

brown9-2 · on March 19, 2010

This challenge is a bit flawed in that we have no way of knowing if the anonymous poster is willing to ever confirm that he/she made the post, isn't it? Not to state the obvious but it just seems that even if a great amount of technical work is put into this, you'll never know the answer unless the person in question agrees to participate.

eru · on March 19, 2010

You can test your techniques on comments (not used in training) for which you know the author. If you can achieve a high accuracy there, you can be fairly sure to be correct in the challenge, too.

ErrantX · on March 19, 2010

Im still convinced the lower case use of google, facebook etc. (which occured more than once in the comment) is important - especially as there is one Google at the start of a sentence - indicating it is intentional/common.

That's why I personally discounted many from your first list (plus the fact I know a few are native English speakers)

stcredzero · on March 19, 2010

I am aghast that people would think I'm not a native English speaker. I hope I'm numbered in the "few." Disclaimer: sometimes Windows handwriting recognition makes a real hash of my post without my noticing.

jjs · on March 19, 2010

Run it through a Markovian classifier like CRM114 (or, if that's too expensive, just do it for the likely candidates identified by your naive Bayesian classifier).

jacquesm · on March 19, 2010

Hey hey, two guesses each :)

Now I have to add my third from the list: nostrademons.

stcredzero · on March 19, 2010

Well, for one thing, I much prefer telling jacquesm he's wrong as myself! :)

fnid2 · on March 19, 2010

Don't rule out jacquesm himself! It'd be just the kind of thing he would do.

JesseAldridge · on March 19, 2010

http://dl.dropbox.com/u/135901/rare_words.py

anigbrowl · on March 19, 2010

This is my inscrutable face.