Hacker News new | past | comments | ask | show | jobs | submit login
Responding to jacquesm's challenge (jgc.org)
83 points by jgrahamc on March 19, 2010 | hide | past | favorite | 49 comments



Did you look at posting times at all to exclude people who never post at certain times? (I should add that that's not my idea but something someone posted on the original thread).


No, I had a limited amount of time to get this done so there are lots of other things I could have tried.


You could include the time something is posted in the Bayesian filtering.


Hahaha, you still haven't found me out.

Oh wait..


You were #50 on my list.

If I modify the parser to take case into account you jump up to #16.


Thanks, that at least puts the heat off my name popping up again.

Maybe we subconsciously adopt the style of the person we reply to :) ?


We do. It's called neuro linguistic programming or the "familiarity principle". Sales people use this for effect.


Seeing people do this in real time must be interesting. Seeing it work must be astonishing.


I used to do cold call sales for a living and from the "pick up" you can tell what kind of person is answering and modify your voice accordingly. You basically just follow their lead.

E.G:

* Quiet people don't want shouty salesmen so go quiet, reserved and professional.

* Some guy off a council estate/trailer park is typically not going to want to talk to a suit so in that case you'll adopt a very casual and matey tone as if you've bumped into him in a pub/bar.

* If I EVER met someone who "knew the score" (e.g. someone smart or knows how telesales work) then I'd just drop all pretences and give it to them "straight".

Those are the "obvious" kinds of NLP, there are subtle bits and pieces including copying their tone or turns of phrase. It does increase your conversion rate.

Obviously (almost) everyone HATES cold calls so one of my goals was to see how long I could keep "angrys" on the phone. NLP definitely helps here (although empathy, humility and diplomacy are probably more useful) I even managed to sell to one once! :D


> [...] so one of my goals was to see how long I could keep "angrys" on the phone.

Just as a challenge? Or with anything else in mind?


For the challenge, mostly. Also because it can result in a good call back as well (apparently after I left my call-backs had good conversion rates).

As long as you combine the right amount of humility, amiability and possibly humour you can turn almost anyone around. For example if someone is getting A LOT of sales calls and yours is the straw that broke the camel's back then you can still help them. Provide information, give em the telephone preference service number (to remove them from sales calls databases), tell them how automated diallers work or how the regulations work in regards to cold calls. Be honest.

There is no reason why any call should have to end in a negative on either side. :)


> There is no reason why any call should have to end in a negative on either side. :)

Unless risking this would increase success in general. ;o)


If you are indeed the mystery user, make a post from that account here and prove it.


That wouldn't prove anything. Swombat and mystery user could collaborate to deceive us :)


I asked PG if swombat and onetimetoken had set off the HN sock puppet detector and he said that onetimetoken had used an IP address never before seen by HN.

So, swombat, time to prove that you are onetimetoken.


If I was onetimetoken, and I went to all this hassle to create a properly anonymous account and even laid down a challenge to you to find me out... why would I help you to find me out by giving you proof?


You are the one making the claim to be onetimetoken. I'm just asking that you back that up.


Where did I make such a claim? I made a joke about such a claim...

or did I...


The comment similarity problem is a lot more interesting than the is-swombat-just-being-coy problem...could you just tell us?


This is certainly an interesting problem,but without having some access to the data I don't think I could really approach it.

Perhaps I am alone in saying this, but I think data mining is interesting while web crawling is boring. Could somebody make the data available so that we don't have to write a crawler? Or is this part of the challenge?

I think this is a classic example of unsupervised learning, for which I would generally use a system like Fuzzy ART. I think that might perform better than a naive Basyesian text classifier though I can't be sure until I try it out.


If anyone wants to use 80legs for this challenge, just drop us a line at http://www.80legs.com/contact.html. We might be able to set up some custom free plans.


Would the owner of the comment just fess up for pete's sake? We won't hurt you.

edit: there's some new text to put through the filter: http://news.ycombinator.com/user?id=onetimetoken

and please try running the previous suspects through the filter: marketer, citizenparker, martythemaniak, eru, vaksel, neilc, vanelsas, swelljoe

Also, if it's not you of those, please formally deny it.


There's a misspelled word in the profile.

http://searchyc.com/identitiy


Isn't the Naive Bayes classifier biased to users with a large volume of text? Ie if there are two users and one writes 99% of the content it's very likely that that user will be picked as the author for almost anything? At the same time, this may be desired since someone who does contribute a lot on HN may have also desired to have some fun.


The key calculation is given a word w what's the frequency with which this user uses word w. So that's number of times w occurs / numbers of words that user has used. So it doesn't matter as long as a user has 'enough' text so that they've covered a good portion of the overall dictionary of words in use.

The prior probability is based on the number of comments a user makes. In this case that prior is insignificant because the sample text is large.


Training on word pairs or triples may also be worth a look, instead of going for single words only.


Very cool, and appropriate that you're basically using PG's spam filtering to identify users on his site :)

I think the next step is to write a more complex filter that does not assume word probabilities are independent of each other, i.e. take unusual phrases like "entirely dissimilar" into account.


What did you use to compute P(D|Ci)?

This uses your notation from the Dr Dobbs article, so that Ci (a category) is a user and D is a document (a comment?).

Did you use something like trigram signatures?

Also, is P(Ci) equal to #comments made by Ci/total number of users?

Interesting stuff jgrahamc!


I just used whitespace separated words after stripping punctuation.


I'd have thought that capitalisation and punctuation were key elements in any textual analysis. In the subject text there is a very unusual hyphenation "pure-ad" for example.


In authorship identification punctuation can be very important, as can non-grammatical features like the distribution of the number of syllables per word.


That's a nice approach, but a naive bayes classifier doesn't seem like it would be the best method for this particular problem.

You probably want to do an N-gram analysis, like that performed by libtextcat http://software.wise-guys.nl/libtextcat/. This will perform a comparison based on common combinations of letters used (like "wo", "or", "rd"). Seems like it would be more accurate with such a relatively small sample of comments. If you had a list of 10-20 possible candidates, you could narrow it down to just a few.


Didn't somebody code up a website a while ago that looked for other HN members that were similar in commenting style to oneself?


As I was in the list on there, I just want to confirm it wasn't me but when I read the original comments left by the anonymous commenter, I saw a lot of my own syntax mannerisms - at least the algorithm isn't too bad, eh? ;-)


So basically everybody struck out. Most likely due to sample size.

There's an interesting lesson here that probably says something like the coolness of the tool used has no direct relation to the usefulness of the conclusions provided.


I think it would be more interesting if the "guesses" would actually take into consideration how successful or unsuccessful the method is with the data available. For example, how likely are each of the names he mentioned and how likely is it that it's any one of them?

Edit: If someone here has a background in intelligence I would love to here their take on the challange.


Sadly, most of the statistics community continues to ignore your pithy lesson, to the great distress of all involved.


Great to see an analytical approach to the challenge :D Although reviewing your first list most of them actually seem unlikely (for a variety of reasons).


I'm dying to know if this turns out to be right, or not. I actually have a totally different list which is based on a different handling of punctuation. It suggests that the most likely person who also commented on that thread is zaveri.


This challenge is a bit flawed in that we have no way of knowing if the anonymous poster is willing to ever confirm that he/she made the post, isn't it? Not to state the obvious but it just seems that even if a great amount of technical work is put into this, you'll never know the answer unless the person in question agrees to participate.


You can test your techniques on comments (not used in training) for which you know the author. If you can achieve a high accuracy there, you can be fairly sure to be correct in the challenge, too.


Im still convinced the lower case use of google, facebook etc. (which occured more than once in the comment) is important - especially as there is one Google at the start of a sentence - indicating it is intentional/common.

That's why I personally discounted many from your first list (plus the fact I know a few are native English speakers)


I am aghast that people would think I'm not a native English speaker. I hope I'm numbered in the "few." Disclaimer: sometimes Windows handwriting recognition makes a real hash of my post without my noticing.


Run it through a Markovian classifier like CRM114 (or, if that's too expensive, just do it for the likely candidates identified by your naive Bayesian classifier).


Hey hey, two guesses each :)

Now I have to add my third from the list: nostrademons.


Well, for one thing, I much prefer telling jacquesm he's wrong as myself! :)


Don't rule out jacquesm himself! It'd be just the kind of thing he would do.



This is my inscrutable face.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: