Hacker News new | past | comments | ask | show | jobs | submit login
Tell HN: After six days of DupDetector, time to take stock.
91 points by DupDetector on Dec 16, 2010 | hide | past | favorite | 43 comments
I've been running DupDetector for just under a week now. The results have been interesting. Many, many duplicates have been found, and many, many more items have been cross-referenced. So often the same story is reported again and again from different sources, and while this isn't a bad thing in itself, my personal belief is that the resulting divided discussions is a waste of time and effort. And wasted time is something I hate.

I come to HN for high-quality links, and even more high-quality discussion, and anything that dilutes that discussion is, to me, a bad thing.

I originally intended to run the DupDetector for a week, but after 5 1/2 days there's enough information to tell me what I want. The thing that has caught me by surprise is the way it has had such widely differeing reactions. Some people have accused me of "a novelty account", something I'd never heard of, but which appears to be associated with Reddit. I thought of it more as a robot assistant.

In particular, I thought more people would be interested in the technology and the hacking. More than anything it's the lack of a response on that level that's made me pause.

And the final factor is today, when DupDetector's karma has fallen from 27 to 12. I don't care about the karma, but it's an indication of people's feeling about the exercise.

I'll run it for just a few hours longer as I tidy things up, but basically I'm stopping, explaining, and I'll see what the response is. I thought it would be, and it would be thought to be, cool, interesting and useful.

Maybe I've misjudged my audience. I am reviewing the situation.




This is something that I seem to disagree with a lot of people on. Dupe-police have existed on reddit, digg, slashdot, here, and just about every other social bookmarking website I've ever used (including my own).

I hate them. Why? Because the entire point of social bookmarking is to find things you find interesting, not things that you find unique. That little arrow to the left of the title means "I found this link interesting. I think other people will find it interesting as well."

It absolutely does not mean "This link is unique. Nobody has seen it before." If that were the point, we could just pipe RSS feeds into the URL submitter, couldn't we?

The very fact that links are appearing on the front page means that a lot of people haven't seen them yet. It means that they got some utility out of reading them, and it means that they thought others would too.

Sometimes, dupes are good. I don't remember who said it, but that Louis CK interview that gets posted every once in a while, where he is talking about how we're surrounded by wonderful technology and yet nobody cares, they said something to the effect of "I wouldn't care if this was stickied to the top of the page and everybody had to watch it every single day before they post. He is making an excellent point."

Now, I think this is a bit excessive, but the point stands. It isn't about being unique, it is about being good.


Referencing past discussions is useful, though. That’s where I would see the niche for this bot.


The difference between something like DupDetector and someone like MrOhHai on Reddit is that DupDetector was polite and was more like "here's the earlier discussion" as opposed to "you should not have posted this".


It’s still an understandably sensitive topic, hence my recommendation to make DupDetector much more polite and friendly so that there can be no misunderstanding about the intent of the bot.


That's an excellent idea.


When folksonmies were the current hotness Clay Shirkey made a great point about things that seem similar but aren't.

"Synonym control is not as wonderful as is often supposed, because synonyms often aren’t. Even closely related terms like movies, films, flicks, and cinema cannot be trivally collapsed into a single word without loss of meaning, and of social context. (You’d rather have a Drain-O® colonic than spend an evening with people who care about cinema.) So the question of controlled vocabularies has a lot to do with the value gained vs. lost in such a collapse. I am predicting that, as with the earlier arc of knowledge management, the question of meaningful markup is going to move away from canonical and a priori to contextual and a posteriori value."

http://many.corante.com/archives/2004/08/25/folksonomy.php

I think this can apply to anything where similarity can seem like a problem.

Just because items seem simliar doesn't take into account the context in which said items are used.


This is a false analogy. We're talking about identical content (a particular website) as compared to two different linguistic terms. The point about folksonomies is interesting in valid in its own right, but should not be used as a justification for duplications on News.YC.


The very fact that links are appearing on the front page means that a lot of people haven't seen them yet. It means that they got some utility out of reading them, and it means that they thought others would too.

Assuming the theory works. People may upvote for a number of reasons, not necessarily because they received value from the link. There remains a lot of work to do in the unexplored space of social, user-powered spaces like Reddit and HN. In theory, the upvote/downvote method maximizes utility of the site users, because those who received value from the article upvoted. Unfortunately, what maximizes utility for one set of users isn't the same for others, which is why we can have such a diversity of sites like HN, Reddit, and Digg.

HN has established itself as a high quality, technically and entrepreneurially oriented site, which is why we don't see the same content as say, reddit.com/r/funny. Defending this position by submitting and upvoting high quality links is critical to maintaining the caliber of HN. This is part of why I don't like seeing resubmitted or duplicate content. Duplicate content (even if very high quality) still takes up a valuable front page location. If enough of the community received new value from the piece (ie. haven't seen it before), then they can upvote it and the article can stay. On the other hand, if you upvote something you've already seen for the benefit of others, you aren't necessarily helping them out. Instead, you are tampering with the ranking algorithm.

If enough new users see an old article and upvote it, then the utility for the aggregate user is maximized, and I am happy with duplicate content when this happens, even if I'm not receiving any value.

Still, there is the problem of users new and old not seeing or knowing about these great quality articles from years ago. Inevitably someone will see an old repost and say "I didn't see this the first time around. I'm glad it was reposted." However, I think using a site like HN to get these old articles new life is using the wrong tool to solve the problem.


I found your bot really annoying. I don't need someone telling me there is another submission with no points and no comments.

It almost felt petty, that you were scolding the submitter for submitting content that had already been submitted. The comments section of Hacker News is one of the last places on the web that has not been hammered with noise.

I would be somewhat interested in how you solved this problem automatically though.


I think a bot which references submissions that already have comments would be incredibly useful, especially on duplicate submissions that don’t have comments but received a few upvotes. I don’t think you should reference duplicate submissions without comments, that just seems pointless. I don’t think that dupes by themselves are bad, but stumbling across a story which was already richly discussed on HN in the past without being able to find that discussion is certainly not optimal.

I also think you should work on the bot’s politeness. It’s easy to perceive the waltzing in of a bot that says nothing but “This submission has ended up with the points and comments” and provides a link as rude. Technically, sure, that sentence is not or at least not overtly negative but it is easy to perceive it as such.

I would formulate the bot’s phrases consciously positive, showing that your intent is not to be smug about submitters of dupes but that you just want to help, you just want to provide a service.


I wonder if it would have gotten a better reception if it was "RelatedStoriesBot" instead of "DupDetector".

I used it as a way to find more comments on a topic, and liked it pretty well for that. Of course, I liked it when RiderOfGiraffes was posting them as well; although I wondered how he was able to actually read any of the stories because he seemed to be everywhere with dupe reports.


This has turned into a brain dump - apologies.

I had hoped that the DupDetector would give me more time to read stories, but I found two things:

1. There were many, many more dupes being found than I expected. Checking the robot's output before confirming the posting took about the same amount of time as it used to take finding fewer, but doing it by hand. Fully automating it would only take a little more work, and maybe I'd then get the time back.

2. I was reading a bit more, but I found that the additional material wasn't interesting. I was probably already reading everything I found useful, interesting, instructive, or engaging. I'm working on a script to help me with that as well.

The problem I'm finding is the sheer volume, and most of it is repeats, politics, repeats, TSA, repeats, wikileaks, repeats, Assange, repeats, etc. The proportion of material with deep technical content is much less than I remember. Consider - as I write this anything older than 40 minutes old has already fallen off the "newest" page. At that pace it can't all be worth reading.

It's suggested that newcomers read the "news" page, and perhaps the "over" page so they get enculturated with what this site is about. Similarly, it's suggested that older hands inhabit the "newest" page so we can vote up those things that deserve it, and flag the inappropriate.

But I can't keep up with "newest" any more, and much of what I would find interesting is vanishing before I can find it. Searching deeper will find it sometimes, but the 'bots were intended to help.

So I'm working on trying to help, working on trying to find the good stuff (by some definition), and working on adding value.

So, for what it's worth, that's what I think. I hope it sparks off some interesting or useful thoughts.


I like the bot (I thought it was you) and I think it would work if the OP deletes the post. It would be up to them, as it probably should be because, as the PG Essay kerfuffle earlier showed, sometimes posts should be seen again. I hope you have time to give some details on the bot itself. In my opinion DupDetector should live on.


I saw the comments didn't realize it was automated.

I think it would be more useful if integrated with HN at the submission phase. Seeing something is a dupe of something else after its already on the frontpage seems too late.


In a semi-related vein, let me reference my own Ask HN from last week: http://news.ycombinator.com/item?id=1975950

My claim was that HN should bake in a "dup" button that let users flag duplicate posts, not with the goal of removing dups, but with the goal of cross-referencing them (and possibly sharing karma or some other idea). I won't recap the whole thing here, but I find DupDetector an interesting complement to those ideas.


Ok, Reddit handles this really well in their submission system.

If you post a link, and it exists, it simply flags: here's all the posts that already have this link. Do you still want to proceed?

Gives you the choice of jumping into the existing discussions or re-posting (eg, if the old subs are years out of date)


A similar thing happens here if you post an identical link. Your submission is disallowed and acts as an up-vote for the original.

But too many URLs are different, sometimes subtly, sometimes not so. I've seen submissions with just an extra hash on the end. There are the submissions with all the feedburner crap cluttering it up, and so on. This was an attempt to be more thorough about detecting duplicates, doing "more properly" a job that's already done, and therefore presumably desirable.


Ah! That's good, didn't realise. The upvote is a good idea, but yeah, there's a million different ways to submit the same online content.

The function is definitely desirable, but it seems better suited as a browser plugin or such. Although it's not literally, botting it and having it contribute almost makes it mandatory for all users, versus being an optional component.

It's the root of the use case here - the functionality is great, but not all users find it desirable.


When I post a link to HN that already exists, it auto-votes the prior story for me. There must be some time threshold after which it's allowed to report the same link.


It checks submissions against the links that are cached in memory -- you can resubmit the same URL if a page containing a link to it has not been requested since the server last crashed.

You can submit links that nobody's looked at recently (modulo the consistent crashiness of Arc). As a hack I think it's clever and terrible all at once.


I first saw it today and I was about to thank the administration for this new feature when I realized that it was a user's (or bot's) comment.

The functionality is imho highly desirable, but a more dynamic solution (regularly updating the number of comments, for instance) would be a better implementation.


I agree with the people saying "market it better." As it is, DupDetector looks like linkspam and is thus ripe for angry downvotes.

Change the focus from strict dupe-finding to "add additional context to the article people are already looking at." Copy some data about the comments, note the age of the previous discussions, even reproduce the highest-rated comment. These things will give it a more positive/helpful image.


I didn't realise it was a bot! It would be awesome if you share your code and we can get PG to incorporate it in the site somehow.


I had no issue with you running it, I think the negative response were people reacting out of fear of letting "novelty" accounts get going with this as the beachhead where they had to make their stand.


Twitter is a good example of what can happen when "meta" accounts get out of hand. I think HNers find them fascinating in theory, but feel that they dilute the community in practice.


A near-simultaneous dup does divide discussion. It would be better if only one was allowed. Quite often, breaking news is divided among two or three posts.

Dups that are distant in time are more complex. One aspect is that people might not have seen the previous stroy - so it would be good to show it again. Another aspect is that the previous discussion is lost, leading to the same points being repeated, instead of (possibly) being built upon - so it would be better to resurrect the previous submission and therefore discussion.

One solution is to detect and combine dups, but enable them to launch the story fresh, if sufficient time has passed.

There is in fact already a discrete implementation of this: stories over a year old (I think that's the period) can be resubmitted as a new story. So I'm suggesting a continuous version of this idea, where the "newness" of a story gradually increases, til it becomes completely new after a year. "Newness" could be implemented with a factor on the story-score. This would enable old stories (and their discussions) to return to the front-page.


I think the idea is really interesting and itself is very HN. The thing getting you down is the way people misuse the down vote.

Disagreeing with something is no reason to down vote a comment. I'm only at ~100 karma, but I'm not waiting to get to 500 to down vote people I disagree with. Disagreement spurs discussion. It'd be pretty boring if everyone agreed on everything.

Since you have to be "qualified" to use the down vote, it should be reserved for instances where the user is not being a respectful fellow hacker.

All that being said, maybe you can include a line saying "Just a bot doing research" before it lists the duplicates.


It would have been fine if you ran the experiment, WITHOUT polluting the comments with useless dupe links. You could then write up a blog post or something showing what you had found.


The dupe links seemed useful to me. If something has already been discussed at length, I'd like to know about it.


That's a fair point, but still, the posts were made by a bot, who can't decide if there is useful discussion on another page or not.


DupDetector can tell if there's a discussion by the number of comments, and it uses points to determine the "usefulness" of the discussion. You can find examples here: http://news.ycombinator.com/threads?id=DupDetector


I agree that there's just a slight miss re: HN style and community behavior, and also think that it's a beneficial service. Does the robot update the comments automatically?

I wonder what people would make of you using your 'real' account with a note that it's a robot posting. On some sites this would be considered karma whoring, but maybe here it would get a better response?


I was watching this with interest, actually. I think it would be great to somehow tie threads together. blhack is right when he says that social bookmarking is about finding things interesting, not unique, so dupes aren't evil in themselves. It's the dilution of discussion that's the issue. So if you can solve that issue elegantly, you win.


> there's enough information to tell me what I want

Then, what is it?

I think you have not been clear enough about what you achieved.


In 3 days, wikileaks will expose what the dupdetector really wants.


What method are you using to conclude that two stories on two different sites are about the same topic? That's always been a feature of certain sites that captured my interest, but it seems like so much can go wrong. Achieving decent accuracy must be very difficult. Have you written about this anywhere?


I actually was really excited when I saw this originally. I made a point of going through all the bots posts to see what you were doing with it. I got the impression that you were trying to tweak the bot to the point that people would find it interesting. So for me it is sad to hear that things aren't working out. I was hoping that DupDetector would eventually be pointing to year old discussion pages; a discoverer instead of a detector. It was only links to the articles that didn't gain any traction that bothered me. They seemed like noise.


When I first saw the dup detector working, I thought that this was an interesting problem, and one that I want to see solved. I have often posted a comment on the losing thread only to see that this was the surpassed by the next post.

In the sense of an interesting hack, it was fun and pragmatic. Thanks for doing it. But it seems that the community doesn't see the need for the service.


Simply stating "More comments here..." then showing the additional submissions may have a better response. HN isn't so much a discussion among a few friends (although it feels that way at times) as much as large public park where people come in and out at all different times. The fact that topics repeat (at least some) isn't such a bad thing.


There were cases where the linked duplicates would have only one or two comments. It was annoying to see them listed - not because there is nothing to read, but because it's a vote of no-confidence that those particular posts would have ever grow a larger discussion.


I personally really like the idea of the duplicate detector. I didn't realize it was going on. I really don't want to see conversation split across threads; I'd rather have one conversation per link.

my 0.02 c


I saw a few posts from the account, but I didn't know any more.

I've been seeing an increasing number of duplicates myself, including entries using the same URL with an octothorpe added at the end -- a pretty obvious ploy, in my mind.

When HN started up, it was about the conversation. If someone else had already posted a link, great: People just joined in the conversation there if they had something to say. From my perspective, it saved the time of having to post it. And, in that I was often learning as much or more from the conversation on HN than from the links themselves, I was happy to find that conversation focused in one thread.

I'm not sure how, but I'd like to see the site steered back in that direction, if possible. (I have a few half-baked "ideas", but PG and crew have already demonstrated themselves to be more insightful than me -- in my own mind.)

I do think identifying the HN member behind DupDetector might be a benefit, to demonstrate their investment in the community and therefore, in my mind at least, credibility. Yes, I see the email address now, and I half remember off the top of my head whose domain that is. But I might have been a bit more supportive if I knew who was behind it and that they had an established, positive history with HN.

Anyway, just my 2¢; spend them before you need a wheelbarrow full.


I really enjoyed this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: