Hacker News new | past | comments | ask | show | jobs | submit login
How we got 100,000 visitors without noticing (historio.us)
80 points by stavros on Oct 3, 2010 | hide | past | favorite | 51 comments



Eek, not to rain the parade, but this is not good historious team. Google's duplicate content filter is generally accurate but in this case it penalized the site that originated the content and ranked yours higher because it assumed you were the owner of the content i.e. the page about sending free SMS's.

Google will quickly correct this error and you'll find the traffic on that page drops to nothing. You may also find other duplicate pages on your site penalized in the same way. You may also find your site penalized by Google for essentially screen scraping sites and copying their content verbatim and in it's entirety.

I'd recommend you set up a robots.txt to block Google from indexing identical pages on your site. Copying content verbatim and republishing in it's entirety is not a good SEO strategy unless you have thousands of throw-away domains and wear a black hat.

Sorry about the negative message, I'm sure this was very exciting for you guys but I don't think it's sustainable.

Perhaps an alternative is to have your users highlight the paragraph or snippet on each page they found interesting and archive that instead. Then publish those snippets on your site. Generally I've found republishing paragraph size chunks of text is OK with Google and will net you decent SEO traffic. It worked for my job search engine when we republished job descriptions and limited them to the first 400 chars. You can also mash the paragraphs up into pages with multiple chunks of text that e.g. show all chunks of text a particular user found interesting or a set of users or by date or location. That will give you the best of both worlds - lots of content to SEO and no dup content penalty.

Best of luck!!


Or set the canonical address of your cached pages to the real page.


Do you mean the "base" tag? We already do set that (admittedly, after this happened and because of it).


No, he meant the rel="canonical" links.


This does not work across domains, guys. (Among other reasons, it would turn injection attacks into $X million bugs for some companies.)


The google blog says it does...


Oh, that sounds easy enough to do, we'll implement it right away (it should be live in a few minutes. Thanks for the tip, we didn't know about this duplicate content situation...

EDIT: Added!



Thank you, added!


I honestly didn't know this existed.... thanks!


Ouch, we didn't know that :/ Thank you, I'll block robots from the cache right away!


For the record, something I forgot to mention. All the ads load for the site of the cached page, and all the links lead to it. So, the only way the original site is impacted is that they don't have to bear the load of 50k visits on their site. Otherwise, they get all the ad revenue, signups, links, etc.

Essentially, we're the only ones who don't gain anything from this.


The short form if don't want to read the entire post is summarized with this snip:

"By some algorithmic oddity, the historious cache actually ranked higher than the site (which turned out to be the most popular free sms service in Brazil), so anyone who searched for "free sms" in Brazil ended up on the historious page!"

In short, Google ranked an archive page (historious) higher then the content. Google corrected this problem, presumably automatically, 3 days after it happened.


They could sue you for copyright infringment if they wanted.

And allowing google to index your "stolen" content pages is just outrageous. You don't own that content.


Then we could sue Google for copyright infringement for caching our pages, I guess... Why would we not allow Google to cache it? Each cached page has a great big box on top saying that this is the historious version of the cache and linking to the original site...

Example: http://cache.historious.net/cached/515865/


There's a huge difference between what Google is doing and what you're doing.

1) Google is caching pages for a specific purpose and ensuring that they aren't cached/scraped by others:

http://webcache.googleusercontent.com/robots.txt

By not excluding robots, you're opening yourself to all kinds of situations where you are responsible for draining revenue from the owner of the content, which leaves you liable to lawsuits. By contrast, the way that Google caches content and their rules surrounding it do not generally harm the copyright owner.

2) Google honors all robots.txt, no-archive meta-tags, and other indications that the author doesn't want the page to be cached. Is historious doing the same?


1) We do exclude robots now, yes. 2) historious doesn't spider websites, it only saves the pages the users give us. It's the same as a user deciding to make a backup of a webpage on their computer...


"It's the same as a user deciding to make a backup of a webpage on their computer..."

... and then publishing it on the Internet.

(This is not meant to be snarky or to imply opposition to your product at all. I think there is a meaningful difference between saving to a computer and saving to a web-accessible, apparently globally readable website.


Isn't it a users responsibility to obey copyright restrictions in this case, given that we never publish content unless the user does it? It's basically the same situation as hosting a website, if you upload and publish a copyrighted page, is the host responsible?


In my opinion, those two cases are not similar. I doubt that this type of automatic caching/publishing would have any protection under the DMCA safe-harbor laws unless you're making it clear to users what they're doing (I'm not a user of the service, so maybe you already are).

If I understand correctly, the users of your site are simply bookmarking pages. You are then caching it, storing it, and publishing it with a world-readable URL. There are many ways that you could provide the same experience to the user without making the cached page publicly accessible.

If you were to give users the option to make specific bookmarks world-readable - and you provided a disclaimer explaining that they should not make copyrighted material world-readable - then it might be different. But that's probably something you should discuss with an attorney.


Ah, no, our users cache pages, but if they want the cache world-readable, they need to explicitly click the "publish" link.

Thank you for the information, I'll talk to our lawyer about it just to be safe.


I wanted to let you know that I didn't mean to disappear without responding, but that sounddust expressed what I was thinking already so I don't have much to add. I did not know that the world-readable bit was opt-in; I think that's a good start, and I'm glad you're getting legal advice on this topic.


Ah, that's the nature of online commenting! Thank you for your concern, we'll talk to a lawyer to clarify this (perhaps in the ToS).

Thanks again!


I thought you were european? The DMCA safe-harbor laws don't apply to you.


I am, I'm in the UK. I know the DMCA safe-harbor laws don't apply, but neither does the DMCA. Copyright law is similar everywhere, however, so I just wanted to get an idea. I have a lawyer researching this right now, though. Thanks again!


That's not true.

Your copyright law is similar to the US or Israel based one, not so strict as the main european one, which is based on the napolion code and is very very strict.


I see, thank you. Our lawyer advised us to add a clause to the ToS and we, of course, take action against copyright infringement.

Thanks for the feedback!


If someone uses your service to republish a few dozen News Corp pages, then sends them the link I reckon you'll be in court before sunset.

Edit: I think it's a great idea though to save bookmarked content, just not to republish it without permission.


Indeed, what google is doing (caching a page and showing it to the user) is copyright infrigment in some countries. (e.g Belgium, ...).

There hasn't been any case against them but theoriticaly someone could sue them. Who will win is a different story.


Hmm, that's interesting... Another difference is that google is doing it by itself, whereas historious only stores pages that users specify and only publishes them when the user specifies it.

We'll have a chat with our lawyer regardless, thank you!


Google honour robots.txt.


This might be slightly off topic but if you don't mind could you share the specs of your Varnish server? Everything I have read about Varnish says that it will run extremely well on pretty much any decent hardware. But I am still unsure how much power I should be providing it.


How many hits per day are you expecting?

We serve up about 2000 requests per instance per second, 8 instances on an 8 core machine with 32 G of RAM.


Why run separate instances on the same box?


To get over a bunch of limits.


What limits?


Specifically 65K file descriptors, I can't seem to get around that one in spite of all the limits having been raised appropriately.

So I ended up running multiple instances. I've discussed this with a few other HN'ers and we agree that it shouldn't be necessary but to date I have no solution for this.


Thanks for the numbers they are very helpful.

Probably <500 cache hits per second (a large portion of the content isn't cacheable).

So I should be able to serve that from a fairly low spec machine.


I should probably add that those boxes have an improbably low load. 500 hits / second out of varnish ? Do you have a spare laptop lying around, could be an old one? ;)

One good way to test is to replay a days worth of log files.


This for ww?


I wish :)

No, it's actually for a competitor, I helped put their CDN together.

WW is only about 100K uniques per day, that site is pushing very close to 2 million now.

For ww.com I still use a very old chunk of software called 'yawwws', it takes care of the stills in the index and the index pages themselves. Hopefully we'll be able to phase it out soon for a Yii/memcached/varnish driven combo.


As jaquesm said, with your expected visits, you will probably see no load at all, so don't worry about it. You definitely don't need to give it more power, unless you're running it on a cellphone...


This is exactly what should happen when your low-traffic site suddenly gets 50k visits in a day: Absolutely Nothing. Kudos to the historio.us guys for building something that can actually handle a little spike in traffic without falling over.

We see so many sites come through here that are showing 500 errors by the time they get halfway up the front page that you'd think nobody knew how to do this stuff anymore.

Great job of ticking off the basics, guys. Building on top of that foundation, you shouldn't have any trouble scaling out when this sort of traffic starts becoming a daily occurrence.


Thank you! To give credit where it's due, buro9 from HN urged us to implement the Varnish caching feature for cached pages, so we had it written but disabled because we didn't think it was going to get that much traffic. We hadn't turned it on for the first 30k visitors and the server was still doing great, but when we saw it we just switched it on and load dropped to nothing.

Serious amounts of love for Varnish here.


I predict a lot of sites featuring the words 'free sms' will be hammered out within hours of this articles posting.


We didn't know one of our users publicised it, though, and we didn't know Google would even see it...


in portugese?


nice discovery. I read somewhere in Blackhat forums about Google Search Engines traffic exploits. I didn't understand what they are talking about until I read your post today.


I just hit the "historify!" bookmarklet on the varnish page. :)


Your actions have placed all of us in great danger.


This makes me want to go and try out varnish. I have heard great things, but never gotten around to setting it up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: