Hacker News new | past | comments | ask | show | jobs | submit login
Come and help save Posterous from oblivion (jacquesmattheij.com)
305 points by jacquesm on March 12, 2013 | hide | past | favorite | 95 comments



Whoa, instead of just lamenting the shutdown and how talent acquisitions are horrid, and how VC's won't do the right thing and companies like Twitter are killing innovation (you might believe these things to be true; that doesn't change my point), these guys (Jason Scott and the Archive Team and jacquesm) did something about it. And when Jacques couldn't do it himself he organized other people.

My favorite kind of leadership, and an example of the (double-edged) sword of instant communication: people can be rallied around time-sensitive causes, like SOPA or posterous shutting down.

I know this seems a little obvious, but it's striking how rare it actually seems to be. I'm curious why. Maybe it's just my perception and it's happening all the time. Certainly people are doing great things, but I'm curious why we haven't yet seen more specific, directed actions like this. Does it depend on relatively homogeneous communities like Reddit (SOPA) and HN (this)? If there is a proliferation of such communities, say subreddits or otherwise, could we expect this to happen more frequently? Do we want it to happen more frequently, or do we run the risk of, say, DHS running a pro-search campaign like China's 50 cent army? I'm just curious about why this seemed so striking to me.

This also fits in with an article I'd been meaning to read by Fukuyama[0] on social capital written 15 years ago.

The vice of modern democracy is to promote excessive individualism, that is, a preoccupation with one's private life and family, and an unwillingness to engage in public affairs. Americans combated this tendency towards excessive individualism by their propensity for voluntary association, which led them to form groups both trivial and important for all aspects of their lives.

Perhaps we'll see more of these spontaneous actions as voluntary associations are easier to make as the infrastructure that supports them (e.g. reddit) becomes more well known and fine-tuned.

[0] http://www.imf.org/external/pubs/ft/seminar/1999/reforms/fuk...


I find this particularly disturbing:

> I made an offer to continue to host posterous.com and all the stuff on it but never received an answer.

Has Twitter completely lost touch with the web community? Is it possibly to get a response from them these days if you're not an advertiser?


As someone who spends reasonably large amounts of money on digital media buys, I've found that most companies like Twitter (though actually I have no experience with Twitter, except as a user) certainly are a lot more friendly when you have money to spend.

Fun story about Facebook: when I first wanted to start spending decent amounts with them (decent not huge - $10,000s/month) a few months ago I literally could not get in touch with a single person. Even completing contact forms I wasn't hearing back from them. A friend who used to work for an SEO/SM company told me a name of someone to contact, I used LinkedIn's InMail (a paid feature) to message him, and 24 hours later I had 3 account managers (including a technical expert, a media strategist and an overall account manager), who answer calls to their mobiles at any time of day. Now I have a nice route to get answers on any topic, not just paid advertising, thanks to my spending. (Was actually shocked about how hard it was initially to make contact and give them money, too.)


Jason Scott and the Archive Team should get all the credit here, they are the heroes.


Well I personally wouldn't have found out about this effort if not for this post of yours, so you certainly deserve some fraction of the credit in my book :)

(I have everything up and running according to your short guide and am working on the posterous project, though my contribution will probably limited to under 100GB due to bandwidth cap considerations. Btw, how much disk space does posterous in its entirety take?)


We've currently saved about 2.1TB in total.

I can assure you that you will not use up very much space nor bandwidth :-) It's mostly text and all. Feel free to stop by the project IRC Channel which is #preposterus on EFNet for specific questions (if they aren't answered in the FAQ or HN thread already)

jacquesm's a member of the ArchiveTeam now by the way ;-) And so are you, since you're helpin'!


Archiving content is one thing but to able to use it again is another.

We are helping users move to tumblr & save their blog. So far moved 500000+ posts (mostly small blogs). But we are not able to help all the users who have multiple images and videos in a post. Currently we support only single image & audio posts. If we can find a way to host their files separately on S3 permanently, then the move would be effortless & many would thank.

There are so many users who don't understand what to do with their backup, for them moving to wordpress is too complex task.

I have reached out to Sachin aggarwal, but yet to hear a positive reply. Tumblr also rejected to host the files of Posterous blogs.

We are ready to collaborate with anyone who can host users file permanently, if needed users can pay directly to you. We were also considering dropbox for hosting files. Moving to new blog platform is a pain and we wanted to minimize as much as possible.

Any advice is appreciated (http://justmigrate.com)


:)

Got all the answers I need here (though I might drop by the channel to say hi). It's actually quite funny to see the traffic graph spike as it makes a successful connection, then lulls for a while as certain pages keep giving 502 responses. I definitely see why you guys need IPs more than bandwidth!

Glad to help out :)


The HTTP 502 Bad Gateway errors are unfortunally due to load over at Posterous.

We do take those in consideration and back off exponentially until we hit 32 seconds - but continue retrying for specific set of times.

You're very welcome to join us on IRC. Thanks for being interested!


I'd say the reason it doesn't happen more is because it takes a ton of work and people are mostly self-involved. Activism is not what it once was (in the States anyway).


I wish more actions like this would happen, too. I'm glad that they at least happen. The technology to share work is there, and here's work that might be overshadowed at times, the BOINC project: http://boinc.berkeley.edu/ . Use your unused CPU cycles for the greater good.


This is a great effort. But it makes me furious that the founding team can be so disparaging to their users.

Sachin Agarwal, you used this community to enrich yourself and further your own career. In return, at the very least, you owe an explanation for why such a convoluted effort has to be made to get this content off the servers.

It's also exceptionally discourteous to ignore emails from upstanding community members, but perhaps you missed these. But I know for sure that you will read this thread, so I'd love to hear why a database dump can't be provided, or a couple of IPs whitelisted to just rip through a scrape.


How does one opt out? I've migrated my blog to my own system so if I want to make changes to an old post I can, but I won't be able to control my content that you are archiving which is a problem for me. It's actually rather bothersome to me. I figured I'd just go make sure my blog is deleted, but it could be archived by now.


Everything that goes online is kind of online for as long as the Internet Archive and Library of Congress decide it should be. But I started worrying about where my account info might end up (I should really worry about that before signing up for every dang thing...), so I just found this info on deleting one's account: http://posterous.uservoice.com/knowledgebase/articles/36544-...

Maybe you can get to it before the Wayback Machine does.


Thank you, this method worked and my account is permanently gone now. I still run my blog at another location so I can maintain my content which I prefer to have some level of control over - even if it isn't total.


That is the world wide web. You published it and anyone could save a copy of what you published (right-click, save as) at any time. This is no different surely?


This archive is taking one more step - right-click, save-as, "re-post to the web".


Sure, but it's still polite to observe robots.txt restrictions (or whatever analogous mechanism could work here)


I can also see your deleted reddit comment, but that still betrays some understanding of privacy.


And Facebook not actually deleting your photos when you click the "delete" button is ok too right? It's published forever?


I would presume that facebook photos are usually not public. There is quite a distance between the way private and public data should be handled.


It took me a bit of fiddling to get running on a spare Debian box, so I thought I'd share:

    $ sudo apt-get install virtualbox-ose
    $ wget http://archive.org/download/archiveteam-warrior/archiveteam-warrior-v2-20121008.ova
    $ tar xf archiveteam-warrior-v2-20121008.ova 
    $ VBoxManage import archiveteam-warrior-v2-20121008.ovf
    $ screen
    $ VBoxHeadless --vnc --startvm archiveteam-warrior-2
Hit Ctrl+A, D to exit screen and leave the VM running. From a non-headless box:

    $ ssh -L8001:localhost:8001 you@yourserver.com
Point a browser to http://localhost:8001

EDIT: Added 'screen' to steps. You're gonna wanna use screen.


You could use:

    VBoxManage startvm archiveteam-warrior-2 --type headless
which would replace VBoxHeadless and remove the requirement for screen. You will need another VBoxManage command to setup VNC if you desire to use that.


Thanks a lot for writing that up. I'll be sure to add it to the FAQ/wiki on http://www.archiveteam.org/index.php?title=Posterous


You're my new hero. I don't use it and I never used geocities but this is still awesome. Historians will thank you in years to come. Sociologists and such will praise what can be mined. And the lists go on...


Here's one of the best blogs about the Geocities data: http://contemporary-home-computing.org/1tb/


The screenshot archive that page mention is fantastic: http://oneterabyteofkilobyteage.tumblr.com/

Bring back so many memories about the early days :)


Most of those pages are still alive, for instance:

http://www.reocities.com/Area51/vault/5058/

Just change the 'g' from geocities to the 'r' of reocities.


I love that tumblr. I would also recommend http://twitter.com/wwwtxt or http://wwwtxt.tumblr.com/ -- short quotes from the early internet. Some are eerily insightful.


Oh dear lord I remember designing one of those in 8th grade computer class.


@jacquesm please send me a list of URLs to crawl (10M+), and I'll set up an 80legs job to do this. shion - at - 80legs - com.


The problem is that Posterous is hard to crawl. For one; They'll continously and automatedly ban your IPs, even if you rotate over a lot of them. Two: Posterous can't take all of the requests.

We've (ArchiveTeam) unfortunally made Posterous unresponsive multiple times. So please be careful to not completely bring it down if you're doing a solo effort.

Please also bear in mind that it's not just to "chuck it into the downloader"..


Also, please use a sensible format if you're crawling/archiving this.

We're using WARC (Web Archive) which is an official ISO File Format standard - which the Internet Archive's Wayback Machine can use. It's also a pretty good and nice format for archiving web pages in general.


please ask on irc://efnet/#preposterus that's where the archive team guys hang out. I don't have a list of seeds but they may be able to figure out a way in which you can put 80legs to good use.


For those of us who might want to donate cloud computing time but have weak/memory-limited laptops, is there an EC2 image that we could fire up for the cause?


There actually is an EC2 AMI available. I won't mention it here though. I'd rather like that you join us up on #preposterus on EFnet over ol' school IRC.

I'm using caution because we don't want to sink Posterous, since it's a very fragile beast, which we're blowing away the caches off of.

We're working on getting a FAQ section up on the project page. (http://www.archiveteam.org/index.php?title=Posterous)


The OVA image is a standard format. I'd imagine EC2 would support importing it.


Sadly, it doesn't seem to (ERROR: Unknown disk image format: OVA). I wouldn't know the first thing about how to reliably convert the format.


StackOverflow (well, ServerFault) has a little bit on it: http://serverfault.com/questions/387049/how-to-run-an-ova-ov...

Rather inconclusive, though.


In the long-term web startups will be dead because it will be more and more obvious that it's imposible to trust your data to for-profit companies that simply cannot maintain your interest in mind no matter what promises they make. Everything important should be on open-source, community run, nonprofit platforms.


Or be hosted with a company that has a sustainable business model from the get-go.


"The long run is a misleading guide to current affairs. In the long run we are all dead." -- John Maynard Keynes


Well, this is exciting: we've just now passed the halfway point!

items: 1470951 done, 1467514 to do

[url to leaderboard deleted upon request]

(NB: it does look like this thing has been underway since Feb 27)


Please don't link to the leaderboard (the link in parent post), it is a bit loaded and fragile. That is used for all the ArchiveTeam Warriors.

Also, unfortunally - those numbers aren't totally up to scratch. Please hang on, we'll have a FAQ describing that in a moment.


Thanks for the heads up. I've removed the link.


Thanks for doing so. I'll be sure to include your thoughts from your post (re our goals) in our FAQ by the way.


SIR! (or mam) Permission to help scale out the leaderboard! (I'll hop on EFnet to ask)


Thanks for the interested! You're of course very welcome to help out with that. I just wanted to let you know that the source for it, is available at https://github.com/ArchiveTeam/universal-tracker

You're very welcome to join up our IRC channels!


Here's a link to VMWare's OVF converter tool if you use VMWare instead of VBox: http://communities.vmware.com/community/vmtn/server/vsphere/...


The VMware Player (and Workstation, I'm assuming) will import it automatically. The only issue is that you have to choose a different connection for the second virtual disk. Not sure why that's a problem but moving it works fine.


Why is it so important to save all this information? Seems like projects like archive are just contributing to more information clutter. We generate more information every single day, what makes a few million peoples blog posts so damn important? We can't just keep saving shit forever, though the progression of technology I guess makes it easier and easier, but eventually we're going to hit a limit.


> We can't just keep saving shit forever, though the progression of technology I guess makes it easier and easier, but eventually we're going to hit a limit.

Yes we can. Through the 'progression of technology' as you put it, I believe we will never come close to any limit. Just look at how much data is stored in DNA, and you'll see that our current data storage technologies are relatively primitive.


How are we to know what's important and not? Surely, there's interesting content available at Posterous. Just to mention an example, CloudFlare's blog is hosted there.

Sure, there's plenty of spam accounts and crappy content - but that might prove worthful in the future. Maybe someone would study what kind of content we as a race were contributing to that kind of platform, in this day and age - maybe someone is researching the automated spam.

This is not really taking up all that much space, in this day and age. There's around 2.2 TB downloaded - it's mostly text and images. That's half a single 4TB drive. Not really storage capacity to fight about in my opinion.


Yeah I guess you're right about the storage piece, however, I don't think it's useful at all. We always live in the moment of "right now is the most important moment in history", when really most of the content we're saving is junk, and, as more and more of it compounds, more and more junk will just accumulate on the pile. I'd assume that 90% of what's in posterous is worthless, the other 10% is just people reiterating good points, but the key word is _re_iterating. Do we really need tens, then hundreds, then thousands of years of files of things people said on personal blogs in the past? Absolutely not.


Ethically this is far better behaviour than those who are shutting down the service and there is altruism rather than profit motivation BUT legally isn't this epic scale copyright infringement of millions of works created by thousands of people?


It's been discussed. Because it was published in a public forum, fair use is certainly a consideration. Is it legal to copy stuff from websites without permission? U.S. courts haven’t made a clear determination. Andy Sellars, a staff attorney at the Citizen Media Law Project, says he would argue that it counts as “fair use” under copyright law. However, he notes that the Archive Team’s torrents don’t offer a mechanism for copyright holders to demand that certain material be taken down, which could hurt its case in a court. http://www.technologyreview.com/featuredstory/426434/fire-in...


I am guessing that if Posterous and the ones who is responsible ultimately, should be giving the data to archive.org directly. This is the only sensible thing to do. Of course, they have to clear copyrights first.


I love it. This is creating a benign botnet.


Benign merely denotes non-maliciousness – I would call this a benevolent bonnet.


I think I saw a comment somewhere about your IP being banned after an hour... anything we should do to avoid this? I'd hate to be scanning for 15 minutes only to be banned and not be able to help anymore.


The bans nominally last 24 hours. There was a point where there were so many IPs running (from AWS servers) they overflowed the ban list and the bans were shorter!


I didn't even think about AWS... someone should put together an article on how to set them up... I'd be happy to launch some instances for the cause


We've had a few guys using Amazon Web Services and continously rotate IP's/set up new instances - unfortunally, the last time we went too 'hard' on them - effectively making Posterous unavailable.

We're thinking about this right now. Feel free to hang around #preposterus on EFNet for updates.


I got banned after doing about 4.7 G, if that is done a few thousand times because of this article it would make a huge difference. And they eventually would have to lift the bans.


Everything does however count. And if you run a single instance on your home/work connection - that'll be a little.

Some is better than none :-)


Depending on your ISP, you can get a new ip by changing your mac address.


Awesome! Love the spirit of the effort, running a Warrior now for kicks. Just curious, isn't it possible for someone at Google to press a button and make this happen? :)


Most likely, yes. They probably have most of Posterous cached already.

It's however very unlikely that they'd just release that data. They probably have their own archive format and things as well..


Are you guys archiving photos / images too? When I just wget the site those don't generally come down. I don't know if they're being loaded via JavaScript or a plugin.

To preserve my own blog I just saved it as a pdf with wkhtmltopdf, but it'd be nice to have a full HTML version.

http://blog.andrewcantino.com/blog/2013/03/12/archive-a-pdf-...


This event, and many like it, only remind me that the modern Internet is about which groups you belong to, not who you are as an individual.

Websites owned and operated by individuals are now vanishingly rare, while aggregations of people -- Facebook, Twitter, et. al. -- have become the norm.

Example individual website: http://arachnoid.com/


Yes, and in my mind, it's a bit unfortunate to have it all centralized to a few major portals.

Especially since nothing lasts forever, like we clearly see here. When a giant falls..


Torrent for the VirtualBox appliance:

magnet:?xt=urn:btih:b1b3df637bf9bb78f32e1667944535b60c9d37c1&dn=archiveteam-warrior-v2-20121008.ova


c8c86a1e225bb28ccdd229f6f27fe39ad65c9831

Sha1 of image, in case you want to verify.


Something like this would be cool as a chrome app. I'm not sure how complicated the job is, but if it's just a matter of hitting apis and queueing data for upload, it should be possible with local storage. You could even use subdomains to defeat storage limits.


"I made an offer to continue to host posterous.com and all the stuff on it but never received an answer."

Have you tried making a public appeal (say, over twitter) to the Posterous owners? They may be able to talk with Twitter and arrange to capture the content directly


I will definitely make another one but it was here on HN to one of the former posterous owners.


Stoked to help with this. Will put my new BT fibre to the test.


Bandwidth in itself is not per say needed, in this project. IP's however are very useful.

Feel free to stop by our project IRC channel at #preposterus (Note: It's not spelled pre-posterous) or our main channel at #archiveteam on EFNet


"Bandwidth in itself is not per say needed"

I guess so; I have a 200Mbit fiber connection but surprisingly little is happening.

I started the instance 10 minutes a go and it has only uploaded 20MB. I've set 'concurrent items' and 'rsync threads' to the max.


We're rate limiting how many items/users gets handed out. Because Posterous is very fragile, and we're blowing away their front end caches - which they rely on heavily.

Because of this, we practically hit the back end every time (as well as other users of Posterous, because we blow the cache away) - which makes Posterous very slow.

We've ran a few hundered more threads earlier, successfully making Posterous completely unavailable unfortunally.


BT fibre sounds pretty sweet :) I'll throw in my crummy broadband as well ;)


That 'crummy broadband' is very much welcome, and so are you.

In this effort, bandwidth in itself is not that important. Feel free to read some of my other posts in this thread regarding bandwidth and/or come join us at IRC (#preposterus at EFNet)


The large "Run your own ArchiveTeam Warrior!" link is currently just linking to a fragment identifier. I think you'd like to change it.


Good point, thank you, fixed now.


Stop being a necromancer and resurrecting a dead service. There's got to be a good reason why they killed Posterous, so let it die. Let it go; stop holding onto the past.


Please read the intro to the blog post and think again. I've seen these comments by the boatload when we took on saving geocities and you are simply wrong.


Is there a place I can find this Geocities archive? I apparently missed something I wanted to keep when I archived my own stuff, and I'm wondering if I can find it again.



This isn't resurrecting a dead service nor is it about that. This is archiving it and making the previously public data stay public.

Why would someone do that?

Well, a lot of people have poured their hearts out and made content that lives on Posterous. They might miss the "sunsetting" (asshole term) of the service and lose their content.

Think of all the dead links that'll be around after the service have died. Wouldn't you be able to read something great that was linked from HN a year ago? From the Wayback Machine or similar?

There's plenty of reasons to archive the web and the content that goes up (and down).


Twitter made the decision to kill Posterous when they bought it. It wasn't the founders' decision, or the users' decision, or the visitors' decision.


I've never visited posterous.com until today, but I fail to see why Twitter would buy it and kill it.. can't be that horrific. It's like buying a house and setting it on fire. What am I missing? Are they just killing off competition?


They probably bought it for the people. So there would be no one left to work on the Posterous site if everyone is working on Twitter instead. I know they didn't buy it for the infrastructure since it's hosted on Rackspace and Twitter is an AWS shop.

https://blog.posterous.com/thanks-from-posterous


But it all seems very volatile as people tend not to stay in one place for too long. Oh well ...


In my opinion, this is a staff acquisition. Buy a company to get the staff free and on board, ditch the product/project.

Besides, I think Posterous would be a good complement to 140 char-twitter. I don't even consider them as competition to each other..


You'd be preserving a piece of history with many interesting pieces of thought on it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: