They currently have 150B posts (https://www.tumblr.com/about) so I expect that to be quite challenging. If you seriously make any progress though, it'd be cool to see.
Maybe 1:100 of those (being incredibly generous) are actually unique posts and not someone else reposting something from someone else, so maybe it's not such a hard task...
It's fairly common for commentary to be added to a reblog via tags rather than after the quoted post, so even reblogs without commentary may need saving.
No, the comment section on an average tumblr page:
X has shared this
Y has shared this
Z has shared this
....
Actual commentary? Not so much.
On a side note I tried going back there to see if the situation has improved and somehow the interface is even worse now. The blog I was trying to read would only appear as a slide in on the side and would disappear at the drop of a hat. I don't even understand how you are supposed to use it now.
Yes. The meme aligns with my own experience of tumblr, so I'm inclined to believe it over contrary anecdotes (of course a more rigorous study would be a different story)
Step 1: Hack into a supercomputing center with at least 1Gbps line.
Step 2: Massively-parallel downloading of all the sites using clustered nodes, compression of it, and resulting data stored into high-performance, clustered filesystem.
Step 3: Move it off of there when traffic is low or overnight if system doesn't go offline overnight.
It's in the supercomputing center. How aboug I modify it where a filterimg step is run deleting everything that doesn't match on desired image festures?
> Step 3: Move it off of there when traffic is low or overnight if system doesn't go offline overnight.
Where are you moving it to? You think that even if you manage to hack into a "supercomputing center" that nobody's going to notice 1PB of storage filled with GIFs?
Having worked in a supercomputing center for a detector at the LHC, yeah, you could probably get away with storing a few hundred TBs for a short period of time without anyone noticing or caring. A whole petabyte might be pushing it.
And I learned about it from you people. All of them said security was lax. Most of them said they personally were using the supercomputer for their own stuff at some point.
People working in and around HPC centers. Obviously. :)
And no, it's both accounting and security issue. One guy I know who does security in ASIC's that stole HPC time in the past did it by modifying the accounting system to not show his jobs. It was easy as it wasnt designed to stop accounting fraud by hackers.