Hacker News new | past | comments | ask | show | jobs | submit login

They're also forcing all adult-oriented content on Tumblr (which isn't limited to porn, although porn makes up the majority of their userbase) behind a logged-in-users-only wall, blocking it from non-Tumblr users and in turn, external search engine results.

https://techcrunch.com/2017/06/20/tumblr-rolls-out-new-conte...




Oi, I suppose archiving Tumblr is this weekend's project.


They currently have 150B posts (https://www.tumblr.com/about) so I expect that to be quite challenging. If you seriously make any progress though, it'd be cool to see.


Maybe 1:100 of those (being incredibly generous) are actually unique posts and not someone else reposting something from someone else, so maybe it's not such a hard task...


It's fairly common for commentary to be added to a reblog via tags rather than after the quoted post, so even reblogs without commentary may need saving.


This assumes there is commentary on Tumblr worth saving.


I'm guessing you've bought into the meme that Tumblr is literally nothing but left-wing politics.


No, the comment section on an average tumblr page:

X has shared this Y has shared this Z has shared this ....

Actual commentary? Not so much.

On a side note I tried going back there to see if the situation has improved and somehow the interface is even worse now. The blog I was trying to read would only appear as a slide in on the side and would disappear at the drop of a hat. I don't even understand how you are supposed to use it now.


Are you saying that meme is inaccurate?


Are you implying it's not?


Yes. The meme aligns with my own experience of tumblr, so I'm inclined to believe it over contrary anecdotes (of course a more rigorous study would be a different story)


That and porn, and Stephen Universe fans.


Step 1: Hack into a supercomputing center with at least 1Gbps line.

Step 2: Massively-parallel downloading of all the sites using clustered nodes, compression of it, and resulting data stored into high-performance, clustered filesystem.

Step 3: Move it off of there when traffic is low or overnight if system doesn't go offline overnight.


Step 0: Acquire at least ~1PB of storage to store all the data.


I have over 1PB of storage at my disposal.


Business or personal? It's doable but it's a lot of money to buy all those drives.


Personal.


$6k for 600T fully assembled backblaze storage pod: https://www.backuppods.com (no affiliation, just pointing out that ~$12k isn't a massive amount).


Note that hard drives are not included.

1000TB / 8TB = 125 HDDs

125 * $200 = $25k


Aww! Damn my quick posting (and too good to be true!). Thanks.


It's in the supercomputing center. How aboug I modify it where a filterimg step is run deleting everything that doesn't match on desired image festures?


> Step 3: Move it off of there when traffic is low or overnight if system doesn't go offline overnight.

Where are you moving it to? You think that even if you manage to hack into a "supercomputing center" that nobody's going to notice 1PB of storage filled with GIFs?


Having worked in a supercomputing center for a detector at the LHC, yeah, you could probably get away with storing a few hundred TBs for a short period of time without anyone noticing or caring. A whole petabyte might be pushing it.


And I learned about it from you people. All of them said security was lax. Most of them said they personally were using the supercomputer for their own stuff at some point.


What's this "you people" bub? :) in this case, it's not a security issue, it's a resource accounting issue.


People working in and around HPC centers. Obviously. :)

And no, it's both accounting and security issue. One guy I know who does security in ASIC's that stole HPC time in the past did it by modifying the accounting system to not show his jobs. It was easy as it wasnt designed to stop accounting fraud by hackers.



https://github.com/fake-name/xA-Scraper already supports tumblr. Their API has annoying limits, though.

Note: It's a project of mine.


I use grab site, which is a spiritual successor to ArchiveTeam's ArchiveBot.

I have to be careful to give my scraping methods away, as I've had digital targets attempt countermeasures.


I've had similar problems. At this point, I basically run a botnet, albeit one I pay for. I have a rolling swarm of DigitalOcean and Vultr VMs that act as RPC clients for a custom system I wrote.


We should collaborate.



Is archiving a bunch of porn really safe legally? I always had the impression archiving a large number of copyrighted images wasn't.


It goes into cold storage for a later date.


Thanks. :)


Porn makes up the majority of Tumblr's userbase?


Tumblr has a huge porn community, and always has. I don't have actual statistics, but it's huge. In general, people on tumblr might have a regular, personal blog, and then a side blog where they reblog and/or comment on/add captions to pornographic images, gifs and videos. In that sense, they can sort of curate porn that interests and appeals to them. The addition of this personal touch and intellectual component appeals to a lot of people, including a lot of women. But there are also various sex advice blogs, etc., which will be affected by this change as well, artists who do non-pornographic nude artwork, and potentially even various LGBT communities where sex can at times be a topic.

There are various levels of NSFW-flagging on Tumblr: users flagging their own site NSFW, Tumblr marking a blog NSFW and the user having no way to change that, or enough individual posts being flagged NSFW by users and AI that Tumblr decides to call the entire blog NSFW.


Maybe your vision of tumblr's userbase is biased by the kind of people you follow. Personally, I'm on the "computer science/glitch art/study/science" part of tumblr and I don't see much NSFW content (Apart from some bots who follow me which I block automatically).


The areas are kept quite separate, but that doesn't mean that there isn't still a huge porn userbase. Many people on tumblr have multiple blogs on different subjects, and deliberately keep their porn blog(s) separate from the rest.


Sounds a lot like what happened with LiveJournal. About 10-15 years ago LJ was where all the LGBT and niche erotic interests where, like fanfic. It's been bought by a Russian company and they've been turning the screws on the LGBT users for years.


Probably not, all the stats I've seen shown that it's something like 20-25% of total consumption, so certainly a lot, but not a majority.

There are reports of that upwards of 80% of Tumblr users have been "exposed to porn" occasionally or even accidentally, but that doesn't mean that the majority of Tumblr is porn.


It's an image site on the internet that doesn't outright ban porn, so yes.


This is a direction they headed in shortly after Yahoo acquired them, it's not a new thing with the Verizon acquisition. Blogs marked "adult" and NSFW posts already won't show up in search engines[edit: I stand corrected], by default in searches on the website, or searches on the mobile app (even when logged in).


That is not true at all. All Tumblr blogs, including NSFW blogs, have always been viewable and indexable by external search engines through their public URL (blogname.tumblr.com) unless flagged as private/hidden (a choice of the blog owner). They have only been hidden from Tumblr's search engine if the user has their account set to not see adult content. You can verify this by searching for "tumblr porn" on Google. Five of the results on the first page are porn blogs on Tumblr.


>All Tumblr blogs, including NSFW blogs, have always been viewable and indexable by external search engines through their public URL (blogname.tumblr.com) unless flagged as private/hidden (a choice of the blog owner).

There is definitely something in account settings somewhere that removes you from search results, I may have mixed that up with the NSFW flag.

>They have only been hidden from Tumblr's search engine if the user has their account set to not see adult content

As I mentioned, this is default. And "adult" blogs do not show up at all in search results on mobile.

My point stands, though. Yahoo already started making adult content harder to find (accidentally or intentionally), the momentum is already there.


There are three visibility flags for Tumblr blogs:

- Allow logged-out users to see this blog

- Allow this blog to appear in search results

- Flag this blog as adult-oriented

Either of the first two flags would obscure the blog from search engines, but the latter one has never.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: