Hacker News new | past | comments | ask | show | jobs | submit login
JSTOR Liberator sets public domain academic articles free (arstechnica.com)
203 points by ryanwatkins on Jan 15, 2013 | hide | past | favorite | 107 comments



"By running the script ... a public domain academic article is downloaded to the user’s computer, then uploaded back to ArchiveTeam in a small act of protest against JSTOR's restrictive policies." -- Ars

"At the same time, as one of the largest archives of scholarly literature in the world, we must be careful stewards of the information entrusted to us by the owners and creators of that content." -- JSTOR

The contention that JSTOR is acting as a faithful steward of public domain works -- having established any sort of restriction or barrier to prevent the public from accessing those works -- truly examines the widths and depths and bounds of intellectual dishonesty.


Afaict JSTOR makes public-domain works available on pretty similar terms as Google Books. They both make the works available free to the public (see http://about.jstor.org/service/early-journal-content), but they: 1) request that you don't remove their watermark/attribution; and 2) block mass downloading. I'd prefer a fully open, mass-download-ok option, but Google Books doesn't allow that either, and I don't see people railing against Google Books's lack of openness.

JSTOR's non-public-domain works are a different story, but the journals are the most to blame for that; JSTOR is just a licensee. Pretty similar to Google Books in that case, too (Google will only let you see a "restricted preview" or "snippet view", depending on the work, because they don't own the copyright).


> JSTOR's non-public-domain works are a different story, but the journals are the most to blame for that; JSTOR is just a licensee.

Indeed, every time I see a publisher offering me an article from the 1920s or 1930s as a digital download for a small fee of around, oh, say, $30, my heart is filled with contempt for this attempt of milking every last cent out of every paper.


JSTOR is a public domain exploiter. They milk it for money. Once we get a copy of the papers, we could distribute them free of charge. I'd like a more altruistic host for this content.


JSTOR does distribute the public-domain papers free of charge. It's the post-1923 papers, which are copyrighted and which JSTOR doesn't own the copyright to, which are paywalled. I do think they could be more aggressively pro-access, but they're slowly moving in the right direction.

That's not to say there can't be other scanning projects that aim to do a superior job, and I'd probably volunteer if there were a way I could be useful to such a project (I've spent some time at Distributed Proofreaders). But I don't see JSTOR as exceptionally evil, at least any more than Google Books is. Both are bringing more content online, in ways that are partly good and partly flawed.


I don't remember ever landing at JSTOR and seeing a whole paper available free of charge. Every time it was asking for money.


Was it a paper old enough to be in the public domain? I just did some spot-checking, and everything I've checked that's old enough is available with no paywall. For example, looking at the American Journal of Archaeology (http://www.jstor.org/action/showPublication?journalCode=amer...), the issues from 1922 and earlier show up for me as free, while the 1923-and-later issues are paywalled.

It's absurd that copyrights are so lengthy that 1923 is the cutoff point, but that's a whole other can of worms.


Show an example please.


Digitizing public domain articles is free now? Like, scanners are free and the people who operate them work for free?

Isn't it better for a non-profit entity to digitize PD articles and make them available to the world at a reasonable price? Or would you prefer a world where JSTOR didn't exist and the only people who had access to most old PD documents where those who lived in major western cities near the biggest libraries?


JSTOR is such a bizarre target to pick.

1) Some of the articles on JSTOR are the product of complete public funding. But many are not. Many are the result of partial or complete university funding.

2) All the work of editing the underlying journals, which is very time consuming, is generally not publicly funded.

3) All of the work of digitizing those journal articles is not free. When Google undertakes to digitize such content, they charge you by selling your privacy to others. JSTOR just asks you to pay a fee for their service. Who is the bad guy here?


> JSTOR just asks you to pay a fee for their service. Who is the bad guy here?

Yes, 34$ for a 20 page PDF. I could get printed books with collections of 15-20 hand picked articles for less. The obscenity is related to the huge & itemized price.

Also, if you are not a student at one of the affiliated universities (which is a rare case, probably less than 1% of the population) then you get no good options. How is that for advancing the arts and sciences?

How about we get all that has been funded by the public back to the public, and stop this madness.

Another problem is lack of access to the text for machine learning, NLP purposes.

In conclusion, I get they invested some money. They should just be nationalized. Pay them a compensation fee and just cut them out of the loop. They are an obstacle to progress.

It is one of those times when public good trumps individual property rights. Back in the time when they were building railroads in USA, it was necessary to solve a similar problem - how could they pass the railroad through the maze of public properties. The solution is simple - expropriation + fair compensation. Public good must be met first.


> Many are the result of partial or complete university funding.

The universities are not paid by the journals, — quite the opposite: they pay handsomely for access the work of their own researchers.


The journals don't charge for the content. They charge for the task of curating and editing those articles. It's actually a lot of work to turn the incomprehensible gibberish that researchers pass off as a first draft into something consumable by human readers.


Very few academic journals have professional editors.

In most cases the journals are edited by the same people who write for them, for free (sometimes there'll be a pittance salary, but it's not enough to quit your day job by any means).

The "incomprehensible gibberish" is, in general, fully comprehensible by their intended audience, which is other academics working in the same field.

Here's how the model works:

1) Academic writes the article, supported by taxpayer funding, or a grant from a foundation, or whatever. 2) Academic submits the article to the journal, where it is peer-reviewed and edited by other academics. 3) None of these academics get paid a cent for their work (other than the aforementioned public funding). In specific, the journal publisher doesn't pay them anything. Some of them even CHARGE the author. 4) The journal publisher then sells the result for hundreds or even thousands of dollars per year.

Nice gig for the publishers. Less so for the academics and the taxpayers.

This model made a certain degree of sense back when journals had to be printed (short print runs are expensive, especially stuff with lots of diagrams, weird equations, foreign languages, etc. as journal articles tend to be), then physically distributed by paper mail to institutions all over the world.

That's no longer the case.


Go talk to someone who edits academic journals and brace yourself for the rant that follows. Journal submissions are generally poorly organized, full of citation errors, full of grammatical and spelling errors, etc. It is not a case of researchers submitting perfect prose that the editors simply collate into a PDF and call it a day. Moreover, journals serve a valuable advertising/branding function for researchers, essentially putting their brands behind the work. The phrase "A study published in the New England Journal of Medicine" carries a very different weight than "a study posted on some professor's blog".


(rewritten to remove snideness)

I know people who edit academic journals, and have published in them myself.

Your claim that editorial costs are what drives journal prices is completely without merit.

Few journals below the level of, say, Nature, have paid staff, and few academics would tolerate major rewrites of their work by an editor anyway.


Spoiler: title is sarcastic.

    The importance of copyediting a scientific paper
    D. J. Bernstein
    2005.05.04
http://cr.yp.to/bib/20050504-copyediting.txt


It is true that journals serve as advertising/branding. That was their real value add. But article-level metrics are changing that. Now you can get stats on individual articles (e.g. # of cites, # of pageviews, listed on Faculty of 1000, Most Viewed Article), and those are increasingly influencing funding decisions.

As for the point re: editing, this is just not true. Academic journal submissions are by highly educated PhDs at the best universities; the editors on the other side just aren't as well trained. Post-submission copy editing is minimal; most of the effort by journal editors is on formatting. The reason is that in many cases the journals have failed to invest in a proper publication workflow so they have to spend time converting Word documents into their in-house style, rather than just (a) distributing a LaTeX template or (b) working on a common interchangeable format and simple rich text editor. Some of the more computational journals, like Bioinformatics or NAR in compbio do have LaTeX formats, but many like Science, Nature, or Cell do not.

So their editorial costs for their human army arise because they didn't invest in a few engineers to build a simple rich text frontend with pdf preview functionality. They are manually doing simple asserts like checking character counts in headline text rather than doing that via client-side JS. Things like that are the cost centers here. A shame.


>As for the point re: editing, this is just not true. Academic journal submissions are by highly educated PhDs at the best universities; the editors on the other side just aren't as well trained. Post-submission copy editing is minimal; most of the effort by journal editors is on formatting.

This is not true. Most editors are PhD-trained people (after all, they have to select the reviewers of a paper, which requires knowing the relevant people in a field who are capable of evaluating a paper). In some of the smaller, field-specific journals, editors are themselves volunteers from academia (go to your local university, find some lesser known journal and check the inside cover, which usually lists the editorial board of the journal. You'll see that most of them are university faculty).


I have edited journal articles, and yes, they are generally "poorly organized, full of citation errors, full of grammatical and spelling errors, etc." It's a lot of work to turn a crappy first submission into something that's readable.

But I did all of my journal editing as a grad student in college. Technically, it was my advisor's job to do it, but he was too busy with non-mundane things, so this kind of work got passed off to his students. None of us were paid for the work, of course. It was just understood that you do it because you have to (grad students need good recommendations, and advisors to those grad students need to play politics with the journal committee). As far as I can tell most editing for most journals is done for free by various university professors and (mostly) their students. More prestigious journals like Nature and Science probably have their own editors do an addition pass after this, but most don't seem to do that.

It might be different in some fields, but that's how it works in my field (mechanical engineering), and it seems to be that way in other engineering fields as well.


The fact that the journals might be edited for free by grad students and professors doesn't mean the results should be freely available. Many journals are run by universities, and those fees go back into the universities that ultimately pay those grad students and professors.


I think you've missed the key issue here, and why we need activism in this area. The point is that the journals model is outdated, and only being sustained by the self-interest of an anachronistic/parasitic publishing industry.

Let me paint you a picture of a hypothetical alternative system: a publicly funded, 100% free-to-access online system where academics can submit papers, and then other academics can peer review them. Papers and authors are ranked by an open-source algorithm that takes into account factors such as the number of citations, the quality of the papers making the citations, results of peer reviews, the quality of the peer reviewers and so on. Doesn't that sound like a better place for our shared cultural knowledge, that belongs equally to everyone, than being locked-up and monetised by a handful of private companies?


Algorithmic curation? Wake me up when Google doesn't get easily confused by "product reviews" made up of auto generated content. Until then, I'll take human editors, thank you very much.


I'd say by far the main input in terms of "curation" would come from fellow academics, the peer reviewers. Only the very largest and most prestigious journals (Nature etc.) have full-time paid editors.


That's the system that some of the newer Public Library of Science journals are aiming for.


What are the factors affecting the prestige of a journal other than the credentials of the reviewers and the impact of the papers previously submitted?


I don't get why you're been downvoted. It's as if people don't realize that editing, curating, hosting, and distributing this papers/articles doesn't cost money. I'd be hard pressed to find any one in here that wouldn't complain when they get their hard earned living endangered because someone decided their "product" should be free.


It doesn't have to be "complete" public funding. I'm gonna go ahead and guess that many, many of the universities that may have partially funded the research also receive tax dollars. This last part is such a huge issue here, I think. "Well it wasn't very many tax dollars..." Yeah, well let's have our share then. One download of each document that tax dollars helped pay for seems fair to me.


Let's say a university is going to do some research. They look around for funds and decide to do a project that the government is offering some grant money for. In that case, the government's money didn't really pay for the research, it just influenced what direction it went. I think the grant money did its intended job even if the results aren't public.


Well, does the study use tax-paid resources or not? It may not seem workable that no private results should result from research receiving tax-money in any way, but it is fair.


I guess what I mean is, sometimes the grant money is to "buy" the results of research, and sometimes it's just to encourage research along certain lines. Or anything else... Think of it this way: if the government requires all publicly-funded research results to be public, a lot less people are going to take them up on the deal.


if the government requires all publicly-funded research results to be public, a lot less people are going to take them up on the deal.

Non-sequitur, and you can bet plenty of researchers will publish publicly in order to get free money. What it will affect is private industry benefitting from publicly funded research and resources.


But if private industry profits less from the research, they will pay less for it. Pretty quickly you will see two exclusive sets appear: a little open research funded with a little public money, and lots of proprietary research funded with lots of private money.


I don't think it would be as complete as you imagine. There would be companies who would forego keeping results private to save the expense of building their own lab facilities and hiring people to run and work in them. That public resources can be used for private results is a back-door subsidy for commerce.


Overall, university research is about 50/50 split between public dollars and private dollars. Coincidentally, Chicago's CTA receives about 50% of its operating budget from taxes. Should I get on the next Metra train and refuse to pay the ticket price?


I agree with that, but I'd also say openness mandates should be added to every entity that takes taxpayer funding, including, for example, Lockheed, SpaceX, and just about every pharmaceutical company.


Works for me.


If I was going to level a complaint about anti-JSTOR activism, it would be that the demise of fee-encumbered academic publishing is inevitable, and that it will most effectively be brought about by researchers like Matt Blaze who have pledged never to publish in closed journals and never to review for closed journals. That in the meantime, efforts to overtly disrupt JSTOR are a sideshow that will ultimately detract from that effort.


Efforts like boycotting closed journals can only help for new publications. Sadly, past publications sometimes stay relevant for a long time, so access restrictions are still a problem even if everyone stopped publishing in closed journals.


You're not making any sense here. Why are the articles on paper, and why are people charging to access research produced with taxpayer money? Your logic is circular.


If you're really asking what I prefer, then I prefer making documents available gratis perhaps after the cost of digitizing them has been recouped. Some additional charge for the ongoing cost of storage and bandwidth would need to be worked out. However, $19 per article is unreasonable (c.f. Amazon S3 Requestor-Pays buckets $0.12/GiB).


What do we know about JSTOR's cost structure? Is the $19 the result of subsidization within the system? I'd imagine there are a ton of articles on JSTOR which nobody ever reads, the digitization of which is subsidized by that of the popular articles. I'd also imagine that a la carte purchases subsidize subscriptions for universities. Also, the underlying content isn't necessarily free. Most of the popular journals charge subscription fees, and I'd imagine JSTOR doesn't get them all for free.


You've raised several points.

I grant that a $19 article may represent subsidization in their system, but argue that they have an ethical obligation derived from their position as a steward to make the components of that price transparent.

It's my understanding that universities pay significant amounts for site licenses and that general public purchases do not subsidize that access.

By perpetuating a watermark or attribution on a public domain work digitized by JSTOR, are we as a society not implicitly granting JSTOR a right on the work which it ought not have? Digitization is a one-time salvage operation. Should we conflate it with copyright through watermarks?

Regarding JSTOR, journals, and non-public-domain works -- that is a private property matter. They ought to be able to charge what the market will bear.


    > It's my understanding that universities pay significant amounts for site
    > licenses and that general public purchases do not subsidize that access.
Yeah, I am pretty baffled with how non-institutional users are requested to pay for access, both on JSTOR and other publisher's sites. Surely this is not a large source of income? arXiv (always pointed out as a shining example of something going right) has recently started requesting membership fees to be paid by the top 200 institutional users. Other than that, the public is still able to go to the site and read about dark matter and substring bananas.

You know, what probably happened was that when the content was starting to get digitized, the publishers only had users that were accessing from universities. So they charged the universities just like they previously did for subscriptions to the bound book issues. At the time, there was probably no concept of doing any sort of analytics to figure out where downloads were coming from, so they didn't think that maybe they should just allow anyone to download, and then setup a deal with colleges to pay their fair share. I think the arXiv model is worth some more investigation..

(This ignores all the actual truths like, they have to maximize shareholder value, they have to make as much money as possible, their incentives are aligned differently, etc. Instead it is simply focusing on the reason why "individual purchase" is an option that doesn't seem to make sense to non-institutional members like the general public.)


JSTOR doesn't have to maximize shareholder value. It is owned by Ithaka, a non-profit whose mission is to help " the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways." http://www.ithaka.org/mission


What shareholders? JSTOR is a non-profit organization. It was originally founded by the Mellon foundation.


I used plural "publishers" and it happens to be that many publishers are commercial. I know JSTOR is not one of them.


How much does Wikipedia charge me to make a copy of an article? What is Wikipedia's annual burn rate?


Wikipedia scans and digitizes old paper articles?

In all seriousness, have you ever tried digitizing old journals? It is fairly time consuming.


Personally, I'd be fine handing over all journal publishing responsibilities to Google. The one great thing that Google has going for it is that many many many of the engineers there came from academia and understand the value of open access to information. I don't care if they run ads while they do so. It would be orders of magnitude better than the status quo.


I fail to understand how a for-profit corporation that sells your privacy to the $500 billion advertising/media/branding industry is a better steward of this information than a not for profit organization that simply charges a fee for their services.


No, let's not do this. We don't need the scientific literature fucked as badly as Google fucked the USENET archives.


Enlighten me. I'm curious.


Google made a complete hash out of the Usenet archives that they bought from DejaNews; you could literally issue the same search month after month and watch yourself getting progressively fewer and fewer results.


I'm pretty sure Google imposes the same limitations JSTOR does for public domain stuff. Google and JSTOR are both the status quo for digital copies of public domain stuff.


Sir, I've worked on/off with Project Gutenberg for over 12 years (since I was 18). I've mailed thousands of CDs/DVDs full of PG's library to people in thirdwave world countries, and I've also reviewed hundreds of books, on my own time and for no compensation, for their Distributed Proofreaders project (http://www.pgdp.net/c/).

No, I have not tried to digitize old journals. But I HAVE helped contribute to the digitization and distribution of public domain works. I'm aware of the time and financial resource commitments involved.


In all seriousness, have you ever tried digitizing old journals? It is fairly time consuming.

Writing an encyclopedia is time consuming, but we did that.

Walking & driving all over the world is time consuming, but the OpenStreetMap community has done that.

Scanning & digitizing old public domain books is time consuming, but the Project Gutenberg project has done that.

What's next?


Additionally I am pretty sure Wikipedia's annual burn rate is pretty high... Hence the donation request every year and all..


My argument is that Wikipedia has more transparency and a lower burn rate than JSTOR. Also, I would accept the Internet Archive as a feasible repository for the data in question vs JSTOR.


There's a difference between volunteer-written Wikipedia articles and JSTOR documents curated by professional scholars; e.g., the latter are acceptable for citations.


This is exactly the same argument that Bill Gates used in 1976 in "Open Letter to Hobbyists", asking: "Who can afford to do professional work for nothing? What hobbyist can put 3-man years into programming, finding all bugs, documenting his product and distribute for free?"

Here we are, nearly 40 years later, standing next to Wikipedia, OpenStreetMap and all the open source software in the world, and we can say, Yes, people will do this work for free. The fundamental premises of your argument (that people won't do it for free) has been disproved.


In context, JSTOR's statement was clearly not about public domain works. It was about the articles that Swartz downloaded, most of which were presumably not in the public domain.


I don't see any evidence either way for whether or not his "keepgrabbing" python soup only grabbed public domain articles. At the very least, I imagine it touched the landing pages for each article to get the pdf link, where it would also have an opportunity to look at the copyright status. Is the copyright status even on the JSTOR page for each article anyway? Or is it only in the PDF?

Let's see..

http://www.jstor.org/discover/10.2307/1831029

The "rights and permissions" link goes out to:

https://s100.copyright.com/AppDispatchServlet?author=Fischer...

... ok copyright.com is not encouraging at all, that scam is everywhere. Let's try something older.

http://www.jstor.org/discover/10.2307/30096268

The "rights and permissions" link goes out to:

https://www.copyright.com/openurl.do?sid=pd_ITHAKA&servi...

which leads to:

http://www.copyright.com/search.do?operation=detail&item...

.. where they seem to be selling the rights??


In the US the metadata is almost certainly not copyrightable, no matter how laborious its creation was. A work must be creative to be copyrightable, and mere facts are not as a matter of law.


Is that a distinction without a difference when their TOS restricts coordinated downloads of public domain content for the purpose of evading their artificial barriers? A TOS violation that carries, in the minds of some US prosecutors, a just penalty of several years in jail?


> The contention that JSTOR is acting as a faithful steward of public domain works -- having established any sort of restriction or barrier to prevent the public from accessing those works -- truly examines the widths and depths and bounds of intellectual dishonesty.

As I see it, they've removed the majority of the barrier (i.e. actually tracking down a copy of the periodical or whatever). So you have to log in and deal with a watermark- calling that 'intellectual dishonesty' is stretching the term to its limits.


Why are bulk downloads prevented when S3 Requestor-Pays buckets could deliver public domain content at $0.12/GiB? Why does a one-time salvage operation require a watermark - are they asserting a copyright?


> Why are bulk downloads prevented when S3 Requestor-Pays buckets could deliver public domain content at $0.12/GiB?

Because a) they need ammo against people like, well, Aaron Swartz, and b) doesn't take many people bulk downloading the DDOS the sucker.

> Why does a one-time salvage operation require a watermark - are they asserting a copyright?

They probably don't distinguish between public domain and copyrighted in document production. TBH I doubt that most documents people access are public domain, so it probably hasn't been an issue until now. Even in fields that are really old (e.g. classics) the vast majority of work is recent and copyrighted.

Besides, if I were to scan these things in, I would want people to know I did it. Do you know how much effort goes into the process? Even when things are completely automated (which is expensive) there's a lot of manual labor in operating the machine, editing, and cleaning stuff up. What do you do with figures? How about typeset math? Glyphs not in unicode?

Not saying I agree with their tactics, but I understand them and I don't think it makes them an immoral/bad organization.


> Do you know how much effort goes into the process? Even when things are completely automated (which is expensive) there's a lot of manual labor in operating the machine, editing, and cleaning stuff up. What do you do with figures? How about typeset math? Glyphs not in unicode?

Step 1: Ask Google to do it. Step 2: There is no step two.

Google Books shows that Google is willing to do these kinds of tasks, as long as they can show the results on their site and thus keep people using Google services. I'm sure they would digitize all of the public domain papers and host them at no cost to JSTOR.


Exactly how is .12c/gb going to fund an operation that provides full text searches of academic journals which saw 74 million downloads in 2010?

Does anyone chiming in who claims this info must be available for pennies per article actually have any evidence than an operation of this scale can be funded this cheaply? The costs to provide this service are not this simple and as cheap as you think.


Ah, yeah, I think you've misunderstood the implications of what I've suggested; from your tone, I think you'll find a corrected interpretation both more and less radical.

JSTOR sends messages into the marketplace that they are a faithful steward of the public domain, but knows that its TOS a) prevents unrestricted access to the public domain, and b) knows that a US Attorney will prosecute violations of that TOS as felonies requiring several years in prison; I argue that their speech does not match their actions and that this is dishonest. Aggravating this, the extreme negative consequences of taking them at their word (as Aaron did) is why I've selected to say that their actions probe the depths of intellectual dishonesty.

Full-text search, like printing, is a value-added service. There is no reason to keep public domain works hostage by the threat of sending people like Aaron to jail for decades just so that they can offer FTS. If JSTOR wants to offer FTS or other services on the corpus of public domain works, let them charge for access to those services.

An existence proof: arXiv offers bulk download access to the works in their repository (~490GiB) via Amazon S3 Requester-Pays buckets [1]. The requester is paying Amazon, not arXiv; so, arXiv doesn't earn "pennies per article", it earns nothing. Incidentally, arXiv also provides full-text search [3].

Let's talk about capacity planning. Let's guess that average size of a digitized journal article is 5 MiB, that comes out to ~361 TB, or $44,400 to stream from S3. The at-rest cost of those articles is far lower because there are far fewer articles than downloads (I don't have a number, but would you argue otherwise?).

My proposal: JSTOR removes its watermarks and puts all public domain works and associated metadata into S3 Requestor-Pays buckets. They finance their operations by selling non-public-domain works and value-added services like FTS at a price the market will bear.

Earnings made by restricting bulk access to public domain works is blood money. Watermarking these documents is confusing - no one seems to have answered my questions about whether they are claiming a new copyright.

[1] http://arxiv.org/help/bulk_data_s3 [2] https://forums.aws.amazon.com/ann.jspa?annID=386 [3] http://arxiv.org/find


aXriv is funded by Cornell.

How will JSTO be a "good faith steward" if they give all of their content away and make it available for free? They have to pay their bills to maintain the infrastructure to support all of this and pay their employees to do so. Charging for article access is how this is accomplished and how the company behind this allows the service to continue. There is a lot more money behind the operation that just copying pdf's to an s3 bucket.


There is a difference between public domain works, non-public-domain works, and value-added services. I've made this distinction clear in my arguments.


My point: They have to charge what they do to continue doing what they do.


We agree that to operate a business one needs revenues. We agree that sales of non-public-domain works and value-added services are appropriate avenues for improving revenues.

Where we differ in opinion seems to be that I do not believe that a company can take a public domain work, wave a magic terms of service, and then sell it back to the public.


This actually happens all of the time, see the copies of Shakespeare and Beethoven on sale at your local bookstore. The content is being repackaged and sold, but that doesn't negate the fact that the works are still available for free somewhere else because they are public domain.


I'm not doing a particularly good job of pinning you down, but look, my complaint is that regardless of the availability of particular individual public domain works JSTOR's TOS contains language preventing the coordinated download of the entire public domain archive, and will block your MAC address if you try to download too much of the public's works.


Correct. Downloading their entire archive and putting it online somewhere else puts their business at risk and will prevent them from recouping costs they've invested and asked users to pay for (by legally signing up for an account and paying them). Just because it is public domain doesn't mean it can be taken and posted elsewhere.


I disagree with the premise that it can be called "their" archive if it contains none but public domain works. Can you have ownership over a collection of things that you don't own?

"Just because it is public domain doesn't mean it can be taken and posted elsewhere." No, that is precisely what it means - the collective commons owns this; ownership means having the right to use your property. These works are your property, my property, our property.


You cannot walk into Barnes and Noble and take a copy of Penguin Publishings' The Complete Works of William Shakespeare off of the shelf and walk out with it without paying just because the work is in the public domain (Note: lets not get into the digital vs. physical depriving anyone of content argument for the 10000th time).

Time and money was spent organizing and publishing the book / article, so they do have a right to charge for it. If you want wholesale free access to public domain documents, then you find a place which will provide these to you for free. You don't get to ignore a companies right to charge for something just because you disagree with the premise.


I suggested that JSTOR make its public domain works available for bulk download through AWS S3 Requestor-Pays buckets. Please don't misstate my position - I haven't endorsed Aaron's Guerrilla Open Access position.


> The costs to provide this service are not this simple and as cheap as you think.

Exactly. I can't believe bandwidth is jstor's limiting costs.


I am trying to collect small hacks like this that are popping up. I know everyone writes scrapers, but hardly anyone talks about it. But maybe talking about it would be helpful so that we can figure out what goes right and what goes wrong:

https://groups.google.com/group/science-liberation-front

Someone contributed a small greasemonkey script that does something similar to the JSTOR memorial liberator, except for SpringerLink previews.

https://gist.github.com/4535401


You're right. everyone writes scrapers. Heck Google is just one huge scraper. It's what you do with the data (and in some cases, what you might intend to do with the data) that counts.


Just because something is in public domain doesn't mean it's free. Shakespear's works might be public domain but there is a cost to printing a book. There is a cost to scanning pages, there is a cost to hosting a website, administering a website etc.

How hard is this logic to understand? If you don't want to use JSTOR don't use it. Don't go around saying they should let you download it for free.


You can buy a copy of Shakespeare's works and then upload a copy of the text to Project Gutenberg.

Just because someone is selling something that is public domain doesn't mean that you can't give it away for free.


Unfortunately, I can see this pounding JSTOR's site. I'm sure US Attorney Carmen Ortiz will think this is a DDoS attack because JSTOR can't handle the traffic generated by everyone going there.

I'm sure she can rationalize that as a terrorist attack on essential infrastructure, so let's round it up to 170 years in jail for the conspirators. Let's assume every one of those connections would have paid $200 for an article,so that's $50 million in damages. Right?


Sigh. It is getting very old to see HN comments degenerating into Techdirt/Reddit/Torrentfreak level hyperbole any time a perceived injustice story is in the news and when it involves the US government, the entertainment industry or any other entity that isn't 100% in the "information wants to be free" boat.


You don't think the site could get slashdotted from everyone trying to use this bookmarklet at the same time and PDFs getting generated from it?

At this point, I don't think it would be a stretch to see an overreaction, even if it's just benign traffic from people checking out the free content melt the servers.


There's a difference between 'The site may go down and 'This person I don't know will overreact and accuse us of terrorism'. Why don't we discuss the reality of the situation rather than some bizarre scenario you've concocted?


The site goes down because of a distributed "attack" in the wake of Aaron's death. Coordinating a distributed denial of service attack (as characterized in the worst possible light) is not a legal activity. Less crazy charges have been brought for other "hacking" activities.

Accessing "hidden" URLs has been called unauthorized access; deep linking or just linking to something like deCSS has been likened to a crime; port scanning has been legally attacked a few times.

If these same type of people think URLs are a crime, I don't think the scenario I described is that bizarre. Look up that news story about the guy in a glider near a nuclear plant (he was catching a thermal from the lake) -- there was never a no fly zone, but officials actually considered shooting him down! Instead they held him for 24 hours while his loved ones started a search for his glider. They dropped the charges only after he agreed not to sue them!


My point isn't "It can't happen".

My point is "Why don't we discuss things as they are, not as they may or may not be some time in the future?".


Let's not be hyperbolic. Carmen Ortiz does not have it out for people; it's incredibly likely she legitimately thought Aaron WAS hacking and simply thought it was right to make an example of him. This probably doesn't even violate JSTOR's EULA- people have the right to view the article, they aren't scraping the site, and it's not like the bookmarklets are going to drive people to the site any more than people already are.

EDIT: Apparently the TOS (not EULA) is vague enough that this might actually violate the TOS- but the creators are very explicit about this and clearly tell the user that if they violate the TOS it's their own damn fault. The only thing I could think of is that the creators could be hit with DMCA violations for circumventing the 'DRM' of the site, EVEN if they never actually use it on anything copyrighted.


> it's incredibly likely she legitimately thought Aaron WAS hacking and simply thought it was right to make an example of him.

And you don't think that she will think that this is hacking if the site goes down from the load? Besides, making an example of people is having it out for them.


> And you don't think that she will think that this is hacking if the site goes down from the load?

No, I don't. Someone stupid enough to continue this shitstorm would never make it that high up in office, especially when this is even less plausibly hacking than what Aaron did, even to someone without a great understanding (it's distributed, they warned people it would violate TOS, and the intent is clearly not to burden the site but to make a political statement that is explicitly designed NOT to burden the site).

I could be wrong. But I really, really don't think so.

> Besides, making an example of people is having it out for them.

Well yea, she might have it out for 'hackers', but who doesn't? Anyway, it's a common practice to go after the 'big fish' to scare the smaller ones.


I can certainly see the site being impacted by a large number of concurrent visitors. The service embeds each PDF with the current datestamp, so every access is taking more server time than bandwidth.

The response time on an example PDF for me was around 10 seconds.


I might be missing something, but how are the articles being liberated? I don't see any downloadable articles at archiveteam.org at all. Is the idea that they're going to be made available as a torrent or in some other downloadable form at a later date?


I feel kind of hollowish about this - I'd wondered at the time why Aaron chose such a direct method that was sure to piss off some people and get him burnt in some way. But everybody's got their own methods and goals, and he succeeded at the social goal of rallying people to his cause (something a purely technical solution has a very hard time doing). But now seeing an implementation of a tool that would liberate the same articles, a bit slower, but without him having to pay the ultimate price? Sigh.


I think it's entirely appropriate for those of us who have just learned of JSTOR to go visit it and check out the site. It's interesting because they allow you to download a single article for free.

Once I download the public domain article from 1886, I suppose I could do whatever I wanted with it.


Not necessarily. The old article ITSELF is public domain, the work, so to speak. The file you downloaded might have a new copyright because someone worked really hard to create it. Check the license and everything before jumping to conclusions.


Effort required to create something does not mean that something is necessarily copyrighted if there was no creative process involved. This is why databases are not copyright-able in the US.

Is scanning old journals a creative process? Maybe there is some case law that says so, but until I see that, I am inclined to say no, and everything that I can find agrees.

http://en.wikipedia.org/wiki/Database_right#United_States

http://en.wikipedia.org/wiki/Threshold_of_originality#Reprod...

http://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_....


Oh, BTW, I looked at the JSTOR license for "Early Content Journals" and I'm confident that I am within the spirit and letter of their TOS.

http://www.jstor.org/page/info/about/policies/terms.jsp#TC2


I'll accept the concept of "new copyright" for something transformative, I do not accept it for something akin to xeroxing.

If JSTOR has a different opinion, they can see me in court.


So all it takes is a famous person killing themselves over prosecution and an insanely massive outcry from the internet. Seems easy enough. /s


You are misreading the title. The jstor liberator refers to a bit of javascript, and the documents are freed up one-by-one as people run the script, apparently limited to once per browser.

So it's not as if JSTOR suddenly saw the light of day and opened up their archive for download.


the beauty is its only the beginning.


its a hack




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: