Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Gutenberg – A simple interface to the Project Gutenberg corpus (bitbucket.org/c-w)
94 points by c-w on Aug 4, 2014 | hide | past | favorite | 13 comments



Hey all, OP here.

I built this because I think that Project Gutenberg is a great resource for NLP (e.g. stylometry, tracking writing styles over time, authorship detection, ...) - I wanted to use the data on Project Gutenberg a number of times in the past but always ended up using another corpus because there wasn't an easy way to access the Project Gutenberg data. Hopefully this library fixes that.

The project currently is "works on my machine" quality, so please do report any bugs you stumble across.

Also, if you can think of any use-cases for the Project Gutenberg data that aren't easily doable using the functionality that is currently available in the library, please let me know (e.g. by filing a ticket on the Bitbucket repo).


This is fantastic!

I just made a github repo for each Gutenber book: https://github.com/GITenberg

This will be very helpful, the XML/RDF files are a hassle.


Are the github repos intended to collect errata? Do you know of a database which has metadata for all the Gutenberg books?


There's a database of RDF files that describe the books (http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2), but its a bit of a pain to use and doesn't link the books back to the API that should be used for crawling Project Gutenberg (http://www.gutenberg.org/robot/harvest).


I think the previous version of the metadata included a path to the ftp server. Splitting the book id (4443 -> 4/4/4/4443) works for _most_ books, but there were somewhere between 800 and 3000 books organized in a different folder structure that I still need to track down.


The github repos are intended to collect issues and received pull requests. Project Gutenberg doesn't have a public bugtracker, nor do they use version control.


Why is Git in uppercase?


To better differentiate from Project Gutenberg, GIT + Gutenberg. Case is ambigious on github, so I can change it later without breaking anyone's URLs.


Wow this is just what I have been looking for. A few questions:-

1. Is it possible to download files in html format? I really prefer ebooks in html cause I can attach my preferred "readSettings.css" to it that way. *

2. Is it possible to run a custom script after the book has completed downloading?

3. I don't understand what you mean by "Making meta-data about the texts easily accessible through a database" in the description. Can you expand a bit on this?

4. Is it possible to specify other donwload contexts like "genre"

Oh Thanks a lot for this :) I always wanted a command line uitlity for project gutenberg.

* I also think that html files are lot easier to read on the phone as you can style them where as with the txt files you have got no choice but to use horizontal scrolling unless you are in a landscape mode.


Your use-case is not what I built the library for (natural language processing, not text consumption), but let's see what we can do...

You can download HTML E-Books using the following command:

  python -m gutenberg.download -vvv --filetypes=html --limit=5mb ./ebooks
This will download 5mb of zipped E-Books for which there exists an HTML version to the ./ebooks directory.

It seems as though the legal disclaimers and copyright notices in the HTML files are all within <pre> tags so we can easily clean-up the files with a small shell script:

  EBOOK_DIR="./ebooks"

  find "${EBOOK_DIR}" -name *.zip -type f -exec unzip -d "${EBOOK_DIR}" {} \;
  find "${EBOOK_DIR}" -name *.html -type f -exec sed -i '/<[pP][rR][eE]>/,/<\/[pP][rR][eE]>/d' {} \;
This will probably not work for all E-Books, but it'll give you something to work with. Note that removing the copyright notices may or may not be against the Project Gutenberg terms of service.

Downloading E-Books via genre, author, etc. is not currently supported but is something that I wanted to implement - so watch this space.


This is a pretty small python module at the moment. To fetch metadata, it downloads a 230mb zip from PG and parses it for a few categories. Project Gutenberg has some metadata about Subject, but the info is inconsistent, but there is sometimes a Library of Congress code in the metadata.

Not all books have an html version. Most PG books are plaintext, and _some_ have a separate html variant. A handful are written in a markup language that can become html or plaintext.


+1 for bitbucket


This is really awesome.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: