What it looks like to process 3.5M books in Google’s cloud

placeybordeaux · on Feb 16, 2016

Would have been nice to include a cost estimate.

marai2 · on Feb 17, 2016

While not a cost estimate, one of the pages linked off of the article ("sample queries") says this:

"NOTE that the complete fulltext of all Internet Archive books over these 122 years reaches nearly half a terabyte, so a single query across the entire fulltext collection will use up nearly half your monthly Google BigQuery free quote, so work with the fulltext field SPARINGLY."

mcguire · on Feb 16, 2016

Did I miss any discussion of what the "processing" is?

Using the Stanford Part-Of-Speech tagger, my goofy project, Ashurbanipal, can tag all the words in one book in about 8 seconds on one core, or ~25,000 books from the Project Gutenberg 2010 DVD image on my 4-core (hyperthreaded) laptop with a 10GB JVM heap in about 8 hours.

dholowiski · on Feb 16, 2016

Nope, there was almost no mention of what this was actually used for. The closest I found was a mention of the final output:

"single output files, tab-delimited with data available for each year, merging in publication metadata and other information about each book"

[edit] More info in a link at the bottom of the article: http://blog.gdeltproject.org/3-5-million-books-1800-2015-gde...

jonesb6 · on Feb 16, 2016

Cloud providers pricing structures don't usually shine in edge cases, which I believe this project qualifies for. I would imagine the total cost to be prohibitive for the average hobby user and that the author neglected to mention it to hide this fact, or because he received special pricing either because he works at Google or is closely affiliated to it.

Still really cool project. Just doesn't sell GCE very well for the use case it embodies, big-data hobby projects (although I'm sure it could be applied similarly to business problems).

tgsovlerkhgsel · on Feb 17, 2016

In my experience, cloud is affordable especially for bursty edge cases. They only get unaffordable for hobbyists when you run them 24/7 or have significant amounts of traffic, which is really expensive.

But since the pricing is public (https://cloud.google.com/pricing/#pricing), we can check. Please double-check my calculations, I may have made a mistake somewhere.

* "single 8-core Google Compute Engine (GCE) instance with a 2TB SSD persistent disk ... downloaded the books to the instance’s local disk" Unfortunately doesn't say how long it took. A n1-standard-8 instance costs $0.4 per hour without any discounts, plus a neglegible amount for 10 GB OS disk space. A 375 GB local SSD costs $0.113 per hour, so let's assume a total of about $1 per hour. Pretty affordable if you just run it for a day or so.

* "ten 16-core High Mem (100GB RAM) GCE instances (160 cores total) to process the books, each with a 50GB persistent SSD root disk" => 500 GB of persistent SSD root disk at $0.17/GB-month would be 85 per month, so about $3 per day for the storage. The instances are about $1 per hour each, so ~$10/hour. Affordable, but can cost a pretty penny if you need to let it run for more than a day.

* "single 32-core instance with 200GB of RAM, a 10TB persistent SSD disk, and four 375GB direct-attached Local SSD Disks" $2 per hour for the highmem instance, another ~$2.36/hour for the persistent SSD, plus the local SSDs, for a total of slightly under $5/hour. Again, dirt cheap if you need it for 1-2 hours, expensive if 1-2 hours turn into 10-20.

* cloud storage... no infos how much data it was, but $0.02/GB-month for durable reduced availability storage (which seems like a reasonable choice). For 10 TB, that would be ~$7/day ($200/month). There are additional costs for writing and access: 100,000 "Class A" operations (e.g. writes) cost $1, so that'll be likely at least another $35 for writing the files. Class B operations (reads) cost 1/10th of the price.

* traffic - inbound and internal is free, so likely neglegible if you just want to analze a lot of data. However, getting the full data set out would likely be very expensive. $0.12/per GB quickly adds up - 1 TB would be $120, 10 TB would be ~$1110! OTOH, if you just need 100 GB of results out, that's $12.

All in all, I'd say that "a couple hundred bucks" is a pretty reasonable assumption, which is still affordable for dedicated hobbyists (I just looked up the price of Märklin model locomotives, which also cost a couple hundred). Especially considering that you get $300 of free trial quota if you sign up - if you're fast, you may even be able to run this for free.

BigQuery pricing is highly dependent on what you query, but I'd call $5 per TB of data queried affordable (i.e. for $5, you can run 10 queries over a 100 GB dataset, and only the columns you touch in your query actually count against the amount of data). And the performance is just insane.

jonesb6 · on Feb 17, 2016

Damn. I underestimated GCE. Thanks for doing the research, you cleared up my naivety on this.

The per hour pricing is especially surprising.

vgt · on Feb 17, 2016

very nice analysis...

I've written on this topic the other week: https://cloud.google.com/blog/big-data/2016/02/understanding...

Basically, in VM terms BigQuery lets you scale to thousands of cores in seconds, for just a few seconds, and you get per-second billing. Pretty cost-efficient :)

tlarkworthy · on Feb 17, 2016

great analysis, and if you are cost sensitive, be sure to have a much smaller dataset to practice the pipeline and subsequent queries on too! Blowing $300 per failed attempt can get expensive

pgrote · on Feb 16, 2016

Seems odd one wasn't included. At the least, I expected a grand total.

faizshah · on Feb 17, 2016

Anybody find a download for the dataset? Would prefer to take a look at it locally.