Search-Script-Scrape: Web scraping exercises in Python 3 for data journalists

Sven7 · on Aug 16, 2015

If you are a young journalist being told "data journalism" is the must have resume bullet point for the future, here's some unsolicited advice from someone in techland who has worked with journalists.

1. Don't waste your time on this stuff if you have no interest/aptitude for it. I see people being pressured into it when it's not the right fit. The kind of people who will have success with this, are the Nate Silver's of the world who are really domain experts dabbling in journalism.

2. Being a journalist gives you access to data and access to experts. Bring the two together whenever you can. It takes time and skill to develop that access. And in most cases, it's time better spent than learning python. Matt Taibbi is a good example of this. He was able to make sense of something complex (2008 meltdown) by bring the data and the experts together. No Python necessary.

danso · on Aug 16, 2015

OP here: I don't necessarily disagree with what you've said here. The "Computational Journalism" class is an elective at Stanford, and while some of the students are from the journalism program, others come from more technical fields such as CS or MSE. The programming part for them is not a huge challenge...but besides the exposure to civic issues and data policy, for some of them, this is the first time they've worked with things like webscraping and public-facing APIs (as was the case for me in my computer engineering degree program, though that was years ago).

So there's a decent sized group of technically-apt students at Stanford who are interested in journalism. And my advice to them would be to at least intern as traditional reporters, as there's no better way to learn the work of developing access and sources (as well as interviewing and writing on deadline!).

That said, there are opportunities to quickly explore a domain if you're skilled at data collection and analysis. One of the best examples I can think of is this writeup by a couple of data reporters about their investigation into Florida cops:

http://ire.org/blog/on-the-road/2011/12/20/behind-story-trac...

> This was a case where the government had this wonderful, informative dataset and they weren’t using it at all except to compile the information. I remember talking to one person at an office and saying: “How could you guys not know some of this? In five minutes of (SQL) queries you know everything about these officers?” They basically said it wasn't their job. That left a huge opportunity for us.

This scenario -- in which the data is freely available but no one thinks to simply collect it into a spreadsheet -- is just the tip of the iceberg of data work that needs to be done...but I'd be lying if I said that this kind of low-hanging fruit was rare...There's plenty of information out there that's just begging for efficient examination...to paraphrase a classic adage, the problem today is not that we lack information, but we lack ways of filtering and understanding it.

I'll leave aside the debate of how worthwhile it is to try to teach programming to traditional journalists -- it's definitely not easy work...but there's a great deal of potential in teaching comsci students about civic and journalistic issues and how specifically to apply their skills. I turned out OK after first spending a few years as a newspaper reporter, but I think I missed some opportunities to hit bigger...but back then, I had no concept of mixing my programming background with my journalism.

Sven7 · on Aug 18, 2015

Appreciate your answer and what the course is trying to do.

All I'll add is, it's good to be aware of the contradiction all that data presents. The contradiction shows up in your post and I have a feeling you are aware of it.

"Quickly exploring a domain" and "efficiently examining the data" are inherently contradictory. To resolve that contradiction (going back to my previous post) is to (a)get an expert involved as quickly as possible or (b)become the expert.

And its healthy for someone starting out (be it a journo dabbling in compsci or a programmer dabbling in journalism) to keep asking themselves (based on their aptitude\motivational levels) which road they are taking.

gtrubetskoy · on Aug 16, 2015

Unless I'm missing something, the README doesn't mention that all the examples rely on "requests" (which is not in the standard lib or Python 3 specific, thus title is a tad misleading): https://pypi.python.org/pypi/requests

rspeer · on Aug 17, 2015

I don't get it. What part of the title would imply that it has no dependencies outside of the standard library?

j4kp07 · on Aug 16, 2015

Maybe I am being picky, but is traversing JSON files truely "web scraping"?

danso · on Aug 16, 2015

No, you're correct...these exercises were deliberately kept programmatically simple -- e.g. single loops and conditional statements -- ...not everyone student had much CS experience, nevermind web scraping. In cases where JSON is being parsed, it's usually because that's the easiest way to access the data...but the "skill" in the exercise is recognizing when a website feeds from such an API...and then go direct to that source.

For example, usajobs.gov is a consumer-friendly jobs search site. You could find the number of librarian jobs by manipulating the web form...or you could do a little looking around and see that there's an API:

https://data.usajobs.gov/Rest

And just as importantly, there's an official taxonomy for federal jobs: https://www.opm.gov/policy-data-oversight/classification-qua...

So being able to look at a website and deduce what might be behind it is good enough...and is actually what I would do in a real-world situation rather than just trying to reverse engineer a site.

And there's the increasingly common situation in which the website loads data client-side, such as analytics.usa.gov...and so inspecting the network traffic and working with the JSON files is the only way to collect the data displayed on the website.

erroneousfunk · on Aug 17, 2015

One of the most important skills a web scraper can have is being able to take the easier path and use an API where it's available. APIs, after all, are just really really nicely-formatted webpages that follow an additional set of generally agreed on standards. If you were interviewing a web scraper for your company, and they didn't know how to parse JSON, you'd probably think twice about giving them an offer. I think it's entirely appropriate -- and necessary -- to include JSON parsing in the list of exercises. (Also, I recently wrote "Web Scraping with Python" (O'Reilly), and had a whole CHAPTER on APIs)

alexcasalboni · on Aug 16, 2015

Many of those scripts will most likely fail within a few weeks, as their data extraction logic is way too simplistic and based on unstable and non-semantic HTML structures (i.e. doc.cssselect('small a')[0] ).

simonw · on Aug 16, 2015

That's just the nature of web scraping.

doug1001 · on Aug 17, 2015

for aspiring journalists, i should think a class like this is a godsend--which for those who put the work in, will have at the end of the semester, a potent set of tools for specific data gathering (e.g., which California city mgr earned the most last year?). Each student forks this repo and builds their own web crawling toolbox. Kudos to the professor who conceived this course and for teaching it.

thuruv · on Aug 17, 2015

The others might failed to understand that these are the tools not the talents to pursue their career.