It shouldn't take 64 lines of code to do something really simple

mattdeboard · on Oct 2, 2011

FWIW the greater Department of Labor (of which the BLS is a subordinate unit) has a pretty okay web API with several "SDKs" (most of which are API wrappers for particular languages). Weirdly, no Python wrapper yet but I wrote one, as well as another for Clojure

https://github.com/mattdeboard/python-usdol https://github.com/mattdeboard/clj-usdol

edit: A couple of the BLS's databases are available through the DOL API:

http://developer.dol.gov/BLS_Numbers-DATASET.htm

"This Dataset contains historic data (last 10 years) for the most common economic indicators. More information and details about the data provided can be found at http://bls.gov

http://developer.dol.gov/DOL-BLS2010-DATASET.htm

"The Bureau of Labor Statistics produces occupational employment and wage estimates for over 450 industry classifications at the national level."

That last one is actually a pretty interesting dataset. I pulled it to tinker with at work last week.

woodgears · on Oct 3, 2011

Given the crazy complicated XML schemas people come up with, I'd say you are lucky it's just plain text, and you can parse it without needing a library.

jrockway · on Oct 3, 2011

Ironically, at work we buy a bunch of BLS datasets from a third-party vendor. Their format is even worse: an opaque binary database and a win32 DLL that reads it. 1000 lines of Haskell later, it sort of works...

(This is mostly because they chose to represent dates as integers, and they have four possible date types: yearly, quarterly, monthly, and daily. 1901, in their exciting world, could mean "year 1901" or it could mean "month January 1919". That's nice work, Lou.)

lurker19 · on Oct 3, 2011

Sounds like Excel.

Where do you work that uses Haskell in the office?

jrockway · on Oct 3, 2011

I work for BofA, but I'm pretty sure I'm the only one that uses Haskell.

I used Haskell because I hate Windows, and Haskell lets me do all my work in a maximized Emacs session without having to install anything other than Emacs and the Haskell Platform. "When life gives you Windows, don't make lemonade. Use Haskell."

georgemcbay · on Oct 2, 2011

I don't doubt this guy's assertion that the data could be formatted better, but if I were working on something like this I'd just be glad that I could access the data at all.

And 64 lines of code... bfd? That's much better than the bad old days of custom binary data formats where you have to write a thousand line data loader in C, worrying about issues like lsb vs msb, whether the number storage is properly following IEEE 754, etc.

I guess I'm just...old, because my first response to 'I had to write 64 lines of Python code to get this data into shape' is to shake my head and think "Kids these days!".

repiret · on Oct 3, 2011

I wish I could parse everything I've been asked to parse with 64 lines of Python.

FWIW, I find it ironic that the author is complaining about whitespace-implied structure in his source data, yet uses Python.

john_horton · on Oct 3, 2011

My point was not that this was absolutely hard, but rather that's way too hard for the totally mundane thing I was trying to accomplish---namely getting public data from a US-funded statistical agency into a form suitable for further statistical analysis (which is the whole reason they make this data public).

Also, I don't see how it's ironic that I used Python---it seems irrelevant. Whitespace in Python has a well-defined, commonly understood purpose; using tabs or spaces to indicate hierarchal data relationships is not at all standard, presumably because it creates messy dependencies across rows of data.

lurker19 · on Oct 3, 2011

"Messy dependencies across rows" is exactly how Python whitespace works. That data file exactly Pythonic indentation in the main content section.

john_horton · on Oct 3, 2011

The reason indentation to imply structure doesn't make sense for datasets is that in data analysis you are constantly subsetting and sorting data at the row level---something you generally don't do to the lines of a compter program.

jackfoxy · on Oct 2, 2011

Yup, worked on this very data problem (bad data sources, especially gov) for a long time, but gave up as I didn't see it as the road to riches. At the end I hit upon an AI solution I'm sure will work, but I never finished it, still on the shelf, though.

zmanji · on Oct 2, 2011

Isn't this the problem that data.gov is trying to solve?

john_horton · on Oct 2, 2011

I'm guessing that's the goal, but it's only going to be as good as the inputs provided by the different agencies.

For example, if you look for the time series I was after on data.gov, the "Download CSV" button brings you to...the BLS page, with all the problems I discussed.

http://explore.data.gov/Labor-Force-Employment-and-Earnings/...

lurker19 · on Oct 3, 2011

Try Data Wrangler http://vis.stanford.edu/wrangler/

snorkel · on Oct 2, 2011

I don't understand the headline, the Python code in the article is not so terrible considering the input.

smokinn · on Oct 2, 2011

His point is precisely that the input is awful.

snorkel · on Oct 2, 2011

Yes, the article says the input is awful, but this headline implies the code is awful.

hmottestad · on Oct 2, 2011

+1

I was think something along the lines of, look at this stupid person who needs 64 lines of code to do a simple insert in a binary tree.

jamesaguilar · on Oct 2, 2011

Can you give some brief examples of what you might do differently? Particularly, how would you solve the problem with fewer lines of code and/or complexity?

snorkel · on Oct 2, 2011

I'm suggesting the Python code is not bad at all, so this headline is just confusing. Personally I'd just use regex replace to convert the data rows to CSV directly, not as thorough as the Python code.

jamesaguilar · on Oct 3, 2011

Oh I see. Sorry, I missed the "not".

hmottestad · on Oct 2, 2011

Semantic web :) Solves everything, not kidding.

Although is a few years when there are a million new standards I guess I can keep my job :)

hmottestad · on Oct 3, 2011

Why am I down voted? Semantic Web does indeed solve this and many more problems. And I'm the only one to mention it. The semantic web has standards for data, standards for the semantic mark-up and ability to connect multiple datasets together. It also allows for standardisation of vocabulary and sharing and reuse.