Hacker News new | past | comments | ask | show | jobs | submit login
How ProPublica Illinois Uses GNU Make to Load Data (propublica.org)
191 points by danso on July 11, 2018 | hide | past | favorite | 62 comments



A lot of comments in here are poking fun at how little data it is relative to a commercial data mining operation. The data they process and what they do with it is worth more to society than any number of petabytes crunched to target ads. Processing petty petabytes is not praiseworthy.

If you focus on the headline, you'll miss the point. The point is they used open source technology to process public data for reporting once the government stopped updating its own tools.


Maybe we'll just take the gigabytes per day bit out of the title so it doesn't trigger people.


Would it be possible to put the gigabytes back in the title? In general, ETL of gigabytes of data can involve complicated operations, e.g., use of statistical models. And, the utility of data is not determined by the size of data. One has to be a pretty petty person to make fun of this article for data sizes.


You're right on the facts, but that's not the way a forum like this works. Like it or not (and probably no one likes it), the way it works is that a minor irrelevant provocation in the title completely determines the discussion. The solution is to take out the minor provocation, even if in principle we shouldn't need to.


That was almost every comment before I posted. All of them seem to be grayed out now.


Hear hear! It's a good solution for this domain using reliable, free tools. And other folks without a lot of compute resources or experience setting up Spark clusters or whatever can easily adapt their approach. Hats off.


Not just "open source technology", battle-tested tried-and-true technologies. Articles like these should remind us that we don't need to keep rebuilding tools to solve the same set of problems -- sometimes, all it takes is some familiarity with what exists


I honestly didn't know Make did all that. I thought it was just a build script thing for complex software. This was enlightening.


The production databases on the project I build and support around are around 100GB in size - tiny. But if they didn't work correctly - ambulances wouldn't arrive at the correct location on time, nor fire engines nor emergency services workers.

Covering an area of 1.25 million square kilometres, supporting 40,000 first responders helping to protect 8 million people.

Of course the databases are not the only important part of an emergency services network such as ours, but they are a critical component.

I would rather work on a project like this any day than working to prop up some faceless advertising/data collection behemoth such as Facebook or Google.


As someone who has done a lot of data processing in journalism, I've found the engineering issues aren't usually about scale, but involve the harder problems around data cleaning/wrangling/updating. Particularly interoperability with opaque government systems, and transformation/delivery to a variety of users, including ones with a high variation in technical skill (i.e. journalists), and an extremely picky tolerance for public-facing errors in the finished product.

I started ProPublica's Dollars for Docs [0], and the initial project involved <1M rows. Gathering the data involved writing scrapers for more than a dozen independent sources (i.e. drug companies) using whatever format they felt like publishing (HTML, PDF, Flash applet). This data had to be extracted and collated without error for the public-facing website, as mistakenly listing a doctor had a very high chance of legal action. Hardest part by far was distributing the data and coordinating the research among reporters and interns internally, and also with about 10 different newsrooms. I had to work with outside investigative journalists who, when emailed a CSV text file, thought I had attached a broken Excel file.

Today, the D4D has millions of records, and the government now its own website [1] for the official dissemination of the standardized data. I have a few shell scripts that can download the official raw data -- about ~30GB of text when unzipped -- and import it into a SQLite DB in about 20 minutes. The data for the first D4D investigation probably could've fit in a single Google Sheet, but it still took months to properly wrangle. But the computational bottleneck wasn't the size of data.

One of other tricky issues is that data management isn't easy in a newsroom. Devops is not only not a traditional priority, but anyone working with data has to do it fairly fast, and they have to move on almost immediately to another project/domain when done. There's not a lot of incentive or resources to get past a collection of hacky scripts, so it's really cool (to several-years-ago me) to see a guide about how to get things started in a more proper, maintainable way.

[0] https://projects.propublica.org/docdollars/

[1] https://www.cms.gov/openpayments/

edit: for a more technical detailed example of newsroom data issues, check out Adrian Holovaty's (creator of Django) 3-part essay, "Sane Data Updates Are Harder than You Think", which details the ETL process for Chicago crime data:

https://source.opennews.org/articles/sane-data-updates-are-h...

Here's a great write-up by Jeremy Merrill, who helped overhaul the D4D project after I left. Unlike me, Jeremy was a proper engineer:

https://www.propublica.org/nerds/heart-of-nerd-darkness-why-...


I suspect that anyone who has worked in tech at more "traditional" non-tech businesses would be far more familiar with the challenges inherent in any ETL undertaking. It's usually critical business data, too, so there's a strong incentive to avoid errors there, too.

The trouble is, despite (or possibly because of) being cognitively difficult and requiring a certain discipline (for lack of a better word), this kind of work doesn't come across as very "sexy" anecdotally.

Even if it does get shared, the part that makes it hard gets overlooked.


It isn't even about tech vs "non-tech". It's about whether you get data in a consistent format or not. Where I used to work, we would get a gigabyte sized file of random XML without any documentation and be told to deal with it, first step being to tell the non technical people what we had. Another delivery would be something totally different. Saying "oh, we deal with petabytes" is missing the point. If there's nothing unexpected or unknown about the data, then it's not a challenge, because you know, computers process stuff automatically.


I guess I was making the assumption that "tech" businesses are more likely to have data that's entirely generated by modern software (e.g. click logs) or at least pre-coerced into a consistent, if not structured format (e.g. tweets), whereas non-tech businesses are more likely to have data that's free-form human input or comes disparate/arbitrary machine sources (e.g. scientific instruments, the mainframe or AS/400 worlds).

I'm sure there's a spectrum, but my point was that the vast majority of what the companies we read about on this site ("tech") deal with is going to fall close to the consistent-format edge of the spectrum, hence the prejudice.


Pettybytes. I like that.


> The data they process and what they do with it is worth more to society than any number of petabytes crunched to target ads.

No it isn't. You are getting defensive for no reason. If propublica ceases to exist or never existed, it wouldn't matter a single bit to the world. You could even argue the world would be better off.

> Processing petty petabytes is not praiseworthy.

From a technical point and many other ways, it is.

I don't get why you are getting offended by people making a jab at the scant amount of data. Last I checked, hacker news is a technology oriented site. And from a technology point of view, what pro publica is doing is a joke. It's a toy amount of data.

Why not just say pro publica is not a technology company and hence people shouldn't expect technological feats of wonder?

> The point is they used open source technology to process public data for reporting once the government stopped updating its own tools.

Which is something I could have done on a lazy afternoon all by myself. It isn't anything to be impressed about. But good for them anyways.


ProPublica is not a technology company, it's a non-profit investigative journalism outlet.

They and countless other journalism/civic orgs would likely be happy for you to show them up by whipping up usable ETL scripts relevant in their respective domains. Since it all involves public open data you don't have to wait for anyone's permission.


So people can't make a point about any aspects of accomplishment that are not engineering feats?

Whether or not Propublica has produced something of value for society seems, at an absolute minimum, highly debatable.

Your comment is the one that seems defensive....


The way qubax sees the world is as important as qubax makes propublica’s view out to be

You’re being too literal. Yes if nothing exists it doesn’t matter

Look in the mirror and realize you’re just one of thousands that could do this in an afternoon

Given the “big picture” context, your personal computer skills aren’t much to brag about either. Literally good with computers. Get in line.

Were they being defensive? Or offering a context to consider the value from?


Since they have "A Note about Security", how about locking down that Python environment?

- Add hashes to their already-pinned requirements.txt deps: https://pip.pypa.io/en/stable/reference/pip_install/#hash-ch...

- Add a Makefile entry to run `[ -d your-environment ] || ( virtualenv your-environment && . your-environment/bin/activate && ./your-environment/bin/pip install --no-deps --require-hashes -r requirements.txt )`


Minor nitpick about their exit code technique [0]: The command checks if the table exists, but it does not appear to re-run if the source file has been updated. Usually with Make you expect it to re-run the database load if the source file has changed.

It's better to use empty targets [1] to track when the file has last been loaded and re-run if the dependency has been changed.

[0] https://github.com/propublica/ilcampaigncash/blob/master/Mak...

[1] https://www.gnu.org/software/make/manual/html_node/Empty-Tar...


> The first is that we use Aria2 to handle FTP duties. Earlier versions of the script used other FTP clients that were either slow as molasses or painful to use. After some trial and error, I found Aria2 did the job better than lftp (which is fast but fussy) or good old ftp (which is both slow and fussy). I also found some incantations that took download times from roughly an hour to less than 20 minutes.

Tangential question: is it possible to use wget for ftp duties? Though may be additional FTP-specific functionality in `aria2c` of course:

https://serverfault.com/questions/25199/using-wget-to-recurs...


aria is multi connection(aria2c -x5 means five concurrent), thats the main reason for speed bump


Does this increase speed when it's only downloading a single file at a time? It might be better off using makes multi process (make -j 5) to be able to process data while still loading other data.


Each connection requests a different range within the same file and they download together.


Have a look at the wget manual. There are lots of FTP-related options.


Perhaps he meant in their specific scenario, since he linked a page that shows use of wget for FTP.


Yes, I should have specified that I was interested in what aria2 provides for FTP in addition to what a more ubiquitous tool like wget seemingly has. u/rasz says aria2 allows multi-connections, so that seems sensible: https://news.ycombinator.com/item?id=17508858


Make is often brought out for data, "single machine ETL" jobs, but for big, complicated (and iterative) workflows it doesn't feel good enough to me.

What do you folks use? Drake, "make for data" https://github.com/Factual/drake seems ok, but doesn't have "batch" jobs, (aka "pattern rules") where you can do every file in a directory matching a pattern.

Others have come up with different swiss army knives but nothing ever sticks for me, it usually ends up as a single Makefile with eg 3 targets that call a bunch of shell scripts.

The whole thing would be configurable to build from scratch, but not well set up to do incremental ETL on a per file basis, after I eg delete some extraneous rows in one file, clean up a column, redownload a folder, or add files to a dataset.


I use Snakemake [1], a parallel make system for data, designed around pattern-matching rules. The rules are either shell commands or Python 3 code.

I settled on it after originally using make, getting frustrated with the crazy work-arounds I needed to implement because it doesn't understand build steps with multiple outputs, switching to Ninja where you have to construct the dependency tree yourself, and finally ending up on Snakemake which does everything I need.

[1] https://snakemake.readthedocs.io/en/stable/


Thank you for sharing this information about snakemake. I administer a cluster for a group of geneticists. I'll try to get them to use it for their publications to make their results easily reproducible by others.


Just today, I used xargs instead of spending a lot of time building a batching script in Python. I wanted to launch a bunch of processes in a queue but only execute 10 of them in parallel at any time.

Here is a skeleton of what I came up with.

    find $(pwd) -mindepth 1 -maxdepth 1 -type d -name ".zfs" -prune -o -type d -print0|xargs -0 -P 2 -I {} echo {}
where,

$(pwd) indicates the starting point of the listing of directories

-mindepth 1 makes sure current directory is not listed once again.

-maxdepth 1 makes sure the list does not get recursive

-type d -name - only directories and list names

".zfs" -prune - makes it ignore .zfs (snapshot directories)

-print0 - makes sure to print results without newlines. just -print will print one result per line

xargs -0 will take care of processing out spaces or newlines in the input stream

-P 2 — run two processes at once in parallel

-I {} says that replace {} in teh subsequent command from stdin piped into xargs echo {} will be echo dir1 and then echo dir2 etc

That's just an example to show that we can do a lot with standard Unix tools before bringing in the external sophistication for data related tasks.


And with GNU parallel, which can take the place of xargs, you can even distribute that job across multiple machines easily (as long as they're accessible by SSH).


Yes, I need to look into whether and how Gnu Parallel will queue up tasks if I restricted the number of parallel processes.

In my case, I was dealing with a FreeBSD server. I went the xargs route instead of installing something that is not available by default.


Someone on Twitter mentioned Luigi, which was previously developed and maintained by Spotify, as a distributed Make written with Python: https://github.com/spotify/luigi

Not sure if Spotify still uses it but it is in their Github org.


Luigi is great although I don't think it's easy to add "rerun if source file updated". Would love to be wrong on that.

http://pachyderm.io seems great but does require more engineering support (needs a kubernetes cluster)


I'm a fan of Apache Airflow for large, complicated ETL processes especially those with depth and breadth in their dependencies.


> [...] --ftp-passwd="$(ILCAMPAIGNCASH_FTP_PASSWD)" ftp://ftp.elections.il.gov/[...]

Is that using traditional (plaintext) FTP? Is it listening on port 21?

    ~ $ ftp ftp.elections.il.gov
    Connected to ftp.elections.il.gov (163.191.231.32).
    220-Microsoft FTP Service
    220 SBE
    Name (ftp.elections.il.gov): ^C
It looks like they are sending their password in plaintext. aria2 supports SFTP, so they should really talk to elections.il.gov about moving to SFTP or any other protocol that doesn't send the password in plaintext.


I imagine there would be other systems (state-owned and private) that use the FTP server, and maybe in a way that changing protocols is inexplicably full of friction. I wonder why the elections server, assuming it only contains records legal to distribute to the public, is even password protected. Maybe it was a policy when govt bandwidth was scarce. California, for example, has campaign finance data on a public webserver: https://www.californiacivicdata.org/

And the FEC has an API, but has long had the data hosted on public FTP: https://classic.fec.gov/finance/disclosure/ftp_download.shtm...


This FTP server supports TLS:

  211-Extended features supported:                     
   LANG EN*
   UTF8
   AUTH TLS;TLS-C;SSL;TLS-P;
   PBSZ
   PROT C;P;
   CCC
   HOST
   SIZE
   MDTM
   REST STREAM
I didn't check if aria2 would actually use it, but I doubt it.

Re alternatives to FTP: as far as only downloads are concerned, HTTPS should be way easier to set up than SFTP.


I did something like this a few years ago! I needed to do a bunch of transformations and measurements of data that came in on a regular basis. Make was a perfect fit - I could test the whole process with a single command, cleaning either just the result data, or nuking everything to make sure it pulled stuff in properly.

I spent some time trying to write my own processing system in Python before realizing this was a familiar task...


I really like make! I use it almost every day. I like it for the structure and simplicity. I don't use it for everything. I plan to use it for the foreseeable future.

Why do I like make over shell scripting (sometimes) is that enforces structure. Shell scripts can turn into a real hairball.

When I did ruby I really enjoyed using rake.


On the whole debate revolving around gigabytes in the title, I'd like to add:

There's a well-substantiated linguistic theory revolving around "maxims of conversation". Maxims of conversation are so strongly universal among the speakers of a given language that they become part of the implied meaning of a conversational act.

For example the maxim of cooperativitiy implies that when a person sitting in a cold room next to a window is spoken to by a person sitting further from the window and is being told "It's a bit chilly, isn't it", they can take it to mean "Please close the window".

https://en.wikipedia.org/wiki/Implicature#Conversational_imp...

Similarly, there are certain maxims of conversation which are part of the language game inherent in the formulation of the title of a blogpost. They are kind of assumed to be boasting about something. So when somebody says "We figured out a way to load a gigabyte's worth of data into a database in a single day" then the being-boastful-about-something maxim is violated. That's why it triggered so many people.

And pointing out that this is not something to be boastful about is a perfectly valid thing to do to keep certain facts straight.

...just saying.

But, by all means, if you get a thrill out of it, keep downvoting me.


Can somehow help me understand the advantages of Make over a bash script? Isn’t bash superior in almost every way?


Make has a dependency system. You can tell it how to create file A, and that it depends on file B, and tell it how to create file B. Then if you request file A, it will check whether file B exists, and create it first if it's missing or outdated.

That's very valuable for building things. If you change a file, you only need to re-build the files that could reasonably be affected by the change. And things happen in the right order without micro-managing.


They can produce the same result but they do it in different ways and require you to express it in different ways.

Make has you describe a graph of outputs and how to produce them. It then traverses the graph to produce the requested output.

Bash is just a regular sequence of commands, with functions and loops if you wish.

If the pipeline you need to run can easily be turned into a dependency graph, I think make is a great fit. It's easy to use, comes with most of what you need built in and has some fun extras, like -jXXX, which allows you to parallelise things and built in caching so you don't regenerate the same asset twice if you don't need to.

You can do all that in bash but you'll have to write it yourself, which takes time you could spend on other things.


The sibling comments give the top-level answer.

In addition, expanding on @Pissompons's note -- make gives you job-level parallelism for free with constructs like:

    make -j 24 transform
which will (if possible/allowed by the dependence structure in the Makefile) run 24 jobs at once to bring "transform" up to date.

So for instance, if "transform" depends on a bunch of targets, one for each month across a decade, you get 24-way parallelism for free. It's kind of like gnu "xargs -P", but embedded within the make job-dispatcher.


In addition to all the features the sibling comments noted, it's important to note that there's no contradiction:

bash is usually the scripting language one uses inside of a Makefile.

It's the default, although one could use any scripting language. Point being, there's no "Make" language, beyond the syntax for describing those dependency relationships and variable assignments.


I believe /bin/sh is the default, not bash. But this can be changed.


You're right that it's /bin/sh, but, since it could be (and is, in some cases) bash, it's not quite right to call it "not bash", either.

I'll grant that the distinction is important, though, in the face of the history of #!/bin/sh Linux scripts with bashisms breaking upon the Debian/Ubuntu switch to dash. Even if you're on a system where /bin/sh is bash, it's safest to set SHELL in your GNU makefiles to bash explicitly, if that's what you you're writing in.


Make is an excellent way to automatically decide to run your bash scripts (and other scripts, shell commands & executables) or not, all depending on if the existing output is newer than the available input


If you are creating files, and these files depends on other files, Make will do the dependency resolution for you. This is difficult to do with Bash and a huge reason to use Make.

Once you start abusing the ".PHONY" targets, the value starts to decrease.


This is exactly the kind of purpose I love seeing open source tools used for. Kudos to propublica for leveraging open source to improve their ability to function!


While it's nice to see Propublica use open source software, keep in mind dozens and dozens of other news organizations use the same tools.


Shoulda used kdb!




I appreciate that it's under a /nerds path:)


It's cute.


Was that supposed to say Petabytes? Gigabytes is really not that impressive.


It doesn’t seem like the size is supposed to be impressive, although I do not know why it is in the title. This is about the use of make.


...well, that was what I was trying to point out.


Tbh I would have opted for Jenkins with declarative pipelines in his situation. Then you get logging , events , cron , CI , all more or less out of the box .

IMO, this is a bad use case for make.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: