Hacker News new | past | comments | ask | show | jobs | submit login

Thanks! Did not know about this site. Annoying that they convert to PDF.

Any idea how frequently this is published?

Looks like Al Jazeera is shutting down it's office in San Francisco. 68 people getting the Axe on Aug 5th

> 05/07/2018 08/05/2018 05/11/2018 Al Jazeera International (USA), LLC San Francisco San Francisco Closure Permanent

Source: PDF link in parent comment.




Did a quick check on Internet Archive. According to this April 23, 2018 snapshot, the most recent file is said to cover July 1, 2017 through April 10, 2018, so 1 to 2 week delay?

http://web.archive.org/web/20180423074904/https://www.edd.ca...

Over the years I had some reason to analyze them, and I do some half-assed job of collecting and parsing them into useable data. This repo from 2 years ago contains the PDFs as translated by ABBYY FineReader (in my experience, the best converter on the market, at least sub $100):

https://github.com/datahoarder/ca-warn

Today I started a new repo (forgetting about my previous one). I've been wanting to create a series of repos showing how I "casually" practice programming and data analysis. That is, satisfy and iterate upon a curiosity without going all-in on best software engineering practices. It's aimed at people who've tried to learn coding themselves, but don't have a job in it but don't know how to practice it in the wild and just for "fun":

https://github.com/hackbashscoop/california-warn

Not much there except a simple wget invocation to pull the latest files, and the use of Poppler's pdftohtext to convert into plaintext files. Even though it's unstructured text, I think it's regular enough to be parseable with some regular expressions. For reference's sake, I've done an ABBBY PDF-to-Excel conversion (and will write a Python script to do the remaining data wrangling), but you can do what you want with the spreadsheets as they currently are:

https://github.com/hackbashscoop/california-warn/tree/master...


somewhat OT, but fwiw I recently stumbled upon the Fonduer project which does some interesting extraction methods beyond just OCR. https://hazyresearch.github.io/snorkel/blog/fonduer.html

They have a pdf-to-tree package which i haven't had good results from but perhaps i need to finally learn ML and try to train models for this a bit: https://github.com/HazyResearch/pdftotree




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: