Thanks! Did not know about this site. Annoying that they convert to PDF. Any ide...

danso · on July 19, 2018

Did a quick check on Internet Archive. According to this April 23, 2018 snapshot, the most recent file is said to cover July 1, 2017 through April 10, 2018, so 1 to 2 week delay?

http://web.archive.org/web/20180423074904/https://www.edd.ca...

Over the years I had some reason to analyze them, and I do some half-assed job of collecting and parsing them into useable data. This repo from 2 years ago contains the PDFs as translated by ABBYY FineReader (in my experience, the best converter on the market, at least sub $100):

https://github.com/datahoarder/ca-warn

Today I started a new repo (forgetting about my previous one). I've been wanting to create a series of repos showing how I "casually" practice programming and data analysis. That is, satisfy and iterate upon a curiosity without going all-in on best software engineering practices. It's aimed at people who've tried to learn coding themselves, but don't have a job in it but don't know how to practice it in the wild and just for "fun":

https://github.com/hackbashscoop/california-warn

Not much there except a simple wget invocation to pull the latest files, and the use of Poppler's pdftohtext to convert into plaintext files. Even though it's unstructured text, I think it's regular enough to be parseable with some regular expressions. For reference's sake, I've done an ABBBY PDF-to-Excel conversion (and will write a Python script to do the remaining data wrangling), but you can do what you want with the spreadsheets as they currently are:

https://github.com/hackbashscoop/california-warn/tree/master...

seltzered_ · on July 19, 2018

somewhat OT, but fwiw I recently stumbled upon the Fonduer project which does some interesting extraction methods beyond just OCR. https://hazyresearch.github.io/snorkel/blog/fonduer.html

They have a pdf-to-tree package which i haven't had good results from but perhaps i need to finally learn ML and try to train models for this a bit: https://github.com/HazyResearch/pdftotree