somewhat OT, but fwiw I recently stumbled upon the Fonduer project which does so...

somewhat OT, but fwiw I recently stumbled upon the Fonduer project which does some interesting extraction methods beyond just OCR. https://hazyresearch.github.io/snorkel/blog/fonduer.html

They have a pdf-to-tree package which i haven't had good results from but perhaps i need to finally learn ML and try to train models for this a bit: https://github.com/HazyResearch/pdftotree