Interesting and fun article! I've been experimenting with various LLMs/GenAI sol... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

guiomie 5 months ago | parent | context | favorite | on: Classifying all of the pdfs on the internet

Interesting and fun article! I've been experimenting with various LLMs/GenAI solutions to extract tabular data from PDFs with underwhelming results. It seems like they are good at extracting strings of text and summarizing (e.g what was the total price? when was this printed?) but extracting reliably into a CSV has a decent margin of error.

abhi_p 5 months ago [–]

Disclosure: I'm an employee.

Give the Aryn partitioning service a shot: https://www.aryn.ai/post/announcing-the-aryn-partitioning-se...

We recently released it and we've a few examples here: https://sycamore.readthedocs.io/en/stable/aryn_cloud/get_sta... that show you how to turn the tabular data from the pdf into a pandas dataframe(which you can then turn into csv).

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact