I've had do some of this recently, as a one-off, to extract the same fields from...

macklinkachorn · 2024-08-13T17:49:05 1723571345

We face similar challenges you listed and handle all of the above. 1. Out of the box OCR doesn't perform as well for complex documents (with tables, images, etc.). We use vision model to help process that documents. 2. Recall (for longer documents) and accuracy are also a major problem. We built in validation systems and references to help users validate the results. 3. Maintain this systems in production, integrate with the data sources and refresh when new data comes in are quite annoying. We manage that for the end users. 4. For non-technical users, we allow them to iterate through different business logic and have a one unify place to manage data workflows.

efriis · 2024-08-13T19:51:23 1723578683

Would recommend using the updated guide here! That link is from the v0.1 docs. https://python.langchain.com/v0.2/docs/how_to/structured_out...

OOC which openai model were you using? Would recommend trying 4o as well as Anthropic claude 3.5 sonnet if ya haven't played around with those yet

rahimnathwani · 2024-08-13T20:01:42 1723579302

Thanks.

I was using gpt-3.5-turbo-0125. It was before the recent pricing change.

But I have a bunch of updates to make to the json schema, so will re-run everything with gpt-4o-mini.

Sonnet seems a lot more expensive, but I'll 'upgrade' if the schema changes don't get sufficiently good results.

efriis · 2024-08-13T20:06:58 1723579618

Nice. Could also give haiku a try!

mistursinistur · 2024-08-13T18:05:14 1723572314

FWIW I've seen noticeably better results on (1) and (4) extracting JSON from images via Claude, although (2) and (3) still take effort.

rahimnathwani · 2024-08-13T18:15:12 1723572912

Thanks for sharing.

I'm curious about what types of source documents you tried, and whether you ever suffer from hallucinations?