Hacker News new | past | comments | ask | show | jobs | submit login

I've had do some of this recently, as a one-off, to extract the same fields from thousands of scanned documents.

I used OpenAI's function calling (via Langchain's https://python.langchain.com/v0.1/docs/modules/model_io/chat... API).

Some of the challenges I had:

1. poor recall for some fields, even with a wide variety of input document formats

2. needing to experiment with the json schema (particularly field descriptions) to get the best info out, and ignore superfluous information

3. for each long document, deciding whether to send the whole document in the context, or only the most relevant chunks (using traditional text search and semantic vector search)

4. poor quality OCR

From the demo video, it seems like your main innovation is allowing a non-technical user to do #2 in an iterative fashion. Have I understood correctly?




We face similar challenges you listed and handle all of the above. 1. Out of the box OCR doesn't perform as well for complex documents (with tables, images, etc.). We use vision model to help process that documents. 2. Recall (for longer documents) and accuracy are also a major problem. We built in validation systems and references to help users validate the results. 3. Maintain this systems in production, integrate with the data sources and refresh when new data comes in are quite annoying. We manage that for the end users. 4. For non-technical users, we allow them to iterate through different business logic and have a one unify place to manage data workflows.


Would recommend using the updated guide here! That link is from the v0.1 docs. https://python.langchain.com/v0.2/docs/how_to/structured_out...

OOC which openai model were you using? Would recommend trying 4o as well as Anthropic claude 3.5 sonnet if ya haven't played around with those yet


Thanks.

I was using gpt-3.5-turbo-0125. It was before the recent pricing change.

But I have a bunch of updates to make to the json schema, so will re-run everything with gpt-4o-mini.

Sonnet seems a lot more expensive, but I'll 'upgrade' if the schema changes don't get sufficiently good results.


Nice. Could also give haiku a try!


FWIW I've seen noticeably better results on (1) and (4) extracting JSON from images via Claude, although (2) and (3) still take effort.


Thanks for sharing.

I'm curious about what types of source documents you tried, and whether you ever suffer from hallucinations?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: