Hacker News new | past | comments | ask | show | jobs | submit login
RAGFlow is an open-source RAG engine based on OCR and document parsing (github.com/infiniflow)
230 points by marban 6 months ago | hide | past | favorite | 53 comments



Took me some time to figure out how to run it, but the layout recogniser model hosted on huggingface is pretty good!

It correctly identifies tables that even paid models like the AWS Textract Document Analysis API fails to – for instance tables with one column which often confuse AWS even if they have a clear header and are labelled "Table" in the text.

I would however love to know broadly what kind of document it was trained on, as my results could be pure luck, hard to say without a proper benchmark

Very nice layout recognition, although I can't quite comment on the RAG performance itself – I think some of the architecture decisions are odd, it mixes a bunch of different PDF parsers for example which will all result in different quality and it's not clear to me which one it defaults to as it seems to be different in different places in the code (the simple parser defaults to pypdf2 which is not a great option)


What's the name of the layout recorgniser model? I did not have a good experience extracting layout from tables, especially those without column boundaries (space instead of lines to demarcate boundaries)


it's https://huggingface.co/InfiniFlow/deepdoc and the code for usage is in https://github.com/infiniflow/ragflow/blob/main/deepdoc/READ... – it took me a bit of trial and error to get it working

It seems to be a YOLOv8 fine-tune, I only did a couple tests but results were decent. Another model that is supposed to be fine tuned for borderless is https://huggingface.co/keremberke/yolov8m-table-extraction but I haven't had great results myself with it, but maybe worth a try for you.


Thank you very much!


Here's a quick test to run: if you have Windows and MS Office, File->Open your PDF and report the results. You might be surprised at the layout extraction quality.


This is because PDF has so many different versions. A third-party tools like pdfplumber won't fit it all. For example, using pdfplumber to parse some PDFs will cause the system to raise exceptions. Sometimes fitz works in situations where pdfplumber won't handle well. It looks a bit complicated, but RAGFlow is using multiple parsing tools to handle different types of PDFs.


It seems to be limited to certain LLM servers, on of which is OpenAI, none of which includes e.g. Mystral and popular OSS LLMs.

I wonder if that will change - eventually.

Discord channels are named in Chinese, though there are English posts.


It's trivial to run a proxy server that routes all OpenAi calls to another LLM, even local ones. See litellm-proxy.



Looks like they do but aren't really documented yet:

https://github.com/infiniflow/ragflow/pull/119


I see a `LocalLLM` chat model where it looks like you can pass a host/port (for example, ollama's)


where do you see that?



RAGFlow will support more LLMs, including locally deployed LLMs.


Document processing is getting better and better with new tools leveraging LLMs. If anyone is interested in exploring this space, try another similar tool LLMWhisperer (https://llmwhisperer.unstract.com/). It is a part of Unstract, an open-source document processing tool (https://github.com/Zipstack/unstract)


Actually we've tried almost all lof existing open source models for document processing, and none of them performs well for complex documents, especially those having complicated tables, such as tables cells without borders, cells need to be combined,...,etc. Although adopting LLMs to perform such document understanding tasks is more scalable, it requires much more data and computation power to achieve similar results. That's why we design such models start from scratch.


Do you have any recommendations for extraction from Powerpoint documents? Those seem like the worst since each of the layouts tend to be unique (unlike the AMEX Statements in the example)


Cool! This is really helpful.


I'm partly sad at the approach this and other engines take: reimplement each part (PDF parser, etc etc) in a way where they are pretty much useless except in their specific engine.

If instead we had a PDF() class that did what RAGFlow is doing (dealing with all the different trade-offs of the different python PDF engines such as pdfplumber), then we could easily adapt it and improve it, and it can be useful for other projects as well.


It is open-source though. Just rip it off and make that PDF() class.


Love this counterpoint to "OSS means I can get everybody else to do work for me for free" => "allows you to do the work yourself and share buddy" PR or it didn't happen


Each project has its own detailed requirements and scenarios, and we cannot demand that each project use same library to implement similar functions


If only they supported local LLMs out of the box. I have a very specific use case buy it needs to run locally offline only. Any suggestions/recommendations from fellow HN users are more than welcomed :-)


To be honest, RAGFlow already supports this but has not documented this local deployment process yet, as we are still working on simplifying this process, and will release this feature soon. Please keep tuned!


Looks like they do but aren't really documented yet:

https://github.com/infiniflow/ragflow/pull/119


The recognition of file layout to parse file content is indeed a very innovative idea. Compared to many open-source projects we have seen before, this provides us with new ideas for using RAG to solve problems in the future. I hope the author of this project will continue to update it, so that more people can benefit from it.


Is there a JavaScript library? Both LlamaIndex and Langchain have nice JS/TS packages on npm. Could thinly wrap a JS client around this Python API but the community aspect of having an official library is nice.

Also might be helpful to have a simple example on the README showing how to fetch a document and start querying it. I would try it!


Hi bschmidt1, This is a good feature. We do plan to support it soon. Please stay tuned. If you have further suggesions, welcome to file an issue with us.


I am curious: if I pass a pdf file such as a research report or an open source text book, what will be the result. Does it create a knowledge graph? triples? Thanks in advance for comments.


A lot of the yolo stuff from ultralytics is AGPL3 fyi. Recommend caution depending on what code or models / model lineage are used


Thanks for your nice suggestion. We train the model using YOLO, but during inference, the model is converted into ONNX and we use ONNXRuntime for the model inference. As a result, YOLO itself is not included in the software package. We will open the training code in the repo soon.


Does anyone know what software was used to create the "System Architecture" diagram?


Hello friend,

I would like to recommend the FigJam feature in Figma to you, a common tool used by high-tech companies for creating Information Architecture (IA). Our company always adheres to the tool usage standards of high-tech companies.

For guidance on how to use the FigJam feature in Figma, you can refer to this link: https://youtu.be/axDzyLEfYgU?si=V6tqO_tEUKYuLxrL (or search for FigJam on YouTube).

Here are three quick tips on how to efficiently create Information Architecture(Also called System Architecture):

Start with the Sketch method by drawing your product's Information architecture on paper. Communicate with your PM, design, and development team to validate its effectiveness.

Open FigJam in Figma to turn it into an electronic version. This step is crucial as design and development will follow the version in FigJam for collaborative work.

When creating your Information Architecture (IA), first identify the core functionalities of your product, represented by one color. Next, determine what sub-functions each core functionality can be divided into, marked by a second color. Finally, decide what detailed functionalities compose each sub-function, indicated by a third color. Continue in this manner until you complete the entire IA construction.

It’s important to note that perfecting the IA is not a linear process; it will go through multiple iterations and modifications. Every great product undergoes this process. Also, initially, you can focus on creating an IA for just one core functional module (usually the innovative feature with the highest user pain point) without defining the entire scope. In essence, flexibly establishing the IA to achieve company goals is the primary task.


This is really cool! Starred and look forward to seeing how this develops further!


Apparently "deep document understanding" refers to OCR and structured document parsing: https://github.com/infiniflow/ragflow/blob/main/deepdoc/READ...

Since "deep document understanding" is not a term of art, I would have just said "OCR and document parsing".

How well does it work? Please include benchmarks. You may be interested in

https://paperswithcode.com/sota/optical-character-recognitio...

https://paperswithcode.com/task/document-layout-analysis

The models seem to be closed source, hosted here: https://huggingface.co/InfiniFlow/deepdoc


We've used YOLOv8 as the object detection model, and use some public datasets, such as PubTable, CDLA, together with some private data to train the model. The model on Huggingface is the one trained using public dataset, and we would open this work later. We use YOLOv8 just because we want to let the document parser run without GPU, I think you could also try any other object detection models such as Detectron, and use the public datasets to train the model as well. We've not used transfomers for this task, because given limited data, it could not outperform traditional CNN based models.


A bit off topic but every time I see DTrOCR I remember that marketing is a good idea because DTrOCR[1] and Fuyu[2] are basically the same architecture[3].

[1] https://arxiv.org/pdf/2308.15996v1.pdf [2] https://www.adept.ai/blog/fuyu-8b [3] If you don't want to search for the figures I made a tiny post about it on my weblog: https://weblog.snats.xyz/posts/2024/02/16/


RAGFlow uses Yolov8 for its OCR/layout recognition/TSR(table structure recognition). And RAGFlow uses large amount private data to train these models for them to perform well in some specialized scenarios.


Ok we've taken deep document understanding out of the title above. Thanks!


I am curious about the performance of their OCR and layout and table detection. Hopefully it’s on par with Amazon, Google, or Microsoft’s tools.


I know this is just an open source project but it’s a good example of why you might want to consult a woman before naming things.


The authors are not native English speakers (and what you’re seeing in the name is not something most non native English speakers would spot).

I understand it’s funny to point and laugh “ha ha dumb dudes didn’t think about something female related because they’re dudes”, but it’s worth remembering that not everyone is from the anglosphere.

Also you might find this normal but I cannot imagine my female colleagues being amused at me calling one of them up asking to “consult” about a name solely because she’s a woman.


I didn't point it out because it was funny-ha-ha, but because it's a teachable example of a linguistic mis-step that could be easily resolved by diversifying your reviewers - in this case, since you mention it, perhaps a native English speaker?

I'm not even suggesting that this project needs a new name - although I think if they were naming a consumer-facing product or company, someone in marketing would push back on the name almost immediately.


As a native English speaker I would not dock points from this name for the gender related points as that doesn't even cross my mind, it's not relevant.

Keep gender out of engineering.


> Also you might find this normal but I cannot imagine my female colleagues being amused at me calling one of them up asking to “consult” about a name solely because she’s a woman.

I'd certainly hope it wouldn't be "solely" because she's a woman. There's no people you'd be interested in input from that happen to be women?

A project name is something you'd want to throw around to a few people (ideally with different perspectives) and make sure it conveys what you intend and has the right tone and such.


The assumption that author didn't talk to a woman sounds pretty knee-jerk to me. Can you explain what you mean?

"On the rag" is a semi-common slang for menstruation, but "rag" has way more meanings than just menstrual pads. IMO, as a marketing term, "RAGflow" sounds about as odd as "Nintendo Wii".



People said the same thing about the iPad.


Hi, buddy, I'm sorry to make you feel like we don't take women's voices into account. We came up with this name because RAG is already a consensus, as an acronym for retrieval augmented generation, RAG is used in many places, and we think it would actually be a standard for LLM oriented B-side scenarios, so that's why we adopted this name.


I don't consider this strictly open source if components it depends on (I.e, the LLM) is closed source. I've seen a lot of these Fauxpen source style projects around


It only depends on the interface. There's a lot of projects which present the openai interface to whatever you want.


Not quite certain about your meaning. Could you be more specific? RAGFlow does not have its own LLM model or souce code. RAGFlow supports API calling from third-party large language model providers, as well as local deployment of these large models. RAGFlow has open-sourced these two parts of codes already.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: