Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Docfd: TUI multiline fuzzy document finder (github.com/darrenldl)
44 points by darrenldl 10 months ago | hide | past | favorite | 17 comments
Think interactive grep for text files, PDFs, DOCXs, etc, but word/token based instead of regex and line based, so you can search across lines easily.

Docfd aims to provide good UX via integration with common text editors and PDF viewers, so you can jump directly to a search result with a single key press.

---

I originally wrote this tool to help me dig through text/markdown notes, since very often I want to search for a phrase of sorts that may span across multiple lines, but constructing the corresponding multiline regex is a bit painful (especially painful if I want to account for fuzziness), and I still have to guess the ordering of the words. Furthermore, I'd still have to sift through all the results, even the irrelevant ones.

fzf's core is kinda there for some cases, but I want the fuzziness to be constrained around each individual word, i.e. each word matching some other word of +/- N edit distance, instead of this whole line matching other whole line +/- N edit distance or other metric. And fzf also doesn't do multiline as far as I can tell (or I'm being silly).

Since nothing quite fit my taste, I decided to make something with a relatively straightforward UI/UX (I gave up on many tools purely because I cannot be bothered to remember yet another set of keybindings), ranks all the results for me using a "pretty alright" heuristic, and allows me to open the file to the search results easily.

Support for PDF (via pdftotext) and DOCX etc (via pandoc) were added later on based on suggestions I got from Reddit users despite my initial reluctance. But after eating my own dog food for a while I have to say they were very right.




This looks great! Thank you for putting it up on GitHub and sharing

Have you tried it with log files? It would be great to have this kind of search when hunting things down on multiple large log files in a server


> This looks great! Thank you for putting it up on GitHub and sharing

Thank you!

> Have you tried it with log files? It would be great to have this kind of search when hunting things down on multiple large log files in a server

That is an interesting suggestion, and I can see the myself wanting to use Docfd search engine even for line oriented input.

Just to make sure I didn't misunderstand, log files in your context mean things that have one entry per line, and no point in searching across lines, correct?


I guess it depends on what I’m looking for

But usually yes, most of the times the search case is:

* I have an id/uuid or some text or a piece of some id/text

* there are multiple log files, each for a different process/daemon/app which handle different parts of a workflow

* I have to grep each log file individually and then piece things together to figure out how the workflow went for some usage/user

Usually there is a lot of back-and-forth between the log files and the grep commands to find the info I need, especially if what I’m looking for is not an id/uuid


I see. Docfd right now should be fine for the id/uuid part, but the "piece of text" can be problematic as right now there is no way to ask Docfd to constrain search to line level.

I'll add a --line-oriented-exts command line argument that defaults to "log", so searching anything *.log will not be cross line boundaries.

Are there other file extensions you're dealing with?


Usually only .log files, although sometimes the log files might not have an extension and just be in a logs/ folder and the file name might be something like a date+name of service that writes the log


Gotcha. In the latter case, would file globbing through bash/whatever suffice? Been trying to avoid opening more cans of worms than I need, but if I really do need to add file globbing then oh well : v

EDIT: Scratch that, I'll just add file globbing since I'm that close to covering most use cases anyway.


Amazing! Love the motivation!


Just added file globbing and single line search mode, will make a new release when I've added some tests and have used it myself for a week or so.


Impressive! So cool


Is interactive mode the value-ad over ripgrep and ripgrep-all?

I accomplish similar with ripgrep and fzf.


> Is interactive mode the value-ad over ripgrep and ripgrep-all?

Partly, yes. Though if that's the only part then ugrep would largely have sufficed for me somewhat.

The other part is that the search algorithm of Docfd is very different that of fzf or ripgrep, and some searches are easier in Docfd than the two.

For instance, "(recursive function | recursion)" will match phrases like "function ... is recursive" that might be split into more than one line, but accomplishing that in ripgrep and fzf will take a lot more elbow grease, especially when the search expression gets bigger.


Wouldn't you struggle to search over docx and pdfs with ripgrep?


I will add ripgrep-all is great for that purpose (and you can also search inside archives with it if I recall correctly).


Are you writing temporary files to temp or /dev/shm ? I would hate it otherwise. Of course, others my prefer to have RAM used by other processes...


Depends on how "temporary" we're talking.

Index caches are written to $XDG_CACHE_HOME/docfd if XDG_CACHE_HOME is defined, otherwise written to .cache/docfd (in current working directory). Docfd handles LRU eviction of cached indices for you here.

Piped stdin are stored in /tmp since that's what was handed to me when using the temp file API. I normally think that if the piped stdin are too much to be stored in RAM, then it probably should be saved into a place first anyway. But I am happy to discuss your use case and see if Docfd should be adjusted.


Great work! How to test it out quickly on Windows?


Thank you! Unfortunately the quickest way to test it on Windows right now is WSL, which is also how I use it on Windows.

I have not spent much time into figuring out how to make Windows builds via GitHub CI, and have not spent time investigating how the PDF viewer, Word invocation code etc behave in Windows.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: