Hacker News new | past | comments | ask | show | jobs | submit login
First personal search engine prototype (rsdoiel.github.io)
147 points by kmstout 9 months ago | hide | past | favorite | 9 comments



If this is interesting to you, you should check out the interesting work that karlicoss and others have done with "Human Programming Interface" [0] / [1].

I've been kicking this idea around for quite a few years and have gone through multiple iterations before finding HPI and tossing out or adapting all my work in favor of building off theirs. Mine is a bit more service / cloud oriented in how it runs (shocker... at one point I made it into a product) while HPI is heavily local-first for obvious good reasons.

HPI is a great platform to build your own stuff off and benefit from all the work that has already been done because imo building a good foundation is the hardest part. Sean Breckenridge's HPI-API is super interesting and useful, could likely be worked into this search engine concept, quite sure Sean actually has both newsboat and Firefox modules already made.

I wrote modules of my own and made an authentication wrapped HPI-API and a GraphQL instance but currently in the middle of an infra move so nothing super cool to show off. At one point I had a dashboard powered by it but we all know how these one-off internal-use-only projects end up. :)

I think the most interesting thing I've written for HPI is my ActivityWatch syncing. It's been happily churning away for years and during my infra move, I discovered that it had about 20GB of activity data stockpiled. Multiple years of down to the second cataloguing of everything I did and saw on the computer.

Lots of interesting stuff in collecting and leveraging your data. If any of this stuff catches your eye, I highly encourage browsing karlicoss' exobrain [2] because there are some interesting things in there.

I post all of this to hopefully save someone (or many people) time because I recall how dejected I felt having spent multiple weekends chipping at a problem someone already solved better. I think this previous discussion was where I discovered it: https://news.ycombinator.com/item?id=26269832

[0]: https://github.com/karlicoss/HPI

[1]: my own stuff, not trying to step on Karli, just wanted a 3 letter org for my stuff: https://github.com/hpi

[2]: https://beepb00p.xyz/myinfra.html


Thanks for the links!


I've been trying to do something similar for a while now. In the past I tried using YaCy in private mode, scraping a few aggregators and RSS feeds I read +2 levels of links. That was cool, but YaCy is practically dead these days and has various issues. Currently I'm trying ArchiveBox for the extraction + storage and poking around importing the results into Verba for RAG-style search using the local model of mixtral. ArchiveBox is nice in that it can do the text extractions from different types of media through a number of plugins. It's early days, but I think that's got a future.


I am working on pretty much exactly this same thing :)

Anything you can share yet?

Here is mine: https://github.com/ydennisy/kg1


Yep. ArchiveBox here too. Am poking at the same problem, using a variety of RAG prompts.


> stale links:

how about automagically checking if archive.org has that link near-about-the-date it was marked? or.. just find the last one there?

btw i have quite some links on me site that are more than 10 years old, and 20-30% of them are now dead.. (and i guess more and more will be gone.. any research on that ?).. so archive.org on the rescue.. (luckily the articles have been of big-enough interest, and in english web-sphere.. YMMV with non-english or non-popular sites)


If pages arent fast or links are broken i move the url to a different file checked less frequently, when checking those i move what fails to the next file and check that on an even lower frequency (if at all)

Yacy is also fun be it a tad weird.


I built https://mitta.us during Covid. I use it every day and am now working on AI pipelines for the documents saved to it over at https://mitta.ai. It's not a traditional crawler, and only saves what you tell it to save. There's also a Chrome extension to save pages quickly while browsing, and it allows you to remap search in chrome to the console, which then forwards you where you need to go based on the content. Still very much a WIP, but getting there!


I am still trying to understand the purpose of this? What is the purpose of your personal search engine?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: