Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Feep! search, an independent search engine for programmers (feep.dev)
182 points by wolfgang42 on Nov 6, 2022 | hide | past | favorite | 72 comments
Hi HN! This started late last year as an afternoon project to play around with ElasticSearch, and then I kept thinking of new features I wanted to add. I still have a lot of things I want to build, but now seemed like a good time to put it out there: even if the results aren’t nearly the quality I’d like, I’ve still found it useful and I want to show it off!

I’ve been working on it since September 2021, but only in fits and starts. The entire thing runs on a computer in my living room (there’s a picture on the About page); I haven’t done any load testing so we’ll see how it holds up.




Love to see more independent indexes! Sometimes there seems like there's plenty of search engines, but when grouped by the indexes they rely on there's actually very few major ones when you group them together

- Google, StartPage

- Bing, DuckDuckGo, Ecosia, AOL, Yahoo

- Yandex (mainly Russian)

- Brave (recently started its own index but often falls back on Google's)

Love to see projects like Marginalia and now this. These projects also make meta search engines like Searx[0] that much more powerful.

Anyways since I'm in the business of listing out relevant projects, other code-centered search engines you might wanna check out are searchcode.com[1], codesearch.ai[2], symbolhound[3], and publicwww.com[4] (some of these are often down, but might still be good to learn from)

[0] https://searx.tuxcloud.net/

[1] https://searchcode.com/

[2] https://codesearch.ai/

[3] http://symbolhound.com/

[4] https://publicwww.com/




That’s not a code search engine, that seems to be a regular search engine.


To that first list you could add Kagi, who also runs their own index

EDIT: Tough crowd, did Kagi get cancelled or something while I wasn't looking?


Kagi results are suspiciously similar to Google's. I don't think it's possible they're using their own index, at least not entirely. They must just be re-ranking Google results.


We do indeed have our own index (in addition to using external indexes). Every Kagi search result page will tell you the percent of results coming from our own index (for logged in users only currently).

For example, take this search :

https://kagi.com/search?q=steve+jobs&r=us&sh=OP2gxAxk3KEV_jM...

60% of results you see are coming from our own index. For most queries it is 10-30%. If you use the 'Non-commercial' filter this number may go up to 100% (because our index focus on non-commercial part of the web).


Kagi is to the best of my awareness mostly doing magic with Google results.


Also bonzamate.com.au


Is there a particular reason this site does not index the official documentation for languages and frameworks? Tried a couple different searches for the things I work on and mostly got HN and stack overflow posts that aren't really responsive to my query.

https://search.feep.dev/about/datasources

Edit: it does appear the the devdocs.io has the docs I'm interested in, but they don't appear to be surfaced in the first several pages of results at least. A good example for this is searching "python datetime" which does not actually return links to datetime docs, just a lot of HN and SO posts referencing datetime.


“python library datetime” gets the results you’re looking for—but mostly by virtue of this not being a way anyone ordinarily thinks of describing it, so it knocks the irrelevant results down in the rankings. I think there are a couple of things going on here:

- The ranking algorithm I’m using isn’t great at distinguishing pages about a topic from pages which merely mention a topic in passing.

- Because the Python docs are versioned, the PageRank they deserve is spread out over several URLs and they appear less relevant than they really are.

I have plans to fix both of these problems, but they’re pretty involved and I haven’t had the time to dig into the matter yet. For the moment, it’s definitely a gamble whether the results will be any good: sometimes they’re great, and other times they’re completely useless. (There’s a reason I put links to other search engines at the bottom of the results page!)


For a programming search engine, the official docs of languages should get special treatment. Google often surfaces outdated versions of the documentation, but they're usually at the top. If you want to improve on this, you should (a) rank official docs the highest, (b) give extra weight to docs.python.org if the query contains "python", (c) merge the same page for different versions and add a version picker.


I have long wanted a programming search engine where I could pick something like "Python" and "version 3.9" and always get the right thing. Similarly there is documentation for software packages that are similarly versioned (e.g. "react-router 4" vs "react-router 5")


If I know I want something from docs rather than SO, I open devdocs.io instead of a search engine. It's not a perfect solution but maybe it helps you as well.


All 3 of these are on my list (plus understanding page sections, which would improve the results for things like the Python docs where there’s a bunch of topics on a single page); I just haven’t gotten round to writing the code yet.


That exciting. I see your project as a financial index; but for search.

I would use it with the addition mentioned above. Add a newsletter subscription box maybe ?


Great idea, but search is pretty bad right now.

Searching for "django signals" got unofficial search results on the first page and all the links on the second page (1) are broken.

Searching for "go cobra" gets no official docs at all.

(1) https://search.feep.dev/search?q=django%20signal&p=2

Some suggestions:

- Prioritize github, gitlab, readthedocs, go.dev, docs.rust links

- In github, only parse readme and wiki links. Avoid parsing links that are related to a specific commit hash.

- Python, Rust docs have versions in the url. Can you link them to the latest version instead ?


Thanks for checking it out!

All those broken links in your “django signals” seem to have come from a page full of mangled URLs that got picked up on; unfortunately they’ve pushed the actual results all the way down to page 6! I definitely need to give a boost to official documentation.

“golang cobra” gets what appears to be the official repo as the first result; but it’s clearly not really getting what you’re going for here. This is a good example of the sort of challenges a search engine faces: both “go” and “cobra” have multiple meanings, and it needs to understand the context to figure out whether a given link is relevant for this particular search. I think something like a vector search would be useful here but I haven’t looked into setting something like that up yet.

GitHub is on my list, but it’s very big and is going to require careful optimization. (Even if I only load top-level READMEs it’s still a ton of data.)

ReadTheDocs would be great, but they don’t seem to have any dump/download support, or even a list of all the documentation sites they host, so in lieu of that they’re going to have to wait until I get a general web crawler.

I have some heuristics to collapse multiple versions into single result with a version picker, but they require some adjustments to the rest of my data processing pipeline which I haven’t gotten round to yet.


One thing you get from reading TREC conference proceedings is that most of the things that you think will improve search relevance won't.

People have almost forgotten how bad search indexes were before Google.


This is a cool idea! I would love to use this. I don't know how it works or if I'm using it right, but I tried this example: "swift ios upload picture multipart/form-data". Something I was just yesterday searching in Google.

The results are not great, first 2 are links for crystal lang, something about Salesforce, general REST PUT, and the rest are other things not related to Swift or iOS. I would have expected results specifically related to iOS or Swift since those were the technologies I specified.

How should I rephrase this query to end up landing at pages like this: https://stackoverflow.com/questions/29623187/upload-image-wi...

Which is the page that Google took me to, and the one that solved my problem.


I have a page with some advice on writing searches (https://search.feep.dev/about/query), but I don’t think you did anything wrong here: sometimes my search results are just inexplicably terrible. This definitely falls into that category and is going on my list of test cases that need improvement. There’s a reason I link to Google at the bottom of the results page!

I’m currently using ElasticSearch for ranking, and made a brief effort at tuning it. The problem is that it’s very big and complicated, which makes it hard for me to understand what’s going on under the hood. If I were doing this professionally I’d dive into ES internals and figure it out, but when I can only squeeze in a few hours a week it’s hard to really sink my teeth in. I’d like to switch to something simpler to wrap my head around (possibly Lucene, or Bleve); once I’ve done that I should be able to get a better handle on how the ranking works and how to make it more reliable.


Might be wrong, but the page they provided as an example correct result is not even in your index. Is that correct, and if so, why? If it is in your index, what is a query that would return it as a top ten result?


I can see it in Kibana when I request it by ID, but I can’t seem to get it via text search no matter what keywords I use, which is bizarre. (“NSMutableURLRequest image” should be pulling it up, but isn’t.) I have no idea what’s going on here, but thanks for bringing my attention to it!

This sort of thing is part of the reason I want to move off of ES: it’s a big black box and when something goes wrong I have no idea how to diagnose it. (I’m currently researching “unassigned shards” in case that’s the problem, but for all I know that could be a red herring.) Something a lot simpler would be easier for me to hold in my head and easier to figure out when it goes wrong.


Elasticsearch is distributed Lucene, no?


Yes (well, plus a lot of other features); and it’s the “distributed” part that gives me headaches. I don’t need any of that stuff, since I’m running on a single node, and it means there’s a bunch of abstractions between me and Lucene (which Elastic mostly tries to hide away as an implementation detail).


I don't have much experience with ES, but I remember trying Solr a few years ago and it was relatively simple to get running on a single VM. It is also using Lucene at its core, so it might be worth a try.


My experience with Solr is that it is much more schema-centric than ES. Which is good and bad, because ES being all "don't worry about it" is fine until you do have to worry about it, and then it's some holy hell trying to square up your version of the world with what ES thinks of the world

The Solr search API is worse, IMHO, also, although it can likely be fine if you just stick to their simple query string (for both versions of "their," ES and Solr). That said, my experience with ES is just like OP's: keeping the piece of junk alive and healthy is a time-and-a-half job. Combined with their recent license tomfoolery, I hope to never touch it again

I haven't used any of the new search upstarts in anger enough to know whether they're prime-time or not


Congratulations, I always enjoy new search engines

W.r.t. "and updated intermittently," I wanted to draw your attention to the HN realtime API: https://github.com/HackerNews/API#live-data and also that S.O. offers Atom Feeds: https://stackoverflow.com/feeds/ (I'd guess the rest do, too, but I didn't verify)

I am a huge proponent of taking advantage of any update features that a site offers, because otherwise the "how about now?" of re-crawling is wasteful to both parties.


My current architecture is built around batch ingestion[1], and doesn’t (yet) have a way to do incremental updates. This is great for getting coverage—there are a lot of long-tail results in my search engine but not in Google!—but it does mean there’s more lag and the results aren’t instantly up-to-date.

[1]: e.g. for StackOverflow, I download an XML dump of the entire site once a quarter: https://search.feep.dev/blog/post/2021-09-04-stackexchange


Hey this is pretty cool. I noticed a lot of the results are HN posts and StackExchange. Is there a definitive list somewhere of the sources, and/or maybe a way to contribute to those?

Thanks for showing this to us, I like where your head's at!

Edit: found it, it was explained on the front page. https://search.feep.dev/about/datasources


I can pretty easily add any StackExchange sites I left out, or anything that comes as a Zim file (e.g. from https://farm.openzim.org/recipes or the like). If it’d be appropriate for https://devdocs.io (official docs with suitable licensing), you can contribute a crawler to them and it’ll flow downstream to me.

I also have plans to do proper web crawling, though it’ll take me a while to get there: https://search.feep.dev/blog/post/2022-08-10-crawling-roadma...


Frankly it doesn't look like it's ready to be useful. As an example tried "Braze notifications" and the first result was about Brave, then two mildly relevant, and then a long stretch of "Who's hiring?" topics from which HN seem to mention only notifications.


The title mentioning “Brave” seems to be a red herring: there’s someone in the comments talking about Braze, though it looks like a typo. Similarly the “Who’s hiring” posts do actually have job listings for Braze, but you have to click through the More link at the bottom to find them. (Because I load HN directly from a data dump, the search doesn’t know about the pagination.)

I think the main problem here is that my index is relatively small: it has only (!) 30 million pages, and it looks like Braze just isn’t popular enough for me to have run into it with the right keywords yet.


For a loonnggg time I thought of developing something like this. I have an entire bookmark section of developer documentation, all of them with their special search and organization. If only there was one search (a good search engine) for all of them! Great work!


I found Feep while going through my coffee and ycombinator this morning. I find the idea of feep intriguing and I have just 2 questions.

1. What is your goal with feep? What are you trying to achieve? What is your grand vision? Do you have any time boxed goals/targets in mind? I'm asking because it is hard to know what to do when you don't know where you want to go.

2. Finally, what challenges or problems are you facing in reaching that goal? The one big challenge or problem you have in realizing your grand vision?


This is great! I've been using https://beta.sayhello.so/ for this so far. Might give you some inspiration/ideas.


This is awesome. I really enjoyed the UI and the no javascript.

Could i ask you a question? What is your tech stack? (programing language, background worker, database) How often does the index updates?

Are you planning to make it open source?


Always happy to answer questions! The code is mostly Node.js, with a lot of shell scripts to glue things together. The “background worker” is mostly me running things in tmux, though I do (ab)use GitLab CI for some scheduled tasks. The main full-text index is currently ElasticSearch (as I mention elsewhere in this thread, I’m not a fan of it); various other data in the ingestion process is stored a combination of JSON-Lines files, SQLite, and bespoke binary formats as needed. Because I’m squeezing this into the hardware I have, the details are generally dictated by performance constraints for the particular problem at hand.

Update frequency depends on the data source, details here: https://search.feep.dev/about/datasources

No plans to open-source it at the moment; that implies a level of stewardship that I don’t have the energy for at the moment, and also some of the code is kind of tied to my specific server right now.


This is great. Would be nice if it considered symbols better.

Looked for "reject!" and it returned "reject" and "rejection" when "reject!" matches exist. Ironic given its name.


Agreed! It seems like exactly searching for special characters could be the "killer app" that programmers need which would get them to leave Google.


Even Github search is so bad. Has it improved any lately?


Yeah, yet another reason for me to switch from ElasticSearch—I need a stemmer that understands symbols (and also can distinguish English from function names and not try to inflect the latter).


Works great, got some good answers and useful links within just three searches on different variations of the same theme. (for reference looked up variations on 'observer pattern')


Seems neat, but trying to find Django documentation is not ideal: searching for "django prefetch" has GitHub repos as the first two results, and the third result is official Django documentation but for an ancient version, something that annoys me about other search engines too.

Out of curiousity, what kind of hardware are you running this on? I can imagine that you'd need a lot of storage to store the index, but the size of plain text can often be surprisingly small.


The problem is that newer versions have fewer links to them, so they seem less authoritative. I have a plan for some heuristics that will detect version numbers in URLs and collapse them into a single result with a version picker in case you want an older version.

The server is an HP Microserver Gen8 (purchased on eBay), with an “Intel(R) Pentium(R) CPU G2020T @ 2.50GHz” and 16GB of RAM. The production index is 70GB, and I also have a 1TB spinning rust disk that I use for scratch space and raw data.


> The problem is that newer versions have fewer links to them, so they seem less authoritative.

Ah, that makes a lot of sense actually, thanks for explaining! The heuristics idea sounds neat.

Sounds like it's running on surprisingly little hardware. Is the index stored on a separate SSD?


> Sounds like it's running on surprisingly little hardware.

Modern computers (even ones from 2013) are really fast. I implemented PageRank from the original paper: I have about the same size of index as early Google, and what Page and Brin’s server could compute in a week with carefully optimized C code, mine can do in less than 5 minutes (!) with a bit of JavaScript.

> Is the index stored on a separate SSD?

No, it’s just on the same disk with the rest of the system. My experience has been that pretty much everything that Feep! search does is CPU-bound on this machine; in fact I suspect (though I haven’t tried it) that the index could even be on the HDD and the only difference would be a few extra milliseconds when serving search results.


This looks great!

MDN docs are pretty strong. Perhaps devdocs is a superset, but if not, I’d recommend indexing them as well.

Also, feature request: it’d be nice if the query help unfolded the instructions in-line with the current page instead of navigating to another page. That way, I would be able to see them while mucking with my query.


Glad you like it! Devdocs includes Mozilla’s docs, for JS, CSS, DOM, SVG, and others. (But my ranking algorithm doesn’t understand that “mdn” is a synonym for “developers.mozilla.org” so it’s hard to surface them explicitly.)

Thanks for the feature request—I don’t have any frontend JS set up yet that I could easily add this to, but I can see how this could be useful and I’ll put it on my list.


Personally, I’d rather click a link to MDN docs than most other sources, so if you had some way to expose pass-through attribution from devdocs, that’d be useful at least for me.

I wonder how many engineers think about search results link origin before clicking through.


Although I’m sourcing the crawl data from devdocs, the ingestion process uses the upstream URL, so the search results link to developer.mozilla.org with the appropriate favicon.

I’ve heard enough complaints about W3Schools and other SEO-heavy but accuracy-light sources that I suspect a fair proportion of technically-minded users probably consider the domain before clicking on a result link.


Yeah w3schools is the very reason I avoid Google search for code questions.


I would be very happy if such service worked (or if I could run it myself). It's my long term goal to break out of dependency on the Borg.

But the results are not even promising, let alone useful, which is very sad.

(I tried "haskell gloss terminate animation normally". That was my real search a couple of days ago.)


Result quality is something of a gamble right now: sometimes the results are really excellent, but as you’ve found they can also be pretty useless. I’m planning to use all the searches I’m getting today to construct a benchmark I can use to improve things.

On that note: what were you hoping to get out of that search? I see that Gloss is a package for doing animations, but (without knowing anything about Haskell) it seems like Google/DuckDuckGo don’t really have anything useful to offer either. (In fact the only thing I found was what I assume is your post on the Gloss mailing list: https://groups.google.com/g/haskell-gloss/c/FGNxutKmm-w)


I think it looks untuned rather than somehow broken.

Fine tuning result relevance is a pretty long and tedious process, and small problems with this can make results look very bad.


I don't know how I feel about this.



Well, with a mere 30 million pages in my index it was inevitable I’d be missing something. I’d expect this to show up eventually as I add more data sources.


It’s a smart idea. And easy to build with ES. I would use something like this over Google, which now shows spam copies of StackExchange sites above the original content.

Aside: Maybe I’m too paranoid, but I wouldn’t show a pic of my specific modem and router models.


Yeah, the StackExchange spam was one of the things that made me think, “hey, there could be something to this...”

I think I’m way ahead of you on the paranoia: both the router and the Ethernet switch you see are actually behind the NAT (the router is just serving as a WiFi AP), so you’d have to already be on the LAN to get at them. (Also, my devices all treat the network as an untrusted public one anyway, so even if someone did decide to target me specifically there’s not much extra they could do even if they did get access.)


Cool! But didn't get me many results :D

PS: when you say "for programmers" I can't help but remember Koders.com. But I also keep forgetting what happened to it. I think it got acquired..?


Do you think you could eventually reproduce the glory of mid-2000s Google? At least for some large predefined subset of the internet?


I think the biggest problem with this is not necessarily Google's algorithm changing, but the internet changing. Sites evolved to produce SEO spam for higher rankings and Google's search, as bad as it is now, would probably be even worse if it stayed stagnant and didn't evolve in response

The "predefined subset of the internet" part can definitely be a solution but the preceding "large" is probably where the challenge remains. However, projects like Looria[0] give me hope for a more curated search experience (i.e. without the "large" adjective)

[0] https://www.looria.com/


I don’t really remember Google of that era; I got on the Internet pretty late. But I do have high hopes for the recent rise we’ve seen in smaller, targeted search engines; a lot of the Google-scale problems of “making a search engine” go away when you focus on a small corner of the Web:

- the tech has reached a point where it’s actually pretty reasonable for someone to index a fairly large chunk of it themselves: https://search.feep.dev/blog/post/2022-07-23-write-your-own

- benefits of diversification: if one search engine isn’t helpful, you can try another instead of just being out of luck; and spammers now have to game a bunch of different algorithms rather than being able to target just one.

- having just one person, or a small group, focuses the results, and can hopefully produce a higher level of polish in a targeted area.


Other sources you may wanna crawl

- https://www.thecodingforums.com/ (and other programming-related online communities like Lobste.rs, certain subreddits, and certain Lemmy instances/communities)

- https://pldb.com/ (might be a good way to automatically get all the docs of each programming language as well as books/videos/publications that mention a certain language)


Also needs to add AWS documentation


How do you find ES is for this kind of thing? Have you looked at others, eg: Solr or even Sqlite FTS?


I spent fifteen minutes on a search for “best full text search” and Elastic looked like the best combination of popular+easy. Since I was expecting this to be the diversion of an afternoon there wasn’t any point in investigating more than that.

In hindsight, ES wasn’t the best choice for what this turned into: the problem is that it wants to be a managed cluster that does log analysis/analytics/observability/machine learning/I don’t know what all and full text search is almost an afterthought; whereas I want a single node that does full text search and nothing else. All that extra complexity makes it hard for me to figure out how to get it to do what I want, and I don’t have the time to invest in really understanding how it works under the hood. So I probably will switch to something simpler when I get a chance, so I can have a better chance of being able to figure out how to adjust it to make the results look the way I want.


For PHP (and other languages) you could add popular framework docs like Symfony or Laravel


I pull a lot of official documentation via https://devdocs.io ; Symfony and Laravel are both included, though (as others in this thread have noted) the current ranking is a bit hit-or-miss and may not always surface them. Searches like “doctrine validator” and “illuminate auth” seem to pull them up, if you’re curious what they look like.


Searched for “React” and the official docs weren’t even in the top 10.


Hope to see dark mode someday for the leet programmers.


[deleted]




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: