Show HN: DeepHN – Full-text search of 30M Hacker News posts and linked webpages

daibo · on April 13, 2021

I assume you are using Symspell for fuzzy search and prioritize candidate frequency over edit distance. My input "algolai" returns results for "angola" even though "algolia" was just 1 transposition away. Anyway just dropping in to say that stuff on your github helped me figure out a lot while building a search engine from scratch. Good luck with SeekStorm!

wolfgarbe · on April 13, 2021

We are combining auto correction (SymSpell) and auto completion (PruningRadixTrie). That is sometimes tricky. We are using a static spelling correction dictionary which does not contains the term algolia, but angola. SymSpell is very fast, but requires a lot of memory. Therefore we are using a static dictionary, and not a dynamic dictionary derived from each individual index of each customer. Auto completion on the other hand we are dynamically generating for each individual index/customer.

daibo · on April 13, 2021

I am combining symmetric deletes with a trie for fuzzy autocompletion and it is working pretty well.

abcc8 · on April 13, 2021

I'm curious how this site is compliant with copyright law. I've thought of working on similar projects, but the possibility of being sued has dissuaded me.

doh · on April 13, 2021

Search engines are generally considered fair use [0], at least in US. So up until HN doesn't send cease and desist, it's in the clear. Once it does, then it's all about what the claims are. Google litigated the crap out of this [1] over the years, so there is a positive precedent minefield all over the place.

[0] https://www.everycrsreport.com/reports/RL33810.html

[1] https://en.wikipedia.org/wiki/Fair_use#Text_and_data_mining

TekMol · on April 13, 2021

They copy the whole text of web pages that are linked in the HN posts. So totally a mass breach of copyright.

If you are afraid of being sued, you certainly should not do something like this.

On the upside, there is probably not much provable damage a publisher could ask to be repaired. So they probably are only liable to pay for the legal expenses of whoever sues them.

wolfgarbe · on April 13, 2021

DeepHN does not copy the whole page, only the first 200 words or less. DeepHN does not copy the whole website but only a fraction of a single page of that web site. We always provide the original link. That should make it fair use. Of course we would prefer if there was a clear and concise legislation, valid in all countries, which could be followed.

We mainly surface historical content, which receives additional traffic from DeepHN, instead of taking value away from the website.

e12e · on April 13, 2021

I copy the whole text of web pages I visit. It's not automatically breach of copyright.

That said, meaningful fulltext search will need data. But both Bing and Google crawl pages.

It's not even immediately clear that displaying a copy of a page is materially different from caching (which http do in many, many layers).

Now, if présent the text as your own - that might be a problem.

Ed: see also: archive.org.

ehmmmmmmmm · on April 13, 2021

It's probably not compliant. As with all good things, if you're in the Land of the Free (TM), eventually people will hunt you down if you try to do good things.

Alternatively, base your site out of a country which gives you the freedom to do things like this. Build it in China and you'll be applauded instead of sued. Copyright suits for petty things like this would laughed off when it's something actually useful.

NicoJuicy · on April 13, 2021

China's monopolies are far worse than in other countries.

Founders there have no choice but to sell or get copied by the platform that you're building on. It's baba or tenscent.

dheera · on April 13, 2021

Yes, that's also true, but on the flip side, technological advancement of society as a whole is not stifled by idiotic IP law.

I agree, some anti-monopoly regulation is in order in China, and I'm not a fan of the Alibaba/Tencent monopolies on the ecosystem. However, I think there are other ways to go about fostering and encouraging small businesses and protecting creators and their intentions than American-style IP law, which allows you to sit on an invention or work and hinder society from having access to the fruits of science, which I'm vehemently against. IP isn't even real property, IMHO.

stjohnswarts · on April 13, 2021

Yes it is, if 90% of the reason for new tech is to get richer, if all protections like copyright, patents, etc are dropped then people will be much less incentivized to invest and stick their necks out for new creations. I wholeheartedly do not agree with your IMHO. That said we could absolutely improve our current patent system and copyright system (especially copyright and those parts which contribute to patent trolls)

LinuxBender · on April 13, 2021

We're sorry but seekstorm-deephn doesn't work properly without JavaScript enabled. Please enable it to continue.

Could the folks with JS disabled get a small semi-static page with the top interesting HN stats, and a message that say something like "Here are some simple stats. For the full interactive experience, please enable javascript"?

slashdev · on April 13, 2021

For the less than 1% of such people?

I doubt that it's worth it. If somebody can't be bothered to enable JavaScript to use the site, they're not a desirable customer anyway.

wolfgarbe · on April 13, 2021

Good idea!

superasn · on April 13, 2021

Very cool and useful. I was trying to find all comments on HN which contains the word youtube.com (basically trying to find some recently linked to videos from comments) but it returns 0 results when i set the search filter to comments(1). Any idea what I'm doing wrong?

https://deephn.org/?q=youtube.com&sort=time&in=Comments

wolfgarbe · on April 13, 2021

We will look into this.

exdsq · on April 13, 2021

It’s fun to see Bitcoin posts from 2008-2010. I wonder how many bought and are now retired because of them. Seeing recommendations to use MtGox is pretty funny too

https://news.ycombinator.com/item?id=1998144

will0 · on April 13, 2021

I like this quote:

> 21 million bitcoins seems like a low limit to set. According to [0] the exchange rate is currently 221 Bitcoins per USD, so the total value of all possible bitcoins is only 95000USD.

Today 95000USD is worth 1.51BTC

clukic · on April 13, 2021

I read about Bitcoin here over a decade ago. Took the time to setup a miner, and let it run for a week. I think I generated a few hundred coins worth less than a dollar total. Figured the electricity cost way more than that and deleted it all.

luckman212 · on April 13, 2021

I admire your perseverance to continue forging ahead despite deleting $20MM worth of coins.

ArtWomb · on April 13, 2021

Also fun to see the contrarian Venus explorers are at least a decade old ;)

SpaceX aims to put man on Mars in 10-20 years

https://news.ycombinator.com/item?id=2479053

TigeriusKirk · on April 13, 2021

The crazy thing is that MtGox really was your best choice back then.

A long way from that mess of a site to Coinbase IPOing tomorrow for a zillion dollars.

optimiz3 · on April 13, 2021

Same for TSLA posts.

myth_drannon · on April 13, 2021

Earlier post 2009: https://news.ycombinator.com/item?id=599852

"Well this is an exceptionally cute idea, but there is absolutely no way that anyone is going to have any faith in this currency."

Written by co-founder/CEO of Pachyderm and first employee of RethinkDB none the less.

dang · on April 13, 2021

People love to pick out the comments of the past that seem most wrong now, because—let's be honest—it makes us feel superior. But if you go back and look at that original 3-comment Bitcoin thread, what you find is that those 3 comments span an entire spectrum: elated, skeptical, and bewildered: https://news.ycombinator.com/item?id=599852.

The first cryptocurrency thread on HN—about 9 months earlier—was even more interesting: https://news.ycombinator.com/item?id=253963. It contains another skeptical comment: "The bigger problem is that everyone has a incentive to run their computers day and night cranking out solutions, which burns up lots of natural resources and processor time for a zero-sum result." (https://news.ycombinator.com/item?id=253999), which seems astonishingly prescient now, even though the objection is still hotly (no pun intended) debated.

Cherry-picking the wrongest-seeming skeptical comment from that collection of data points is a case study in survivorship bias. (I don't mean to pick on you personally! This is common of course.) The infamous Dropbox comment from 2008 is similar in that it has gotten repeated out of context in a way that is unfair to the original commenter (https://news.ycombinator.com/item?id=23229275). When we repeat these things, it says more about us than it does about the thing we're repeating.

julienchastang · on April 13, 2021

What this also shows to me is just how difficult it is to predict the future. Here is another thought experiment: There is probably a wonderful technology / opportunity, perhaps lying in plain sight, that we will be kicking ourselves in ten years for not having gotten involved in. But what is it? "Every day brings new opportunities to become very rich, most of which we are bound to miss" [1]

[1] https://www.wsj.com/articles/missed-teslas-12-551-rise-dont-...

derangedHorse · on April 13, 2021

Not really a superiority thing for me. I just think it’s funny how much perceptions change over time and how reasonable assumptions can be proven wrong

myth_drannon · on April 13, 2021

It's just humbling. Others and me, everyday we miss out hundreds of opportunities, it's a good reminder of life complexity. Things like that should be printed and posted on an office wall.

Also from a creator's perspective it's important to see these cases of once in a generation invention (good or bad) and the comments on it.

notional · on April 13, 2021

Ho boy looking up Xinjiang then reading this comment was a trip. Any ways Dropbox is pretty great and seeing the differences in sentiment towards TSLA stock compared to Bitcoin over the years has been great. Can't pick all the winners.

lotsofpulp · on April 13, 2021

> People love to pick out the comments of the past that seem most wrong now, because—let's be honest—it makes us feel superior.

I like to see wrong “obvious” predictions to humble myself, even if not from me.

dang · on April 13, 2021

Ok, you get a pass. I'm generalizing from myself of course.

smt88 · on April 13, 2021

I still don't think many people have faith in it. The vast majority of people who own it are speculators who never actually use it as a currency.

laurent92 · on April 15, 2021

At the time it felt astonishing that central banks allowed non-state currencies to exist on national ground, for the first time. It still feels astonishing.

LittlePeter · on April 13, 2021

Reminds me of the infamous dropbox comment https://news.ycombinator.com/item?id=9224

freediver · on April 13, 2021

The effort definetely gets my upvote.

What did you use to crawl the pages and how long did it take? Curious about your experience doing it and crawler integration with Seekstorm.

Is is possible to easilly expand the index with embeddings (vectors) and perform semantic search in parallel?

Your pricing indicates that hosting similar index as this demo would cost $500/month. Wondering what kind of infrastructure is supporting the demo? Thanks!

ps. Small quirk: https://deephn.org/?q=how+to+be+productive&filter=%7B%22hash...

First three tags seem not to be relevant to the post itself.

wolfgarbe · on April 13, 2021

Crawling speed is between 100...1000 pages per second. We crawled about 4 million linked unique web pages

The pricing for an Index like DeepHN would be $99/month: While we are indexing 30 million Hacker news posts, for DeepHN we are combining a single HN story with all its comments and its linked webpage into a single SeekSorm document. So that we index just 4 million SeekStorm documents.

Yes, it would be possible to expand index with embeddings (vectors) and perform semantic search. This would we an auxiliary step between crawling and indexing.

freediver · on April 13, 2021

What is being used as a crawler and is it integrated with Seekstorm?

The same article referenced here https://deephn.org/?q=how+to+be+productive&filter=%7B%22hash... contains the phrase 'well-defined'

Any idea why doesn't the article surface when searching for this ?

wolfgarbe · on April 13, 2021

The crawler is a part of SeekStorm.

"Well-defined": Just a guess: We are doing key text extraction, i.e. we try not to index boilerplate stuff an and menu items. As "well-defined" is within a short list item, it might be accidentally skipped.

So, that is not yet perfect, and considering the diversity in web page structure it probably never will. But we will try to improve.

freediver · on April 13, 2021

Yes, if the preview is the representation of what you have indexed, then half of that article is missing. You may have identified the weakest link in your stack - crawler/extractor (which is notoriously hard to do, would be good if you provided more detail eg. do you use headless browser or simple GET request, do you crawl PDFs etc). Little use of the advanced stack on top of it, if the data does not end in the index in the first place. Hope you provide an update on this in the future. I'd probably sign up for a plan.

wolfgarbe · on April 13, 2021

No, the preview is NOT the representation of what we have indexed. The preview ist limited to about 200 words to be compliant with fair use legislation. Indexing is limited to 1 MB per document.

WalterGR · on April 14, 2021

Crawling speed is between 100...1000 pages per second.

Stupid question, but you were crawling news.ycombinator.com, right?

Its robots.txt (https://news.ycombinator.com/robots.txt) contains

  Crawl-delay: 30

Why did you not follow that?

(I'm not being accusatory. I'm both curious about web crawling in general and have personally been archiving the front page and "new" every 60 seconds or so... (Obviously there's no reason for me to retrieve them more often, but my curiosity persists.))

wolfgarbe · on April 15, 2021

>> you were crawling news.ycombinator.com, right?

No, for retrieving the Hacker News Posts we were using the public Hacker News API, which returns the posts in JSON format: https://github.com/HackerNews/API

The crawling speed of 100...1000 pages per second refers to crawling the external pages linked from Hacker news posts. As they are from different domains we can achieve a high crawling speed while being a polite crawler with a low crawling rate per domain.

syntaxing · on April 13, 2021

Neat but the back button hijacking is not cool.

wolfgarbe · on April 13, 2021

Thanks for pointing this out. That was not intentionally, probably a side effect from the navigation inside our search results within a dynamic page (using the back button to return to previous search queries). We will fix this.

wolfgarbe · on April 13, 2021

We have just fixed the back button bug (make sure to clear the browser cache).

dastx · on April 13, 2021

This kind of thing is enough for me to never want to use a site. If it's a bug, high recommend you fix it, if it's not, highly recommend you reconsider your position.

SamBam · on April 13, 2021

It is so common in SPA that the long-press of the back button to see the list of recent sites should be muscle-memory by now for everyone.

Yes, it's an annoyance. Yes, sites should fix it. But not ever using a site because of it seems silly when it's literally a half-second longer click.

(Also, for this site I don't actually see the bug. So either they fixed it very rapidly, or GP was just referring to individual searchers being in the history, which is common for any search engine.)

dastx · on April 13, 2021

> It is so common in SPA that the long-press of the back button to see the list of recent sites should be muscle-memory by now for everyone.

This is not a good excuse for laziness.

> But not ever using a site because of it seems silly when it's literally a half-second longer click.

I highly disagree. With this site at least it is actually possible to leave. Many websites I've come across with this issue, it is entirely impossible to leave without physically holding down the back button in the browser to get a list of history items, and then clicking a site from earlier.

> (Also, for this site I don't actually see the bug. So either they fixed it very rapidly, or GP was just referring to individual searchers being in the history, which is common for any search engine.)

Looks like they've fixed it.

wolfgarbe · on April 13, 2021

Your are right. It IS a bug, and we will fix it asap.

pelagicAustral · on April 13, 2021

I'm a bit curious about this 'back button hijacking', since the bug has now been fixed I'm afraid I'm not able to test. What was the unintended behaviour?

wolfgarbe · on April 13, 2021

If you were entering deephn.org in your browser address bar, you couldn't use the browser back button to return to the previous page. That was a bug caused by the navigation code within our dynamic HTML page. It is fixed now.

pelagicAustral · on April 14, 2021

Interesting. Thanks for your reply.

karterk · on April 13, 2021

Really cool demo. I'm curious to know what kind of hardware this is this hosted on.

Full disclosure: I work on a similar fast, typo tolerant, fuzzy search engine search engine called Typesense (https://github.com/typesense/typesense).

wolfgarbe · on April 13, 2021

SeekStorm uses dedicated root servers with NVMe SSDs, hosted by IONOS https://www.ionos.com/servers/intel-servers

goodmachine · on April 13, 2021

Yeah, was also interested. It would be nice to see Typesense take up the same challenge for comparison ;)

karterk · on April 13, 2021

Happy to benchmark if the dataset is shared!

matt_wilson_206 · on April 13, 2021

Seems odd to publish a benchmark against Lucene where your results are "Lucene crashes at 4 concurrent users". Since there are thousands of people using Lucene and products built on Lucene every day, I think the issue may be in your benchmarking code.

wolfgarbe · on April 13, 2021

There is no secret about the benchmark, its all open source: https://github.com/wolfgarbe/LuceneBench

arafalov · on April 13, 2021

It seems you tried MMapDirectory and then commented it out. https://github.com/wolfgarbe/LuceneBench/blob/master/LuceneB...

So, you may be hitting SimpleFSDirectory instead, which does have issues with too many searches.

Could you share the reasons, MMapDirectory did not work for you?

wolfgarbe · on April 13, 2021

"NIOFSDirectory and MMapDirectory implementations face file-channel issues in Windows and memory release problems respectively. To overcome such environment peculiarities Lucene provides the FSDirectory.open() method. When invoked, it tries to choose the best implementation depending on the environment." https://www.baeldung.com/lucene-file-search

arafalov · on April 13, 2021

Right. I found you are running on Windows after posting the comment. Makes sense, though there were some fixes in the latest Lucene I believe.

The Lucene's own benchmarks are at: https://home.apache.org/~mikemccand/lucenebench/ , though I admit to not know enough about benchmarking to form a strong opinion.

Either way, good luck with the project/service. More competition is always great. The open-source components look interesting too.

Semaphor · on April 13, 2021

The way I understand it is that it crashes when 4 concurrent users are running the benchmark. Which could make sense if it’s a stress-test.

nitramm · on April 13, 2021

It is really great tool too see what used to be very popular posts in the past.

I have one observation. Posts linking to twitter - https://deephn.org/?sort=score&filter=%7B%22score%22%3A%7B%2... - have most of the preview occupied with navigation and other not interested parts. So maybe it is possible to use some hacky solution to show the content part for the most popular websites, or if multiple pages are downloaded from the same domain the content part could be detected automatically.

wolfgarbe · on April 13, 2021

You are right. The preview is far from perfect. We just wanted to ship early. We will continue to fix and improve.

zoomablemind · on April 13, 2021

Amazingly fast! I'm used to seeing a Firefox's load progress bar for most of requests on my mobile, yet here the results "just show up".

More observations:

- When trying to go back after clicking the original HN link, there's no response.

- Clicking on what I understand are tags (hashtags) in a given post has no effect.

- Sorting by date is working, though it'd be nice to see the date.

Also a question: how do you generate the tags (hashtags) from the original contents?

wolfgarbe · on April 13, 2021

>> When trying to go back after clicking the original HN link, there's no response.

The original HN link is opened in a new browser page/tab. The search results should remain unchanged in the previous tab. So you don't use the back button, but the previous browser tab.

>> Clicking on what I understand are tags (hashtags) in a given post has no effect.

Search for "google", go to the first result. there is a hashtag "oracle", click and the search results are filtered by that hashtags. Please post an example where its doesn't work, so that we can fix it.

The date is shown in the preview panel on the right hand side. But wa are thinking adding the time also to the result list.

We are deriving the tags from the terms and bigrams in title, text and parsed html of linked web pages. Top frequent terms per post, but only if terms are within top 65k tags per index. Stopwords are excluded.

zoomablemind · on April 13, 2021

> Please post an example where its doesn't work, so that we can fix it.

Note: I'm trying this in a mobile browser (Firefox).

- Search for a word 'Decentralized'

- pick the second result: 'A decentralized web would give power back to the people online'

- On the result's (preview?) page, click the tag 'Decentralization'

I get no response, nothing updates, no new tab opens. Not sure what is the intended action (I'd assume it should be a new set of results corresp. to the hashtag). Same issue with 'google' example that you described.

wolfgarbe · on April 21, 2021

If you click a tag on the result/preview page (in mobile Firefox), the result list is filtered by the tag (you get a new/shorter result list with an updated result number).

But because you are still on the preview page you don't immediately see this as it's done in the background. But once you close the preview page (with the big arrow on the top/left of the preview page) you will see the updated result list and number.

Of course, this is a design flaw, we should automatically close the preview window once you select a tag to immediately show you the updated result list. In the desktop version, this is no issue, as there result list and preview window are always simultaneously visible.

FairDune · on April 13, 2021

Apologies for being nitpicky about the already fixed back button While viewing a specific drilled-down result, each time the back button is pressed, it seems to navigate to the previous character in the incremental search. In other words, if the search term is 'C#', navigating back seems to highlight 'C' in the result detail page.

wolfgarbe · on April 15, 2021

This is most likely a timing problem. If you type c# fast, then the back button brings you back to the previous query. If you pause between typing 'c' and '#'. then it brings you back to the previous character only.

That is related to our instant search, where you don't need to hit the return key to search. This causes the problem, that we need to identify when a query ends and the next query starts, to create the correct entries for the back button. We do this via timing - if the pause between key presses is too long, a new, separate entry is assumed. As the DeepHN is a single dynamic HTML page, we need to implement and manage the browser navigation logic ourselves.

crazypython · on April 13, 2021

I was hoping it would do full-text search of the text inside the linked pages, like hndex.org but with ranking.

wolfgarbe · on April 13, 2021

Yes, it does exactly that. If you check "Web" on the left sidebar, and uncheck "Stories" it will find the posts where "friendship" occurs in the linked web pages. But we still show the title of the original hacker news story in the search result and highlight the search term if it ALSO occurs in the title.

If you scroll long enough within the search results you will finally reach results where the search term is not in the title, but only within the linked web page.

It is more obvious if you search for queries which are less popular than "friendship"

Hitton · on April 13, 2021

You have to check box "Web" for it to search in linked articles too. Without that it highlights the words in articles that are found only because of title. And the indexing of contents isn't perfect, sometimes it seems to work for main article content but not for article author name even when it's shown.

johtso · on April 13, 2021

I think it is supposed to be doing that?

crazypython · on April 13, 2021

If I search "friendship" all the article titles have the word friendship in them. In my experience many HN titles don't contain the topic word.

yewenjie · on April 13, 2021

Nice to see full-text of the articles linked and really fast. Do you provide an API like Algolia does for HN?

wolfgarbe · on April 13, 2021

DeeepHN is powered by our SeekStorm search-as-a-service, which has a full-featured API: https://seekstorm.com/docs

SeekStorm is intended that the user can index and search their own private data.

But if there is demand we could also allow to search in a central public index for web data or other public data.

morrbo · on April 13, 2021

How does seekstorm perform on large documents? Say documents 1MB-10MB in size, are you able to search entire docs or just paragraphs within the doc?

wolfgarbe · on April 13, 2021

Currently the maximum size of a single document is limited to 1 MByte. That is an artificial limitation resulting from the fact, that SeekStorm does not only indexes the content of the document, but also stores the original document. Somewhere there has to be a limit.

throwaway823882 · on April 13, 2021

Interesting, it doesn't seem to show the dead pages. I have been indexing dead HN pages for several years now, with the idea that eventually I'd run some statistics on it to better understand both the auto-dead filters and the people who are submitting dead content.

smusamashah · on April 13, 2021

https://deephn.org/?q=procedural+planet&filter=%7B%22domain%...

A bug probably. Huge amount of text in title here.

wolfgarbe · on April 13, 2021

Thank you. We will look into this and fix it asap.

localhost · on April 13, 2021

Have you tried against a corpus of text like email? How well does that work? I've been frustrated with the search speed (and results) from email search in general and was wondering if I could do better with an offline search?

wolfgarbe · on April 13, 2021

That would definitely work. We are working on local client that could index your local PDF and Word documents. So importing emails would be a nice addition.

localhost · on April 13, 2021

I'd be interested in seeing how well it performs vs. integrated search. Outlook search (which is what I use) suffers from performance and relevance issues, yet when I tried this experiment with another full-text search provider (Azure Search) it performed worse on relevancy.

I use email as kind of an activity stream - generally I can remember roughly the people involved and the time period. Would be lovely to have search that performs well in this regard.

b1tshift · on April 13, 2021

Love it! I would kill for a reverse order button to see oldest posts first

wolfgarbe · on April 13, 2021

We can add that button, if you promise not to kill ;)

b1tshift · on April 13, 2021

Okay :D

wolfgarbe · on April 15, 2021

Done. Now you can sort by newest/oldest date.

wolfgarbe · on April 13, 2021

Thank you all for your great feedback. We will fix all the bugs and add some of the suggested features and release a next iteration of DeepHN within the next few days!

codeproject · on April 13, 2021

that is a great site. To be really honest, Those bugs are not much trouble compared with your service to all. that is really remarkable useful website. Thank you so much from the bottom of our hearts.

youngNed · on April 13, 2021

So, can you explain what is going on with 'covid' articles from <2019 - are these where the original page that was linked to has now become a spam link?

wolfgarbe · on April 13, 2021

Not necessarily a spam link. Linked pages from older post have sometimes updated their content.

Sometimes this is legitimate, sometimes spam.

0xmohit · on April 13, 2021

Curious to know how does it differ from HN Search [1]? HN Search API documentation [2].

[1] https://hn.algolia.com

[2] https://hn.algolia.com/api

jankotek · on April 13, 2021

It searches content of referenced webpages. Also has filtering by year intervals (for example 2012-2016), domains, tags etc.. And I find usability better.

I think it is great product.

sturza · on April 13, 2021

How can we learn more about this indexing service?

wolfgarbe · on April 13, 2021

https://seekstorm.com/about-deephn

https://seekstorm.com/

https://seekstorm.com/search-api-features

https://seekstorm.com/search-as-a-service-architecture

de6u99er · on April 13, 2021

I like it. I built something similar recently for indexing scientific publiations on r/COVID19 based on Elasticsearch and Kibana.

finikytou · on April 13, 2021

can you share a link?

reivalc · on April 13, 2021

Neat! I wonder if you quantified the performance of your search somehow? I guess labelling data for that would be laborious.

wolfgarbe · on April 13, 2021

You mean performance in terms of latency or relevancy?

reivalc · on April 13, 2021

Relevancy - I'd guess it's difficult to measure, due to the large dataset, the large number of possible search terms, and large number of possible results.

djhn · on April 13, 2021

A page to bookmark for sure and an interesting product.

The UI takes getting used to, but then again so do many actually useful tools.

smusamashah · on April 13, 2021

Can we exclude certain terms in search with a "-" e.g. "-oculus"?

wolfgarbe · on April 18, 2021

The '-' operator has been fixed. 'vr -oculus' should work correctly now.

wolfgarbe · on April 13, 2021

It should, but it currently doesn't in DeepHN. We will fix this asap.

artembugara · on April 13, 2021

Nice! How did you fetch the content of those articles?

wolfgarbe · on April 13, 2021

With the crawler which is part of the SeekStorm search-as-a-service.

yamrzou · on April 13, 2021

This is nice! How are the stories tagged?

wolfgarbe · on April 13, 2021

We are deriving the tags from the terms and bigrams in title, text and parsed html of linked web pages. Top frequent terms per post, but only if terms are within top 65k tags per index. Stopwords are excluded.

jschveibinz · on April 13, 2021

I like the UI. Very nice.