Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: DeepHN – Full-text search of 30M Hacker News posts and linked webpages (deephn.org)
324 points by wolfgarbe on April 13, 2021 | hide | past | favorite | 112 comments



I assume you are using Symspell for fuzzy search and prioritize candidate frequency over edit distance. My input "algolai" returns results for "angola" even though "algolia" was just 1 transposition away. Anyway just dropping in to say that stuff on your github helped me figure out a lot while building a search engine from scratch. Good luck with SeekStorm!


We are combining auto correction (SymSpell) and auto completion (PruningRadixTrie). That is sometimes tricky. We are using a static spelling correction dictionary which does not contains the term algolia, but angola. SymSpell is very fast, but requires a lot of memory. Therefore we are using a static dictionary, and not a dynamic dictionary derived from each individual index of each customer. Auto completion on the other hand we are dynamically generating for each individual index/customer.


I am combining symmetric deletes with a trie for fuzzy autocompletion and it is working pretty well.


I'm curious how this site is compliant with copyright law. I've thought of working on similar projects, but the possibility of being sued has dissuaded me.


Search engines are generally considered fair use [0], at least in US. So up until HN doesn't send cease and desist, it's in the clear. Once it does, then it's all about what the claims are. Google litigated the crap out of this [1] over the years, so there is a positive precedent minefield all over the place.

[0] https://www.everycrsreport.com/reports/RL33810.html

[1] https://en.wikipedia.org/wiki/Fair_use#Text_and_data_mining


They copy the whole text of web pages that are linked in the HN posts. So totally a mass breach of copyright.

If you are afraid of being sued, you certainly should not do something like this.

On the upside, there is probably not much provable damage a publisher could ask to be repaired. So they probably are only liable to pay for the legal expenses of whoever sues them.


DeepHN does not copy the whole page, only the first 200 words or less. DeepHN does not copy the whole website but only a fraction of a single page of that web site. We always provide the original link. That should make it fair use. Of course we would prefer if there was a clear and concise legislation, valid in all countries, which could be followed.

We mainly surface historical content, which receives additional traffic from DeepHN, instead of taking value away from the website.


I copy the whole text of web pages I visit. It's not automatically breach of copyright.

That said, meaningful fulltext search will need data. But both Bing and Google crawl pages.

It's not even immediately clear that displaying a copy of a page is materially different from caching (which http do in many, many layers).

Now, if présent the text as your own - that might be a problem.

Ed: see also: archive.org.


It's probably not compliant. As with all good things, if you're in the Land of the Free (TM), eventually people will hunt you down if you try to do good things.

Alternatively, base your site out of a country which gives you the freedom to do things like this. Build it in China and you'll be applauded instead of sued. Copyright suits for petty things like this would laughed off when it's something actually useful.


China's monopolies are far worse than in other countries.

Founders there have no choice but to sell or get copied by the platform that you're building on. It's baba or tenscent.


Yes, that's also true, but on the flip side, technological advancement of society as a whole is not stifled by idiotic IP law.

I agree, some anti-monopoly regulation is in order in China, and I'm not a fan of the Alibaba/Tencent monopolies on the ecosystem. However, I think there are other ways to go about fostering and encouraging small businesses and protecting creators and their intentions than American-style IP law, which allows you to sit on an invention or work and hinder society from having access to the fruits of science, which I'm vehemently against. IP isn't even real property, IMHO.


Yes it is, if 90% of the reason for new tech is to get richer, if all protections like copyright, patents, etc are dropped then people will be much less incentivized to invest and stick their necks out for new creations. I wholeheartedly do not agree with your IMHO. That said we could absolutely improve our current patent system and copyright system (especially copyright and those parts which contribute to patent trolls)


We're sorry but seekstorm-deephn doesn't work properly without JavaScript enabled. Please enable it to continue.

Could the folks with JS disabled get a small semi-static page with the top interesting HN stats, and a message that say something like "Here are some simple stats. For the full interactive experience, please enable javascript"?


For the less than 1% of such people?

I doubt that it's worth it. If somebody can't be bothered to enable JavaScript to use the site, they're not a desirable customer anyway.


Good idea!


Very cool and useful. I was trying to find all comments on HN which contains the word youtube.com (basically trying to find some recently linked to videos from comments) but it returns 0 results when i set the search filter to comments(1). Any idea what I'm doing wrong?

https://deephn.org/?q=youtube.com&sort=time&in=Comments


We will look into this.


It’s fun to see Bitcoin posts from 2008-2010. I wonder how many bought and are now retired because of them. Seeing recommendations to use MtGox is pretty funny too

https://news.ycombinator.com/item?id=1998144


I like this quote:

> 21 million bitcoins seems like a low limit to set. According to [0] the exchange rate is currently 221 Bitcoins per USD, so the total value of all possible bitcoins is only 95000USD.

Today 95000USD is worth 1.51BTC


I read about Bitcoin here over a decade ago. Took the time to setup a miner, and let it run for a week. I think I generated a few hundred coins worth less than a dollar total. Figured the electricity cost way more than that and deleted it all.


I admire your perseverance to continue forging ahead despite deleting $20MM worth of coins.


Also fun to see the contrarian Venus explorers are at least a decade old ;)

SpaceX aims to put man on Mars in 10-20 years

https://news.ycombinator.com/item?id=2479053


The crazy thing is that MtGox really was your best choice back then.

A long way from that mess of a site to Coinbase IPOing tomorrow for a zillion dollars.


Same for TSLA posts.


Earlier post 2009: https://news.ycombinator.com/item?id=599852

"Well this is an exceptionally cute idea, but there is absolutely no way that anyone is going to have any faith in this currency."

Written by co-founder/CEO of Pachyderm and first employee of RethinkDB none the less.


People love to pick out the comments of the past that seem most wrong now, because—let's be honest—it makes us feel superior. But if you go back and look at that original 3-comment Bitcoin thread, what you find is that those 3 comments span an entire spectrum: elated, skeptical, and bewildered: https://news.ycombinator.com/item?id=599852.

The first cryptocurrency thread on HN—about 9 months earlier—was even more interesting: https://news.ycombinator.com/item?id=253963. It contains another skeptical comment: "The bigger problem is that everyone has a incentive to run their computers day and night cranking out solutions, which burns up lots of natural resources and processor time for a zero-sum result." (https://news.ycombinator.com/item?id=253999), which seems astonishingly prescient now, even though the objection is still hotly (no pun intended) debated.

Cherry-picking the wrongest-seeming skeptical comment from that collection of data points is a case study in survivorship bias. (I don't mean to pick on you personally! This is common of course.) The infamous Dropbox comment from 2008 is similar in that it has gotten repeated out of context in a way that is unfair to the original commenter (https://news.ycombinator.com/item?id=23229275). When we repeat these things, it says more about us than it does about the thing we're repeating.


What this also shows to me is just how difficult it is to predict the future. Here is another thought experiment: There is probably a wonderful technology / opportunity, perhaps lying in plain sight, that we will be kicking ourselves in ten years for not having gotten involved in. But what is it? "Every day brings new opportunities to become very rich, most of which we are bound to miss" [1]

[1] https://www.wsj.com/articles/missed-teslas-12-551-rise-dont-...


Not really a superiority thing for me. I just think it’s funny how much perceptions change over time and how reasonable assumptions can be proven wrong


It's just humbling. Others and me, everyday we miss out hundreds of opportunities, it's a good reminder of life complexity. Things like that should be printed and posted on an office wall.

Also from a creator's perspective it's important to see these cases of once in a generation invention (good or bad) and the comments on it.


Ho boy looking up Xinjiang then reading this comment was a trip. Any ways Dropbox is pretty great and seeing the differences in sentiment towards TSLA stock compared to Bitcoin over the years has been great. Can't pick all the winners.


> People love to pick out the comments of the past that seem most wrong now, because—let's be honest—it makes us feel superior.

I like to see wrong “obvious” predictions to humble myself, even if not from me.


Ok, you get a pass. I'm generalizing from myself of course.


I still don't think many people have faith in it. The vast majority of people who own it are speculators who never actually use it as a currency.


At the time it felt astonishing that central banks allowed non-state currencies to exist on national ground, for the first time. It still feels astonishing.


Reminds me of the infamous dropbox comment https://news.ycombinator.com/item?id=9224


The effort definetely gets my upvote.

What did you use to crawl the pages and how long did it take? Curious about your experience doing it and crawler integration with Seekstorm.

Is is possible to easilly expand the index with embeddings (vectors) and perform semantic search in parallel?

Your pricing indicates that hosting similar index as this demo would cost $500/month. Wondering what kind of infrastructure is supporting the demo? Thanks!

ps. Small quirk: https://deephn.org/?q=how+to+be+productive&filter=%7B%22hash...

First three tags seem not to be relevant to the post itself.


Crawling speed is between 100...1000 pages per second. We crawled about 4 million linked unique web pages

The pricing for an Index like DeepHN would be $99/month: While we are indexing 30 million Hacker news posts, for DeepHN we are combining a single HN story with all its comments and its linked webpage into a single SeekSorm document. So that we index just 4 million SeekStorm documents.

Yes, it would be possible to expand index with embeddings (vectors) and perform semantic search. This would we an auxiliary step between crawling and indexing.


What is being used as a crawler and is it integrated with Seekstorm?

The same article referenced here https://deephn.org/?q=how+to+be+productive&filter=%7B%22hash... contains the phrase 'well-defined'

Any idea why doesn't the article surface when searching for this ?


The crawler is a part of SeekStorm.

"Well-defined": Just a guess: We are doing key text extraction, i.e. we try not to index boilerplate stuff an and menu items. As "well-defined" is within a short list item, it might be accidentally skipped.

So, that is not yet perfect, and considering the diversity in web page structure it probably never will. But we will try to improve.


Yes, if the preview is the representation of what you have indexed, then half of that article is missing. You may have identified the weakest link in your stack - crawler/extractor (which is notoriously hard to do, would be good if you provided more detail eg. do you use headless browser or simple GET request, do you crawl PDFs etc). Little use of the advanced stack on top of it, if the data does not end in the index in the first place. Hope you provide an update on this in the future. I'd probably sign up for a plan.


No, the preview is NOT the representation of what we have indexed. The preview ist limited to about 200 words to be compliant with fair use legislation. Indexing is limited to 1 MB per document.


Crawling speed is between 100...1000 pages per second.

Stupid question, but you were crawling news.ycombinator.com, right?

Its robots.txt (https://news.ycombinator.com/robots.txt) contains

  Crawl-delay: 30
Why did you not follow that?

(I'm not being accusatory. I'm both curious about web crawling in general and have personally been archiving the front page and "new" every 60 seconds or so... (Obviously there's no reason for me to retrieve them more often, but my curiosity persists.))


>> you were crawling news.ycombinator.com, right?

No, for retrieving the Hacker News Posts we were using the public Hacker News API, which returns the posts in JSON format: https://github.com/HackerNews/API

The crawling speed of 100...1000 pages per second refers to crawling the external pages linked from Hacker news posts. As they are from different domains we can achieve a high crawling speed while being a polite crawler with a low crawling rate per domain.


Neat but the back button hijacking is not cool.


Thanks for pointing this out. That was not intentionally, probably a side effect from the navigation inside our search results within a dynamic page (using the back button to return to previous search queries). We will fix this.


We have just fixed the back button bug (make sure to clear the browser cache).


This kind of thing is enough for me to never want to use a site. If it's a bug, high recommend you fix it, if it's not, highly recommend you reconsider your position.


It is so common in SPA that the long-press of the back button to see the list of recent sites should be muscle-memory by now for everyone.

Yes, it's an annoyance. Yes, sites should fix it. But not ever using a site because of it seems silly when it's literally a half-second longer click.

(Also, for this site I don't actually see the bug. So either they fixed it very rapidly, or GP was just referring to individual searchers being in the history, which is common for any search engine.)


> It is so common in SPA that the long-press of the back button to see the list of recent sites should be muscle-memory by now for everyone.

This is not a good excuse for laziness.

> But not ever using a site because of it seems silly when it's literally a half-second longer click.

I highly disagree. With this site at least it is actually possible to leave. Many websites I've come across with this issue, it is entirely impossible to leave without physically holding down the back button in the browser to get a list of history items, and then clicking a site from earlier.

> (Also, for this site I don't actually see the bug. So either they fixed it very rapidly, or GP was just referring to individual searchers being in the history, which is common for any search engine.)

Looks like they've fixed it.


Your are right. It IS a bug, and we will fix it asap.


I'm a bit curious about this 'back button hijacking', since the bug has now been fixed I'm afraid I'm not able to test. What was the unintended behaviour?


If you were entering deephn.org in your browser address bar, you couldn't use the browser back button to return to the previous page. That was a bug caused by the navigation code within our dynamic HTML page. It is fixed now.


Interesting. Thanks for your reply.


Really cool demo. I'm curious to know what kind of hardware this is this hosted on.

Full disclosure: I work on a similar fast, typo tolerant, fuzzy search engine search engine called Typesense (https://github.com/typesense/typesense).


SeekStorm uses dedicated root servers with NVMe SSDs, hosted by IONOS https://www.ionos.com/servers/intel-servers


Yeah, was also interested. It would be nice to see Typesense take up the same challenge for comparison ;)


Happy to benchmark if the dataset is shared!


Seems odd to publish a benchmark against Lucene where your results are "Lucene crashes at 4 concurrent users". Since there are thousands of people using Lucene and products built on Lucene every day, I think the issue may be in your benchmarking code.


There is no secret about the benchmark, its all open source: https://github.com/wolfgarbe/LuceneBench


It seems you tried MMapDirectory and then commented it out. https://github.com/wolfgarbe/LuceneBench/blob/master/LuceneB...

So, you may be hitting SimpleFSDirectory instead, which does have issues with too many searches.

Could you share the reasons, MMapDirectory did not work for you?


"NIOFSDirectory and MMapDirectory implementations face file-channel issues in Windows and memory release problems respectively. To overcome such environment peculiarities Lucene provides the FSDirectory.open() method. When invoked, it tries to choose the best implementation depending on the environment." https://www.baeldung.com/lucene-file-search


Right. I found you are running on Windows after posting the comment. Makes sense, though there were some fixes in the latest Lucene I believe.

The Lucene's own benchmarks are at: https://home.apache.org/~mikemccand/lucenebench/ , though I admit to not know enough about benchmarking to form a strong opinion.

Either way, good luck with the project/service. More competition is always great. The open-source components look interesting too.


The way I understand it is that it crashes when 4 concurrent users are running the benchmark. Which could make sense if it’s a stress-test.


It is really great tool too see what used to be very popular posts in the past.

I have one observation. Posts linking to twitter - https://deephn.org/?sort=score&filter=%7B%22score%22%3A%7B%2... - have most of the preview occupied with navigation and other not interested parts. So maybe it is possible to use some hacky solution to show the content part for the most popular websites, or if multiple pages are downloaded from the same domain the content part could be detected automatically.


You are right. The preview is far from perfect. We just wanted to ship early. We will continue to fix and improve.


Amazingly fast! I'm used to seeing a Firefox's load progress bar for most of requests on my mobile, yet here the results "just show up".

More observations:

- When trying to go back after clicking the original HN link, there's no response.

- Clicking on what I understand are tags (hashtags) in a given post has no effect.

- Sorting by date is working, though it'd be nice to see the date.

Also a question: how do you generate the tags (hashtags) from the original contents?


>> When trying to go back after clicking the original HN link, there's no response.

The original HN link is opened in a new browser page/tab. The search results should remain unchanged in the previous tab. So you don't use the back button, but the previous browser tab.

>> Clicking on what I understand are tags (hashtags) in a given post has no effect.

Search for "google", go to the first result. there is a hashtag "oracle", click and the search results are filtered by that hashtags. Please post an example where its doesn't work, so that we can fix it.

The date is shown in the preview panel on the right hand side. But wa are thinking adding the time also to the result list.

We are deriving the tags from the terms and bigrams in title, text and parsed html of linked web pages. Top frequent terms per post, but only if terms are within top 65k tags per index. Stopwords are excluded.


> Please post an example where its doesn't work, so that we can fix it.

Note: I'm trying this in a mobile browser (Firefox).

- Search for a word 'Decentralized'

- pick the second result: 'A decentralized web would give power back to the people online'

- On the result's (preview?) page, click the tag 'Decentralization'

I get no response, nothing updates, no new tab opens. Not sure what is the intended action (I'd assume it should be a new set of results corresp. to the hashtag). Same issue with 'google' example that you described.


If you click a tag on the result/preview page (in mobile Firefox), the result list is filtered by the tag (you get a new/shorter result list with an updated result number).

But because you are still on the preview page you don't immediately see this as it's done in the background. But once you close the preview page (with the big arrow on the top/left of the preview page) you will see the updated result list and number.

Of course, this is a design flaw, we should automatically close the preview window once you select a tag to immediately show you the updated result list. In the desktop version, this is no issue, as there result list and preview window are always simultaneously visible.


Apologies for being nitpicky about the already fixed back button While viewing a specific drilled-down result, each time the back button is pressed, it seems to navigate to the previous character in the incremental search. In other words, if the search term is 'C#', navigating back seems to highlight 'C' in the result detail page.


This is most likely a timing problem. If you type c# fast, then the back button brings you back to the previous query. If you pause between typing 'c' and '#'. then it brings you back to the previous character only.

That is related to our instant search, where you don't need to hit the return key to search. This causes the problem, that we need to identify when a query ends and the next query starts, to create the correct entries for the back button. We do this via timing - if the pause between key presses is too long, a new, separate entry is assumed. As the DeepHN is a single dynamic HTML page, we need to implement and manage the browser navigation logic ourselves.


I was hoping it would do full-text search of the text inside the linked pages, like hndex.org but with ranking.


Yes, it does exactly that. If you check "Web" on the left sidebar, and uncheck "Stories" it will find the posts where "friendship" occurs in the linked web pages. But we still show the title of the original hacker news story in the search result and highlight the search term if it ALSO occurs in the title.

If you scroll long enough within the search results you will finally reach results where the search term is not in the title, but only within the linked web page.

It is more obvious if you search for queries which are less popular than "friendship"


You have to check box "Web" for it to search in linked articles too. Without that it highlights the words in articles that are found only because of title. And the indexing of contents isn't perfect, sometimes it seems to work for main article content but not for article author name even when it's shown.


I think it is supposed to be doing that?


If I search "friendship" all the article titles have the word friendship in them. In my experience many HN titles don't contain the topic word.


Nice to see full-text of the articles linked and really fast. Do you provide an API like Algolia does for HN?


DeeepHN is powered by our SeekStorm search-as-a-service, which has a full-featured API: https://seekstorm.com/docs

SeekStorm is intended that the user can index and search their own private data.

But if there is demand we could also allow to search in a central public index for web data or other public data.


How does seekstorm perform on large documents? Say documents 1MB-10MB in size, are you able to search entire docs or just paragraphs within the doc?


Currently the maximum size of a single document is limited to 1 MByte. That is an artificial limitation resulting from the fact, that SeekStorm does not only indexes the content of the document, but also stores the original document. Somewhere there has to be a limit.


Interesting, it doesn't seem to show the dead pages. I have been indexing dead HN pages for several years now, with the idea that eventually I'd run some statistics on it to better understand both the auto-dead filters and the people who are submitting dead content.


https://deephn.org/?q=procedural+planet&filter=%7B%22domain%...

A bug probably. Huge amount of text in title here.


Thank you. We will look into this and fix it asap.


Have you tried against a corpus of text like email? How well does that work? I've been frustrated with the search speed (and results) from email search in general and was wondering if I could do better with an offline search?


That would definitely work. We are working on local client that could index your local PDF and Word documents. So importing emails would be a nice addition.


I'd be interested in seeing how well it performs vs. integrated search. Outlook search (which is what I use) suffers from performance and relevance issues, yet when I tried this experiment with another full-text search provider (Azure Search) it performed worse on relevancy.

I use email as kind of an activity stream - generally I can remember roughly the people involved and the time period. Would be lovely to have search that performs well in this regard.


Love it! I would kill for a reverse order button to see oldest posts first


We can add that button, if you promise not to kill ;)


Okay :D


Done. Now you can sort by newest/oldest date.


Thank you all for your great feedback. We will fix all the bugs and add some of the suggested features and release a next iteration of DeepHN within the next few days!


that is a great site. To be really honest, Those bugs are not much trouble compared with your service to all. that is really remarkable useful website. Thank you so much from the bottom of our hearts.


So, can you explain what is going on with 'covid' articles from <2019 - are these where the original page that was linked to has now become a spam link?


Not necessarily a spam link. Linked pages from older post have sometimes updated their content.

Sometimes this is legitimate, sometimes spam.


Curious to know how does it differ from HN Search [1]? HN Search API documentation [2].

[1] https://hn.algolia.com

[2] https://hn.algolia.com/api


It searches content of referenced webpages. Also has filtering by year intervals (for example 2012-2016), domains, tags etc.. And I find usability better.

I think it is great product.


How can we learn more about this indexing service?



I like it. I built something similar recently for indexing scientific publiations on r/COVID19 based on Elasticsearch and Kibana.


can you share a link?


Neat! I wonder if you quantified the performance of your search somehow? I guess labelling data for that would be laborious.


You mean performance in terms of latency or relevancy?


Relevancy - I'd guess it's difficult to measure, due to the large dataset, the large number of possible search terms, and large number of possible results.


A page to bookmark for sure and an interesting product.

The UI takes getting used to, but then again so do many actually useful tools.


Can we exclude certain terms in search with a "-" e.g. "-oculus"?


The '-' operator has been fixed. 'vr -oculus' should work correctly now.


It should, but it currently doesn't in DeepHN. We will fix this asap.


Nice! How did you fetch the content of those articles?


With the crawler which is part of the SeekStorm search-as-a-service.


This is nice! How are the stories tagged?


We are deriving the tags from the terms and bigrams in title, text and parsed html of linked web pages. Top frequent terms per post, but only if terms are within top 65k tags per index. Stopwords are excluded.


I like the UI. Very nice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: