I assume you are using Symspell for fuzzy search and prioritize candidate frequency over edit distance. My input "algolai" returns results for "angola" even though "algolia" was just 1 transposition away. Anyway just dropping in to say that stuff on your github helped me figure out a lot while building a search engine from scratch. Good luck with SeekStorm!
We are combining auto correction (SymSpell) and auto completion (PruningRadixTrie). That is sometimes tricky. We are using a static spelling correction dictionary which does not contains the term algolia, but angola.
SymSpell is very fast, but requires a lot of memory. Therefore we are using a static dictionary, and not a dynamic dictionary derived from each individual index of each customer. Auto completion on the other hand we are dynamically generating for each individual index/customer.
I'm curious how this site is compliant with copyright law. I've thought of working on similar projects, but the possibility of being sued has dissuaded me.
Search engines are generally considered fair use [0], at least in US. So up until HN doesn't send cease and desist, it's in the clear. Once it does, then it's all about what the claims are. Google litigated the crap out of this [1] over the years, so there is a positive precedent minefield all over the place.
They copy the whole text of web pages that are linked in the HN posts. So totally a mass breach of copyright.
If you are afraid of being sued, you certainly should not do something like this.
On the upside, there is probably not much provable damage a publisher could ask to be repaired. So they probably are only liable to pay for the legal expenses of whoever sues them.
DeepHN does not copy the whole page, only the first 200 words or less. DeepHN does not copy the whole website but only a fraction of a single page of that web site.
We always provide the original link. That should make it fair use. Of course we would prefer if there was a clear and concise legislation, valid in all countries, which could be followed.
We mainly surface historical content, which receives additional traffic from DeepHN, instead of taking value away from the website.
It's probably not compliant. As with all good things, if you're in the Land of the Free (TM), eventually people will hunt you down if you try to do good things.
Alternatively, base your site out of a country which gives you the freedom to do things like this. Build it in China and you'll be applauded instead of sued. Copyright suits for petty things like this would laughed off when it's something actually useful.
Yes, that's also true, but on the flip side, technological advancement of society as a whole is not stifled by idiotic IP law.
I agree, some anti-monopoly regulation is in order in China, and I'm not a fan of the Alibaba/Tencent monopolies on the ecosystem. However, I think there are other ways to go about fostering and encouraging small businesses and protecting creators and their intentions than American-style IP law, which allows you to sit on an invention or work and hinder society from having access to the fruits of science, which I'm vehemently against. IP isn't even real property, IMHO.
Yes it is, if 90% of the reason for new tech is to get richer, if all protections like copyright, patents, etc are dropped then people will be much less incentivized to invest and stick their necks out for new creations. I wholeheartedly do not agree with your IMHO. That said we could absolutely improve our current patent system and copyright system (especially copyright and those parts which contribute to patent trolls)
We're sorry but seekstorm-deephn doesn't work properly without JavaScript enabled. Please enable it to continue.
Could the folks with JS disabled get a small semi-static page with the top interesting HN stats, and a message that say something like "Here are some simple stats. For the full interactive experience, please enable javascript"?
Very cool and useful. I was trying to find all comments on HN which contains the word youtube.com (basically trying to find some recently linked to videos from comments) but it returns 0 results when i set the search filter to comments(1). Any idea what I'm doing wrong?
It’s fun to see Bitcoin posts from 2008-2010. I wonder how many bought and are now retired because of them. Seeing recommendations to use MtGox is pretty funny too
> 21 million bitcoins seems like a low limit to set. According to [0] the exchange rate is currently 221 Bitcoins per USD, so the total value of all possible bitcoins is only 95000USD.
I read about Bitcoin here over a decade ago. Took the time to setup a miner, and let it run for a week. I think I generated a few hundred coins worth less than a dollar total. Figured the electricity cost way more than that and deleted it all.
People love to pick out the comments of the past that seem most wrong now, because—let's be honest—it makes us feel superior. But if you go back and look at that original 3-comment Bitcoin thread, what you find is that those 3 comments span an entire spectrum: elated, skeptical, and bewildered: https://news.ycombinator.com/item?id=599852.
The first cryptocurrency thread on HN—about 9 months earlier—was even more interesting: https://news.ycombinator.com/item?id=253963. It contains another skeptical comment: "The bigger problem is that everyone has a incentive to run their computers day and night cranking out solutions, which burns up lots of natural resources and processor time for a zero-sum result." (https://news.ycombinator.com/item?id=253999), which seems astonishingly prescient now, even though the objection is still hotly (no pun intended) debated.
Cherry-picking the wrongest-seeming skeptical comment from that collection of data points is a case study in survivorship bias. (I don't mean to pick on you personally! This is common of course.) The infamous Dropbox comment from 2008 is similar in that it has gotten repeated out of context in a way that is unfair to the original commenter (https://news.ycombinator.com/item?id=23229275). When we repeat these things, it says more about us than it does about the thing we're repeating.
What this also shows to me is just how difficult it is to predict the future. Here is another thought experiment: There is probably a wonderful technology / opportunity, perhaps lying in plain sight, that we will be kicking ourselves in ten years for not having gotten involved in. But what is it? "Every day brings new opportunities to become very rich, most of which we are bound to miss" [1]
Not really a superiority thing for me. I just think it’s funny how much perceptions change over time and how reasonable assumptions can be proven wrong
It's just humbling. Others and me, everyday we miss out hundreds of opportunities, it's a good reminder of life complexity. Things like that should be printed and posted on an office wall.
Also from a creator's perspective it's important to see these cases of once in a generation invention (good or bad) and the comments on it.
Ho boy looking up Xinjiang then reading this comment was a trip. Any ways Dropbox is pretty great and seeing the differences in sentiment towards TSLA stock compared to Bitcoin over the years has been great. Can't pick all the winners.
At the time it felt astonishing that central banks allowed non-state currencies to exist on national ground, for the first time. It still feels astonishing.
What did you use to crawl the pages and how long did it take? Curious about your experience doing it and crawler integration with Seekstorm.
Is is possible to easilly expand the index with embeddings (vectors) and perform semantic search in parallel?
Your pricing indicates that hosting similar index as this demo would cost $500/month. Wondering what kind of infrastructure is supporting the demo?
Thanks!
Crawling speed is between 100...1000 pages per second.
We crawled about 4 million linked unique web pages
The pricing for an Index like DeepHN would be $99/month: While we are indexing 30 million Hacker news posts, for DeepHN we are combining a single HN story with all its comments and its linked webpage into a single SeekSorm document. So that we index just 4 million SeekStorm documents.
Yes, it would be possible to expand index with embeddings (vectors) and perform semantic search. This would we an auxiliary step between crawling and indexing.
"Well-defined": Just a guess: We are doing key text extraction, i.e. we try not to index boilerplate stuff an and menu items. As "well-defined" is within a short list item, it might be accidentally skipped.
So, that is not yet perfect, and considering the diversity in web page structure it probably never will.
But we will try to improve.
Yes, if the preview is the representation of what you have indexed, then half of that article is missing. You may have identified the weakest link in your stack - crawler/extractor (which is notoriously hard to do, would be good if you provided more detail eg. do you use headless browser or simple GET request, do you crawl PDFs etc). Little use of the advanced stack on top of it, if the data does not end in the index in the first place. Hope you provide an update on this in the future. I'd probably sign up for a plan.
No, the preview is NOT the representation of what we have indexed. The preview ist limited to about 200 words to be compliant with fair use legislation. Indexing is limited to 1 MB per document.
(I'm not being accusatory. I'm both curious about web crawling in general and have personally been archiving the front page and "new" every 60 seconds or so... (Obviously there's no reason for me to retrieve them more often, but my curiosity persists.))
No, for retrieving the Hacker News Posts we were using the public Hacker News API, which returns the posts in JSON format: https://github.com/HackerNews/API
The crawling speed of 100...1000 pages per second refers to crawling the external pages linked from Hacker news posts. As they are from different domains we can achieve a high crawling speed while being a polite crawler with a low crawling rate per domain.
Thanks for pointing this out. That was not intentionally, probably a side effect from the navigation inside our search results within a dynamic page (using the back button to return to previous search queries). We will fix this.
This kind of thing is enough for me to never want to use a site. If it's a bug, high recommend you fix it, if it's not, highly recommend you reconsider your position.
It is so common in SPA that the long-press of the back button to see the list of recent sites should be muscle-memory by now for everyone.
Yes, it's an annoyance. Yes, sites should fix it. But not ever using a site because of it seems silly when it's literally a half-second longer click.
(Also, for this site I don't actually see the bug. So either they fixed it very rapidly, or GP was just referring to individual searchers being in the history, which is common for any search engine.)
> It is so common in SPA that the long-press of the back button to see the list of recent sites should be muscle-memory by now for everyone.
This is not a good excuse for laziness.
> But not ever using a site because of it seems silly when it's literally a half-second longer click.
I highly disagree. With this site at least it is actually possible to leave. Many websites I've come across with this issue, it is entirely impossible to leave without physically holding down the back button in the browser to get a list of history items, and then clicking a site from earlier.
> (Also, for this site I don't actually see the bug. So either they fixed it very rapidly, or GP was just referring to individual searchers being in the history, which is common for any search engine.)
I'm a bit curious about this 'back button hijacking', since the bug has now been fixed I'm afraid I'm not able to test. What was the unintended behaviour?
If you were entering deephn.org in your browser address bar, you couldn't use the browser back button to return to the previous page. That was a bug caused by the navigation code within our dynamic HTML page. It is fixed now.
Seems odd to publish a benchmark against Lucene where your results are "Lucene crashes at 4 concurrent users". Since there are thousands of people using Lucene and products built on Lucene every day, I think the issue may be in your benchmarking code.
"NIOFSDirectory and MMapDirectory implementations face file-channel issues in Windows and memory release problems respectively. To overcome such environment peculiarities Lucene provides the FSDirectory.open() method. When invoked, it tries to choose the best implementation depending on the environment."
https://www.baeldung.com/lucene-file-search
It is really great tool too see what used to be very popular posts in the past.
I have one observation. Posts linking to twitter - https://deephn.org/?sort=score&filter=%7B%22score%22%3A%7B%2... - have most of the preview occupied with navigation and other not interested parts. So maybe it is possible to use some hacky solution to show the content part for the most popular websites, or if multiple pages are downloaded from the same domain the content part could be detected automatically.
>> When trying to go back after clicking the original HN link, there's no response.
The original HN link is opened in a new browser page/tab. The search results should remain unchanged in the previous tab. So you don't use the back button, but the previous browser tab.
>> Clicking on what I understand are tags (hashtags) in a given post has no effect.
Search for "google", go to the first result. there is a hashtag "oracle", click and the search results are filtered by that hashtags. Please post an example where its doesn't work, so that we can fix it.
The date is shown in the preview panel on the right hand side. But wa are thinking adding the time also to the result list.
We are deriving the tags from the terms and bigrams in title, text and parsed html of linked web pages. Top frequent terms per post, but only if terms are within top 65k tags per index. Stopwords are excluded.
> Please post an example where its doesn't work, so that we can fix it.
Note: I'm trying this in a mobile browser (Firefox).
- Search for a word 'Decentralized'
- pick the second result: 'A decentralized web would give power back to the people online'
- On the result's (preview?) page, click the tag 'Decentralization'
I get no response, nothing updates, no new tab opens. Not sure what is the intended action (I'd assume it should be a new set of results corresp. to the hashtag). Same issue with 'google' example that you described.
If you click a tag on the result/preview page (in mobile Firefox), the result list is filtered by the tag (you get a new/shorter result list with an updated result number).
But because you are still on the preview page you don't immediately see this as it's done in the background.
But once you close the preview page (with the big arrow on the top/left of the preview page) you will see the updated result list and number.
Of course, this is a design flaw, we should automatically close the preview window once you select a tag to immediately show you the updated result list.
In the desktop version, this is no issue, as there result list and preview window are always simultaneously visible.
Apologies for being nitpicky about the already fixed back button While viewing a specific drilled-down result, each time the back button is pressed, it seems to navigate to the previous character in the incremental search. In other words, if the search term is 'C#', navigating back seems to highlight 'C' in the result detail page.
This is most likely a timing problem. If you type c# fast, then the back button brings you back to the previous query.
If you pause between typing 'c' and '#'. then it brings you back to the previous character only.
That is related to our instant search, where you don't need to hit the return key to search. This causes the problem, that we need to identify when a query ends and the next query starts, to create the correct entries for the back button. We do this via timing - if the pause between key presses is too long, a new, separate entry is assumed. As the DeepHN is a single dynamic HTML page, we need to implement and manage the browser navigation logic ourselves.
Yes, it does exactly that. If you check "Web" on the left sidebar, and uncheck "Stories" it will find the posts where "friendship" occurs in the linked web pages. But we still show the title of the original hacker news story in the search result and highlight the search term if it ALSO occurs in the title.
If you scroll long enough within the search results you will finally reach results where the search term is not in the title, but only within the linked web page.
It is more obvious if you search for queries which are less popular than "friendship"
You have to check box "Web" for it to search in linked articles too. Without that it highlights the words in articles that are found only because of title. And the indexing of contents isn't perfect, sometimes it seems to work for main article content but not for article author name even when it's shown.
Currently the maximum size of a single document is limited to 1 MByte. That is an artificial limitation resulting from the fact, that SeekStorm does not only indexes the content of the document, but also stores the original document.
Somewhere there has to be a limit.
Interesting, it doesn't seem to show the dead pages. I have been indexing dead HN pages for several years now, with the idea that eventually I'd run some statistics on it to better understand both the auto-dead filters and the people who are submitting dead content.
Have you tried against a corpus of text like email? How well does that work? I've been frustrated with the search speed (and results) from email search in general and was wondering if I could do better with an offline search?
That would definitely work. We are working on local client that could index your local PDF and Word documents. So importing emails would be a nice addition.
I'd be interested in seeing how well it performs vs. integrated search. Outlook search (which is what I use) suffers from performance and relevance issues, yet when I tried this experiment with another full-text search provider (Azure Search) it performed worse on relevancy.
I use email as kind of an activity stream - generally I can remember roughly the people involved and the time period. Would be lovely to have search that performs well in this regard.
Thank you all for your great feedback. We will fix all the bugs and add some of the suggested features and release a next iteration of DeepHN within the next few days!
that is a great site. To be really honest, Those bugs are not much trouble compared with your service to all. that is really remarkable useful website. Thank you so much from the bottom of our hearts.
So, can you explain what is going on with 'covid' articles from <2019 - are these where the original page that was linked to has now become a spam link?
It searches content of referenced webpages. Also has filtering by year intervals (for example 2012-2016), domains, tags etc.. And I find usability better.
Relevancy - I'd guess it's difficult to measure, due to the large dataset, the large number of possible search terms, and large number of possible results.
We are deriving the tags from the terms and bigrams in title, text and parsed html of linked web pages. Top frequent terms per post, but only if terms are within top 65k tags per index. Stopwords are excluded.