Thanks for noticing! To be more specific, "what is a" are stop words and "cynic"...

talolard · on Jan 24, 2021

I have a few suggestions regarding search performance.

As others have mentioned, there isn't one "right answer" but there are a few nifty tricks you can use to boost performance.

You could try a trigram index[0] as the primary search index. You'd loose stemming, but gain language support and indexed regular expression search which is a nice power user feature.

Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.

if you order by ts_rank_cd then the query will sort all rows, including those that don't match. A better pattern is to take the ranking score as a column and then sort it in a subquery. [1] from stack overlfow has an example. (As an aside, from pg2, CTEs are no longer an optimization fence, so you can write the query out more cleanly with a CTE and still get the desired performance).

You should experiment with GIN vs GIST indices. GIN are larger, so on a large dataset the index might not fit in memory. You could pay more to have more memory, but worth trying a GIST index to see if that makes things faster just because it fits in memory.

A final frontend comment, I'm a fan of infinite scroll for this kind of stuff. You already have pagination effectively set up, you could spoil me as a user with infinite scroll. react-virtualized[2] is a great library for that.

Hope that helps.

[0] https://www.postgresql.org/docs/9.6/pgtrgm.html [1] https://dba.stackexchange.com/questions/4054/ts-rank-massive... [2] https://github.com/bvaughn/react-virtualized

gutensearch · on Jan 24, 2021

Thanks for the suggestions! I will try them next.

> Looking at your where clause, did you index language and textsearchable_index_col together ? If not, you are giving the planner the opportunity to skip your full text index.

Here is the index:

  update gutenberg.paragraphs
  set textsearchable_index_col = to_tsvector(b.cfgname::regconfig, coalesce(paragraph, ' '))
  from
  gutenberg.all_data a 
  inner join pg_ts_config b on lower(a.language) = b.cfgname 
  where gutenberg.paragraphs.num = a.num;
  create index textsearch_paragraph_idx on gutenberg.paragraphs using gin (textsearchable_index_col);

> if you order by ts_rank_cd then the query will sort all rows, including those that don't match. A better pattern is to take the ranking score as a column and then sort it in a subquery. [1] from stack overlfow has an example. (As an aside, from pg2, CTEs are no longer an optimization fence, so you can write the query out more cleanly with a CTE and still get the desired performance).

Ranking was already done as a column in a CTE and ranked further down:

  with paragraphs as (
                select 
                num
                , paragraph
                , ts_rank_cd(textsearchable_index_col, phraseto_tsquery(%s::regconfig, %s), 32) as rank 
                , ts_headline(%s, paragraph, phraseto_tsquery(%s::regconfig, %s), 'MaxFragments=1000, StartSel=**, StopSel=**') as highlighted_result 
                from gutenberg.paragraphs 
                where language = %s::regconfig and textsearchable_index_col @@ phraseto_tsquery(%s::regconfig, %s)
            )

I did rewrite the query as per your [1] but EXPLAIN ANALYZE found no difference.

Pagination is a tricky one. I was in a rush and hit an issue where building large Dash tables was very expensive and prone to crashing. I initially set up pagination as per the Dash examples, but that didn't play nice with markdown which I needed for the URLs and highlighting (giving these up for plain text tables made in HN-fast, but it confused my trial users). So the quick and dirty solution "to ship now" was to set up custom pagination in the SQL and show the results.

I think that if there is enough interest in this project, the front end will have to move off Dash.

kristopolous · on Jan 24, 2021

full-text searching is contentious, it's a rabbit-hole of holy-wars.

I don't believe I'm going to summon cranks with the following:

There's more specialized tools for your usecase than postgres and you should be looking into "n-gram indexing".

Lucene based systems such as elasticsearch are quite popular there's also sphinx and xapian, also fairly widespread.

You need to read the documentation and configure them, they are flexible and need to be tuned to this usecase.

In the end, there is no "correct" way to do things. For instance, sometimes stemming words is the right way to go, but if you are say doing a medical system where two very different medicines could be spelled similar and stem to the same word, a mistake could lead to death, so no, stemming is very bad here.

Sometimes stop words is the way to go, but if people are looking for quotes, such as "to be or not to be" well now you have the empty string, splendid.

So yeah, configure configure configure.

This may bring out the cranks:

If you want to roll-your-own nosql systems like redis and mongo or couch seem to work really well (I've rolled my own in all 3 on separate occasions).

I guarantee there's advanced features of maria and postgres that aren't widely used and some people reading this will confidently claim they are superior but I assure you, that is a minority opinion. Most people go with the other options.

If you ever doubt it, ask the actual developers of postgres or maria on chat. They are extremely nice and way more honest about the limitations then their hardcore fans are. The databases are under constant development with advanced features and you'll learn a lot (really, they are both super chill dev communities, impressively so).

Perhaps your solution (as mine has been) is a hybrid. You can store the existence for instance, in one system and the offsets and chunks in another so you get a parallelizable asynchronous worker pipeline, it's impressively fast when you horizontally scale it. <0.5 sec for multiple terabytes of text (and yes, I'm talking nvme/hundreds of gb of RAM per node/>=10gb network). I've legitimately just done random queries to marvel at the speed

I really hope I save myself from the rock throwing.

thaumaturgy · on Jan 24, 2021

> For instance, sometimes stemming words is the right way to go, but if you are say doing a medical system where two very different medicines could be spelled similar and stem to the same word...

FWIW lemmatization may be a good alternative to stemming. Stemming is algorithmic and can generate errors, as you point out; "caring" for example might naively be stemmed to "car". Lemmatization uses a dictionary of senses and their root words to avoid this. For common English, there's Princeton's Wordnet (https://wordnet.princeton.edu/). Lemmatizing technical niches, like medicine, would require an additional dictionary.

yorwba · on Jan 24, 2021

> there's also sphinx and xapian, also fairly widespread.

Sphinx is now Manticore, and as luck has it, a Manticore dev is in this thread, offering support: https://news.ycombinator.com/item?id=25890828

gutensearch · on Jan 24, 2021

Thanks! I really appreciate the pointers. I already had planned to explore some of these and you've expanded and directed the search nicely.