I'd love to see this dataset used as a performance and relevance benchmark for d...

gutensearch · on Jan 24, 2021

That was definitely part of the original plan! I spotted two other attempts [1] [2] here using BERT and ElasticSearch respectively.

The main performance issue with the Postgres FTS approach (possibly also the others?) is ranking. Matching results uses the index, but ts_rank cannot.

Most of the time, few results are returned and the front end gets its answer in ~300ms including formatting the text for the front end (~20ms without).

However, a reasonably common sentence will return tens or hundreds of thousands of rows, which takes a minute or more to get ranked. In production, this could be worked around by tracking and caching such queries if they are common enough.

I'd love to hear from anyone experienced with the other options (Lucene, Solr, ElasticSearch, etc.) whether and how they get around this.

[1] https://news.ycombinator.com/item?id=19095963

[2] https://news.ycombinator.com/item?id=6562126 (the link does not load for me)

ngrilly · on Jan 24, 2021

I suggest to have a look at https://github.com/postgrespro/rum if you haven’t yet. It solves the issue of slow ranking in PostgreSQL FTS.

karterk · on Jan 24, 2021

What kind of hardware are you using to host the Postgres instance?

gutensearch · on Jan 24, 2021

Same place as the app: a Start-2-M-SSD from online.net in their AMS1 DC (Amsterdam).

Subset of sudo lshw --short:

  Class          Description
  ======================================================
  processor      Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
  memory         16GiB System Memory
  disk           256GB Micron_1100_MTFD