Hacker News new | past | comments | ask | show | jobs | submit login
Postgres full-text search is good enough (lostpropertyhq.com)
242 points by percept on Dec 8, 2014 | hide | past | favorite | 57 comments



no_future (dead) says:

I've actually tried both Postgres FTS and Elasticsearch(in conjunction with a Postgres DB) and found Elasticsearch easier for search. It's much more robust - for my specific case I needed a bunch of Japanese text to be searchable alongside english text; Postgres has very poor support for this - and its API is excellent and its very quick and painless to set up(documentation is not stellar though). You don't have to worry about things like vectors or indices with ES, you can just put the data in and it will be easily searchable, and it's miles more efficient when searching across large data sets. It Java-based so it is a pretty big resource hog though, so I can see the advantage of someone who just needed very lightweight search for their project using tsearch.


It points to a pathology of HN that someone discussing their experience with a specific technology is getting downvoted and flagged for describing their own experiences. This is not reddit, the downvote is not meant for enforcing groupthink.

I'm in favor of Postgres, I make my living in large part from my expertise with it; but it rubs me the wrong way when people downvote an informative comment covering someone's experience where it didn't happen to be the right tool.

no_future I apologise to you for the poor reception your comment got from immature people who don't understand how to discuss things and who cannot handle new information. This did not used to be representative of the community here.


It points to a pathology of HN that someone discussing their experience with a specific technology is getting downvoted and flagged for describing their own experiences.

He was shadowbanned because of his previous trollicious comment, not the one below, so your assumptions here are incorrect. Perhaps the mods are a bit quick to shadowban (or the algos they use are) - I personally think it should be reserved for persistent trolls, but have some sympathy - moderating a community like this is a hard, thankless task and there is an endless supply of toxic trolls. I'm sure if he asks nicely they'd reinstate the account.


I think it's more likely that no_future was shadow-banned for their previous comment.


What I don't like about shadow-banning is that it's a death sentence without warning appeal for a single misstep. Drank a bit too much and typed a comment you're going to regret the next morning? Shadowban. At least, that's what it looks like from the history of the few shadowbanned people I looked at.


Counterexample: no_future has a lot of downvoted comments, and vis total karma is (slightly) negative. (I would guess that it was slightly positive until yesterday, and then ve received a lot of downvotes on a single comment and got autobanned.)

This wasn't a single misstep, it was just the first one that attracted much attention.


Thanks, I hadn't seen those pronouns before.


I believe that if you have a good history, you're much less likely to be shadowbanned. In this case, no_future actually has negative karma.


Frankly, who gives a shit?

Even a broken clock is right twice a day.

I'd much rather filter out the trolling from the useful experiences myself than have somebody shadowbanned because people disagreed with them.

Remember, downvotes signal disagreement, not civility. This is a problem.


Following the project for which I wrote this blog post, we recently moved to ES mainly because of Japanese support. It's an important point to highlight and not all the internet is english. I was satisfy with PG for english, french, spanish but PG extension for Japanes was not looking inspiring


Yep, this is a genuine complaint. Multi-language support requires a lot more work.


Link to previous discussion here: https://news.ycombinator.com/item?id=8381748


As much as I would like it to be 'good enough', Postgres' full text search lacks the feature set to be a reasonable solution to most search use cases.

Background: I wrote a product search engine for a company called iTrackr using Postgres full text search, and later wrote the product search for milo.com using Solr. I also designed a simple (but very effective) algorithm to boost search rankings based on historical user click data, by generalizing click data onto the feature space of the document set (using mutual information). http://www.google.com/patents/US20140067786

So, why would I recommend against using Postgres' full text search:

- It is not configurable. A search engine like Solr or ElasticSearch allows you to configure the tokenization pipeline, what is indexed, how it is indexed, etc. Postgres hard-codes it's tokenization pipeline. You can turn stemming on and off, but you can't, for example, lowercase text, strip out html characters, handle special cases for tokenization (camel cased or delimited text, etc). You can write your own tokenizer, but it has to be as a C extension. Could you get it to do what you want? Mostly, but only with an order of magnitude more work than with Solr or ElasticSearch.

- Postgres has no facility to make it possible to do faceting (histograms). This is a pretty big deal breaker for most kinds of eCommerce. I've seen a few attempts at this, but they are incredibly complex and slow compared to other search engines.

- Postgres does not give you much control over ranking. It has some good generic ranking methods, but those will only get you so far. (Query Amazon for 'ipad -dkfjdkf' to see how well it's generic ranking function is. No generic ranking function can do very well compared to something tuned to the domain, and Postgres just isn't flexible enough to extend in any reasonable way.

- It is not scalable. Solr and ElasticSearch are both very very easy to scale up (and out). Postgres' built in replication is very heavy, as it has to stream all database changes to all slaves (yes there are things like SLONY but they are incredibly hard to deal with). In any real environment, you have to protect your database from load. It's the the hardest thing to scale, and consulting it for all traffic is a surefire way to run into issues that are very hard to get back out of.

Now, with all that said, I would recommend it for a few cases:

- You just want to do SQL queries with 'LIKE' queries, but indexed.

- You need to do arbitrary relational constraints on top of text queries. Doing a query like "give me all the tickets closed in November by Hugh where the project was ProjectX and the ticket name has the word "nested loop" in it's title OR it's description OR it's title is less than 15 characters long" is going to be more or less impossible in any other environment. If you were building something like Jira, or really anything that needs to support relational logic, Postgres' full text search would be a great way to go.

- This is probably the most common case I would recommend it: You already are using Postgres, and you just want to add on some basic search functionality to data that is already in there. In that case, being able to integrate with your existing queries, and not having to support a seperate indexing step and corresponding index updating processes is a huge win.


This is probably the most common case I would recommend it: You already are using Postgres, and you just want to add on some basic search functionality to data that is already in there.

Perhaps this is true for more sites than you imagine - you started your comment saying that it is not a reasonable solution to most search use cases, but the vast majority of site search on the web is subpar and way below psql FTS. So for those sites FTS would be good enough, and a big improvement on what is there with minimal time investment (far less than installing, maintaining and learning to use something like solr).

I've replaced ILIKE type searches with psql FTS on a few sites and been very happy with it - it certainly won't be suitable if your entire value proposition is centred around searches, if you need configurable tokenisation, many languages etc, but it is more than adequate for simple searches across text columns where you don't need to use the psql ranking for example but just want a list of hits to order as you see fit. If you don't want to have a JVM dependency on your server, and don't have high traffic or specialised search requirements, FTS is a good choice. There is a huge long tail of sites with terrible search who could improve it by using FTS in postgres instead, and that's who this article is aimed at - not people who know what stemming is and why they should use it, or what faceted searches are (I didn't till I read your comment, thanks!). Imagine for example a startup who want to quickly enable a decent search of their blog posts - FTS is a good option which doesn't require external dependencies and is very quick to start with.

Thanks for the insight though on what the limitations of psql FTS are and where it starts to break down, it's interesting to hear this from someone who's tried better options and written a search service which was heavily used.


I'm not convinced that you can magically improve search on those 'tail' sites. Postgres' full text search is nontrivial to set up and use, even if you have some SQL experience. I've done substantial work on text search in the past, and I still have trouble following the docs for FTS. The interface is cumbersome at best.

Finally, if search on a site is especially bad, probably the people that made it can't do better, or don't care. I would suspect that most of them are in the "can't do better" camp. One of the not-so-surprising things I have learned about these sorts of programmers is, they don't spend any time reading up on programming related topics, so they'll never see blog posts like this!


Postgres' full text search is nontrivial to set up and use, even if you have some SQL experience.

Seemed pretty straightforward to me.


Elasticsearch is much more cumbersome to setup. I think Postgres' wins in relative ease of setting up.


No it isn't. DEB/RPM package install and you're pretty much done.


I m glad to see that you got the main goal behind this post. ES/SOLR are great tools but some project may not need some of their feature


> You can turn stemming on and off, but you can't, for example, lowercase text, strip out html characters, handle special cases for tokenization (camel cased or delimited text, etc). You can write your own tokenizer, but it has to be as a C extension.

Of course this can be done without messing with C, you just massage your data on a trigger before updating the FTS index.


> Postgres has no facility to make it possible to do faceting (histograms). This is a pretty big deal breaker for most kinds of eCommerce. I've seen a few attempts at this, but they are incredibly complex and slow compared to other search engines.

why exactly GROUP BY doesn't work here?


> You can write your own tokenizer, but it has to be as a C extension.

Another option is to tokenize the data before insertion into PostgreSQL (which, ironically, I'm currently doing in one of my projects using Lucene).


Sure, but that gives away most of the benefit of using it. You have to have your tokenizer algorithm available at query time (so you can tokenize the query the same way), usually at least.

You also have to undertake some batch process to reindex things that have been added/changed. You don't have to handle deletes, which is certainly the trickiest thing to try to keep in sync, but once you have some external cronjob or whatever running your code over all the full text entries, you might as well just make that job insert the data into ElasticSearch. Most of the difficulty in setting up full text search outside of the DB is creating and managing/monitoring the process to shuttle the data out of the db and into ElasticSearch, and if you are going to do most of the work, you might as well get the benefit, rather than just shuttling the data out of the database and then back into the database.


Hi Justin, do you have a link to a technical paper on the anonymous? click algorithm. Was curious if you applied it to domains other than products/shopping.


I linked to the patent application (which has been denied).

The algorithm is actually very very simple.

1. produce a set of 'features' for every document. This is what you will index into the inverted index. Features can be words from the document, or they can be things like 'price_between_0_100'.

2. Build a set of pairs (search_phrase, clicked_on_document_id).

3. For each search_phrase in your set of pairs, run the search_phrase through your search engine. The set of all results we will call SET_A.

4. SET_B is the set of documents for a search_phrase that have been clicked on. SET_B is a subset of SET_A. Using mutual information, it is possible to determine which features best differentiate documents from SET_B from (SET_A - SET_B) for a specific search_phrase. These are the features you will 'boost' when you see that search_phrase.

As an example, one of the features this pulled out in practice was the word 'Oceanic' for the search_phrase "lost". It basically zeroed in on the fact that the word 'Oceanic' appearing in the description of a product differentiated 'good' results from 'bad' results for the query 'lost'.

Edit: Here is a hard to follow presentation on it: https://docs.google.com/presentation/d/1xzQbQ1cM3GnaS_LoCvmJ...

Edit2: Another presentation, easy to follow but without most of the technical depth: https://drive.google.com/file/d/0B_eNCTyNa0DHcy1UOGpSUk41NDA...


what about ILIKE '%KeyWords%'? I know it doesn't get ride of 'ing' suffixes, what about performance?


This post shows the problem when people ignore theory. Elasticsearch and Lucene are simply inverted indexes. PostgreSQL full text search is also an inverted index, but there are some small architectural issues that make the difference.

Depending on your use case, you cannot replace like or ilike with full text search. You may need to use a trigram index, which you cannot do with ElasticSearch and Lucene.

PostgreSQL text search is absolutely good enough for a great many cases. Maybe not for specialized cases like yours, but for general search it is fine.


This is not a case of 'ignoring the theory', it's a case of you being pedantic.

I didn't say "replace 'like' queries with identical functionality", I said 'replace', as in 'supersede' and 'improve upon'. Except for insane cases such as the search 'ant' matching the word 'pedANTic', which is to say whenever whole tokens are intended to be matched, postgresql's full text search will replace 'like' queries for you with better functionality (such as matching non-continuous tokens in a search text) and better performance (assuming an index is created).

However, having full control of the tokenization pipeline means you can certainly satisfy any special cases your old 'like' queries would satisfy. For example, in Solr it is trivial to tokenize additionally on word/camel case/number-letter boundaries, so that the tokens generated from something like EM22C can be EM 22 C, EM22 C, EM 22C, EM22C, etc.

That is a specific example, so please don't respond that I am ignorant of the theory, I fully realize that you can index any string, not just the exact ones I mentioned there.

Speaking of being pedantic, Postgresql's full text search is not an 'inverted index'. It is not an index at all, although it can use indexing to improve performance. In addition to GIN (Generalized Inverted Index) it can also use GiST (Generalized Search Tree)


You seem to be implying that the primary purpose of like queries is to do full text search, which is of course not true. E.g. perhaps you need prefix matching, or perhaps you are implementing an autocomplete box where you want partial matching, or you are searching over filenames and want partial matching, the latter two both being "insane" cases in your world.

And yes, there are GiST indexes for text in addition to inverted indexes. I didn't bring these up because you are comparing primarily to products using inverted indexes. My point is to illustrate that these products are actually mostly the same, with differences primarily around tooling.

No one is arguing that Solr and Lucene don't have advantages over PostgreSQL's text search. But I strongly disagree with your claim that "full text search lacks the feature set to be a reasonable solution to most search use cases."

You are quick to claim that postgresql cannot use customize tokenizers or cannot scale or doesn't have features. Then in the same paragraph say that well, yes, it can do both these things but it's harder to work with. You also wipe all the other advantages that postgresql brings under the rug with a dismissive "oh well these are minor situations." I guess they are minor in your products, but they are certainly not minor in others.

If you only need text search, or are using a database other than postgres, then Solr and Lucene are great choices. On the other hand, if you are using PostgreSQL already, its' full text search is entirely reasonable and feature rich for many if not most search cases.


> You may need to use a trigram index, which you cannot do with ElasticSearch and Lucene

You definitely can create an analyzer to generate trigrams in Elasticsearch. Unless you mean something different with "trigram index" than indexing trigrams?


It may be good enough but I find ElasticSearch just easier to work with. For example, ElasticSearch gives you query string syntax [1] for free and it's a HUGE plus. It's both user friendly and extremely powerful. You can see it in action over at GitHub's Code Search [2].

[1] http://www.elasticsearch.org/guide/en/elasticsearch/referenc...

[2] https://help.github.com/articles/search-syntax/


As the author, it may be good that I clarify the goals behind this post.

First I'm glad that this post made HN first page for a second time.

I encourage to read this comment because it summarizes well few of the missing features in Postgres about search. https://news.ycombinator.com/item?id=8715624

It's not the first time that I get similar feebacks and I understand that "Enough" is subjective notion and maybe I should've pick an other title like "Postgres full-text search for basic search need".

Just to clarify, this post is not against ES or SOLR. This tools are great and really powerfull but obviously this tools will better than Postgres Search. ES / SOLR are tools build to be anly around document store and search and on the otherside the search in Postgres is only one feature in many.

I think there is multiple reason when you would want to give Postgres search a go:

- You have no experience with ElasticSearch or SOLR

- The project is still early stage and you don't want to invest time/money in a secondary datastore (eg: ES cluster)

- You want to give quickly a search feature (maybe not perfect) in the hand of your user

- Team knowledge, you may want to be careful before introducing a new tech to your team and ensure that people understand it. There is nothing worst than introducing a tech and being the only one who can maintain it.

The tech world is big and doesn't only apply to Startup usecase, I worked in very different environment: Financial institution, Open Source company, Startup and Service company. A thing that I learn is that you don't always have flexibility. In some company having new servers is long process and even if I hate you have to do with this constraint then you have 2 choices front of you: You do nothing and wait or you move yourself to offer to your users a workaround solution without any extra dependency ... That's what Postgres search can be in some usecases.

From my experience, if you are already using Postgres and you find the right documentation (this post try to describe a start to end example) then I think you could get a search working in 1 day of work without any extra servers or resources.

I don't have any experience with SOLR so I will have to pick ES, but if you never used it then you are going to have loads of "fun":

- Sync. ElasticSearch IMHO cannot be used as a cannonical datasource so you will need to figure out a sync strategy. Some people may argue on that but I don't think that ES started with this goal in mind. They may change direction now but still I wouldn't store my data only in ES. http://aphyr.com/posts/317-call-me-maybe-elasticsearch

- Hosting. In the case that you are not using a hosted ES service then you will need to learn to host and build a cluster and also monitor it.

- Learning to use it. After reading ~50pages of doc then you start to figure out how to use type mapping, custom analyser and the 42 ways to do a search query.

I can only speak from my experience but it took me ~2weeks to get my head around all this point without any previous experience. The project for which I used PG search and wrote this blog post recently moved to ES. More or less, 1y later and I'm glad that the PG solution was good enought until now. In our case we needed to support more languages: Japanese, .. and I also wanted to get rid of a bunch of triggers.

The results are good and very fast but for english the results are quite similar to Postgres solution. I think that's mainly because I'm using much only ES basic feature, but we have loads of room for improvement which is a good position to be.

Now there is few other usecases where you may want to consider PG search:

  - you are using MySQL and you are using the builtin then moving to Postgres can give your search a bit more power

  - Side project using Postgres and you want to start a search feature.

  - Search on machine output, case where analyser doesn't really matter and you don't need stemming. Lucene features for tokenization and stemming are great.
To summarize, I don't think that PG search is a silver bullet but it may be "Good Enough" for your need to get you started. Spend 20min and read the post and then maybe it's also good enough for you. On the other hand if you know already ES/SOLR and you got the resources then it become less relevant to use Postgres Search. ES/SOLR (tools base on Lucene) are amazing and really feature rich, if you have important search need and long term plan about search then you should look into them.

If you are interrested about the PG search and you want to know more about it than I suggest you to have a look to this blog posts. I think that the author did an amazing job.

http://shisaa.jp/postset/postgresql-full-text-search-part-1....

http://shisaa.jp/postset/postgresql-full-text-search-part-2....

http://shisaa.jp/postset/postgresql-full-text-search-part-3....

I don't know if people will be interrested but I thought few times about writing a following blog post "When Postgres search is not Enough" to describe how to integrate Postgres with ElasticSearch with JDBC River to pull content from the DB.. Let me know if it's something that you will be interrested about.

Sorry for the numerous grammatical mistakes but I'm not a native english speaker and I don't want to spend too much time reviewing my comment grammar

Thanks again to upvote this post on the 1st page of HN


Interesting post. I work at Swiftype, where we build full-text search engines, and it's been interesting seeing the varying definitions of "good enough." It seems like Postgres full-text search is surprisingly decent for a less robust solution than something designed entirely for search, and I can see it being "good enough" if you are in the very early stages of development and don't want to add anything that requires additional knowledge to your project quite yet.

>> You want to give quickly a search feature (maybe not perfect) in the hand of your user

If this is your goal, rather than simply keeping your project contained to postgres as much as possible, Swiftype would be a better solution. Building a whole full-text search engine in a day with Postgres is faster than you could reasonably do with ES or Solr, but if your goal is simply the functionality, you can get a search engine into your app in less than an hour with Swiftype. We're built on Elasticsearch and Lucene, so you can get all their power from a team that works on it full time.

https://swiftype.com/


I gave a talk at the Elasticsearch usergroup about that topic and evaluated a few solutions. I think Postgres fulltext falls into the interesting spot of being "just there" if you are using PG already and being very robust. It has a visible ceiling, but if you are okay with that, go with it. (take care that PG does not implement UTF-8 full but UCS-2, but does not refuse UTF-8, as I pointed out in the original post)

If you get started with building a secondary system for searching anyways, I wouldn't recommend it.

Also, don't use MongoDB fulltext, it's just broken.


Speaking of syncing strategies, I use Django with PostgreSQL as the main datastore and Elasticsearch as a secondary data store. PostgreSQL receives all admin and staff related updates and Elasticsearch is updated automatically using signals (and celery tasks if necessary).

I created a Django app to help with this syncing - check it out at https://github.com/jaddison/django-simple-elasticsearch. I love feedback and pull requests!


It's also a Django project, I like what you have done. It's a nice lib and it's intuitive. How do you handle a ES document which is a composition of multiple Django Model?

It's just personal opinion but I try to avoid having my application responsible of the data integrity so I went into the way of ES River plugin to pull data. I used the JDBC one, https://github.com/jprante/elasticsearch-river-jdbc ... I met some problems but at the end, it works quite well and don't need logic in the application keep the data synced.

Two other reason that I used River was to not make the app slower by saving data in ES and also of being able to run the application during development without the need of ES and being installed able to substitute the search with a stub.


Good question - django-simple-elasticsearch is definitely focused on generating a document from a single instance of a specific model. Of course, you can add supplementary data from associated models via M2M or FK models as you see fit (nested objects, etc.)

I've thought of adding in support for pushing bulk index request data to redis (for example) so that an Elasticsearch river could pull from it; this would decouple the app somewhat - but not completely as you've noted. It would likely help with throughput however, and still provide you with the ability to do pre-processing on the data as needed within your app's/project's context.

The Elasticsearch JDBC river isn't as flexible for processing data if I'm not mistaken, as it doesn't have context for the data? Please correct me if I'm wrong, but it's somewhat limited?


I think that you nailed it, Elasticsearch JDBC is not flexible and you need to be good in SQL if you have a complex model with many relation. Because JOIN create a cartesian product you can end with duplicate and you have to find a way to avoid that because duplicate affect the ranking. In my case I ended using UNION queries.

Also you cannot also have multiple queries but not to update and existing document(eg: adding extra fields) The SQL query for me ended being ~100 lines but the model is quite complex with multiple languages support.


You just posted this today? This isn't the first time your article has been on the front page of HN. I actually still have your article "saved" in a tab in chrome from the last time it was on HN. It made quite an impression on me, I'm in the process of moving our dev environments from sqlite to postgres, so that we can start adding some of this functionality. Thanks.


To be honest, I'm quite surprise to see that somebody posted this article again.. I don't know HN rules when you can post again something.

I'm not going to complain about it :)


It's definitely a cool article, I don't mind seeing it twice. I love the idea of having these tools with postgres and not having to use thinking sphinx or solr right as quickly.


I think you can repost about a year. It could have changed tough.


I don't know when it was posted, but it definitely was less than one year ago. My main data point is that it still is living in an open tab in Chrome. I have a couple of sites which allow submitting links. I know people can get around my requirement that they are unique pretty easily. I'm sure HN is more sophisticated but I think that just having different url parameters might qualify as a different link? Sometimes a url parameter change really does point to a different page, so it might not be possible to eliminate duplicates that have the same url with just different params.


URL parameters don't qualify as a different link (or else anyone submitting a url without removing them would have unique posts). I always thought the rule had to do with comments being allowed on the post, which, I believe, are based on comment activity. My assumption was as long as an article still allows comments, resubmissions don't become new posts.


I don't know why and I can't help you, in both case I didn't post it. But we don't have multiple url for our blog. I find it odd to be honest as the post was posted on HN the 29th of September (base on Google Analytic)


Thanks for the article, and I'd certainly be interested in seeing "When Postgres Search is Not Enough" -- it'd make an excellent companion piece to the one you've already written.


Thanks for the article and sharing so much of your experience.


I’m using it on http://sublevel.net/ — there’s no problem whatsoever. It does the job so well, especially if you have consistent fields on tables. I like the fact that it can search multiple fields at once without degrading the performance. You can try it on Sublevel without creating an account.


The use of a Materialized View here is clever, but it means that every time you add a new row you have to refresh the view. For an app that accepts user-generated content, this would be very frequent since you wouldn't want users to create content and then test the search and not find it.


I don't know how optimized "refresh materialized view" is, but making this efficient is easy if you do materialized views the "old school" way, with triggers (create a table instead of a materialized view; create triggers to update based on individual rows, and a function to refresh the whole view).


I've used tsearch2 on several projects and it's always been super handy. It's much easier than adding another dependency just to support search.


In case anyone ends up confused by the parent post, tsearch2 is not an additional Postgres contrib module; rather, it's an old contrib module that the built-in full text search features were built on top of. It has been obsolete forever.


Huh, one of the senior engineers at a previous employer turned me on to it. I completely missed that it was built in now, I feel kind of silly now.


The Oracle docs have a nice overview of the Oracle equivalent http://docs.oracle.com/cd/B28359_01/text.111/b28303/toc.htm I need to give both of these a try sometime!


I've been teaching databases this semester, and PostgreSQL's full-text search capabilities have been wonderful for allowing my students to build search into their projects without needing to learn additional tech (we're using PostgreSQL as our database anyway).


this is a awesome post. for most web applications, installing solr/elasticsearch for full-text search capability in addition to a sql database is highly time and maintainence intensive. Postgres for both storing and searching is really useful here.


I think Algolia is also a nice alternative if you're looking for a quick setup.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: