Ask HN: Are Lucene/Solr/ES Still Used for Search?

crawdog · on July 19, 2019

Very much so. For retail/catalog search SOLR dominates. There's a lot more customization available for relevancy/ranking OOB than Elastic. Drawbacks are managing indexing - SOLR cloud is much harder to manage.

For commodity search workloads (general retrieval/faceting) Elastic does a fine job. It scales well and there is good documentation and support.

Lucene is the core engine behind both of these solutions.

For fun, lets look at the large Enterprise acquisitions over the years:

* Verity - bought by Autonomy

* Fast - bought by Microsoft (Also known as the Enron of Norway...)

* Autonomy - bought by HP (Look at the backstory on this deal!)

* Endeca - bought by Oracle

* Vivisimo - bought by Oracle

* Google - GSA (now Google Cloud Search, hosted solution)

Next, follow the path of online acquisitions:

* IndexTank - bought by LinkedIn

* Swiftype - bought by Elastic

There's a number of interesting independent players still. Coveo plays in the Enterprise space, but it's a hard market. Algolia is doing great in the commodity online search space and seems to be growing well.

This is an area I think is open to more competition. Especially with AI/ML technologies available around Document Understanding - the Enterprise market is open for a good on-prem upstart to really take off.

Ping me offline if you have additional questions - spent almost 20 years in the space and ran a search company of my own.

aeyes · on July 19, 2019

* Fast - bought by Microsoft (Also known as the Enron of Norway...)

That one was painful to live through, we got forced to migrate to Windows and everything went sideways. That was almost 10 years ago with quite a big cluster (tens of nodes).

tallanvor · on July 19, 2019

You were never forced to migrate to Windows. In fact, the last major customers on ESP were using Linux to the end.

josh2600 · on July 19, 2019

Don't forget Powerset!

bogomipz · on July 19, 2019

>"For retail/catalog search SOLR dominates."

Interesting, could you elaborate on why SOLR is dominant in that space over say Elasticsearch?

crawdog · on July 19, 2019

Definitely. Product catalog information as a whole doesn't change often. Price and availability does. With retail catalogs you often do a full reindex of the data in your master catalog and then run partials to account for price/availability if you don't do that in realtime using filters. Since the system of record is not the search index, Elastic is often not a good solution here.

Also, relevancy in retail is often influenced by other factors that cannot easily be implemented in Elastic. TFIDF/BM25 search is available in both platforms, but you may also weigh in other factors such as relationships with the vendor, stock on hand, or other ML techniques that are more easily implemented in SOLR.

crawdog · on July 19, 2019

One more point - if you can run the entire index on one machine it makes deploying and managing SOLR much easier to manage than Elastic. The complexity only grows when you have a distributed system. You can fit a lot on a big box.

gvurrdon · on July 19, 2019

Yes, it's certainly easy to manage on one system. As the production system I work on is an academic one with few concurrent users it's possible to get away with that.

bogomipz · on July 24, 2019

Thanks for the explanation and insight. I hadn't ever considered how different the catalog use case is from my own use cases. Makes good sense. Cheers.

jillesvangurp · on July 20, 2019

I think a big reason is simply that Solr has been around a few years longer and most older sites would have been using that by the time Elasticsearch came along and never saw a good reason to go through a painful process of replacing one with the other.

These days, I would say there's very little that either product does that the other product can't do though obviously there are lots of strengths and weaknesses on both sides.

softwaredoug · on July 19, 2019

There is a long term trend of search engine companies being acquired, going away, products ending, and customers left in the lurch....

shepmaster · on July 20, 2019

Vivisimo was acquired by IBM.

Source: I have the capital V from the building in my house.

(I see now atambo has already noted this typo)

hackbinary · on July 19, 2019

We have on prem Coveo, which is based on ES. It is/they are horrible. We pay a large amount in Enterprise support and maintenance, and it is next to nonexistent.

Thankfully we are moving to Azure Cloud Search services, which just work, over the next year.

Good bye manure pile, hello compost.

avodonosov · on July 19, 2019

There are no contacts in your profile (for pinging offline)

crawdog · on July 19, 2019

corrected!

127001brewer · on July 19, 2019

Please update your profile with your contact information.

crawdog · on July 19, 2019

thanks for catching. updated.

atambo · on July 19, 2019

Vivisimo was actually bought by IBM, not Oracle.

jillesvangurp · on July 19, 2019

Elasticsearch, definitely. I always recommend using it in hosted form and not running your own cluster. That allows you to focus on getting data in and out of your cluster instead of sinking time into doing devops.

While I have not used Solr recently, it has evolved along with ES as a solid product with a solid community. Nothing against it; it's a solid choice and there are probably people offering to host that as well. Either way you are using Lucene. Using Lucene directly makes no sense unless you know what you are doing and have a need to do that. If you have to ask that means it's not for you.

If search is important to your use case and things like relevance, precision, and recall have real impact on your business, you should get some specialists involved and not reinvent the wheel or make all the rookie mistakes. Somebody like me basically ;-). However, that can be expensive and if search is not that critical, just sign up for one of the search as a plug in solution type products out there and don't bother running a lot of infrastructure. E.g. Elastic offers a thing called App Search and it probably covers most of the simple needs and is stupidly easy to get started with. There are several competing products probably that I can't vouch for.

You can always upgrade to something proper later. Things like mongo and postgres also have some limited capabilities here and you can get away with doing some simplistic stuff with sql even. However, there's a point where you hit a brick wall and there's something you need that is simply hard/impossible using that or where you end up reinventing a lot of stuff that things like Elasticsearch probably do better.

busterarm · on July 19, 2019

> Elasticsearch, definitely. I always recommend using it in hosted form and not running your own cluster. That allows you to focus on getting data in and out of your cluster instead of sinking time into doing devops.

I think everyone doing Elasticsearch well has to bring it in-house eventually. AWS's hosted solution is poor, Logz.io and ElasticCloud are expensive.

There's a 7-figure/yr Elastic Cloud customer I work with who is so tired of Elastic just randomly killing their clusters out of nowhere and having to spend basically triple to deal with it that they're bringing it all in house.

jillesvangurp · on July 20, 2019

Elastic Cloud is pretty reasonable IMHO. We have a logging cluster that costs us around 250 euro per month. Yes, running it ourselves would be cheaper but doing that in Amazon would eliminate most of the cost benefits. If you then consider devops cost needed to that, there is no difference. Most companies getting started with Elasticsearch should probably not do that on day 1. You can always start doing this later if there's some reason to.

I've never had clusters being killed randomly but I did have a few self inflicted issues with cluster instability due to flooding the cluster with too much data and not having suitable data retention policies in place. With the recently added index life cycle management (an x-pack feature), this is easier to manage these days.

If you are spending seven figures a year on elastic search, you clearly are not a beginning user and there's some cost savings that you might be able to realize by taking ownership of the problem of hosting it somewhere cheaper. For that, their recently open sourced kubernetes helm charts are worth a look. Those scripts take care of a lot of things and get you a self hosted version of Elastic Cloud.

Amazon hosted clusters are indeed a bit bare-bones (as is their support for these clusters) and I would also not recommend them; you get more value for money by using Elastic Cloud.

otisg · on July 19, 2019

> so tired of Elastic just randomly killing their clusters out of nowhere and having to spend basically triple to deal with it that they're bringing it all in house.

That is because Elastic Cloud is not a fully managed Elasticsearch. People often don't get that with Elastic Cloud you are still responsible for your ES cluster. That's one of the differentiators that e.g. Sematext has (disclaimer: founder).

> AWS's hosted solution is poor, Logz.io and ElasticCloud are expensive.

Right. Have a look at https://sematext.com/logsene pricing. People say Sematext compares favorably to Logz.io and Elastic Cloud.

sandGorgon · on July 20, 2019

Why are all ELK stack based logging solutions so much more expensive than custom rolled solutions like logdna, datadog, papertrail,etc.

The per gb cost on the other ones start at 1.20 $/gb and goes till 2$/gb. While almost all hosted ELK solutions start off at 3$/gb.

Im asking because i would very much like to adopt an ELK based hosted solution..but I'm not able to justify paying double. Is it that running+resource costs for ELK are so high that the extra charge needs to happen ?

manigandham · on July 20, 2019

Yes it's mostly about resource costs. ES is a generic search that can handle logs but isn't focused on it, and indexing everything can be costly. There are major efficiencies gained in creating your own log-focused storage and access methods with object storage, columnar formats, and zone mapping, etc.

busterarm · on July 20, 2019

Datadog, etc are not cheap when you're paying per agent/per month to ship your logs in the first place. Which they want you to do, of course.

sandGorgon · on July 20, 2019

The pricing is not per agent per month. The pricing for everyone is always either per million events or per gb.

You do not need an agent. You can ship directly from syslog ( https://docs.datadoghq.com/integrations/rsyslog/?tab=datadog...)

ragle · on July 19, 2019

> AWS's hosted solution is poor

What has been poor about it in your experience?

__blockcipher__ · on July 19, 2019

Not the GP, but you aren't allowed to touch settings like the shard recovery rate.

If a configuration change (changing # of nodes, instance type, etc) goes wrong, your cluster indefinitely gets stuck in Processing due to a race condition. The only way to get unstuck is to file a ticket. The company I'm at doesn't pay for AWS support, so at one point we ended up completely tearing down our cluster and rebuilding a new one (via Terraform) after getting tired of waiting for the reps. (They advised us to cut off log flow to let the system get out of processing, which we did, but it didn't work because once it gets stuck in processing like that it's just completely stuck).

It's difficult to troubleshoot issues - you can get some logs via Cloudwatch, but they're hard to search through and I'm not entirely positive everything shows up there.

Amazon is always several releases behind Elasticsearch versions.

--

Elastic.co's offering looks much better, just by reading their excellent comparison article: https://www.elastic.co/blog/hosted-elasticsearch-services-ro...

(We haven't used Elastic.co but what they say makes sense and I imagine their service is much better)

--

Once you hit a big enough scale - for us, we're pushing about 2TB a day of logs (that's before accounting for replication of course), it doesn't make sense to stay on Amazon's hosted service.

I'm in the process of advocating for in-housing our Elasticsearch setup and just building on top of ec2. Elasticsearh seems like the perfect candidate for Kubernetes since rebalancing is automatic and the affinity rules are simple (every Elasticsearch instance needs its own node). Cluster autoscaling (i.e. node-level, not _horizontal_) just makes too much sense.

Unfortunately I haven't gotten the go-ahead to take it inhouse, but I've been gunning for the project for some time now, so I'm hopeful I'll get the opportunity.

--

BTW, totally unrelated but for anyone managing Elasticsearch, make sure you have your shard count tuned properly. When I came to this company, they had their data way oversharded with primary shards varying between hundreds of kb to a few gb; i.e. orders of magnitude difference. Switching to ~50 GB shards (done via simplifying the way we were indexing) massively improved performance.

Also i3 instances > anything with EBS.

[/ramble]

jillesvangurp · on July 20, 2019

Check out the recently open sourced kubernetes scripts: https://www.elastic.co/elasticsearch-kubernetes. There's no need to reinvent all of that.

ES is indeed super solid if you know what you are doing. However, most users getting started with this are probably going to find out a few things the hard way though; which is why I recommend hosted solutions as it removes quite a bit of non trivial devops from the equation.

boyter · on July 19, 2019

Not OP but the list I had from a year ago, didn’t support all multi az (only 2), lack of admin features meant it broke and you could work out what happened, limited choice of instance type, the backup happened once a day regardless if you needed it and it had issues under load.

The backup issue has been resolved I believe but the others...

On AWS I’d suggest elastic on ECS and you can use the leftover compute on the cluster to run other applications effectively for free.

amelius · on July 19, 2019

Also, some clients may prohibit uploading of data to third parties. In those cases you simply have no other choice than to run your own cluster.

showerst · on July 19, 2019

I think it depends massively on how much data you have. A great many companies and websites only need a few hundred megs of text data indexed, which is easier to outsource.

Once you grow larger than that though, the hosted service prices get astronomical compared to standing up a cluster, assuming you have someone who can admin it.

devonkim · on July 19, 2019

Things are not quite cut and dry operationally. Some places may have only a few hundred MB of data but they need really high availability and performance (probably picked the wrong solution IMO but still...) which is better guaranteed hosting it yourself and perhaps even outside the cloud. Most places the availability of whatever a hosted solution provides is a better value than spending engineering hours to maintain these things. Devops / SRE folks are expensive compared to most other engineers that deliver features primarily.

My point is that a ton of companies (more for cultural reasons than business requirements IME) are so freaked out by any downtime in any way that they’re going to pay for an engineer to just maintain these things themselves.

busterarm · on July 19, 2019

The only issue I have with this suggestion is that all of the situational awareness needed to _use_ Elasticsearch effectively is the same as is needed to _operate_ Elasticsearch effectively.

If you're just spinning up something's defaults and throwing data in it, that bill is going to due eventually and it's probably going to be ugly.

ageyfman · on July 19, 2019

We host a smallish (12 large node) cluster ourselves on AWS, and it's been nothing but SUPER SOLID for us. Literally 0 issues in the year. We use it for analytics, aggregations and the like, as well as domain-specific search, for which it's a great fit.

fredsted · on July 19, 2019

I agree, if ES is scaled well for your usage it just keeps going. However, a big time sink for me has been reindexing, lost data due to bad mapping, and version incompatibilities.

anaphor · on July 20, 2019

I've found https://www.npmjs.com/package/elasticdump to be VERY useful for doing things like copying indices and recovering data. Much more than the built in stuff.

arafalov · on July 19, 2019

AFAIK, Reddit, Slack, Dice, Bloomberg, IBM, Apple all use Solr. Jira and Confluence use Lucene. Others use Elasticsearch and Fusion (commercial product on top of Solr). See, for example: https://www.activate-conf.com/more-events for presentations from several past years on who and how uses Lucene/Solr.

Also, the new trend in jobs is "Relevancy Engineering", which is less about just setting up search engines and more on actually tuning them. That's where also Machine Learning and other AI techniques come in (Learning to Rank, Named Entity Recognition, sentiment analysis, etc.). Which was recognized by rebranding of the conference from Lucene/Solr revolution to Activate last year.

See also Haystack conference which focuses very specifically on relevance regardless of the specific search engine: https://haystackconf.com/

peterdm · on July 20, 2019

Until about 2014-2015, many large companies wouldn't have looked twice at Elasticsearch. The companies you list using Solr have been invested in search for 10 years or longer (predating Elasticsearch), and may have high switching costs.

arafalov · on July 23, 2019

For sure they would have high switching costs. They would be switching from an open-source (and free product) to which they contributed changes to a commercial product. License alone would be a serious discussion point. So, that's a fact.

Was there an opinion in there as well that you tried to convey?

softliving · on July 20, 2019

Bloomberg, Apple, IBM also use Elasticsearch or more Elastic stack!

aliswe · on July 19, 2019

From what I read, wikipedia uses ES.

bratao · on July 19, 2019

From my experience, yes. Lucene is a "production" state of art library and Solr/Elasticsearch is very used in many scenarios.

This expertise is very on demand.

My company personally migrated from ElasticSearch to https://vespa.ai/ and could not be happier. Faster and way easier to maintain a cluster. The "Application Packages" feature present in Vespa opened many opportunities to improve our product( Curiously we use Lucene inside our custom application for a "Search map" functionality, for something like that https://www.lexisnexis.com/en-us/products/lexis-advance/sear... ) . I highly recommend it!

atombender · on July 21, 2019

I've looked at Vespa a bit. It looks pretty good.

It's also readily apparent that it's an ancient system that's grown out of an in-house project, and that its design has accrued a lot of oddities over the years from lack of careful, co-ordinated design. It includes a bunch of esoteric features (like the "predicate" function) that have obviously grown out of Yahoo's own internal architecture.

One curiosity is Vespa's approach to schema and configuration changes. To make any kind of change, or indeed set up an index, you have to create that "application package" containing your schema and configuration in the form of files, and then use separate REST APIs to "upload", "prepare" and "activate" it. There's a CLI tool to help perform those steps, at least.

It's nice that they're more consistent and rigid about schema and config evolution than Elasticsearch. But it's not exactly operator-friendly, at least not for first-time users with no pre-existing operations based around Vespa.

The package design also makes it more cumbersome to perform programmatic updates for a schema. I once worked on a SaaS project where we indexed data in Elasticsearch — arbitrary documents where we didn't know the schema ahead of time, because we just accepted any JSON document posted by the client. With ES, we could just use its dynamic mapping support, which automatically creates field definitions when new fields arrive (using regex-based templates). Do you know how long a package update takes in Vespa, to add, say, a single field?

The Vespa documentation is also pretty terrible, in my opinion. They explain a lot of things, but it's confusingly written, uses a lot of homegrown terminology, and neglects to collect all the reference documentation in one place. For example, you can't find an overview of the entire API — fragments of it are just scattered across a dozen or so unrelated pages.

Lastly, Vespa is Java. One of the biggest challenges maintaining Elasticsearch is controlling its resource consumption. You have to give it a lot of RAM, and it's never clear how much it needs and what configuration settings and usage patterns affect its memory use. Tuning it is something of a dark art. I don't know exactly how Vespa is implemented (is it all pure Java?), but I'm worried that, being a JVM app, it has the same shortcomings.

RealJon · on July 22, 2019

Predicate fields are indeed an oddity, but not an architectural one - it's for situations where the documents need to specify criteria (predicates) for when they should match - like only match for certain users, certain times of day etc. It's probably an underused feature imho since most people don't know this can be done efficiently.

If you have dynamic fields like in your SaaS example I recommend using a single map field rather than let data not under your control drive changes to the set of fields.

> Do you know how long a package update takes in Vespa, to add, say, a single field?

A few seconds. However, rather than having operators do any of this manually, set up an automatic process which deploys on each change made to the repo (i.e do CD).

> all the reference documentation in one place

https://docs.vespa.ai/documentation/api.html

atombender · on July 22, 2019

Another thing is that Vespa doesn't seem to support indexing of nested data, either structs or arrays of structs. For example:

  {
    "location": {
      "city": "Washington",
      "state: "District of Columbia"
    },
    "friends": [
      {"firstName: "Bill", "lastName": "Clinton"}
    ]
  }

Maps aren't suitable here because they can't be used for ranking. So you have to use structs, but those aren't indexable.

An application's search module could flatten the location key (e.g. "location_city", "location_state") for simple attributes, but the same is not possible for the array, since there can be arbitrary array elements. And you can't split it to an array of strings:

  "friends_firstName_elems": ["Bill"]
  "friends_lastName_elems": ["Clinton"]

...because queries like "firstName contains 'Bill' and lastName contains 'Clinton'" could match different records ("Bill Bryson" and "George Clinton"). Never mind deeply nested arrays of objects containing arrays containing objects containing arrays.

This seems unnecessarily restrictive. A search engine should be able to index the data you already have, not force the application to contort its data to whatever shape the engine requires.

Is there no way around this?

atombender · on July 22, 2019

Thanks. I'm still learning about Vespa, and it's still not clear how map fields work.

Edit: Documentation says: "Accessing attributes in maps and arrays of struct in ranking is not possible". So maps aren't really usable.

Regarding how long it takes to update a field, the application I described would have to do this programmatically. It would have to keep track of all known fields in some kind of registry, and then if a new unknown field came in, it would have to perform an "application package" deploy just for that field, using the REST API. (Unless there's a less cumbersome way to do it?)

Reference docs: That's nice, but that's just a bunch of links. Good reference documentation has tables of contents. Bonus points for runnable examples in multiple languages. For an example of good reference API documentation, look at Stripe's [1].

[1] https://stripe.com/docs/api

andreer · on July 21, 2019

Just an FYI to your last paragraph: The core indexing/ranking/storage components of Vespa are C++, and run in a separate process (no jni).

In my own attempt to compare the two, I found the memory consumption of Vespa was easier to predict and understand (there are formulas for it in the documentation).

atombender · on July 22, 2019

Thanks, I didn't know that!

simplechris · on July 19, 2019

I am very interested in hearing about your experiences migrating to vespa. When it was opensourced we all thought it would be an absolute game-changer, but I see so few people building new products (or migrating existing products to) vespa.

bratao · on July 19, 2019

You can contact me. I'm also planning to do a blog post comparing with Solr and Elasticsearch. I think that naturally it takes some time to adopt a solution like that. And the ecosystem it still at it infancy. But, randomly, a project using Vespa appeared in my GitHub timeline today (https://github.com/rdoume/News_API). So, the adoption is increasing.

For me Vespa is a absolute game-change, in features and as someone said here, ES looks like it intentionally complicated to maintain. With nodes randomly getting unhealthy. Vespa is like Redis to me. I completely forgot about maintaining it and works great.

It makes a world of difference in our product, and I take every opportunity to evangelize it.

otisg · on July 19, 2019

That's interesting to hear. Would love to read about how Vespa compares to Solr and ES.

This may be of interest to you: https://sematext.com/opensee/report/project/trend?q=ElasticS...

Would you happen to know how Vespa compares to ES in terms of memory or CPU footprint? Have you done apples to apples comparison by any chance?

bratao · on July 19, 2019

I do not have a completely fair comparison. But a migration from Elasticsearch 5 (2016) to the Vespa 7 (2019) we reduced half of our nodes, and cut in half the average response time. Another amazing feature during the migration, is that Vespa allows you to reduce or increase the number of nodes dynamically. And it take full care about the data distribution. In ES we (used to) had to follow the limits of the pre-configured number of shards/replicas during the Index creation.

ksec · on July 20, 2019

I have been wondering why Vespa isn't getting much traction. Everyone still defaults to ES, even in new project.

lovelearning · on July 20, 2019

It should be marketed better, I feel. Some SEO for "Solr/lucene vs X" queries might help. I have been spending the last 3-4 months studying open source and commercial search systems, but it's only in this thread that I discovered Vespa.

onethumb · on July 21, 2019

This sounds very interesting! What is the size & scale of your data, if you don't mind sharing? (How many documents, total storage footprint, etc) Thanks!

lm28469 · on July 19, 2019

Afaik almost everything runs Lucene under the hood, it's 20 years old, no one is going to build something as good any time soon. I suppose some company like Google have their own in house solution but otherwise it'll always be something built on top of Lucene.

I guess you don't see much demand because for a lot of use cases the basic setups are good enough.

tombert · on July 19, 2019

About seven years ago, I got a contracting gig for a website that wanted a "search engine". I remember thinking "Solr/Lucene is old, not pure-functional, and therefore awful!" and decided to build my own. Somehow I even managed to convince the client that this was a good idea.

I ended up trying to reinvent Solr for the client, realizing after about two days of trying to reinvent stemming and indexing, that this was stupid to do on someone else's time, and called the client to tell them that I'm moving to Solr, and I got the project done before-schedule as a result.

====

I think for 99% of usecases (involving search), Lucene/Solr/ES is perfectly fine. However, I do absolutely hate that some companies have decided to make it their primary database.

EDIT: I just want to make it clear, I think it's totally valid to try and reinvent Solr for fun, or if that's something you're paid specifically to do; nothing is perfect, and I am actually a big fan of the "if it works, break it and make it better!" mentality.

aliswe · on July 19, 2019

I can chime in here that lucene- based solutions are sufficient almost always, for a purely frontend, js-based fuzzy search engine check out fusejs. https://fusejs.io/

hiram112 · on July 20, 2019

Can you explain a bit why ES isn't a good solution for storing data itself?

I inherited a legacy Mongo solution, and all the data is duplicated and indexed in ES, so I've always wondered why we're using both. Mongo has none of the SQL capabilities that would make my life easier, and the types of queries allowed by Mongo could be done with ES.

What are the negatives of ES alone?

manigandham · on July 20, 2019

It's not reliable: https://www.quora.com/Why-shouldnt-I-use-ElasticSearch-as-my...

The v7 upgrade to a new cluster protocol (zen2) has improved things but overall the system has a long history of losing or destroying data. It's better to have a primary OLTP system that's ACID and reliable while using ES as the secondary search source. You can also remove the _source field if you just need matches without the original content.

It's common to see pattern used with a relational database since, as you can see, Mongo doesn't buy you much else as another document-store.

hiram112 · on July 26, 2019

Thank you very much. It seemed like we were just duplicating a bunch of scheme-less json for no reason, but if ES can lose data, that's probably not a good idea.

manigandham · on July 20, 2019

Follow up: MongoDB is adding full-text search capabilities: https://www.youtube.com/watch?v=4QUGWnz-XaA

penagwin · on July 19, 2019

> "if it works, break it and make it better!"

Fix it 'til it breaks is what I always say :D

Needless to say my 3D printer had a lot of down-time haha.

twic · on July 19, 2019

"If it ain't broke, open it up and find out what makes it so bloody special" - i think that's wisdom from the BOFH.

eivarv · on July 19, 2019

Agreed re: using ES as primary storage (which it is NOT meant as) - as far as I can tell, it might even make you in breach of the GDPR [0].

TLDR: Lucene < 7.5 won't merge segments larger than 5GB (default) unless they accumulate 50% deletions.

Delivering a conference talk [1] later this year about it.

[0]: https://www.eivindarvesen.com/blog/2018/09/16/elasticsearch-...

[1]: https://2019.javazone.no/program/3f7cd8a7-a9ea-4874-a7dd-531...

balfirevic · on July 19, 2019

> Agreed re: using ES as primary storage (which it is NOT meant as) - as far as I can tell, it might even make you in breach of the GDPR [0].

How is GDPR compliance of having data in Elastic Search influenced by it being primary vs. secondary storage?

rickmode · on July 19, 2019

See my reply to bryanrasmussen for a full explanation.

Basically: you reindex ES periodically, so when a user is deleted from the primary, it will disappear from ES upon the next reindex. The old index is deleted at the file system level.

ipython · on July 19, 2019

At some point, though, the pedantry can get out of hand. After all, 'deleting' at the file system level is just 'unlinking' the inode from the underlying data blocks... in fact, data forensics at the file system level is probably more well-understood than recovering deleted data from a Lucene shard.

at what point would you be able to 'delete' data without being in violation of GDPR?

eivarv · on July 19, 2019

I know - it really is a matter of definition.

Though the EU has said it will consider intention etc. there's really no way of knowing for certain until when and if it's settled in a court case.

bryanrasmussen · on July 19, 2019

I think the instructor's response in the linked article is a reasonable defense, you don't really know that data is deleted all the way down to the file level. It is just marked as deleted and could be retrieved by someone clever enough to do so. At some point in the future it will be really deleted.

I don't think the GDPR regulatory agencies are operating at a technical level that they would make an argument that it was not a good enough deletion.

Finally I have to ask this part: assuming ES is not your primary database, how does this get around the GDPR issues? If someone wants their data erased you are supposed to erase it from wherever you store data, I suppose this means ES when it indexes a primary store and finds it has deleted data actually deletes it but if it is told to delete something it keeps it around?

rickmode · on July 19, 2019

When using ES for indexing and not the primary store, you can (and should) periodically fully reindex the data set. You can use a blue / green pattern — create a new index then swap from the old one to the new one. ES supports aliases, making this swapping transparent to the apps using the index. Now you have more options.

If it is easy to delete specific users from the primary database, the deleted users will naturally disappear during the next ES reindex.

Edit: The old index is deleted at the file system level.

If the reindexing occurs daily or weekly, perhaps this will satisfy GDPR.

There are other good reason to not use ES as the primary data store. First, it isn’t entirely reliable. It’s good and I’ve never seen a corruption, but ES and Lucene’s history isn’t as a reliable database. Second, if you want to change how you index, it is a bit easier to do if the source data is outside of ES.

bryanrasmussen · on July 20, 2019

thanks, I wasn't arguing that using ES as primary was good. Just don't necessarily see the GDPR argument as being a reasonable one. Although I've seen some startups using Mongo as primary and have to wonder if there would be that big a difference in using ES at that point (not a Mongo dig as I've kept away from it for various reasons)

busterarm · on July 19, 2019

There's folks working on Bleve (written in Go) and developers that I work with want to use it (we use Elasticsearch heavily), but as I've told everyone like you just did, Lucene has a 20 year head start.

Thing is, there's heavy demand for something more performant than Elasticsearch, so eventually the market will provide.

Meanwhile, Redis Enterprise is trying to grab some 'share with RediSearch, which has some severe caveats IMO that make it not a great fit for most.

bpicolo · on July 19, 2019

Tantivy is an interesting project I'd point to in this space:

https://github.com/tantivy-search/tantivy

That said - it's effectively Lucene rewritten in rust, so the main win is some performance gains. Lucene has spent a ton of time getting the details right, and it's unlikely we'll see an order of magnitude of innovation in that particular space. At the higher level for querying / query understanding it feels like there's still more technological room to grow vs the lower level details.

fulmicoton · on July 20, 2019

Tantivy main dev here. Thanks for the free marketing :)

It is not exactly a port but yeah. tantivy is strongly inspired from Lucene.

> Lucene has spent a ton of time getting the details right, and it's unlikely we'll see an order of magnitude of innovation in that particular space.

Have you checked out the perf gain in Lucene 8.0 ? Block-WAND proved you wrong.

bpicolo · on July 20, 2019

I suppose I could have phrased that better. I appreciate the correction. I mostly mean to say that the functionality is really impressive today, and serves its use case very well for the intended target of lower level search primitives.

Tantivy is a cool project, but I have to say the part I love most about it is your blog posts on it. They're a great introduction for people who are unfamiliar with the underlying tech of search engines.

fulmicoton · on July 29, 2019

> Tantivy is a cool project, but I have to say the part I love most about it is your blog posts on it. They're a great introduction for people who are unfamiliar with the underlying tech of search engines.

Thanks a lot! I am not a native speaker, and I often feel very bad at conveying engineering concepts. The positive feedback is actually very helpful :)

pornel · on July 19, 2019

Seconded Tantivy. Very easy to set up and maintain, and faast.

rbjorklin · on July 19, 2019

If you’re looking for something more performant and you’re in retail I’d recommend taking a look at Apptus eSales. Their product has displaced ES/SOLR at several retail websites in Sweden. https://www.apptus.com/

Disclaimer: I’m a previous employee but have no economic interests in this as it’s not a publicly traded company.

seesaw · on July 19, 2019

We use Solr/Lucene heavily, ingesting about 3TB a day. We had to build our own clustering since we started before the Solr cloud project. We have been very happy with the results.

ksec · on July 19, 2019

>Afaik almost everything runs Lucene under the hood, it's 20 years old, no one is going to build something as good any time soon.

Vespa [1] would like to have a word with you.

[1]https://vespa.ai

softliving · on July 20, 2019

move on! IMO donated tool, Yahoo can't maintain it, give it to community reduce the cost. If only it was a new project which wanted to address search in a different way.

ksec · on July 20, 2019

If anything Vespa has actually gotten faster in development since it Open Sourced.

jerska · on July 19, 2019

I’ll just chime in to say that Algolia runs on an home-made C++ engine. [I work at Algolia]

deathtrader666 · on July 19, 2019

Wonder what DuckDuckGo might have built for its search..

gfs · on July 19, 2019

I see their logo on the main Solr page [0]. They must be using it in some capacity still.

[0]: https://lucene.apache.org/solr/

aliswe · on July 19, 2019

Word is they use bing api for web search.

matt_heimer · on July 19, 2019

The technologies are still heavily used but there might be slightly less demand for the skill set because:

1) Cloud search service - You are less likely to deal with setting up your own instance and concerns that go with it (sharding, etc) because most Cloud providers offer either ElasticSearch as a service or some form of turnkey deployment. You still have to do some low-level work like score manipulation but you don't deal with as much administration.

2) Search as a Service - https://en.wikipedia.org/wiki/Search_as_a_service. There are several companies that provide a Search SaaS offering. Typically they provide value adds above just running your ES service. Often they will provide web crawlers so you can just point them to your domain or they might provide other datasource integrations like pulling content from a database. You get access to Solr/ES functionality if you want it but you can get search running without going to that level if desired.

Either way a Lucene based stack is still in use.

linsomniac · on July 19, 2019

About a year ago I set up an ES cluster to load Apache logs into as an experiment. Around a month later my boss asked if it was down. "Yeah, it looks like it, since you are using it let me upgrade it from an experiment to critical!" Since then we've been using it in more and more places, and are looking to see if it would be a good fit for storing some of our user data in. The big barrier right now is that we are running the version of SQLServer right before changed data capture became part of the standard edition, and that's the way we'd probably prefer to synchronize data into ES from our primary database. We have a couple home grown solutions to synchronizing secondary data sources, and we'd like to get out of that business. The SQL server is always going to be our source of truth, probably.

parthdesai · on July 19, 2019

>The SQL server is always going to be our source of truth, probably.

ES is not supposed to be your source of truth.

rhacker · on July 19, 2019

Definitely agree - ES is designed more like a search appliance - it definitely should be pushed data from other databases that are the source of truth.

https://discuss.elastic.co/t/elasticsearch-as-a-primary-data...

aliswe · on July 19, 2019

In the early days at least, cosmosdb was built on es. Dunno about now though.

ps101 · on July 19, 2019

Can you elaborate on the reasons for this?

Izkata · on July 19, 2019

Coming from experience with Solr, the answer is far simpler than the link in the sibling comment indicates: Data is processed and manipulated on import, and it's difficult - sometimes impossible, depending on the field config - to get that data back out in the format it was imported.

Any changes to the schema, such as switching fields between indexed/not indexed/stored/not stored, requires reimporting the data to populate those fields, data which you're not likely to have if it was your primary store.

manigandham · on July 20, 2019

Elasticsearch has a _source field [1] that stores the entire original document and is enabled by default. It's required to support features like highlighting in results. ES also has a reindex API that specifically makes use of this [2].

1. https://www.elastic.co/guide/en/elasticsearch/reference/curr...

2. https://www.elastic.co/guide/en/elasticsearch/reference/curr...

lemcoe9 · on July 19, 2019

Here is a very good Quora answer on why you should never use ES as a central repository for data: https://www.quora.com/Why-shouldnt-I-use-ElasticSearch-as-my...

simplechris · on July 19, 2019

We do at Vimeo: I gave a talk this week at the NY Elasticsearch meetup about a new search product we built using ES. If you're interested you can access the recording of the livestream here: https://vimeo.com/348443979 The product in question is https://vimeo.com/stock

be_erik · on July 19, 2019

Search gets less attention these days, but mostly because the current tools work so well. Scaling Elastic is still a bit of a dark art, but our company likely wouldn't exist without lucene/elastic. They take a bit to learn and use correctly, but they are incredibly powerful.

shhsshs · on July 19, 2019

What do you mean by saying scaling Elastic is a dark art? Care to expand on that?

meddlin · on July 19, 2019

I'm by no means heavily experienced in this, but my current employer has a big demand for this. (e-commerce)

First there's the Docker + Kubernetes architecture that ES lends itself to really well. Then (depending on your use-case) there are concerns like hot/warm architecture, node types, ETL/indexing processes. ES recently moved over to openJDK, so there's a couple intricacies there (i.e. JVM heap size)

Then, there's document/query structure. In no particular order:

- Do you have any parent/child relationships?

- Do you have stop-word lists developed?

- Can search templates help your queries?

- How will you interface with ES? It has REST APIs, but it's recommended to not expose ES directly to your applications.

- Some advanced querying possibilities like customizing tokenizers, normalizers, and a bit of internationalization.

- Oh, we haven't even discussed security yet.

- Also, ES isn't meant to be a primary data storage. This is more so a "cache", but not quite like Redis. So, you'll need a DB elsewhere most of the time.

All of this changes depending on if you're using it for SIEM, e-commerce, AI/ML, etc. Also, Elastic now provides their own SIEM solution, a pre-built search solution (AppSearch + Search UI), built-in security features. Check out the new ES 7.2 update; it's kinda nuts.

vosper · on July 19, 2019

> ES recently moved over to openJDK, so there's a couple intricacies there (i.e. JVM heap size)

My current employers uses ES - we're on 6.8, planning to move to 7 in a few months. Judging by the other replies here I'd say we have a reasonably large cluster (150+ i3.2xlarge instances, billions of documents), so tuning the cluster is very relevant to us. Could you expand on how things have changed with the move to OpenJDK?

I've seen some claims online that, contrary to what Elastic recommends in their docs, a few machines with huge heaps (100+ gb) is the way to go, rather than many machines with 20gb heaps.

yehosef · on July 21, 2019

>I've seen some claims online that, contrary to what Elastic recommends in their docs, a few machines with huge heaps (100+ gb) is the way to go, rather than many machines with 20gb heaps.

Usually the recommendation is less than 32GB - this link has some more discussion about it: From https://discuss.elastic.co/t/es-lucene-32gb-heap-myth-or-fac...

It seems whether it's better or worse depends on your data set . But I would love to see tests of different kinds of workloads with large or smaller heaps.

busterarm · on July 19, 2019

It's multivariate calculus.

Also, you have plan ahead and over-allocate or deal with fixed indexes/datasets. You also have to religiously monitor the garbage collection and deduce what's going on with search & indexing performance. When the situation changes you need to scale your cluster and re-index everything, which is not a trivial thing at most companies. I've seen bad situations at companies where it takes days to re-index a cluster and they're dead in the water until it's done.

And that's just the operations side. You have to make sure your data is flat (because nesting creates subindexes for Lucene and kills your search performance), that you only define in your index template the fields that you want to be searchable (and binary blob the rest), etc.

m-i-l · on July 19, 2019

Within my current organisation, two main search options are Solr and Elasticsearch, both based on Lucene.

Generalising a bit, Solr is more targetted at enterprise search and unstructured content search (e.g. bundled with most content management systems), and Elasticsearch is more targetted at data analytics and structured data search (e.g. with the ELK stack). Again simplifying a bit, Solr can be a bit more configurable and/or work better with the types of data that benefit from more configuration, and Elasticsearch can work better with the types of data that work more "out of the box".

I'd agree that there don't seem to be a huge number of openings for specialist search roles or a huge number of people specialising in search, but it is often part of another role and there are often people who have touched on search in their roles. That suggests that many people are just using it with largely default setups. Having said that, things like advanced relevancy tuning, if you need it, is a very much under appreciated skillset, and definitely needs someone with good experience or ability to learn.

garysieling · on July 19, 2019

I'm using Solr for https://www.findlectures.com, but I think Vespa looks interesting - lets you store feature vectors in the index, so you can do neat things to incorporate ML algorithms in ranking.

fogetti · on July 19, 2019

That's right! Vespa looks cool. I really appreciate that they even implemented a use case as a proof of concept.

peterdm · on July 21, 2019

Feature vectors do tend to get incorporated in relevance tuning (regardless of the engine), but from what I've heard of Vespa, features (and ML in general) are first-class citizens, whereas with Elasticsearch and Solr, text statistics are your first-class citizens, and you're adding in additional features and integrating ML at the periphery.

softwaredoug · on July 19, 2019

Yes but the search market has shifted.

We're at a point where Lucene and family are used for increasingly sophisticated use cases. The commodity end of the market used to be dominated by open source (Solr connected to Drupal for example)

Now for commodity sites there’s so many SaaS search products it doesn’t make as much sense to hook up Solr or ES to make your blog or university website or whatever searchable. A lot of basic search use cases are covered by products where you don’t want to have to hire a team to manage search.

But at the higher end apps with search, customization, especially of doing domain specific relevance at scale, is often a product differentiator (but often not so important or weird you should write your own engine). So this is where these systems thrive...

ageyfman · on July 19, 2019

100% agreed. This is what my company uses ES for, and it's exceptional at it.

nova22033 · on July 19, 2019

SOLR is great but it's a pain to manage it in the cloud. If you lose an EC2 instance, there is manual work involved when you bring up a new instance. You have to tell the new instance servers which shards they're going to replicate. If the EC2 instance hosting shard1 replica2 goes down, you can't just bring up a new instance and have it be replica2. You need to use the API(which is just a call to a bunch of URLs) to get the new instance to be part of shard1. Also, a good cloud overview UI would be nice. 8.1.1 does have some improvements.

Also, SOLR speed is almost directly proportional to disk speed. If you index is on solid state drives with high iops, you'll be fine.

Backing up a large index is a little painful too.

kostarelo · on July 19, 2019

We have been using https://www.algolia.com/ completely as a replacement for ES.

Pros:

- Managed search engine

- Great API / Developer experience

Cons:

- Cloud only makes it hard for local development

- Expensive (I guess it depends on the usage)

karterk · on July 20, 2019

I have been working on an open source alternative: https://github.com/typesense/typesense

Would love to hear your feedback :)

np_tedious · on July 19, 2019

Worth noting this is what hacker news itself uses

kayoone · on July 19, 2019

you mean https://hn.algolia.com/? thats not made by the HN team though, wouldn't be surprised if the algolia guys built it to show their tech, which is a very smart idea anyway.

Redsquare · on July 19, 2019

It is a similar cost to elastic/solr cloud options, cheaper if you have to get to feature parity with Algolia.

kostarelo · on July 22, 2019

Oh that's a good point. I was referring to Algolia vs self hosted ES and that's why I pointed out that it depends on the usage (how much power you need and how many people to maintain it).

acdha · on July 19, 2019

I use Solr & ElasticSearch heavily — they're “boring” in the sense that they do a lot of heavy lifting without many surprises and they scale easily into at least terabyte-sized indexes.

One area where this might be less true is that the full-text search in Postgres & MySQL have matured to the point where some basic applications might reasonably decide that it's not worth using a separate service.

manigandham · on July 20, 2019

Elasticsearch is very popular because it works well for generic searching and can be customized for lots of unique scenarios. There's competition on the infrastructure side using something other than Java/JVM though:

For Rust, there's Toshi: https://github.com/toshi-search/Toshi which is built on top of Tantivy: https://github.com/tantivy-search/tantivy

For C++, there's Xapiland: https://github.com/Kronuz/Xapiand

For Go, there's Blast: https://github.com/mosuka/blast built on Bleve: https://github.com/blevesearch/bleve

Gonzih · on July 19, 2019

Yes. They might not look trendy anymore, but they are still heavily used in the industry. I constantly see in the industry use cases where sold or ES would be much better choice, but those options are simply ignored because they are rarely visible at the top of tech publications.

blyry · on July 19, 2019

We use elasticsearch to power our ecommerce search, and it works pretty well, but we're considering moving to a commercial product, or solr, to get closer to personalized results based on our knowledge about the user.

We just rewrote our internal search API from a windows service indexer with lucene indices and a vb.net SOAP api in iis to a netcore service, hosted in k8s, that splits out ingest, analysis, storage and queries into separate domains, with the writes going to Azure Search Service*.

Our use case might be a bit weird -- this app is essentially an internal API that supports the search needs of our other teams and their own products for our own internal software. It probably has 30 million records across a few different indices. We made the decision to migrate from lucene because of the ease of clustering elasticsearch. We previously achieved availability by just running multiple copies of the standalone service and doing smart health checks at the load balancer level in case a lucene index got corrupted and needed to spend a day rebuilding, but that didn't scale well for rebuild times, and we have been consolidating all of our legacy tech onto netcore and kubernetes.

Raw lucene was an order of magnitude faster than azure search service, but that's probably more a function of being able to essentially query the indices directly in memory of the webservice, as opposed to a slightly underprovisioned search cluster with all the HTTP overhead. We're migrating it to our own elasticsearch cluster right now for performance, cost savings and cloud-agnosticism.

mish15 · on July 19, 2019

We have an early access product for personalized ecommerce search @Sajari if you are interested. One early access company is on track to generate $30 million in additional revenue from switching (over 10% search conversion increase). That is across millions of skus and hundreds of products updated per second also.

We are also looking at releasing this as a k8s deployable product. It's all k8s services and gRPC already...

peterdm · on July 21, 2019

For e-commerce search, personalization with Elasticsearch takes a similar level of effort as with Solr. Don't re-platform under the impression it will make personalization easier. It still takes data collection and experimentation but can be accomplished on Elasticsearch. Feel free to contact me if you have questions.

pmarreck · on July 19, 2019

Two years ago I decided to go with Postgres' built-in fulltext search instead of adding another dependency like ElasticSearch, and I believe I've profited from that in much less maintenance while still getting quite good performance/features.

ngrilly · on July 20, 2019

Do you use ts_rank? PostgreSQL FTS is very efficient until you want to rank the results according to their relevance. This is because the data necessary to the ranking are not in the GIN or GIST index. They are in the heap, and this triggers a lot of random IOs.

pmarreck · on July 22, 2019

Ah, this is good to know. My site doesn't yet need to scale, so this is definitely A Problem I Would Love To Have ;)

EDIT: This seems to help with the ranking problem: https://github.com/postgrespro/rum

ngrilly · on July 22, 2019

Yes, RUM is great. I'd hope it will be built-in one day.

aldoushuxley001 · on July 20, 2019

Any tips for scaling Postgres-only fulltext search?

ngrilly · on July 20, 2019

https://github.com/postgrespro/rum

pmarreck · on July 22, 2019

Good find, bookmarked!

pmarreck · on July 22, 2019

I wish I needed to, let's just say that!

flexer2 · on July 19, 2019

We use ES heavily. Most of our queries are basic document filtering plus some geospatial stuff. We could probably have done it with Postgres/PostGIS, but with AWS manages ES, it's all "good enough" -- we can do geospatial searches on millions of documents with response times around 100ms. The other part I like about ES is how it's easy to scale out across machines, which lets us handle quite a bit of load and tolerate failures easily. We have a cluster of 5 m4.large instances and it only runs us about $600/mo. Like others have said, tuning AWS ES sucks, but it's always been good enough for us.

We've run into some pain points like trying to index very large shapes into a geospatial index, but have workarounds for basically everything now. We also had a problem where when AWS had the outage around autoscaling groups a few months ago, we lost 3/5 of our instances and had to reindex some data from backups. That was the worst thing that's happened.

I'm sure there would be better/faster/cheaper ways of doing what we do, but for what we get out of the box for the price, it's going to take a lot for us to move away from it for now.

niklasrde · on July 19, 2019

Yep, most of the audience facing search functionality across our sites (https://www.bbc.co.uk) is powered by various Solr clusters, hosted on-prem and cloud.

quickthrower2 · on July 20, 2019

ES powers search one of my side projects https://dealscombined.com.au.

The ability to not only full text search but to do it fast, to tune the lexical behaviour (lowercase, plurals, stemming etc.) and to top it all combine geo search pretty much left any other solution in the dust. I even considered some paid solutions.

I also considered postgres which looked strong but I felt it’s be harder to set up these features and that the full text would be weaker although geo might be stronger but my geo needs are simple.

ES was easy to set up to do this, taking about 2 hours of tuning. I used AWS so I didn’t have to figure out how to install it. I admit I had a mental model of ES from ELK-ing at work.

At some point when the site gets more traffic I’ll tune the search so that rather than nearest matching, I’ll score bit the distance and the words and order by perceived relevance. Ie weigh up both how close something is with how well the words match.

ES is a pretty amazing tech and it’s the easiest way to set up a decent quality free text search for your site.

_x5md · on July 19, 2019

yeah, there aren't many alternatives unfortunately. I've used Sphinx a lot, but am now stuck with ES and it is horrible to operate, probably because we don't need a cluster solution so it is total overkill. Yeeeah for technical debt.

For small projects (for some 1000s of documents), I'd probably go with Postgresql FTS if possible. Sphinx/Solr for anything with indices smaller than a couple 100GBs After that, ES seems reasonable & worth the overhead

EDIT: my biggest issue with ES is that it seems to be specifically engineered to sell you support. So get a managed version if you can.

karterk · on July 20, 2019

This was exactly my pain point as well. For smaller projects, ES is a overkill. So I decided to do something about it! I started working on an open source, really, developer friendly search engine that just works. It's pretty stable now and quite a few people use it and like it: https://github.com/typesense/typesense

Would love to hear your feedback.

riku_iki · on July 19, 2019

> I'd probably go with Postgresql FTS if possible. Sphinx/Solr for anything with indices smaller than a couple 100GBs After that, ES seems reasonable & worth the overhead

why do you think you can't throw 1TB of data on postgresql?

aldoushuxley001 · on July 20, 2019

Could postgres FTS handle millions of documents within a reasonable timeframe?

manigandham · on July 20, 2019

Yes, it's searching against the ts_vector data types which can be indexed.

The problem with PG FTS is that it doesn't have advanced search functionalities (fuzzy matching, faceting, term distance, highlighting results) and it lacks the modern relevance scoring systems so that'll be the limiting factor instead of speed.

otisg · on July 19, 2019

At Sematext we help companies with Apache Solr and Elasticsearch. ES/ELK is definitely used mode for timeseries sort of data. Solr community puts more focus on full-text search (email search, product search, database search, etc.). Elasticsearch can do that, too, and we regularly help companies who use ES for that, but Solr seems more focused on that use case.

inertiatic · on July 21, 2019

I work at a company who's a major player in online academic publishing.

We use Solr to power our main, end user-facing search after migrating from a custom Lucene solution some years ago.

To me it seems Lucene based tools are the best for the job if the main thing you care for is having text focused search with a huge potential for extensibility.

But there are a lot of use cases where you will never need anything more than the base capabilities of this technology (so you can be served by something simpler to use or maintain nowadays) and there are probably a lot of use cases where your search will be mainly driven by vector similarity (in which case you are working around the limitations of picking a technology with another focus).

As far as jobs go, I'm not sure how in demand specialists are. After a few years of working in the field I had a look to see if I could leverage my experience to get a remote position and came up with pretty much nothing.

james-skemp · on July 19, 2019

The Sitecore CMS moved from License to Solr for search for on-prem instances. After trying to run it on Windows we were happy there was a third party provider that was easy to work with.

alasano · on July 19, 2019

For Sitecore I will shamelessly plug Coveo for Sitecore: https://www.coveo.com/en/solutions/coveo-for-sitecore

I think that we definitely have the best integrated and most full featured solution for Sitecore customers.

It's not only about querying but also having a UI framework, built-in customizable indexing, analytics tracking and access to machine learning available in one package.

Source: Am product manager for Coveo for Sitecore.

james-skemp · on July 19, 2019

Based upon community responses (and someone from the organization being in the Slack group and responding as well), we ended up going with SearchStax.

I work at a state university (with the associated purchasing ... issues :)) and we had some staffing issues, which they were very accommodating of from initial quote to subscription.

If we end up running into issues as/if we expand our usage (we're essentially only using Solr for the mandatory bits) I'll keep Coveo in mind. :)

Snoddas · on July 19, 2019

We use solr/lucene. It provides search and indexing for our CMS. It will in all likelihood be retired when we change CMS, it's scheduled for autumn next year (yeah right).

acd · on July 19, 2019

Duckduckgo was/is using Solr

http://highscalability.com/blog/2013/1/28/duckduckgo-archite...

thecatspaw · on July 19, 2019

Yes, we use it as part of an ecommerce framework to index products, categories and cms content for various customers. However we dont have a specific Solr position, as we only need to make small adaptations which an average developer usually can do/interfere from existing code.

crishoj · on July 19, 2019

ES is backing the product search at ecommerce site https://www.imusic.dk/. Even with 16M documents, a fairly intricate ranking function, and spelling suggestions, latency is in the order of 200 ms.

paulirwin · on July 19, 2019

Lucene is still great today for smaller indexes that can entirely fit in memory and can be indexed quickly on app startup. Think something like searching for a setting in Windows 10 settings, or if you had some other fixed, small data set that you wanted to allow users to do real text search without the complexity of a search service. Lucene is still helpful here because of the analyzers, stemming, etc.

But for searching data that can grow and change over time, it's hard to justify using Lucene directly anymore. Azure Search (I believe built on Lucene) is an awesome (but relatively expensive) SaaS solution that is far easier to manage than Elasticsearch.

srameshc · on July 19, 2019

Search built using Postgres is underrated. It can do a lot if used properly.

joking · on July 19, 2019

there are on different levels, a search on a sql database will have real time results, while on solr / elasticsearch it's going to have a delay (from milliseconds to minutes). That delay gives the advantage to build a series of data structures much more suited for search than the ones on a database.

I built several search systems for classified listing sites, something like solr is a life saver once you get enough traffic. Is much easier to scalate than a sql database, and you can do much more things. The easiest example is a facet, for example you make a search on a car listing site, and you want to show how many cars from each brand you are matching, with sql you have to make another query, while in solr you can get those results in the same query. Now, add the model, the color, the gas type, the transmission, the place, etc. that actually grows easily to something unescalable, while with solr you can do easily.

glintik · on July 20, 2019

ES search have no any delay, if data is really commited to index. That's also true for any general DB - Postgres etc.

scalesolved · on July 19, 2019

We use Solr heavily at https://www.helpscout.com/, it really powers a ton of our functionality.

adventured · on July 20, 2019

I'm still using an older version of Sphinx. I love it. It's fast, moderately flexible, very lightweight, easy to set up and produces good enough results. I have also found it to be highly reliable (at least the version I've been using for years). It's not useful for anything that needs hyper scale (Twitter et al), however for the next tier of scale below that it generally does well if you know how to leverage its strengths.

rchasman · on July 19, 2019

We use ElasticSearch at Lawmatics, and it powers more functionality than just our search! We use it to power our Automation targeting engine, reporting features, audience builder and pagination, filtering, sorting of data tables.

We denormalize associated records into one Index. And any record that we need to find based on user-defined queries will go through ES since it's much simpler to metaprogram queries across denormalized data (no conditional joins).

yehosef · on July 21, 2019

This is one of the undersung benefits of ES in my eyes. Relevancy results requires tuning of the indices and queries and in most cases (that non IR-experts would program) ES will give as good results and be as easy or easier to implement as Solr.

But after you've gotten over that, you realize that this new tool can do lots more things than just text search. Time series metrics, BI, predictive ML, APM, etc. with relatively little work. With Solr, you could do those non-IR tasks, but it's going to feel much more awkward, IMO.

frankjr · on July 19, 2019

We use ElasticSearch extensively at our company but we don't use it for full text search (in fact, we don't use its full text capabilities at all) but rather for its ability to match and aggregate large data sets without having to create any indexes at all prior (and it's fast, it still blows my mind a little). This allows us to offer customers a way for them to create arbitrary queries in our own little DSL.

tjpnz · on July 20, 2019

My experience is mostly Japan-centric nowadays but SOLR is very widely used here and there is demand for people with that background. A lot of work has been done with SOLR to better support the intricacies of dealing with Japanese text which differs substantially from other languages. Most of the search and NLP jobs I've seen recently outside of Google and Amazon expect some SOLR experience.

porker · on July 19, 2019

This has been a great thread, and there's some heavyweight indexes here. But what about at the other end of the scale?

Say when you've got 10k-50k contact details (name, email, phone) and you want to provide a quick, autocomplete lookup. I've used basic SQL string matching for this, but it doesn't catch mis-spellings and the rest.

Running SOLR or ES is overkill for this. Is there a tool that fits this niche?

noir_lord · on July 19, 2019

Postgres does lexemes and all that jazz natively.

You want to be looking for tsvector related stuff.

I used it a few years ago to do full text search on a smallish website and it was great, I gather it has improved further since.

karterk · on July 20, 2019

Yes! Please take a look at an open source search engine I am working on. You will definitely like it:

https://github.com/typesense/typesense

Would love to hear your feedback :)

rpedela · on July 20, 2019

What's wrong with running Solr/ES? It is trivial to run either in standalone mode, and it is a lot easier to set up autocomplete with misspelling support than messing with PG. Algolia is a good option if you have the budget.

porker · on July 20, 2019

> What's wrong with running Solr/ES?

With this small quantity of data, usually the app's running on a small VM. I'm wary of running anything Java, having had it require large amounts of RAM before.

That said, I haven't touched JVM stuff for 5+ years.

mohaine · on July 19, 2019

Lucene should work great for this use case. It has been awhile but I have successfully used it for this exact use case.

samsk · on July 19, 2019

With Postgresql you can use pg_trgm, might be not as powerful as what SOLR/ES provides, but easier to run.

peterdm · on July 21, 2019

Upvoted. This is how Postgres supports "fuzzy searching" which helps with misspellings. https://www.rdegges.com/2013/easy-fuzzy-text-searching-with-...

Completion response-time will be slower than Solr, Elasticsearch, Algolia, etc... but if you're already running Postgres, this may be the fastest to deliver for you.

softwaredoug · on July 19, 2019

This is a great niche for Algolia, instant fuzzy 'autocomplete' style search of short strings

sciurus · on July 19, 2019

While I was at Eventbrite we were using Solr and started moving to Elasticsearch. I know one of the main people I worked with on that recently left for Github, which also uses Elasticsearch.

At Mozilla I work one project with a search component (https://crash-stats.mozilla.org/), and it uses Elasticsearch.

JnBrymn · on July 19, 2019

:handwave:

sidcool · on July 19, 2019

ElasticSearch is widely used in enterprises for full text searching.

kodz4 · on July 19, 2019

Wikimedia uses ES and you can download their entire index for any of their sites wikipedia/travel/quotes etc.

truth_seeker · on July 19, 2019

Yes. Also, just recently MongoDB v4.2 added Lucene as an embedded engine for text search capabilities

yla92 · on July 19, 2019

Do you know if it is coming to the community edition or is it only for Atlas?

truth_seeker · on July 19, 2019

Atlas.

Source : https://www.youtube.com/watch?v=4QUGWnz-XaA

Jeremy1026 · on July 20, 2019

We use ES for our searching data within our application. We store about 20,000,000 rows of data in the primary table, with plenty of dependent and secondary tables. ES takes the load off our MySQL cluster for generating reports and fulfilling searches.

darkhorn · on July 19, 2019

The company I'm working is using Solr. We collect the data by either paying, for free or by using crawlers. Then we query the data from Solr if it is not a meta data. And there are other types of databases too that we use.

csixty4 · on July 19, 2019

ES is somewhat popular in the enterprise WordPress space, driven largely by 10up's ElasticPress plugin which uses it both for search and to improve the performance of database queries over MySQL.

ketanhwr · on July 19, 2019

Atilika (https://www.atilika.com/en/) uses Lucene/Solr for their NLP based search products.

samsk · on July 19, 2019

I'm using SOLR for public facing search API/engine on several projects (own and customers). ES is imho better for doing various on-demand analytics (like dev logs search)

airocker · on July 19, 2019

ES is used in code search tool dxr which we locally use as well: http://dxr.mozilla.org

termez442 · on July 21, 2019

Data discovery: this is the key concept. they are absolutely unbeatable at this and for this you can use them for a lot of things. Http://siren.io

janemanos · on July 19, 2019

I guess more and more people are turning towards search engines directly integrated into databases, like ArangoSearch of ArangoDB to combine search with other needs https://www.arangodb.com/why-arangodb/full-text-search-engin...

mosen · on July 19, 2019

I assume you work for ArangoDB?

mixedCase · on July 20, 2019

Looking at his/her post history, he/she does but rarely discloses it.

Risse · on July 19, 2019

Yeah, I work pretty much daily with Solr and related stuff (PHP, JSON). Still used quite a lot in PHP and Drupal scene.

Kimitri · on July 19, 2019

Yup. Lots of Drupal sites use Solr. There are very good contributed modules that make using Solr with Drupal a doddle.

MaddAgent · on July 19, 2019

Same here - using SOLR with Drupal and it's pretty simple and effective

lightbyte · on July 19, 2019

Yes, I work on the search team at my work and we extensively use Elasticsearch for many different search services.

alfarisi17296 · on July 19, 2019

Using SOLR heavily. I have a strong preference for SOLR when it comes to Full Text and ELK when it comes to logs.

znpy · on July 19, 2019

I've seen elastic search used as a document indexing engine and thus also as search engine.

Also solr, to although a lot less.

altendo · on July 19, 2019

Are people still using Sphinx Search (http://sphinxsearch.com) at all? It doesn't seem like it gets many releases anymore...since they unpublished the source code, it's hard to see how much activity there is.

pQd · on July 19, 2019

https://manticoresearch.com/ is the lively, open source, fork of Sphinxsearch. that's where some of the earlier developers from the project moved to. it's used as a text-search backend on craigslist.

jzawodn · on July 19, 2019

Definitely. I love me some manticore. :-)

altendo · on July 22, 2019

this is cool! Definitely will give it a look.

adventured · on July 20, 2019

I'm using a 2.x version of Sphinx and have been for many years. I refuse to upgrade the version. I haven't been able to break/crash the version I'm using under nearly any common circumstances or loads, so I'm sticking with it until I find an alternative that is dramatically better. I've learned every nuance of it over time and can make it sing & dance exactly how I want it to. I consider it a spectacular piece of software; it does a thing and does it well, reliably.

makillik · on July 19, 2019

We just migrated from Sphinx to using the full text search indexes in PostgreSQL, we had to deal with some changes in how special characters are handled, but it's worked well enough.

snikolaev · on July 20, 2019

As far as I remember few years ago they didn't have BM25 and even TF-IDF support. Have they added that? Are you experiencing any issues with full-text search quality after migrating from Sphinx (you probably used BM15(+F) which is BM25 w/o doc length).

drenvuk · on July 19, 2019

Do you have any numbers on your requests or searches per second in a real use case? I've really been wondering this as I've been considering manticore which is the major sphinx fork.

jzawodn · on July 19, 2019

If you need high query rates, I suspect manticore will stand up quite nicely.

chenster · on July 19, 2019

We use it on https://dealbert.net