Manticore 6.0.0 – a faster alternative to Elasticsearch in C++

boyter · on Feb 10, 2023

I highly recommend this software if its capabilities fall into what you need. It is very fast both in terms of indexing speed and search. It’s relatively simple to setup and start working against, and I have found it very reliable.

It works best when you have a SQL data store you want to index against, but with the real time index you can treat it more like elastic and other searches. However for that first use case of SQL, I don’t know of anything else that comes close to being as easy to use.

Simply point it at your database, give it a query to pull what you want to index and you are done. I suspect that this covers about 90% of use cases out there.

If you need more than what the DB native indexing is giving you give it a try.

james_in_the_uk · on Feb 10, 2023

I used Manticore to implement full text search on a legal journal’s website. It has worked brilliantly for several years now. Recommended.

sgc · on Feb 11, 2023

Is it easy to set up indexing for html / xml? Can you extract metadata from the file, etc?

james_in_the_uk · on Feb 11, 2023

Our content is in html format but stored inside a MSSQL database. It worked out of the box for me. I don't think that Manticore comes with a crawler; if you need this you'd have to implement it yourself. I did find this just: https://medium.com/@s_nikolaev/60-lines-of-code-web-crawler-...

sgc · on Feb 11, 2023

I was talking about local indexing. Nice to know it can deal with this use case relatively easily.

james_in_the_uk · on Feb 11, 2023

Yes it was straightforward. The documentation makes it sound more complex to implement than I remember it being.

mcronce · on Feb 10, 2023

This sounds... Legitimately like a perfect match for a use case I have. Thanks for the write up.

zmmmmm · on Feb 11, 2023

> works best when you have a SQL data store you want to index against, but with the real time index you can treat it more like elastic and other searches

This sounds like a bit of a killer use case but even explicitly searching the documentation I can't find more than tease level information about it. They seem to be hyper focused on just presenting it as a replacement for ElasticSearch.

CoolCold · on Feb 12, 2023

> They seem to be hyper focused on just presenting it as a replacement for ElasticSearch.

Probably it's because they think it's where money floats, seeing ES as a train to push them forward as well.

Not long time ago, ~ 2 years of so, had a case where dev team (okay, it was just 2 persons of developers) asked to setup ES to offload some searches from DB (mysql). I've asked why not using Sphinx/Manticore and one of the reasons they said they want ES is cuz in Laravel [php framework] they have some nice libs to work with it, while for Manticore it seemed to be more work for them. Quite shocking reasons from my POV, but real story.

entropyie · on Feb 10, 2023

See also this lightweight alternative to ES: https://github.com/zinclabs/zinc

remram · on Feb 10, 2023

That looks great. I've been needing a multi-platform low-resource full text search engine/library.

rbanffy · on Feb 10, 2023

I think I may give it a shot at my home network's ELK stack. Elastic is a memory hog.

rzzzt · on Feb 10, 2023

"We are excited to announce the addition of telemetry in this release. [...] This feature can be easily turned off in the settings if desired."

How did they know in the previous 5 major versions which part of the product to improve?

snikolaev · on Feb 10, 2023

We didn't, hence we may have inadvertently improved the wrong areas. Although we have been receiving feedback from the community, it was never backed by concrete data and thus, we had to rely on our best guess.

reilly3000 · on Feb 10, 2023

This is the best appeal for OSS I’ve seen in a while.

Actively telling a project how to improve vs passively observing how the system is used in the aggregate: that is a choice each person should be offered to engage with the project, along with the right to lurk in peace or fix their problems their own ways.

The data itself ought to be public, and if it can’t be then it shouldn’t be gathered. The insights from that data should come from the community, not just project leads. That data’s relevance to product development and design should be annotated on feature tickets.

I am good with having a relationship with a project that involves sharing my behaviors within the system, others are not. As a user I do I want to be able to see if I’m way off the beaten path or using a popular method when I’m deciding if I should debug, research, or work around a problem. As a maintainer team, your time is precious and the BS is already thick, so I feel providing that visibility is the least I can do to contribute to your work.

I feel very differently about this with respect to for-profits and my PII tied to behavioral records.

1vuio0pswjnm7 · on Feb 11, 2023

They are "excited"? No end-user wants telemetry. They want software that works reliably. And when it doesn't work, they want customer service. Rarely do they get either.

If end-users wanted "telemetry", then they would have asked for it in previous versions.

Even once telemetry is added, there is still no reason to enable it by default. End-users that want/need to send data to the company, e.g., to back up their claims about deficiencies in the software, can easily enable it.

Show us the contractual restrictions that limit how the data sent by the end-user can be used. Show us the guarantee or enforceable promise that this data collection will result in improvements that end-users want (versus only benefitting the company in some undisclosed way(s)).

Why would anyone want to send behavioural data to a company with no enforceable promise of a benefit and no way to monitor how the data is used.

dang · on Feb 10, 2023

mooreds · on Feb 10, 2023

I looked for an up to date elasticsearch compatibility chart but was unable to find one.

I found an article from 2022 that did a compare/contrast but I wanted a feature by feature breakdown.

caseyf · on Feb 10, 2023

I've been running Manticore (previously SphinxSearch) on a faceted search heavy site with a million MAUs for 15 years. I'd definitely use it again for another project.

If the data that you want to search is entirely contained in a SQL database, it's an uncomplicated and powerful solution, definitely check it out. If not, Manticore may still be a nice solution for you, but I can't speak to that.

pQd · on Feb 12, 2023

coincidentally - we've been using first sphinx and then manticore for over 15 years as well. in our case it's fed each night with XML generated by Java code from data stored in MySQL databases. we index over 294M pseudo documents.

it's been rock solid for all those years.

synergy20 · on Feb 10, 2023

how does it compare to another c++ elasticsearch alternative that was on HN a few days ago: https://github.com/typesense/typesense

snikolaev · on Feb 10, 2023

From the Typesense's site: "Typesense is an in-memory datastore", "If your dataset size is 1GB, you'd need between 2GB - 3GB RAM to hold the whole index in memory."

Manticore is different in terms of this, especially the Manticore columnar storage which doesn't require a significant portion of the data set to be stored in memory. This allows, for example, for a 1TB data set to be served on a standard server with some 32GB of RAM.

idoubtit · on Feb 10, 2023

In my opinion, the main difference is the history: this search engine has been used for 2 decades in various production sites.

Mantiscore is an opensource fork of Sphinx Search, which released its first version in 2001. The fork started after the latter went from opensource to proprietary, at the end of 2017. The engine is stable and battle-tested. IIRC, Craiglist uses Sphinx.

boyter · on Feb 10, 2023

I believe Craigslist has moved to manticore.

jzawodn · on Feb 10, 2023

you are correct

AlexAltea · on Feb 10, 2023

Related: Meilisearch v1.0.0 release two days ago: https://news.ycombinator.com/item?id=34707727

I have been following these two libraries (Manticore and Meilisearch) very closely. Their simplicity, portability and performance gains over Elasticsearch are impressive.

Since two days ago, I am creating Python bindings for the core search engine of each of these two libraries, starting with https://github.com/AlexAltea/milli-py. Getting extreme performance, but as an embedded/self-contained package (basically same goals as SQLite).

arein3 · on Feb 10, 2023

Regarding performance, hope it's not the same as Graphana's Loki.

Grapana Loki advertises lower resource requirement, but it's just a disk storage system. Any query will read everyrhing from disk.

The Elasticsearch has big RAM requirements if you create a lot of indexes of course. You can't have something more quick than indexes, and you can't have lower resource requirements without having fewer indexes.

drowsspa · on Feb 10, 2023

What do you mean by "any query will read everything from disk"? Is that when you do text search or even when you lookup by labels Prometheus-style?

arein3 · on Feb 10, 2023

Tags in loki are things like host, application, and environment. When searching by those tags and a time interval, it will read everything from disk. So any query that filters by ex. SessionId or a keyword from the log like Exception will read all the logs from disk. This can take ages if you have a lot of logs and a big time frame. Compare that with Elasticsearch which can index anyrhing, like SessionId/log message and return the result in an instant, without even reading the disk.

drowsspa · on Feb 11, 2023

> When searching by those tags and a time interval, it will read everything from disk

That's what I'm asking, actually. Isn't Loki's proposition that it only indexes the tags and time interval? Do you mean that even filtering by that there's still a lot of data to go through?

Because it seems like you're saying it always fetches everything from disk.

arein3 · on Feb 11, 2023

> Isn't Loki's proposition that it only indexes the tags and time interval? Do you mean that even filtering by that there's still a lot of data to go through?

Yes

> Because it seems like you're saying it always fetches everything from disk.

If you specify a tag, like environment, it will not read the disk for data from other environments. But the tags like environment/host/timeframe are not enough if you want to query for something like error/exception/sessionid, and you might have to wait minutes/hours for a query which covers a lot of data.

mardix · on Feb 10, 2023

Loving it. I'm interested in milli-py.

What can be a cool feature, it's auto backup to S3, or load from S3.

canadiantim · on Feb 10, 2023

That looks awesome, kudos! I've been looking for a way to do local-first high-quality FTS.

m3affan · on Feb 10, 2023

I wonder how lasting will the support be for such libraries

ollybee · on Feb 10, 2023

Give Xapian a go also.

remram · on Feb 10, 2023

Xapian is a library but is licensed under GPL, so you can't build on it without making your whole app GPL.

You can get around that by having the search happen in a separate process or something, maybe. But this is a huge issue for something that one might want to embed.

kat_rebelo · on Feb 10, 2023

that is a slight misunderstanding of how open source licensing works.

the GPL bleed only happens if you distribute your application, meaning to sell or give away binary packages for customers to install. if your product is a hosted api that you do not distribute, you do not invoke that clause.

also, a lot of open source projects handle this by having things like the core engine licensed on a copy-left friendly license (GPL,AGPL). however, the language connectors and bindings are licensed under the slightly less restrictive apache license. unless you are offering a saas service of the product itself, it is more likely you are actually interacting with the connectors anyways. mongodb is a classic example of this model.

remram · on Feb 11, 2023

Yes, if you don't distribute it, the license doesn't matter. That is more a flaw than the intention, but that is correct.

The point is that I can use SQLite, Tantivy, RocksDB, ... in my app no problem. I can make it open core, I can make it AGPL, BSD, MIT, not problem. Because those things are meant to be embedded. But I almost definitely can't use Xapian.

Let's be honest, if I want a search solution for use in my SaaS, I will grab Elasticsearch or an equivalent, I have no need for a library. It seems to me that the only use case where Xapian could really shine is crippled by their license. That is a shame.

AlexAltea · on Feb 10, 2023

It can be a problem if you intended to make an embeddable search engine within applications meant to be executed by your end-users (as is the case with milli-py above).

antman · on Feb 10, 2023

github 404 fyi

AlexAltea · on Feb 10, 2023

Fixed, apologies! I had finished the PoC last night but the repo was still marked as private.

quijoteuniv · on Feb 10, 2023

What is people looking for in alternatives for elastic search? I have been toying in docker with a version of elastic search, fscrawler and workplace search to get a company to have better access to their knowdledge base. They have exchange, manuals, emails,images& video github and other stuff… does this alternatives have connectors too? Any experience on this?

snikolaev · on Feb 10, 2023

There are several reasons why some people prefer alternatives to Elasticsearch, including:

* License preference: Some people prefer true open-source licenses as opposed to the license that Elasticsearch has switched to.

* Performance and resource consumption: For some, performance and resource consumption are significant factors in their choice of a search engine.

* SQL vs JSON DSL: Some people prefer using SQL over Elasticsearch's JSON domain-specific language.

* Maintenance: Some believe that maintaining Elasticsearch can become challenging when the data collection becomes large enough.

That's what I've heard from those who preferred Manticore over Elasticsearch.

vosper · on Feb 10, 2023

> License preference: Some people prefer true open-source licenses as opposed to the license that Elasticsearch has switched to.

Those people can also use OpenSearch, which is a recent fork of ElasticSearch (by Amazon) that is using the Apache 2.0 license.

davewritescode · on Feb 10, 2023

Anytime I see an alternative to Elastic search on HN my first thought is how much of a shame it is to use something other than Lucene for text search because of just how powerful it really is.

Elasticsearch is a pain to tune and partition, and the JVM brings a whole set of operational issues but what's the point of better read/write performance when the actual search performance is worse?

I guess this makes sense for use cases where you care more about speed than the quality of results.

snikolaev · on Feb 10, 2023

When the search performance is worse - may be no sense. Regarding Manticore, we conducted relevance tests and found it to be on par with Elasticsearch. In fact, the objective tests [1] showed that Manticore can even provide better relevance results than Elasticsearch, when using almost default settings. You can view the relevant pull request in the BEIR information retrieval benchmark [2].

[1] https://docs.google.com/spreadsheets/d/1_ZyYkPJ_K0st9FJBrjbZ...

[2] https://github.com/beir-cellar/beir/pull/92

idoubtit · on Feb 10, 2023

Do you have any sources when you claim that Lucene is the best search engine because it is much more "powerful" to the point it's a "shame" to use anything else, and because every engine "actual search performance is worse" than Lucene?

This is a very strong claim, and without strong arguments, it's a ridiculous claim.

oarsinsync · on Feb 10, 2023

Agreed. As much as regular expressions resulted in me having two problems, Lucene doesn’t actually seem any better.

trallnag · on Feb 10, 2023

Time to rewrite it in Rust

xeraa · on Feb 11, 2023

> You can now execute Elasticsearch-compatible insert and replace JSON queries, which enables the use of Manticore with tools such as Logstash and Filebeat

Looking at the docs I could only see _create and _doc but not _bulk endpoint support. How will that work with Logstash and Filebeat?

snikolaev · on Feb 11, 2023

Please take a look at the 'Elasticsearch' tab on this page https://manual.manticoresearch.com/Data_creation_and_modific...

juxtaposicion · on Feb 10, 2023

How does this compare to Quickwit or other Tantivy-powered engines?

remram · on Feb 10, 2023

I think quickwit is more tailored towards querying large indexes sitting on S3 than fast queries from a local or in-memory index.

lnx might be similar, I'm not sure. It's very new and I had a bad experience trying it out.

riku_iki · on Feb 10, 2023

Does it have distributed partitioning like es?..

snikolaev · on Feb 10, 2023

Yes it does https://manual.manticoresearch.com/Searching/Distributed_sea...

nathanwh · on Feb 10, 2023

From the doc linked "Partitioning is done manually.", which seems like a pretty big limitation to me? For static data sure, but for non static data it seems like it would be an operational headache to manage manual partitions, compared to elasticsearch which partitions for you (although IIRC shard count is fixed)

snikolaev · on Feb 10, 2023

Despite years of users relying on distributed indexes and manual sharding in Manticore, I concur that automation would simplify the process greatly. The development team is working on automating sharding and orchestrating the shards. Hopefully it will be included in the next major release in a few months.

Multrex · on Feb 10, 2023

Any chance to use it with Graylog instead of Elasticsearch or Opensearch?

snikolaev · on Feb 11, 2023

Good idea, but we'd need to check. We have little experience with Graylog, so I'm not even sure if you can easily replace Elasticsearch/Opensearch with smth else in it.

comrad · on Feb 10, 2023

Just because it is in C++ it is supposed to be faster? I highly doubt that.

Minor49er · on Feb 10, 2023

I don't see anywhere where they claim that it's faster simply because it's written in C++. They do mention that they make use of C++ to add low level optimizations that make queries faster and the memory imprint smaller, but any claims about performance in the readme are linked to benchmarks to back up their claims

https://github.com/manticoresoftware/manticoresearch/

https://db-benchmarks.com/test-taxi/#manticore-search-vs-ela...

snikolaev · on Feb 10, 2023

It is unlikely that it is because of C++, however, we have conducted extensive benchmarking (which, by the way, is fully open-source and can be easily reproduced if desired). You can find more information about this at https://manticoresearch.com/blog/manticore-alternative-to-el....

Shorel · on Feb 10, 2023

No, just because it's in C++ does not mean it is automatically faster.

However, with good enough algorithms and judicious coding and memory management, the possibility exists.

danudey · on Feb 10, 2023

Also, architectural changes. They describe how ES can't parallelize a query unless it's spread across multiple index shards, which has its own tradeoffs. Their query engine can parallelize a query on a single index shard, which means it scales much more linearly on more cores without having to make those tradeoffs.

pjmlp · on Feb 10, 2023

Which could also be done in Java.

pjmlp · on Feb 10, 2023

Same can be said by Java, even without direct support for value types.

Plus it doesn't need to be Java xor C++, JNI exists for a reason (now Panama).

alphanullmeric · on Feb 10, 2023

Someone tell the rust people

janmo · on Feb 10, 2023

It is, and also it is the way c/c++ makes you write code.

Languages such as Java or PHP make you lazy and you end up using the string variable type a lot. It is extremely inefficient.

Conscat · on Feb 10, 2023

C/++ actually make you write relatively slow code too, by default. Not to the extent of Java, but still there is HUGE room for improvement in libC and by extension the STL. I'm working on a slash-and-burn approach to the problem here: https://github.com/cons-cat/libcat