I highly recommend this software if its capabilities fall into what you need. It is very fast both in terms of indexing speed and search. It’s relatively simple to setup and start working against, and I have found it very reliable.
It works best when you have a SQL data store you want to index against, but with the real time index you can treat it more like elastic and other searches. However for that first use case of SQL, I don’t know of anything else that comes close to being as easy to use.
Simply point it at your database, give it a query to pull what you want to index and you are done. I suspect that this covers about 90% of use cases out there.
If you need more than what the DB native indexing is giving you give it a try.
Our content is in html format but stored inside a MSSQL database. It worked out of the box for me. I don't think that Manticore comes with a crawler; if you need this you'd have to implement it yourself. I did find this just: https://medium.com/@s_nikolaev/60-lines-of-code-web-crawler-...
> works best when you have a SQL data store you want to index against, but with the real time index you can treat it more like elastic and other searches
This sounds like a bit of a killer use case but even explicitly searching the documentation I can't find more than tease level information about it. They seem to be hyper focused on just presenting it as a replacement for ElasticSearch.
> They seem to be hyper focused on just presenting it as a replacement for ElasticSearch.
Probably it's because they think it's where money floats, seeing ES as a train to push them forward as well.
Not long time ago, ~ 2 years of so, had a case where dev team (okay, it was just 2 persons of developers) asked to setup ES to offload some searches from DB (mysql). I've asked why not using Sphinx/Manticore and one of the reasons they said they want ES is cuz in Laravel [php framework] they have some nice libs to work with it, while for Manticore it seemed to be more work for them. Quite shocking reasons from my POV, but real story.
We didn't, hence we may have inadvertently improved the wrong areas. Although we have been receiving feedback from the community, it was never backed by concrete data and thus, we had to rely on our best guess.
This is the best appeal for OSS I’ve seen in a while.
Actively telling a project how to improve vs passively observing how the system is used in the aggregate: that is a choice each person should be offered to engage with the project, along with the right to lurk in peace or fix their problems their own ways.
The data itself ought to be public, and if it can’t be then it shouldn’t be gathered. The insights from that data should come from the community, not just project leads. That data’s relevance to product development and design should be annotated on feature tickets.
I am good with having a relationship with a project that involves sharing my behaviors within the system, others are not. As a user I do I want to be able to see if I’m way off the beaten path or using a popular method when I’m deciding if I should debug, research, or work around a problem. As a maintainer team, your time is precious and the BS is already thick, so I feel providing that visibility is the least I can do to contribute to your work.
I feel very differently about this with respect to for-profits and my PII tied to behavioral records.
They are "excited"? No end-user wants telemetry. They want software that works reliably. And when it doesn't work, they want customer service. Rarely do they get either.
If end-users wanted "telemetry", then they would have asked for it in previous versions.
Even once telemetry is added, there is still no reason to enable it by default. End-users that want/need to send data to the company, e.g., to back up their claims about deficiencies in the software, can easily enable it.
Show us the contractual restrictions that limit how the data sent by the end-user can be used. Show us the guarantee or enforceable promise that this data collection will result in improvements that end-users want (versus only benefitting the company in some undisclosed way(s)).
Why would anyone want to send behavioural data to a company with no enforceable promise of a benefit and no way to monitor how the data is used.
I've been running Manticore (previously SphinxSearch) on a faceted search heavy site with a million MAUs for 15 years. I'd definitely use it again for another project.
If the data that you want to search is entirely contained in a SQL database, it's an uncomplicated and powerful solution, definitely check it out. If not, Manticore may still be a nice solution for you, but I can't speak to that.
coincidentally - we've been using first sphinx and then manticore for over 15 years as well. in our case it's fed each night with XML generated by Java code from data stored in MySQL databases. we index over 294M pseudo documents.
From the Typesense's site: "Typesense is an in-memory datastore", "If your dataset size is 1GB, you'd need between 2GB - 3GB RAM to hold the whole index in memory."
Manticore is different in terms of this, especially the Manticore columnar storage which doesn't require a significant portion of the data set to be stored in memory. This allows, for example, for a 1TB data set to be served on a standard server with some 32GB of RAM.
In my opinion, the main difference is the history: this search engine has been used for 2 decades in various production sites.
Mantiscore is an opensource fork of Sphinx Search, which released its first version in 2001. The fork started after the latter went from opensource to proprietary, at the end of 2017. The engine is stable and battle-tested. IIRC, Craiglist uses Sphinx.
I have been following these two libraries (Manticore and Meilisearch) very closely. Their simplicity, portability and performance gains over Elasticsearch are impressive.
Since two days ago, I am creating Python bindings for the core search engine of each of these two libraries, starting with https://github.com/AlexAltea/milli-py.
Getting extreme performance, but as an embedded/self-contained package (basically same goals as SQLite).
Regarding performance, hope it's not the same as Graphana's Loki.
Grapana Loki advertises lower resource requirement, but it's just a disk storage system. Any query will read everyrhing from disk.
The Elasticsearch has big RAM requirements if you create a lot of indexes of course. You can't have something more quick than indexes, and you can't have lower resource requirements without having fewer indexes.
Tags in loki are things like host, application, and environment. When searching by those tags and a time interval, it will read everything from disk. So any query that filters by ex. SessionId or a keyword from the log like Exception will read all the logs from disk. This can take ages if you have a lot of logs and a big time frame. Compare that with Elasticsearch which can index anyrhing, like SessionId/log message and return the result in an instant, without even reading the disk.
> When searching by those tags and a time interval, it will read everything from disk
That's what I'm asking, actually. Isn't Loki's proposition that it only indexes the tags and time interval? Do you mean that even filtering by that there's still a lot of data to go through?
Because it seems like you're saying it always fetches everything from disk.
> Isn't Loki's proposition that it only indexes the tags and time interval? Do you mean that even filtering by that there's still a lot of data to go through?
Yes
> Because it seems like you're saying it always fetches everything from disk.
If you specify a tag, like environment, it will not read the disk for data from other environments. But the tags like environment/host/timeframe are not enough if you want to query for something like error/exception/sessionid, and you might have to wait minutes/hours for a query which covers a lot of data.
Xapian is a library but is licensed under GPL, so you can't build on it without making your whole app GPL.
You can get around that by having the search happen in a separate process or something, maybe. But this is a huge issue for something that one might want to embed.
that is a slight misunderstanding of how open source licensing works.
the GPL bleed only happens if you distribute your application, meaning to sell or give away binary packages for customers to install. if your product is a hosted api that you do not distribute, you do not invoke that clause.
also, a lot of open source projects handle this by having things like the core engine licensed on a copy-left friendly license (GPL,AGPL). however, the language connectors and bindings are licensed under the slightly less restrictive apache license. unless you are offering a saas service of the product itself, it is more likely you are actually interacting with the connectors anyways. mongodb is a classic example of this model.
Yes, if you don't distribute it, the license doesn't matter. That is more a flaw than the intention, but that is correct.
The point is that I can use SQLite, Tantivy, RocksDB, ... in my app no problem. I can make it open core, I can make it AGPL, BSD, MIT, not problem. Because those things are meant to be embedded. But I almost definitely can't use Xapian.
Let's be honest, if I want a search solution for use in my SaaS, I will grab Elasticsearch or an equivalent, I have no need for a library. It seems to me that the only use case where Xapian could really shine is crippled by their license. That is a shame.
It can be a problem if you intended to make an embeddable search engine within applications meant to be executed by your end-users (as is the case with milli-py above).
What is people looking for in alternatives for elastic search? I have been toying in docker with a version of elastic search, fscrawler and workplace search to get a company to have better access to their knowdledge base. They have exchange, manuals, emails,images& video github and other stuff… does this alternatives have connectors too? Any experience on this?
Anytime I see an alternative to Elastic search on HN my first thought is how much of a shame it is to use something other than Lucene for text search because of just how powerful it really is.
Elasticsearch is a pain to tune and partition, and the JVM brings a whole set of operational issues but what's the point of better read/write performance when the actual search performance is worse?
I guess this makes sense for use cases where you care more about speed than the quality of results.
When the search performance is worse - may be no sense. Regarding Manticore, we conducted relevance tests and found it to be on par with Elasticsearch. In fact, the objective tests [1] showed that Manticore can even provide better relevance results than Elasticsearch, when using almost default settings. You can view the relevant pull request in the BEIR information retrieval benchmark [2].
Do you have any sources when you claim that Lucene is the best search engine because it is much more "powerful" to the point it's a "shame" to use anything else, and because every engine "actual search performance is worse" than Lucene?
This is a very strong claim, and without strong arguments, it's a ridiculous claim.
> You can now execute Elasticsearch-compatible insert and replace JSON queries, which enables the use of Manticore with tools such as Logstash and Filebeat
Looking at the docs I could only see _create and _doc but not _bulk endpoint support. How will that work with Logstash and Filebeat?
From the doc linked "Partitioning is done manually.", which seems like a pretty big limitation to me? For static data sure, but for non static data it seems like it would be an operational headache to manage manual partitions, compared to elasticsearch which partitions for you (although IIRC shard count is fixed)
Despite years of users relying on distributed indexes and manual sharding in Manticore, I concur that automation would simplify the process greatly. The development team is working on automating sharding and orchestrating the shards. Hopefully it will be included in the next major release in a few months.
Good idea, but we'd need to check. We have little experience with Graylog, so I'm not even sure if you can easily replace Elasticsearch/Opensearch with smth else in it.
I don't see anywhere where they claim that it's faster simply because it's written in C++. They do mention that they make use of C++ to add low level optimizations that make queries faster and the memory imprint smaller, but any claims about performance in the readme are linked to benchmarks to back up their claims
It is unlikely that it is because of C++, however, we have conducted extensive benchmarking (which, by the way, is fully open-source and can be easily reproduced if desired). You can find more information about this at https://manticoresearch.com/blog/manticore-alternative-to-el....
Also, architectural changes. They describe how ES can't parallelize a query unless it's spread across multiple index shards, which has its own tradeoffs. Their query engine can parallelize a query on a single index shard, which means it scales much more linearly on more cores without having to make those tradeoffs.
C/++ actually make you write relatively slow code too, by default. Not to the extent of Java, but still there is HUGE room for improvement in libC and by extension the STL. I'm working on a slash-and-burn approach to the problem here:
https://github.com/cons-cat/libcat
It works best when you have a SQL data store you want to index against, but with the real time index you can treat it more like elastic and other searches. However for that first use case of SQL, I don’t know of anything else that comes close to being as easy to use.
Simply point it at your database, give it a query to pull what you want to index and you are done. I suspect that this covers about 90% of use cases out there.
If you need more than what the DB native indexing is giving you give it a try.