Hacker News new | past | comments | ask | show | jobs | submit login
Toshi: An Elasticsearch competitor written in Rust (github.com/toshi-search)
255 points by whoisnnamdi on Jan 13, 2019 | hide | past | favorite | 93 comments



> What is a Toshi?

> Toshi is a three year old Shiba Inu. He is a very good boy and is the official mascot of this project. Toshi personally reviews all code before it is commited to this repository and is dedicated to only accepting the highest quality contributions from his human. He will though accept treats for easier code reviews.


Does anyone know of any other prod ready elasticsearch alternatives? I'm working on a logging infrastructure project for shipping syslog, and it seems no one these days just uses plain central syslog and ES is the standard, but it seems bloated.

I've been tempted to just ship straight to a dB and skip all these crazy shippers and parsers and all the other middle men in the equation.

Also, why has no product unified monitoring and logging? AFAIK that's why Splunk is worth it is you have the budget (I dont)


Grafana's new Loki [1] looks very promising. It bills itself as "Prometheus for logging". It doesn't index the text of messages, but it does allow fast filtering predicates on labels and time.

[1] https://grafana.com/loki


ES is not "bloated"; it does require decent hardware, but so does anything else that's taking in gigabytes of data an hour and making it all searchable and indexed. Why bite off working on some weird, less-supported mechanism because of your lack of confidence in Elasticsearch?

> I've been tempted to just ship straight to a dB and skip all these crazy shippers and parsers and all the other middle men in the equation.

You need the parsers. You want to find a needle in a haystack? Good luck without having things broken down into proper fields with metadata.

You need the shippers. Elastic Beats has full backpressure support so that when your cluster is busy, it can intelligently back off. Otherwise, you'll drop logs, or overwhelm the system to the point of uselessness....

> why has no product unified monitoring and logging?

Metricbeat from Elasticsearch + Grafana aimed against Elasticsearch to get you better dashboards and alerting.

Please don't reinvent the wheel on this one. Deploying ELK + Beats + Grafana is not that hard, there's tons of documentation, and it is a very stable product.


Back when I worked at Google, the standard log processing tool was Dremel. You could get exactly the same thing by shipping your logs to BigQuery.

I haven't checked, but I bet it's cheaper than ES for data that's mostly cold, like logs. You will need a separate monitoring solution though.


If you want streaming inserts to BQ, that becomes the biggest cost. Dataflow could be used to turn inserts into batch and gather interesting metrics that you don't want to hit BQ for, but I don't think anyone's open sourced anything in this space. I've implemented streaming inserts to BQ for logs "at scale" and it was at least an order of magnitude cheaper than splunk still. Happy to talk via email.


what was the monitoring solution at the end, where were BQ results going to?


It was generic log aggregation used mostly for incident response and forensics as well as some offline metrics. There were a bunch of metrics that were being created (like 3 different ways) on box with parsing that we were looking at moving into the log processing stream. We had a chat bot that people could use to interact with common queries as well as standard SQL interaction via UI and API auth'd by Google IAM.


we use spark on hdfs.


IMO it is not as widespread because it isn't a good practice. Health monitoring, metrics and logging are orthogonal and really should be handled separately. Monitoring makes sure everything is working properly, metrics is about understanding how it is being used, and logging is about peering inside what is happening. Conflating them hinders their application and makes them less useful.


I like the seperation of concerns, do you have an example of conflating these that leads to issues?


When a customer complains that something's not right, and you check your logs, and the logs have been spewing millions of alarming messages for hours that you wish you had seen before the customer noticed issues, that's when you wish the programmer who wrote the log lines had used the health monitoring framework instead of the logging framework.


> health monitoring framework instead of the logging framework.

What is the difference?

Can you just extract metrics from logs via ES. Logstash even has a prebuilt JAVASTACKTRACEPART for java exceptions.


That user/dev error, not a fault of the system or a problem in handling it all with the same.

Metrics can always be generated from logs, especially structured payloads.


You may take a look at https://blevesearch.com/ written in Go.


If you're looking for SaaS solution, check out DataDog. They offer monitoring and logging (and a bit more) as a packaged service.


As of 6.5, the Elastic Stack ships with a logging app and an infrastructure monitoring app out of the box in Kibana. They are both new, so expect a bunch of new features im 6.6 onward.

The docs have more info about both: https://www.elastic.co/guide/en/infrastructure/guide/current...

Disclaimer: I work on Kibana at Elastic


Clickhouse is a good (self hosted) alternative to Elasticsearch for log storage: it saves a lot of space due to better compression, it supports sql (with regex search instead of useless by-word indexing), and ingestion speed is great.


> (with regex search instead of useless by-word indexing)

Perhaps I misunderstand your situation, but I don't see any "CREATE INDEX" available in Clickhouse, and thus won't "SELECT * FROM logs WHERE match(message, '(?i)error.*database')" require a full column-scan (including, as you mentioned, decompressing it)? Versus the very idea of an indexer like ES is "give me all documents that have the token 'ERROR' and the token 'database'" which need not tablescan anything

I only learned about the project 9 minutes ago so any experiences you can share about the actual performance of those queries would be enlightening -- maybe it's so fast that my concern isn't relevant


Clickhouse is designed for full table scans. It allows one index per table, usually a compound key including the date as the leftmost part of the key. This allows it to eliminate blocks of data that don’t contain relevant time ranges. It is also a column store, so the data being read is only the columns used in the query.

If your query is linearly scalable conceptually, Clickhouse is also linearly scalable. Per core performance is also pretty good. (tens of millions of rows per second on good hardware and simple queries, like most log aggregation queries are)


Clickhouse (like any other SQL DB) would work great if you could chop up your log files into fields and store one type per DB. ElasticSearch is great for this because you don't have to worry about schema- ClickHouse you will... unless you do Two Arrays. One for the field type, and one for the field value.

If you value being able to store arbitrary log files, ClickHouse is not for you. If you want to build your system to generate tables on the fly- ClickHouse might work.

See: https://github.com/flant/loghouse


You can use materialized views in ClickHouse to simulate secondary indexes. See https://www.percona.com/blog/2019/01/14/should-you-use-click... for an example of this usage. It's about half-way through the article.

Disclaimer: I work for Altinity which is commercializing ClickHouse.


Yes. The TICK stack.

Namely InfluxDB and friends


Since when can you send logs (not just time series) to Influx? That would certainly be news to me, as a huge TICK stack user.


You can send logs to influx no problem. telegraf (The collecting agent from the Influx team) has a logparser input plugin which can parse logs via patterns, it understands Apache logs by default. The catch is that there is no full text search. It's still useful as you can quickly find logs to view by searching fields that have been parsed.


That's really funny. My employer is one of the larger users of influxdb I'm aware of, so much so we had to write some software[1] (we did synthetic benchmarks of 500k metrics per second and it held up just fine) to overcome scaling limitations and I didn't know this. Thanks for the cluebat! TIL!

[1] https://jumptrading.github.io/influxdb/metrics/monitoring/de...


Try https://getseq.net/

It was built for .NET devs using Serilog but can accept structured logs/json over HTTP.


Loki from grafana



Check out Vespa: https://vespa.ai


> Also, why has no product unified monitoring and logging?

In my experience SumoLogic is excellent for this.


just ship to spark?


Neat! I hope this goes far, it'd be great to have a faster/lighterweight Elastcsearch.

Something similar I'm really hoping to see is Tantivy in a Postgres extension, so I can stop playing the game of trying to keep my search engine and database in sync. Seeing pg-extend-rs (https://github.com/bluejekyll/pg-extend-rs) on HN the other week got me thinking about it again. Does anyone know whether this is feasible or if anyone is working on something in this vein?


Out of curiosity —- have you looked at using Postgresql’s full text search functionality to implement your search engine (e.g. [1])? If so, what do you get out of the combination of Postgres + Elasticsearch that you chose it over just the Postgres full text search?

[1] http://rachbelaid.com/postgres-full-text-search-is-good-enou...


Major problem with Postgres full-text search that those articles don't dwell into too much is that unless your documents are in one of the "chosen languages", you are more likely to find support for your language in search engine (like ElasticSearch) than get it on PostgreSQL.

You can convert existing dictionaries available to format Postgres understand, but this is annoying pain point if you happen to be an open source project like CMS or communication platform.


I don't get the hype about elasticsearch at all. Elasticsearch is more suited to searching logs. It doesn't have powerful sort functions, doesn't allow you to use multiple sort parameters etc.

Apache Solr is more suited to search. Lots of document filters, query filters, the index itself is highly configurable and the ability to sort on multiple parameters is great. LTR is also something too good to miss out on.


The difference is in the size of your Sales team.

Lot of open source projects would benefit, if they had dedicated sales teams contacting firms day in, day out talking about their features.

People who run IT departments in the Enterprise world or even small firms lacking resources to keep up, just pick tools and make software decisions based on who reaches out to them.

Invest in a sales team and you penetrate markets that don't spend time monitoring developments in the tech world which is really the majority of all orgs. Elastic has done that quite well and is reaping the rewards.


Nonsense, Elasticsearch does all these things (and extremely well too). These are core lucene features. I'm not aware of any much that Solr does that Elasticsearch does not. And yes, I've used both.

Solr was there earlier. In many ways, Elasticsearch was a response to stuff Solr did not do, or did not do well. Like clustering for example. Solr has of course added this since then.


Can you explain what you mean by "multiple sort parameters" because it looks like you can to me in ES [1]. There is a well maintained LTR plugin for ES. Honestly Solr and ES are more similar than different. There are a few things Solr has which ES doesn't and the reverse is true too.

1. https://www.elastic.co/guide/en/elasticsearch/reference/curr...


Thanks for letting me know about Elasticsearch LTR.


Both solr and elasticsearch use lucene to index.


ES doesn't expose as many knobs last I checked.


Last time I checked, PostgreSQL doesn't support any kind of tfidf ranking scheme because it doesn't track corpus frequencies of terms in the index. This impacts how well relevance ranking works for some workloads.


ZoomDB (https://github.com/zombodb/zombodb) offers something like this.


Didn’t know about this project, very cool.


Thanks! It’s been fun to develop it and we love when people discover and use it


I know you can go pretty far just within postgresql these days without any extensions, even for production grade search.

For my projects thats all I've been doing, in order to avoid the overhead of ensuring ES or Solr are synced with db.


Just released a system that does the same - Postgres FTS can get you a long way. I've used Elasticsearch before, and it's a lot of complexity to add, and isn't always necessary.


There’s a super neat plugin for Postgres called “ZomboDB” that hooks up ElasticSearch to Postgres, so when you make full text queries, it gets performed on ES. https://www.zombodb.com/

I had a brief play with it for a project at work and it was super straightforward to get running.



There are some real gems among Rust crates. I'm using tantivy[1]. It has been super easy to set up and it's faaast.

[1]: https://crates.rs/crates/tantivy


Toshi is built on top of tantivy if I'm not mistaken...


Yep. Very first line of the README


Thanks! If you can share your use case, I'm interested.


I love these type of projects. I am a big fan of Elasticsearch, but sometimes it feels overly complex, bloated and memory-hungry. I hope someday a decent Rust/C++ alternative will take over. I was following the development of Apache Lucy (https://lucy.apache.org/) but the project has been retired now.


There is Manticore Search (https://github.com/manticoresoftware/manticoresearch) - it's a fork of Sphinx - written in C++, not based on Lucene library. (disclaimer: I was/am involved in both projects)


Have you tried Xapiand? (https://kronuz.io/Xapiand/). It's a Xapian-based search and storage engine written in C++! We've been using it for a while in a number of projects and it's quite simple and reliable.


> I hope someday a decent Rust/C++ alternative will take over.

I found this part of your comment interesting, given that Rust is in someways being offered as an alternative to C++ for similar use cases.

What would you like to see different in a language that you see as the common issue with C++ and Rust?

edit: I misread the parent comment, the question should be disregarded.


I'm pretty sure he/she was saying they want to see something like Elasticsearch written in Rust/C++ rather than Java and made to be less bloated and less complex.

I would also be very interested in that. Elasticsearch usually works very well for me, especially the latest version, but it feels very heavy, and seems to only be getting worse in that regard.


Oh! I completely misread that. Thanks for clarifying.


Tantivy does not allow schema evolution, but Lucene does, this is a major blocker for dynamic indices


So technically, it is supposed to allow you to add field... but it is not well tested so I try to keep that a secret.

I should probably work on a proper scenario for schema evolution.


So what's the modus for that case? Rebuild the whole DB?


It would be great if this achieved API parity with ES. Being able to swap out parts of the ELK stack would make tools like kibana even more powerful.


I hear you, and I know why you'd say that, but wow the API surface-area of ES is ginormous. Maybe the 80-20 rule goes a long way here, but I wouldn't expect API parity to be a simple matter of exposing the same REST endpoints -- it's the payload that'll be the headache

I actually strongly considered that just with Solr, which has the extreme benefit of using the same query language under the hood, but the more I scratched the more I found it would be a horrific amount of work


Plus the Elasticsearch API isn't especially nice to use. I haven't tried their new SQL, since it requires an enterprise licence or something.


Do you mean encoding the lucene queries as JSON objects into the ES endpoints, or do you mean the actual lucene syntax (as would be surfaced by kibana et al)?


I mean the Elasticsearch API. Kinda what you were referring to in the first part of your sentence, but I don't know why you'd say it like that, especially since the Elasticsearch API covers other things, such as mapping indexes and other cluster administration.


Love the idea, saw another one in 2018(https://www.gravwell.io/)

At least in my experience the web interface is a huge gap. Kibana is ok but Splunk and it's query language and visuals are much better. Anything that competes with Splunk is great imo.


Hey, Corey from Gravwell here. Huuuge updates coming to the web interface this month! You can see teaser screenshots on our website and a blog post coming this week using the new UI.

Subscribe to the blog (we are non-spammy) to get the email announcement.


Hi! -tantivy main dev here-. Do you have a engineering dev blog that disclose information about your backend?


We haven't published a lot on the backend architecture, but there's some info in the docs. I think it would be interesting to chat. Can you hit me up at info@gravwell.io?


Also check out the roadmap:

> https://github.com/toshi-search/Toshi/blob/master/roadmap.md

It seems that the main added functionality of ES, namely clustering, will still take a while to be implemented.


Nice! I’m going to have to check it out.

At the same time, I’m curious on the performance differences with postgres. I’ve been able to get very quick queries from Postgres:

https://austingwalters.com/fast-full-text-search-in-postgres...

The only time it’s slower to the point of using elasticsearch (for my usecases) is something like log search (so far).


Well that’s a big use case that ES is meant for. There’s also the fact that ES is easy to shard. At my previous gig we easily had over terabyte of data written into ES indices every day, with hundreds of TBs worth of documents searchable in various indices, and none of this is even counting the logging use case. Used it extensively for calculating various aggregations (for reporting/analytics) of events.


Cool project!

But there's an attitude attached which amuses me:

> Toshi will always target stable Rust and will try our best to never make any use of unsafe Rust. While underlying libraries may make some use of unsafe, Toshi will make a concerted effort to vet these libraries in an effort to be completely free of unsafe Rust usage. The reason I chose this was because I felt that for this to actually become an attractive option for people to consider it would have to have be safe, stable and consistent. This was why stable Rust was chosen because of the guarantees and safety it provides.

It's an admirable goal, though the fact that it's stated prominently as one of the first things on the front page gives off somewhat of a "doth protest too much" vibe. It's not like "safety" is rare these days. Any project written in Go, Python, Java, C#, Erlang, JS, and a myriad of others will be "safe" as far as memory access is concerned, and in many cases this safety will be easier to achieve than in Rust. As far as error handling safety, so far exceptions seem to be more expressive, though the jury is out there.

Basically, if a project stays away from C and C++ and libraries written in them, it's more likely it will be hit by a hardware problem than an inherent language safety / security issue. Luckily, "safety" is for the largest part the default for modern projects.


There's an inside baseball discussion happening here with the larger Rust community that I think, maybe, you're missing. You are absolutely correct in saying that Go, Python et al offer memory safety in a way that low-level languages do not. These languages use techniques that incure some kind of runtime penalty. Erlang, for instance, copies memory like it's going out of style. Now, with Rust you get a similar kind of memory safety but, somewhat uniquely, without the same category of runtime penalty. There's still some, sometimes, and that's where the community discussion around 'unsafe' happens in Rust.

So, you're writing in Rust and you'd like to write code which has the absolute most aggressive performance possible. In Rust parlance this _maybe_ means you need to do unsafe things: turn off bounds checks, fiddle with the raw memory of a structure, allow multiple threads to access the same memory without synchronization, interact with mutable globals and so on. That's great, but, you've potentially opened up your program to crashes or security issues. If you're a solo author that's a trade-off you can make based on your needs. Here's the rub: if you use similar techniques in a crate -- a shared library for all to use -- then you've opted everyone using your crate into the same trade-off, one which they might not have otherwise chosen for themselves. What the Toshi project is saying here is that it's design preference is to avoid opting into this trade-off, preserving all the guarantees that Rust can provide at _possibly_ the expense of absolute performance.

There's a safety-focused subset of the Rust community that takes the presence of an 'unsafe' in a body of Rust code very seriously and this project is participating in that conversation.


This is very well stated. I would clarify one thing in particular. I think, hard to speak for everyone, most people have settled on the idea that high-level libraries should avoid unsafe (like this one) and rely on libraries that need to work around safety restrictions.

This allows the “unsafe” libraries to have fewer lines of code and more isolated testing, with greater coverage.


You can also, if they're written in C, verify the absence of everything you'd worry about using the automated and lightweight tooling available for C. Then, fuzz it to be sure. Then, port it back to equivalent rust maybe with C2Rust. Then, equivalence testing to make sure the two have same output. That's my current recommendation for medium-assurance apps in Rust that have to use unsafe code. Oh yeah, gotta turn overflow checks on in the safe Rust for best effect but might have performance hit.

I'm still thinking about the rest of concurrency problems and side channels that Rust's type system doesn't cover. Trying to find out what's as easy as above. For concurrency, I'm eyeballing Eiffel's SCOOP, DTHREADS, and eventually will study whatever Pony is doing.


> You can also, if they're written in C, verify the absence of everything you'd worry about using the automated and lightweight tooling available for C. Then, fuzz it to be sure. Then, port it back to equivalent rust maybe with C2Rust.

Not sure I totally understand (or if I do, totally agree). For all FFI in Rust, we have to drop down to unsafe interfaces. I like the model that's generally happening in this area where there is an auto-generated sys crate (with bindgen), then an FFI crate that does all the Rust <-> C interop. This tends to work pretty well.

A lot of C/C++ (native library) validation tools just work with Rust artifacts, so I don't personally see a lot of value for writing in C as well as Rust, unless we're talking about a rewrite from C to Rust.

> still thinking about the rest of concurrency problems and side channels that Rust's type system doesn't cover.

What things are you thinking about beyond the Send/Sync traits? I've found those to be very expressive, and appropriately restrictive.


"A lot of C/C++ (native library) validation tools just work with Rust artifacts,"

Tools like RV-Match and Astree Analyzer can prove absence of entire classes of errors using static analysis. Frama-C and SPARK Ada can do that with annotations with high amount of proof automated. There's optional, runtime checks for stuff not proven if you don't want to or can't do it by hand. C also has lots of open-source tools for static/dynamic analysis, test generation in many forms, and so on. In my Brute-Force Assurance concept, you convert a program into C and/or Java to throw all the automated tooling you can at them, fixing whatever real issues are found. Then, the last benefit is C has a formally-verified compiler should someone want to eliminate compilers as an error source. Rust-to-C is still valuable for all these reasons.

So, is the Rust tooling at the level that you can do all that without converting it to C?

"What things are you thinking about beyond the Send/Sync traits? I've found those to be very expressive, and appropriately restrictive. "

I don't use Rust yet. I just know what I learn from folks like you who do. When I studied it, the docs said their type system blocked some but not all concurrency problems. I don't know where it's at currently on various types of races, deadlocks, and livelocks. Those are the main problems improved models or analyzers should try to solve.


KRust looks like the rv-match for Rust: https://news.ycombinator.com/item?id=16970050

But comparing Rust to C is difficult. On the one hand they compile down to the same thing, on the other when not using unsafe, the type system itself allows you to express strong proofs, especially with state machines.

I don’t generally need to make use of these tools, I would point you to the ring project, where they are very interested in formal proofs: https://github.com/briansmith/ring they might have some interesting options and a few of the people over there are very capable in answering these questions.

> I don't know where it's at currently on various types of races, deadlocks, and livelocks.

For dataraces, there is a very strong story. In terms of deadlocks, I’m not aware of anything here. For livelocks, I think it’s generally possible define the state of a system such that you can make sure you aren’t in conflict with other threads, so better, but not fundamentally different than other threaded languages. In other words, if you define cross thread state in an appropriate way, you can prove that you can’t get into a livelock situation.


"KRust looks like the rv-match for Rust"

I was in that comment section bringing up RV-Match. The difference is that RV-Match is a bunch of static-analysis functionality built on a comprehensive semantics for C. KRust is a tiny subset of Rust with RV-Match-style analysis. I did bookmark it in case it could be useful for someone trying to build that.

"where they are very interested in formal proofs"

The only thing I mentioned that would be doing a formal proof was the lightweight stuff like Frama-C and SPARK Ada that don't require proof so much as annotation (eg like borrow checker) that run through automated provers. The rest was all push-button or no-proof-needed tools that say something has no errors, specific errors, or a mix of them and false positives. RV-Match and Astree Analyzer will straight-up tell you that specific errors don't exist with low, false positives. The test generators that work on structure of your code get deep into lots of errors from different combinations of inputs and control flow. None of these require proof. People building these usually test them on FOSS projects, sometimes the same ones, often finding new errors in them.

"In terms of deadlocks, I’m not aware of anything here. "

Good to know. That be the focus area for now since you said livelocks are a good situation. I'll still keep an eye on the side for anything checking that.

"they might have some interesting options and a few of the people over there are very capable in answering these questions."

Thank ya very much. I'll hit them up, too.


> Go, Python, Java, C#, Erlang, JS, and a myriad of others

But few of them has as strong guarantees about your safety as rust. Python, Erlang, JS will not enforce anything and raise recoverable errors on mismatch. Java and C# will enforce a bit of correctness and raise recoverable errors. Go is the most strict, but happy to panic at runtime if you cast to the wrong interface. And that's before we talk about things like defined integer overflow behaviour, safety without concurrent execution, etc.

Those languages do protect from accidental memory overflows, but rust offers a lot more in safe code.


Let’s not go too far, Rust has plenty of panics at runtime.


Do you mean something that's not validated by the compiler, or explicit unwrap/panic calls?


Explicit unwrap/panic calls, as well as some numerical operations (in debug mode) that will panic on overflow. Indexing into arrays/slices outside bounds will also panic.

This was in response to the point that Go panics, and I didn't think it was perfectly fair to use that as an example (I have others that I would use) since Rust also panics in similar situations.


Go does panic on a wrong interface cast, which doesn't have an equivalent in outside unsafe mode in rust, so I think it's a valid comparison.

> as well as some numerical operations (in debug mode) that will panic on overflow

That's a good thing. You normally want to find them in debug mode.

> Indexing into arrays/slices outside bounds will also panic.

Kind of. There's .get which returns an Option to do it safely, but you're right that basic indexing can fail.


The standard library provides safe downcasts on trait objects, which aren't all that dissimilar to Go interfaces.


Yes, generally there are alternative interfaces for every one that panics that can be used to almost ensure that your code won't panic in Rust. As of today, that's by convention, not enforced by the compiler.

There are situations where panicking is something you want to try and guard against, so I hope something like this gains traction and is able to be used more generally: https://crates.io/crates/no-panic


Cute!


dang?


Yet another great piece of work by the Rewrite-It-In-Rust Task Force (a sister organization to the Rust Evangelism Strike Force)!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: