Liftbridge: Lightweight, fault-tolerant message streams

altmind · on July 30, 2019

If someone want to hear some experience using NATS in production(this is based on NATS):

* standalone server is mature and very stable, crunching messages with incredible speed

* server side error handling is ok, no problems here

* very simple, text-based protocol, even simplier than STOMP

* c, rust, java, python client libraries are weak; c library even contains some state transition errors when the library can lock on reconnect( https://github.com/nats-io/nats.c/issues/217 ). (unofficial) rust library is unusable. some netowork errors will not make nats.c reconnect what can be a surprise in prod.

* arbitrary hardcoded limit on the message size - 1M

atombender · on July 30, 2019

NATS is a solid piece of engineering. I rank it alongside Postgres and Memcached in terms of reliability and convenience. It just works.

I have an app in production that publishes real-time events via NATS at high volume, and it has never failed. You do have to understand its design -- NATS can drop messages, and your application has to deal with this.

One lesser-known feature of NATS is that it's embeddable -- if you're writing a Go app, you can use it as a library. You end up with a single statically linked binary that can do everything out of the box.

Liftbridge looks like a neat way to add a layer of statefulness to NATS. I have an app that will need a log abstraction, and have been hoping to avoid integrating any oversized guns like Kafka (whose dependency on ZooKeeper makes it operationally more complex) and Pulsar for what is a fairly simple system.

KaiserPro · on July 30, 2019

We've been using NATS in prod for a very specific usecase: request reply.

We'd got tired of trying to hack around SQS. After a boat load of testing, we settled on NATS. It was our big "innovation token" for that project. It has paid off wonderfully.

The server clusters very well, and is very simple to peer. We've joined two clusters across two regions, and we've not had any issues.

If you need a fast, low latentcy, non persistant, non guarenteed message delevery system NATS is for you.

If you like to store messages in your queues for later consumption, or need _every_ message, NATS isn't for you.

krz · on July 30, 2019

What problems did you have with SQS?

KaiserPro · on July 30, 2019

Nothing really, We were asking it to do something its not really supposed to

Basically we needed a request reply semantic. (forgive the explanation) Its where we have an instance of the front end sending a request to any of the back end servers. A single backend replies directly to the same front end instance.

Basically we needed something similar semantic to a load balancer, but in messaging form.

SQS is only a 1-n Queue system. So we can deliver a message to a single backend in a cluster, but we couldn't get it back to the producer without making a message routing system. We used Dynamodb and repeated polling, but that wasn't cheap to scale.

We might have been able use SNS as a broadcast bus of somesort, but that seemed like a backwards step.

We still use SQS for virtually everything else. NATS is kept for that one use case.

iddqd · on July 30, 2019

> Basically we needed something similar semantic to a load balancer, but in messaging form.

Why not stick with a load balancer? I've been struggling to find a good use case for request-response over message queues so I'm curious.

KaiserPro · on July 31, 2019

Let me try and draw a diagram:

Api-gateway -> Lambda -> NATS -> docker backend on GPU

The backend is C++ and needs to be shielded as much as possible from the web. It also takes a fair wedge of time to deal with requests (1-5 seconds) THe rationale is the Lambda front end validates and routes, the backend replies.

Before we moved to NATs we had the added bonus that using SQS meant that there was no possible link between front-backend apart from SQS. (both lambda and SQS are outside of the VPC by default)

the backend has a strict schema, and to keep moving parts down it seemed wise to avoid having to plumb in a webserver as well. They really are badly suited to this sort of thing. They also add a boat load of extra latency.

We have a large number of machines, all of which deal with certain areas, This allows us to intelligently cache hotter areas closer to clients.

Request reply is a reasonable pattern for interacting with backends that are constantly changing capacity based on demand. It also, if done correctly can be a way to optimise for latency over capacity (depending on how you do it.)

With NATS you can ask for more than one response, which might also be useful. Some other systems use a "fastest reply wins" which guarantees a fast response at the expense of capacity.

ikozlovic · on July 30, 2019

Regarding the NATS C client and issue reported here, this is being worked on and should be addressed in v2.1.0.

derekcollison · on July 30, 2019

We are always interested in feedback on how we can do better. Feel free to jump on our slack channel to join the community. We are actively working on the C client. Interested in more details on Python and Java issues as well.

tepidandroid · on July 29, 2019

It looks like Liftbridge occupies a nice little niche between full-blown, durable Kafka and lightweight, ephemeral NATS.

Additional reading: https://bravenewgeek.com/introducing-liftbridge-lightweight-...

gtirloni · on July 30, 2019

While searching for comparable solutions to understand what this about, I found this website which has a very good summary about queue systems and the like: http://queues.io

gregwebs · on July 30, 2019

This is based on NATS. NATS just had their 2.0 release: https://nats-io.github.io/docs/whats_new/whats_new_20.html

tylertreat · on July 30, 2019

Author here. Happy to answer any questions.

softwarelimits · on July 30, 2019

Hi, what are the message delivery guarantees with Liftbridge? I would like to replace a PostgreSQL based hand made queue system, but I need the transactional safety of a database - should I use Liftbridge?

Thank you very much for your attention!

tylertreat · on July 30, 2019

It has at-least-once delivery guarantees currently. I have plans for some lightweight transactional-style publishing with an idempotent producer though.

networkimprov · on July 30, 2019

What are the principal use cases?

tylertreat · on July 30, 2019

Basically the same types of uses cases as Kafka and similar systems: event sourcing, CQRS, stream processing (e.g. processing click stream events), data pipelines (e.g. log or metric aggregation used to feed multiple backends), or as a replicated transaction commit log (e.g. use it to create materialized views, populate caches or indexes, etc.). Log compaction means it can do this efficiently by just retaining the current state and discarding older transactions.

https://kafka.apache.org/uses

canadaduane · on July 30, 2019

I believe NATS streaming server also fills this niche: https://github.com/nats-io/nats-streaming-server

Can anyone speak to the current state of each project, pros & cons?

manigandham · on July 30, 2019

NATS is a pub/sub server written in Go. It's designed for simplicity and performance with features like request/reply and 1-hop fanout but no persistence.

NATS Streaming (aka STAN) was an answer to persistence by basically building acks and storage on top of the base NATS protocol, resulting in another protocol on top that are stored as files or backed by a database. STAN has limitations on topic/subs and is designed as single master with multiple failovers but works well if you need a simpler self-contained alternative to Kafka. Later a Raft-based clustering option was added but isn't a great architecture and has a major performance impact. The docs recommend you use the failover model instead.

Liftbridge offers a better version that uses the standard NATS protocol and basically functions as invisible subscribers to the topics and just logs all the messages using multiple Raft groups, which lets you scale out by using more topics. NATS itself has recently been revamped with 2.0 release that adds some major new features and they're working on a better answer to the persisted/distributed log offering.

If you need fast pub/sub for ephemeral messaging then NATS is great. If you need persistence and a single-server is enough throughput then use NATS Streaming (with failover if you need it). If you need more persistence throughput then use Liftbridge. Although once you start getting to these scales I would recommend looking at Kafka (which also has gotten much better with v2.0+) or something like Apache Pulsar which is more advanced and scalable then both. There are also commercial options like Solace with a free tier that are worth checking out.

tylertreat · on July 30, 2019

Just to clarify, Liftbridge relies on a single Raft group used purely for metadata replication, i.e. control plane, not data plane. Streams themselves are replicated using a protocol very similar to Kafka (ISR-based with followers fetching messages from the leader's log).

tepidandroid · on July 30, 2019

The intro blog [1] mentioned in my other comment gives a detailed run-down of differences between the two. Not sure it can be summarized any more succinctly so I'll just paste an excerpt here:

"NATS Streaming provides a similar log-based messaging solution. However, it is an entirely separate protocol built on top of NATS. NATS is an implementation detail—the transport—for NATS Streaming. This means the two systems have separate messaging namespaces—messages published to NATS are not accessible from NATS Streaming and vice versa. Of course, it’s a bit more nuanced than this because, in reality, NATS Streaming is using NATS subjects underneath; technically messages can be accessed, but they are serialized protobufs. These nuances often get confounded by first–time users as it’s not always clear that NATS and NATS Streaming are completely separate systems. NATS Streaming also does not support wildcard subscriptions, which sometimes surprises users since it’s a major feature of NATS.

As a result, Liftbridge was built to augment NATS with durability rather than providing a completely separate system. To be clear, it’s still a separate server, but it merely acts as a write-ahead log for NATS subjects. NATS Streaming provides a broader set of features such as durable subscriptions, queue groups, pluggable storage backends, and multiple fault-tolerance modes. Liftbridge aims to have a relatively small API surface area.

The key features that differentiate Liftbridge are the shared message namespace, wildcards, log compaction, and horizontal scalability. NATS Streaming replicates channels to the entire cluster through a single Raft group, so adding servers does not help with scalability and actually creates a head-of-line bottleneck since everything is replicated through a single consensus group (n.b. NATS Streaming does have a partitioning mechanism, but it cannot be used in conjunction with clustering). Liftbridge allows replicating to a subset of the cluster, and each stream is replicated independently in parallel. This allows the cluster to scale horizontally and partition workloads more easily within a single, multi-tenant cluster."

[1] https://bravenewgeek.com/introducing-liftbridge-lightweight-...

canadaduane · on July 30, 2019

Good explanation, but I feel like that needs to be "bottom line up front": the last two sentences are the most important,

"Liftbridge allows replicating to a subset of the cluster, and each stream is replicated independently in parallel. This allows the cluster to scale horizontally and partition workloads more easily within a single, multi-tenant cluster."

polskibus · on July 30, 2019

Say I need to add messaging with persistence (at least once delivery) for my stack (multiple services, often but not always on the same machine), and I want the stack to not have external dependencies-so it's easy to install. Does liftbridge fill this niche? Would NATS streaming be a better solution or maybe Akka with Persistence module?

paddybyers · on July 30, 2019

I can't speak for NATS streaming or Akka but it does look like liftbridge can fit that niche very well, depending on what features and semantics you need. Your services that publish can be assured that a message has been persisted to a quorum of replicas (strictly, the in-sync replicas) before receiving an acknowledgement, so you are able to assure that all onward processing to your downstream services occurs at least once for all acknowledged messages.

If you require fanout to your stream consumers then this is supported simply by having multiple consumers subscribe to each stream. If you require a competing consumers model then this isn't directly supported, but you can have load-balance groups which you could use to achieve something similar, but with a static configuration - that is, your consumer population can't resize dynamically, but you can distribute messages to a fixed set of streams, each of which has a consumer.

liftbridge doesn't yet support durable consumer-offset checkpointing, although this is advertised as being on the roadmap. In the meantime, if you wanted the consumer offset to survive consumer restarts, then the consumer itself would need to take care of persisting its current offset.

tylertreat · on July 30, 2019

This is correct. Also, I'm planning to introduce a mode that runs NATS "embedded" within the Liftbridge server, such that NATS basically becomes an implementation detail of Liftbridge.

Re: consumer-offset checkpointing, this will be implemented in the form of consumer groups, similar to how Kafka works where a consumer group checkpoints its position in the log by writing it to the log itself.

atombender · on July 30, 2019

I'm not familiar with Akka, but both NATS Streaming and Liftbridge sound like they'd fit your purpose.

NATS on its own won't cut it, since it has no persistence and is entirely real-time (new clients receive messages published after they connect, not old messages).

I've not used Liftbridge yet, but plan to. While NATS Streaming is solid, its distributed design seems a bit ramshackle, relying on a combination of Raft and a shared state (a distributed file system or an SQL database) for failover. Liftbridge seems like a better design to me.