Tiered storage won't fix Kafka

mschuster91 · 2024-04-29T15:21:31 1714404091

> But if you could rebuild streaming from the ground up for the cloud, you could achieve something a lot better than fewer disks – zero disks. The difference between some disks and zero disks is night and day. Zero disks, with everything running directly through object storage with no intermediary disks, would be better.

That's still a trade-off. Object storage, simply by the overhead of HTTP + SSL, has higher latency than EFS, which has higher latency than EBS, which has higher latency than local SSD. So in the end your service (no matter if it's Kafka or anything else) has _higher_ latency if you also want consistency (aka resilience against "everything goes dark in an instant") as all writes on all machines in the pool have to be committed to storage.

The only way a "zero disk" anything makes sense is if you have enough machines in enough diverse locations with enough RAM to cover the entire workload and to pray there's never any event taking the entire cloud provider offline.

ryanworl · 2024-04-29T15:31:22 1714404682

(WarpStream co-founder here)

We're not talking about no disks as in no storage, just nothing other than object storage. This does have a latency trade-off, but with the advent of S3 Express One Zone and Azure's equivalent high-performance tier (with GCP surely not far behind), a system designed purely around object storage can now trade cost for latency where it makes sense. WarpStream already has support for writing to a quorum of S3 Express One Zone buckets to provide regional availability, so there's not an availability trade-off here either.

solatic · 2024-04-29T15:57:58 1714406278

> This does have a latency trade-off

There are no silver bullets. Traditional S3, with the durability guarantees that S3 provides, has a latency trade-off because the data needs to be copied to additional availability zones before acknowledging the write. Once you collapse everything to a single availability zone (i.e. S3 Express One Zone), you have little reason not to use Kafka, which scales costs within a single AZ without a problem. At $0.16/GB, S3EOZ is about 7x more expensive than normal S3 ($0.023/GB) for fewer copies of the data/lower integrity guarantees, or about 60% more expensive than MSK or Kinesis Data Streams ($0.10/GB). If you write to a quorum of S3EOZ, then you're tripling your S3EOZ storage costs, to 0.16 * 3 = $0.48/GB. And this doesn't include the cost of compute!

Where's the value above just running Kafka within a single AZ, with no latency trade-off?

richieartoul · 2024-04-29T16:01:02 1714406462

(WarpStream co-founder here)

You don't have to keep the data stored in S3 express one zone forever, you can just land it there and then immediately compact it to S3 standard. You still pay the higher fee to write to S3EOZ, but not the higher storage fee.

WarpStream does this, data gets compacted out within seconds usually. Of course this is now... tiered storage. But implemented over two "infinitely scalable" remote storage systems so it gets rid of all the operational and scaling problems you have with a typical tiered storage Kafka setup that uses local volumes as the landing zone.

solatic · 2024-04-29T16:11:06 1714407066

> so it gets rid of all the operational and scaling problems you have with a typical tiered storage Kafka setup

Do these operational and scaling problems include AWS's managed services? MSK, Kinesis Data Streams?

At small scale, why wouldn't someone go with one of those? And at large scale, where's the Total Cost of Ownership comparison to show that it's worth it to ditch Kafka's local disks for a model built on object storage?

richieartoul · 2024-04-29T16:19:32 1714407572

Short answer is that MSK has (almost) all the same problems as OSS Kafka. Kinesis streams is a different beast that would require its own blog post.

RE:numbers: https://www.warpstream.com/blog/warpstream-benchmarks-and-tc...

richieartoul · 2024-04-29T16:05:30 1714406730

I talk about this more here: https://www.warpstream.com/blog/s3-express-is-all-you-need

RE: comparing to a single-zone Kafka cluster. A lot of people really dislike operating Kafka. Some people don't mind it and that's cool too, but its not the majority in my experience.

chenyang · 2024-04-30T11:10:02 1714475402

In addition to the high cost of S3Express, utilizing warpstream to write three replicas to S3Express and later compacting them to S3Standard could result in quadruple network/outbound traffic costs. With two consumer groups involved, this could increase to six times the network/outbound traffic.

Considering a c5.4xlarge instance with 16 cores and 32GB of memory, which offers a baseline bandwidth of only 5Gib, it's limited to a maximum production throughput of 100MiB/s.

Therefore, I have reservations about the cost-effectiveness of your low-latency solution, given these potential expenses.

wokwokwok · 2024-04-29T15:52:28 1714405948

I guess we’ll have to wait for a full write up of this, but it does seem like having multiple categories of object storage is pulls off hood tiered storage!

…rebranded with a different name, again.

Again complex, again no obvious way to query storage directly, again unclear performance characteristics, again no obvious reason to see why the networking costs make saving from it largely meaningless.

You have to admit it’s a bit of a hard sell without any comeback after literally just saying that people were just inventing new names for minor variations on tiered storage…

wanshao · 2024-04-30T11:16:59 1714475819

I agree with your viewpoint. The crux of the matter is not whether to use tiered storage or not, but what trade-offs have been made in the specific storage architecture and what benefits have been gained. Here(https://github.com/AutoMQ/automq?tab=readme-ov-file#-automq-...) is a qualitative comparison chart of streaming systems including kafka/confluent/redpanda/warpstream/automq. This comparison chart does not have specific numerical comparisons, but purely based on their trade-offs at the storage level, I think this will be of some use to you.

ryanworl · 2024-04-29T16:08:11 1714406891

We're still drafting our next post in this series, but the answer is actually very simple: two tiers of object storage do not have the same drawbacks as a combination of object storage and local disk. We wanted to explain that in this post too, but it would've been unreasonably long.

We've designed WarpStream to work extremely well on the slower, harder-to-use one first, and that is how 95+% of our workloads run in production. The tiered storage solutions from other streaming vendors do the opposite, where they were first designed for local SSDs and then bolted on object storage later.

The equivalent would be if we were pitching our support for an even slower, cheaper tier of object storage like AWS S3 Glacier.

wanshao · 2024-04-30T10:40:52 1714473652

Buddy, you've hit the nail on the head. Everything is a trade-off. For a stream processing system, I believe it's entirely possible to balance cost, ease of use, and latency. AutoMQ(https://github.com/AutoMQ/automq) is also a stream system built on top of S3. Its storage scheme introduces a very small size of EBS storage as a persistent write cache, and then asynchronously compacts the memory data to S3, taking into account latency while retaining the advantages of warpstream. Tiered-storage is just a form, how to implement it depends on you.

chenyang · 2024-04-30T11:18:57 1714475937

In fact, EBS is entirely a cloud storage solution and operates as shared storage. It is not a local disk system.

knur · 2024-04-29T15:57:32 1714406252

I agree but I hate it when at the end of an article I realize it was just an ad.

The conflict of interest should be disclaimed in the very first sentence of the post.

sward-zk · 2024-04-29T16:08:06 1714406886

Do you honestly think that every article posted on a product's blog should include a disclaimer that the post may contain information highlighting the usefulness of their product?

marcinzm · 2024-04-29T19:04:23 1714417463

This doesn't do that. It shits on the competition and then says their approach is better but doesn't detail that (wait for the next blog post...).

sward-zk · 2024-04-29T21:58:36 1714427916

To be fair, it seems from the article that it's their interactions with the community using the feature that shits on it. Or do you have specific counterpoints to their arguments made against tiered storage in kafka? As far as I can tell there are two points being made against tiered storage: operational burden is increased and networking costs (which if you've read any of their other posts or visited their homepage, is a primary motivation for the existence of their product) are not reduced.

eatonphil · 2024-04-29T17:22:35 1714411355

> I agree but I hate it when at the end of an article I realize it was just an ad.

If you wanted to be reductive: if the domain ends in .com, you can assume the submission is an ad. Some ads are better quality and worth discussing and some aren't.

drewda · 2024-04-29T16:23:54 1714407834

More and more blog posts that make it to the HN front page are content marketing.

Nothing nefarious about it -- it's just a deliberate strategy on the part of companies like Retool or Supabase or Fly or whatever to market their services to this target market.

I have no idea how many actual sales conversations it leads to, but it sure is effective at convincing many HN readers that those are cool companies selling cool things...

cryptonector · 2024-04-29T18:48:32 1714416512

In this case TFA was lame and a waste of time. Advertising by posting blogs on HN is fine, but the blogs had better be interesting.

nh23423fefe · 2024-04-30T14:57:53 1714489073

You imagine communication exists to serve you. You are wrong.

You should spend time examining the author instead of just blinding vacuuming "content"

temporarely · 2024-04-29T15:58:07 1714406287

> taken to its logical conclusion, tiered storage could turn Kafka into [...]

A message broker sitting in front of an RDBMS. I mean, if we're now basically 'tailing' streaming data and saving to another storage system might as well use RabbitMQ.

cryptonector · 2024-04-29T18:53:43 1714416823

"tailing" is a pretty cool mechanism. An MQ is possibly too general.

For "tailing" you need to be able to resume, and that requires that you have a stable "offset" identifier, which in HTTP-speak would be: {URI, weak ETag, byte-range offset}, and then you can use conditional requests and local-part naming conventions to deal with things like:

- detection of rollover (which is where you can move older content to tiered storage),

- recovery from that (resumption), and

- detection of lost content (e.g., you resumed a tail a month later and [surprise!] you can't catch up and you need to recover in some other way.

In fact, I've written an HTTP server I call tailfhttpd that supports all of that and, if you GET w/ `Range: bytes=${offset}-` (for some offset) then the GET doesn't complete until the file is unlinked or renamed away. Poor-person's-Kafka. This is generally quite useful since you can use it to tail both structured and unstructured log files, as long as you only ever append to them.

datavirtue · 2024-04-29T16:10:45 1714407045

Kind of difficult if all you know is a cloud Kafka and have no idea that RabbitMQ exists or how to use it.

I went to a lunch and learn where a young engineer was demoing some Kafka-based solution that was spun up in AWS.

I asked about some of the functionality in comparison with RabbitMQ. Deer in the headlights.

kdavyd · 2024-04-29T15:44:47 1714405487

The article doesn't mention which EBS volume type was used, but since Provisioned IOPS are mentioned, I assume it's gp3 or io2. One pattern that is especially often used in Time Series databases, but could work for Kafka too, is not tiering down to S3, but changing older volumes to a slower volume type, such as sc1 ($0.015/GiB-Mo). This can be done completely transparently to the application.

Another thing worth looking into is S3 Mountpoint with or without read caching, which offers a POSIX-like interface for S3 to applications that don't natively support S3.

ryanworl · 2024-04-29T15:54:46 1714406086

This strategy will not work well for Apache Kafka because it is extremely IOPS hungry if you have more than a few partitions, and a replay of a large topic will require lots of IO bandwidth. It would work well e.g. a columnar database where a query targeting old data may only require reading a small fraction of the size of the volume, but Kafka is effectively a row-oriented storage system, so the IO pattern is different.

jackbauer24 · 2024-04-30T10:30:01 1714473001

Have you seen AutoMQ's approach(https://github.com/AutoMQ/automq)? It is hard to believe that users can tolerate a produce message latency of hundreds of milliseconds with WarpStream. As the Co-founder & CEO of AutoMQ, we have engaged with hundreds of users, many of whom are seeking both speed and reliability. So We require a stateless broker solution that is fully compatible with Apache Kafka, while also excelling in terms of low latency and cost effectiveness on cloud infrastructure.

zbentley · 2024-05-06T15:58:22 1715011102

> It is hard to believe that users can tolerate a produce message latency of hundreds of milliseconds with WarpStream

I think that the critical axes of competition (for your company and for WarpStream) here are latency distribution versus cost and durability. Many users might be willing to tolerate produce latency in the 100s of milliseconds at the p99, provided the p50 is fast and the messaging system is cheap to operate. The same tradeoff applies for durability: if the p999 stays fast in exchange for replicated in-memory buffers being the sole residence of data before batched shipment to S3-alike, some users might be less (or more!) interested in that messaging product.

Foobar8568 · 2024-04-29T16:15:49 1714407349

And everything is an abstract layer of CSV over SFTP /soff

msarrel · 2024-04-29T16:21:14 1714407674

Tiered storage won't save Kafka but we built the exact same thing while thinking differently so we're better. Got it!

zbentley · 2024-05-06T16:05:35 1715011535

My understanding of WarpStream's approach is that it's fundamentally different from tiering: at a basic level, it's a system that accumulates and persists (to S3) batches of messages before sending produce-acknowledged responses to clients. This means that there's no "tier" of storage before S3.

On top of that, the WarpStream folks have layered extensive mitigations for the worst-case latency costs of the producer-blocking batch-and-ship approach, as well as a fairly sophisticated system to make consumers consistently quick via online continuous storage rewriting, prefetching, and data movement between broker nodes (if you squint at it, this system looks a lot like the familiar index+buffer cache design).

Unaffiliated, just an expert in the space who likes reading WarpStream's blog.