Well Sidekiq is free to use. It's only the pro version that he charges and the f...

phamilton · on April 14, 2023

I have no problem paying for the Pro version, but one if its marketing pitches is "enhanced reliability", which is a wild marketing spin on "the free version will lose jobs in fairly common scenarios".

In sidekiq without super_fetch (a paid feature), any jobs in progress when a worker crashes are lost forever. If a worker merely encounters an exception the job will be put back on the queue and retried but a crash means the job is lost.

Again, no problem paying for Pro, but I would prefer a little more transparency on how big a gap that is.

arkasan · on April 14, 2023

I wish this was prominently documented. Most people new to Sidekiq have no idea that the job will be lost forever if you simply hard kill the worker. I have seen a couple of instances where the team had Sidekiq Pro, but they had not enabled reliable fetch because they were unaware of this problem

mperham · on April 14, 2023

The free version acts exactly like Resque, the previous market leader in Ruby background jobs. If it was good enough reliability for GitHub and Shopify to use for years, it was good enough for Sidekiq OSS too.

Here's Resque literally using `lpop` which is destructive and will lose jobs.

https://github.com/resque/resque/blob/7623b8dfbdd0a07eb04b19...

phamilton · on April 14, 2023

> If it was good enough reliability for GitHub and Shopify to use for years, it was good enough for Sidekiq OSS too.

Great point, and thanks for chiming in. I wonder if containerization has made this more painful (due to cgroups and OOMs). The comments here are basically some people saying it's never been a problem for them and some people saying they encounter it a lot (in containerized environments) and have had to add mitigations.

Either way, my observation is a lot of people not paying for Sidekiq Pro should. I hope you can agree with that.

aqme28 · on April 14, 2023

When we used Sidekiq in production, not only did I never see crashes that lost us jobs, but there are also ways to protect yourself from that. I highly recommend writing your jobs to be idempotent.

phamilton · on April 14, 2023

Idempotence doesn't solve this problem. The jobs are all idempotent. The problem is that jobs will never be retried if a crash occurs.

This doesn't happen at a high rate, but it happens more than zero times per week for us. We pay for Sidekiq Pro and have superfetch enabled so we are protected. If we didn't do so we'd need to create some additional infra to detect jobs that were never properly run and re-run them.

_sojh · on April 14, 2023

Or install an opensource gem[1] that recreates the functionality using the same redis rpoplpush[2] command

[1] https://gitlab.com/gitlab-org/ruby/gems/sidekiq-reliable-fet...

[2] https://redis.io/commands/rpoplpush/#pattern-reliable-queue

aqme28 · on April 14, 2023

Fair enough about idempotence.

I'm still confused about what you're saying though. You're saying that the language of "enhanced reliability" doesn't reflect losing 2 jobs over about 50*7 million (from your other comment)?

And that if you didn't pay for the service, you'd have to add some checks to make up for this?

That all seems incredibly reasonable to me.

danenania · on April 14, 2023

Crashes are under your control though. They’re not caused by sidekiq. And you could always add your own crash recovery logic, as you say. To me that makes it a reasonable candidate for a pro feature.

It’s hard to get this right though. No matter where the line gets drawn, free users will complain that they don’t get everything for free.

Mavvie · on April 14, 2023

How are crashes under your control? Again they aren't talking about uncaught exceptions, but crashes. So maybe the server gets unplugged, the network disconnects, etc.

danenania · on April 14, 2023

To me 'crash' means any unexpected termination, whether it's caused by an uncaught exception, OOM, or hardware/network issues.

I guess you can say that hardware issues on your host aren't under your control, but it's under your control to find a host that doesn't have these issues. And not even a full-on ACID database is going to be 100% reliable if you yank the power cord at the wrong moment.

Mavvie · on April 14, 2023

I hope my tone doesn't come across as rude or too argumentative, but I think your understanding is a bit inaccurate.

> it's under your control to find a host that doesn't have these issues

All hosts will have these issues, the only question is how often. If you need 100% consistency, then you can't use the free Sidekiq. Personally, I've never needed Sidekiq pro (as these kinds of crashes are extremely rare). But this will depend on your scale and use case.

> And not even a full-on ACID database is going to be 100% reliable if you yank the power cord at the wrong moment

This is only true if there's bugs in the DB, or some underlying disk corruption happens. The whole point of an ACID database is that they're atomic, durable, and consistent, even in the worst case scenario. If a power failure corrupted my SQL database I would feel very betrayed by the database.

danenania · on April 14, 2023

It wouldn’t be corrupted, but in-flight transactions could fail to commit, just like queued jobs can be lost with sidekiq. The failure modes are similar.

I take your point that at a certain scale, hardware failure is inevitable, but if you’re running that many servers, you can afford sidekiq’s enterprise plan. It’s not something that will realistically happen if you’re just running like 20 instances on AWS. It’s perfectly reasonable to charge extra for something only large organizations with huge infrastructure budgets need.

Mavvie · on April 14, 2023

For sure, I agree with you.

I would say that queued jobs being lost is different from an in-flight transaction being auto-rolled-back, but it's not a super important distinction. Like others have said, I think Sidekiq really nailed the free vs premium features and its success is evidence of that.

sorentwo · on April 14, 2023

Jobs may crash due to VM issues or OOM problems. The more common cause of "orphans" is when the VM restarts and jobs can't finish during the shutdown period.

durkie · on April 14, 2023

how often do your workers crash? i rely heavily on sidekiq and don't think I see this very often, if ever.

phamilton · on April 14, 2023

We process around 50M sidekiq jobs a day across a few hundred workers on a heavily autoscaled infrastructure.

Over the past week there were 2 jobs that would have been lost if not for superfetch.

It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

Edit for additional color: One of the most common crashes we'll see is OutOfMemory. We run in a containerized environment and if a rogue job uses too much memory (or a deploy drastically changes our memory footprint) the container will be killed. In that scenario, the job is not placed back into the queue. SuperFetch is able to recover them, albeit with really lose guarantees around "when".

ZephyrBlu · on April 14, 2023

Let me get this straight, you're complaining about eight 9s of reliability?

50,000,000 * 7 = 350,000,000

2 / 350,000,000 = 0.000000005714286

1 - (2 / 350,000,000) = 0.999999994285714 = 99.999999%

> It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

If your system isn't resilient to 2 in 350,000,000 jobs failing I think there is something wrong with your system.

phamilton · on April 14, 2023

This isn't about 2 in 350,000,000 jobs failing. It's about 2 jobs disappearing entirely.

It's not reliability we're talking about, it's about durability. For reference, S3 has eleven 9s of durability.

Every major queuing system solves this problem. RabbitMQ uses unacknowledged messages which are pinned to a tcp connection, so when that connection drops before acknowledging them they get picked up by another worker. SQS uses visibility timeouts, where if the message hasn't been successfully processed within a time frame it's made available to other workers. Sidekiq free edition chooses not to solve it. And that's a fine stance for a free product, but just one I wish was made clearer.

ZephyrBlu · on April 15, 2023

If you want to focus on durability then I think your complaint makes even less sense. Somehow I doubt S3 is primarily backed by Redis.

I think it's fair to assume that something backed by Redis is not durable by default because that's not what Redis is known for, whereas the other options you listed are known for their resiliency and durability. I wouldn't view Sidekiq as a similar product to RabbitMQ and SQS.

Also, Sidekiq Pro uses more advanced Redis features to enable super_fetch lending to the assumption that by default Redis is not durable: https://www.bigbinary.com/blog/increase-reliability-of-backg....

Colex · on April 14, 2023

it’s not uncommon to lose jobs in sidekiq if you heavily rely on it and have a lot of jobs running. If using the free version for mission critical jobs, I usually run that task as a cron job to ensure that it will re-try if the job is lost.

I have in the past monitored how many jobs were lost and, although a small percentage, it was still recurring thing.

kubectl_h · on April 14, 2023

In containerized environments it may happen more often due to OOM kills or if you leverage autoscalers and have long running sidekiq jobs that have a runtime that exceeds your configured grace period for shutting down a container during a downscale and the process is eventually terminated without prejudice.

OOM kills are particularly pernicious as they can get into a vicious cycle of retry-killed-retry loops. The individual job causing the OOM isn't that important (we will identify it, log it and noop it), it's the blast radius effect on other sidekiq threads (we use up to 20 threads on some of our workers), so you want to be able to recover and re-run any jobs that are innocent victims of a misbehaving job.

tebbers · on April 14, 2023

Exactly why we refuse to use Sidekiq. “Hey, you have to pay to guarantee your jobs won’t just vanish”.

No thanks.