I don't know what were the issues Yelp was facing, I have upgraded several times my past Kafka clusters and really never experienced any issues. Normally the upgrade instructions are documented (e.g. https://kafka.apache.org/31/documentation.html#upgrade) and the regular rolling upgrade comes with no downtimes.
Besides this, operating Kafka never required much effort a part when we needed to re-balance partitions across brokers. Earlier versions of Kafka required to handle it with some external tools to avoid network congestions, but I think this is part of the past now.
On the other hand, Kafka still needs to be used carefully, especially you need to plan topics/partitions/replications/retention but that really depends by the application needs.
I used to work in a project where every other rolling upgrade (Amazon’s managed Kafka offering) would crash all the Streams threads in our Tomcat-based production app, causing downtime due to us having to restart.
The crash happens 10–15 minutes into the downtime window of the first broker. Absolutely no one has been able to figure out why, or even to reproduce the issue.
Running out of things to try, we resorted to randomly changing all sorts of different combinations of consumer group timeouts, which are imho poorly documented so no one really understands which means which anyway. Of course all that tweaking didn’t help either (gunshot debugging never does).
This has been going on for the last two years. As far as I know, the issue still persists. Everyone in that project is dreading Amazon’s monthly patch event.
Check the errors coming back on your poll/commit. The kafka stack should tell you when you can retry items. If it is in the middle of something sometimes it does not always fail nicely but you can retry and it is usually fine. Usually I see that sort of behavior if the whole cluster just 'goes away' (reboots, upgrades, etc). It will yeet out a network error and then just stop doing anything. You have to watch for it and recreate your kafka object (sometimes, sometimes retry is fine). If they are bouncing the whole cluster on you each broker can take a decent amount of time before they are alive again. So if you have 3 and they restart all 3 in quick succession all at once you will see some nasty behavior out of the kafka stack. You can fiddle your retries and timeouts. However, if that is lower than it takes for the cluster to come back you can end up with what looks like a busted kafka stream. I have seen it take anywhere from 3-10 mins for a single broker to restart sometimes (other times it is like 10 seconds). So depending on the upgrade/patch script that can be a decent outage. It goes really sideways if the cluster has a lot of volume to replicate between topics on each broker (your replica factor).
Besides this, operating Kafka never required much effort a part when we needed to re-balance partitions across brokers. Earlier versions of Kafka required to handle it with some external tools to avoid network congestions, but I think this is part of the past now.
On the other hand, Kafka still needs to be used carefully, especially you need to plan topics/partitions/replications/retention but that really depends by the application needs.