Hunting down a C memory leak in a Go program

tialaramex · on Oct 16, 2021

This goes on a very exciting journey. But, the leak has a notable property that should cause you to reach for a particular tool quite early just in case. The leak is enormous. The program's leak is much larger than the program itself and is in fact triggering the OOM killer. So my first thought (on Linux) would be to reach for my:

https://github.com/tialaramex/leakdice (or there's a Rust rewrite https://github.com/tialaramex/leakdice-rust because I was learning Rust)

leakdice is not a clever, sophisticated tool like valgrind, or eBPF programming, but that's fine because this isn't a subtle problem - it's very blatant - and running leakdice takes seconds so if it wasn't helpful you've lost very little time.

Here's what leakdice does: It picks a random heap page of a running process, which you suspect is leaking, and it displays that page as ASCII + hex.

That's all, and that might seem completely useless, unless you either read Raymond Chen's "The Old New Thing" or you paid attention in statistics class.

Because your program is leaking so badly the vast majority of heap pages (leakdice counts any pages which are writable and anonymous) are leaked. Any random heap page, therefore, is probably leaked. Now, if that page is full of zero bytes you don't learn very much, it's just leaking blank pages, hard to diagnose. But most often you're leaking (as was happening here) something with structure, and very often sort of engineer assigned investigating a leak can look at a 4kbyte page of structure and go "Oh, I know what that is" from staring at the output in hex + ASCII.

This isn't a silver bullet, but it's very easy and you can try it in like an hour (not days, or a week) including writing up something like "Alas the leaked pages are empty" which isn't a solution but certainly clarifies future results.

avinassh · on Oct 16, 2021

This looks quite interesting and I learned something new! Is there any sample output of it? Adding in README also would be great

tialaramex · on Oct 16, 2021

I didn't provide sample output, but I think I should, this is a tool many people would use only once so it can't hurt to set expectations properly. I will think about how best to do that. There's a README for the original C version, and I should add one to the Rust code.

Note that you can play with it on programs that aren't leaking, it's just that obviously it isn't diagnosing a leak, you're just seeing whatever happened to be in a random page of that process with no reason to expect that page to be representative of anything.

WalterBright · on Oct 16, 2021

I once had to intercept every call malloc/free/realloc and log it to find a leak. I wound up turning that into an immensely useful tool.

WalterBright · on Oct 16, 2021

It would also stomp on all memory returned by malloc, and all memory that free free'd. This uncovered amazing numbers of bugs.

Over time, though, it got less and less effective, as I got better and better at avoiding writing such bugs in the first place.

Most bugs I create these days are due to misunderstanding the problem I'm solving, rather than memory misuse.

marivilla · on Oct 16, 2021

If you are the same Walter Bright who invented the D programming language then this comment should win a fun comment award

also there should be an HN comments award show or something

vips7L · on Oct 16, 2021

The one and only Walter Bright.

WalterBright · on Oct 16, 2021

There can be only one.

lostmsu · on Oct 16, 2021

valgrind?

WalterBright · on Oct 16, 2021

I wish. Mine instrumented the source code, not the executable.

linkdd · on Oct 16, 2021

This was my first guess, but according to wikipedia[0], the author of valgrind is Julian Seward[1].

  [0] - https://en.wikipedia.org/wiki/Valgrind
  [1] - https://en.wikipedia.org/wiki/Julian_Seward

leeoniya · on Oct 16, 2021

https://en.m.wikipedia.org/wiki/Walter_Bright

lostmsu · on Oct 16, 2021

Maybe WalterBright is his pseudonym on HN?

FridgeSeal · on Oct 16, 2021

Fairly sure this is the Walter Bright that wrote the D language, a C++ compiler etc

WalterBright · on Oct 16, 2021

It was my evil twin. Of course, he claims the evil one is me. Don't listen to him!

WalterBright · on Oct 16, 2021

No, no, he's the bad one!

cranekam · on Oct 16, 2021

Nice write up! Using BPF to trace malloc/free is good example of the tool’s power. Unfortunately, IME, this approach doesn’t scale to very high load services. Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

It would be great if one could configure the uprobes for malloc/free to trigger one in N times but when I last looked they were unconditional. It didn’t help to have the BPF probe just return early, either — the cost is in getting into the kernel to start with.

However, jemalloc itself has great support for producing heap profiles with low overhead. Allocations are sampled and the stacks leading to them are recorded in much the same way as the linked BPF approach:

https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Pro...

kouteiheika · on Oct 16, 2021

> Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.

Shameless plug in case you (or anyone else) is interested, I wrote a memory profiler for exactly this usecase:

https://github.com/koute/bytehound

It's definitely not perfect, but it's relatively fast, has an okay-ish GUI, and it's even scriptable: https://koute.github.io/bytehound/memory_leak_analysis.html

cranekam · on Oct 16, 2021

Interesting! What is the overhead of this? We found that jemalloc's heap profiling had a small (perhaps 1-2%? It's been a while) CPU penalty and, depending on the complexity of the code being profiled and the sample rate, potentially a few hundred MB of extra RAM use on very large, complex binaries. I'd assume the RAM cost is similar given the data is the same (i.e. backtraces).

kouteiheika · on Oct 16, 2021

It widely depends on the program's allocation patterns. It works by continuously tracking allocations and by default it emits every allocation as it happens, with a full stack trace attached.

It certainly adds certain overhead, but usually it's not significant enough to be a dealbreaker (especially if you compile the current master and configure it to automatically strip out all short lived temporary allocations). I've used it to profile a bunch of time-sensitive software, e.g. telecom software and blockchain software, where other profilers were too slow to be usable.

Not sure how it compares to jemalloc's profiler, but it should be orders of magnitude faster than Valgrind while at the same time gathering a lot more data.

G3rn0ti · on Oct 16, 2021

> our application was not actually handling events from that queue, so the size of that queue grew without bound

While new tools are great and I appreciate this nice write-up of how you can use BPF to find memory leaks, I wonder if they could have just guessed the above issue the minute after realizing that Valgrind did not report relevant issues. Actually the program just kept creating objects that were never used. With more context about the offending program, such design issues could be found by the responsible programmers by means of „thinking it through“. What I mean is: Sometimes complicated tooling distracts you so much from the actual problem that you are missing the obvious.

xerxes901 · on Oct 16, 2021

Believe me, before reaching for this stuff there was several passes of staring at the code trying to find something by “thinking it through” as you say. But at the end of the day, we’re human, and fixing bugs by inspection is just not always a realistic strategy.

Matthias247 · on Oct 16, 2021

I think it depends on how much context you have. For apps where I have written most of the code its usally easy to think it through and review some code paths. However the situation is vastly different if you run 90% unfamiliar code, like it often happens in larger organizations where everyone only contributed a tiny part of things. When you run millions of lines of code in production it's not reasonable to review everything in order being able to fix it - you need to use hints and tools to narrow down the source of the issue. That can be logs, metrics, debugging tools, profilers, etc.

otterley · on Oct 16, 2021

Segment learned quite some time ago that confluent-kafka-go has problems like these (and doesn’t support Contexts either), so they wrote a pure Go replacement instead. https://github.com/segmentio/kafka-go

xerxes901 · on Oct 16, 2021

So, in the interests of full transparency - we at Zendesk are actually running a fork of confluent-kafka-go, which I forked to add, amongst other things, context support: https://github.com/confluentinc/confluent-kafka-go/pull/626

This bug actually happened because I mis-merged upstream into our fork and missed an important call to rd_kafka_poll_set_consumer: https://github.com/zendesk/confluent-kafka-go/commit/6e2d889...

otterley · on Oct 16, 2021

Did you consider Segment’s pure Go library instead? Curious as to why you might have rejected it. (I should also note that Go C bindings impose performance penalties in many cases due to the need to copy memory back and forth.)

xerxes901 · on Oct 16, 2021

We had previously made extensive use of Sarama, another pure-go implementation of the Kafka protocol. After the third or fourth time where I had to use Wireshark to debug a protocol-level bug we figured it would probably be better to use the librdkafka-based library to have better protocol support.

Plus the team running our brokers found that the Sarama clients were a drag on them being able to enable broker features like zstd compression (although I understand Sarama has zstd now). This felt like it was going to be a kind of problem that would keep cropping up.

otterley · on Oct 16, 2021

You’ll notice Segment rejected Sarama, too. :-) Honestly I think only Segment got the pure Go library right. They care a lot about correctness and performance.

twmb · on Oct 16, 2021

I spent quite some time writing github.com/twmb/franz-go due to wanting more from both Sarama and from Segment’s library. Pure Go as well.

I’d recommend evaluating it as an alternative, even to confluent-kafka-go due to comprehensive feature support and (afaict) even higher performance than the aforementioned libraries.

xerxes901 · on Oct 16, 2021

Oh hey that looks really interesting. I'll certainly give it a look!

Lesson learned from the Sarama -> librdkafka migration: We actually wrapped Sarama in a home-grown abstraction (to handle some Zendesk-specific concerns). When we made v2 of this wrapper to support confluent-kafka-go, we made sure that no confluent-kafka-go implementation details/types leaked out of the interface. So, in theory, we can change underlying libraries again "easily" (well, with less rework in the apps at least).

twmb · on Oct 16, 2021

Nice! That’s definitely a good practice :).

I’ve tried to make the kgo package easier to use, but I think some libraries suffer from over abstracting the underlying protocol, which ultimately locks them out of some higher efficiency code or some correctness. So, I’ve tried to make the library simple yet not hide all of the details. It’s seemed to have panned out successfully for at least some people. Happy to answer any questions about it, and thanks for taking a look!

otterley · on Oct 16, 2021

Nice!

EdwardDiego · on Oct 16, 2021

I strongly believe that kafka-go is going to eat Sarama's lunch in a few years, in terms of Github stars, it's Sarama, kafka-go, then confluent's lib.

Incidentally, I'm always surprised at how little Confluent supports librdkafka considering that their Go, Python and C# clients (that I'm aware of) are just wrappers around it.

nikanj · on Oct 16, 2021

Author uses jmalloc to confirm malloc allocations are unfreed, then later speculates the allocations might be something not visible to Valgrind e.g. mmap.

I’ve often done similar mistakes, where data from step 1 already rules out a hypothesis for step 2 - but I’m too sleep-deprived and desperate to realize it. Debugging production issues is the worst.

kubb · on Oct 16, 2021

It's insane how much ad hoc engineering and random details like compiler flags were required to get the location where the unfreed memory was allocated. It's likely that an experienced team was on it for several days (unless they already had experience with all the tools used).

It's also crazy how the bug could be tied back to an unbounded queue that was backing up. It seems like the wrapper library should be designed in a way where not handling the queue events is hard to do, meanwhile the experts walked right into that.

mmoll · on Oct 16, 2021

I suspect valgrind‘s massif would have helped (massifly). It shows memory usage over time, but also where what fraction of memory was allocated.

eska · on Oct 16, 2021

My thought as well. I was able to fix a years old memory leak at a prior company using massif within an hour. It’s a really great suite of tools!

jjluoma · on Oct 16, 2021

I wonder if statistics provided by librdkafka (available also with confluent-kafka-go) could have been used to solve the issue with less effort.

https://github.com/edenhill/librdkafka/blob/master/STATISTIC...

GnarfGnarf · on Oct 16, 2021

On an unrelated note, I am a Zendesk customer and absolutely love the app. Zendesk makes customer support fun!

matt123456789 · on Oct 16, 2021

The author describes using eBPF to trace malloc/free refs as a solution to the program properly freeing all heap objects before exiting, which was enlightening to me. Would it have been possible to issue a kill -9 to the program in the middle of execution while using valgrind to see this info as well? Or is it more to the point that eBPF is cleaner and allows you to see many more snapshots of memory allocations while the program is still running?

xerxes901 · on Oct 16, 2021

Off the top of my head I don't think this would work, because Valgrind needs the atexit hooks to run?

Probably some other signal that has a default action of terminate the program that our app isn't handling might've worked though.

sam0x17 · on Oct 16, 2021

Fun side note: I once had to debug a GC stuttering issue in Crystal, and was delighted to find that the language was so damn open that I could just monkey-patch the actual allocator to print debug information whenever an allocation was made.

richardfey · on Oct 16, 2021

So the root cause was...not reading librdkafka documentation?

masklinn · on Oct 16, 2021

The root cause was bad library design, either incorrect defaults or a bad interface.

You should not have to read the documentation of features you do not need to discover they’ll take down your system if you forget to configure them.

Either the feature should default to a safe innocuous state, or you should not be able to skip its configuration.

richardfey · on Oct 16, 2021

> You should not have to read the documentation of features you do not need to discover they’ll take down your system if you forget to configure them.

Yes, you should totally read the documentation before using a library and then blaming it for a memory leak bug.

At the very least you could read it before launching such an expensive troubleshooting operation.

Taniwha · on Oct 16, 2021

Yeah, it's also a little unfair blaming C for a bug in the go program