I'm actually looking into a segfault issue deep in the bowels of a C++ addon we have in node.js (anyone in #node.js will have seen me over the past few weeks ask about it), but what reading this makes me realize is how woefully underequipped I am to hunt for problems of this nature.
My problem is likely in one of our addons, but this kind of debugging, this whole genre of problem solving is entirely beyond me. How do I get to this level? What do I need to learn? To study?
It's just a little depressing to read something like this and see how far the road ahead goes, despite how far I've already traveled...
- Use valgrind (or gdb)! Your segfault should be simpler to find than a memory leak, because you know what line the segfault happens on.
- If you have a value that's getting mangled (pointer getting overwritten by a write to another address) and you can't figure out why, use watchpoints to see when that address is getting touched. http://sourceware.org/gdb/onlinedocs/gdb/Set-Watchpoints.htm...
- Find a minimal program to reproduce the problem. It's gross, but I used to actually just take a copy of the code and cut things out until the bug stopped, then look at the last thing I cut. You can do this as a binary search - only run the first half, check for the bug, only run the second half, check for the bug, repeat on the buggy half.
As I said, segfaults are a lot easier than this kind of problem (not that they're easy when you start out). Don't be discouraged! I would help out too, but you'd need to send everything to reproduce the bug (client code, server code, server platform, etc.)
Debugging severe memory corruption or memory leaks is annoying, and can occasionally take a lot of time, but it's not necessarily that bad. Here are some pointers that may be helpful.
Tools: valgrind and gdb are obvious. But don't forget your compiler! Crank up the warnings, and look through LLVM-clang's -fsanitize=<foo> and warning options. (Also, if you're already on OpenBSD, check out the "S" flag to malloc; if you're on Solaris, check out, well, the blog post.) Finally, Boehm's conservative garbage collector has a "find memory leaks" mode, which looks useful for those cases where you can't get valgrind working. If all else fails, shovel through the memory dump looking for repeated patterns.
Testing: try to reproduce the problem; the first iteration may look something like "it runs out of memory after 36 hours". Then simplify: for instance, the author of the article could have asked "does this still happen if the server closes the connection immediately, without sending any data" and would have found the bug very quickly. (Of course, you're likely to ask a lot of wrong questions before hitting on the right one; experience and a full knowledge of the system you're working on is useful but not sufficient.) Questions like "does this happen more quickly if we ping 100 times per second instead of once every ten minutes" are often useful as well. (Finally, just printing memory usage every N seconds is helpul.)
Coding: be careful when writing code. The usual ways of improving code quality (e.g. code reviews) work to reduce memory leaks, too. Try to run a multiple-hour soak test every so often during development (preferably on a CI server); it's a lot easier to debug "hey, we suddenly run out of memory after yesterday's commits" than "well, something goes wrong in production". If you're doing new development, consider alternatives to malloc() - arena/pool allocation (e.g. libtalloc) is convenient and very fast if your memory use is tree-like (e.g. a connections owns a request owns some memory to sort the data before returning it). In C, goto a single chunk of cleanup-and-return code rather than duplicating the cleanup at every place where you exit from the function.
You need determination and experience, and some knowledge of how code is compiled at a low level.
Tools like those described in the article are handy, but aren't absolutely necessary. They save a lot of time, but the same effects can usually be gotten by more laborious means.
You have a segfault. You should know where in the code it's occurring already; it's either an access to bad memory with the instruction pointer (IP) at the point of access, or it's an attempt to execute code with the IP pointing at the bad memory, in which case the top of the stack (or, depending on calling convention, one of the registers) normally contains the place where it came from (necessarily, since the code expected to be returned to).
There are ways to turn an instruction pointer into line number offset when you have appropriate debug info, if you can't get the program running under a debugger.
Given the line number, segfaults can typically be split into three categories: plain bad logic, use after free, and memory corruption. The last is hardest to find IME, most easily done using a debugger and hardware breakpoints on memory address modifications, but you need a stable repro and a consistent memory allocator that gives predictable addresses for every rerun.
If any of the above is meaningless to you, it should give you some clues as to where you need to research.
Yeah I got this far, but the stack trace doesn't have the symbols. The guy I'm working with said it might be because the addons use dynamically linked libraries instead of statically linked libraries...
By the way, I think a jenandre used to work at my company.
What library is missing the symbols? You may be able to tell by the stack trace. You should get symbols for something at least before it's lost. Examine that frame to see what it's doing. Linking some non-debug library w/ a debug executable just means that gdb won't display the symbols when it enters code for that library. But if your addons are built with -g you'll get the symbols for the addon before it starts calling the other library it is using.
Btw, people often write javascript wrappers that manually refer to the build/Release/.node version instead of the debug version (which will get added to build/Debug/.node). Check that first.
Even if you are using dynamically linked libraries (like the zmq addon does), you can always build debug versions of those to get all of the symbols (try CFLAGS=-g when ./configure).
We never make it to symbol-land in the bt, so I'm assuming it's because we never get into a debug library at all. I will make sure the js is referring to the correct version of the addon though.
My next step is probably going to be to try and figure out how to get every library to build with symbols. Thanks for the help. :D
I can't claim to be at the level of the joyent guys presented here, but I think taking a Operating System class and Computer Architecture class, or reading the respective textbooks helps, and at the same time you have to be familiar with the particular OS you happen to use, probably up to the point of reading and having basic understanding of the source code of the most important subsystems (virtual memory, process scheduling, filesystem handling, TCP/IP stack) and understanding what the system calls are and what they do. Then you need to know a wide range of tools the given OS offers for examining things, so that you do not get hopelessly stuck in the face of an emergency, since you often have to investigate a crash while it happens to even be able to reproduce it, so you need to know how to examine a running process etc. For Linux this means knowing stuff like:
There is a big bunch of tools in the OS very few developers know, sysadmins know more, but they often don't understand the OS and use the tools without understanding their output too well.
Some confirmation of what I have written here is the fact that Joyent forked OpenSolaris to create an OS precisely to make it easier to do things of this kind:
In 2005, Sun Microsystems open sourced Solaris, its renowned Unix operating system, eventually to be released as a distribution called OpenSolaris. Among the earliest adopters and most effective advocates of OpenSolaris was Ben Rockwood, who wrote The Cuddletech Guide to Building OpenSolaris in June, 2005 – the first of his many important contributions to the nascent OpenSolaris community. Meanwhile, Joyent's CTO Jason Hoffman was frustrated by the inability of most operating systems to answer seemingly-simple questions like: "Why is the server down? When will it be back up? ... Now that it's back up, why is my database still slow?"
Jason knew that these questions would be a lot easier to answer on Solaris-based systems, and recognized Sun's open-sourcing initiative as a huge opportunity.
I looked at node.js for a system I'm involved with creating, but ultimately we went with Erlang just because it's been around a lot longer and is more stable in terms of things like this. We're working on a semi-embedded system that will not always be on-line or accessible for debugging. We also considered Go, which probably would have been more familiar to C++ guys, but it was also deemed a bit immature even if it seems like a very pleasant language to work with.
Of the three options you've considered, Erlang is clearly the best choice, but why haven't you even considered Java (or any other JVM language)? When it comes to monitoring, profiling or debugging a long-running application, nothing comes close to the JVM. And, needless to mention, it's extremely mature and stable.
A Java memory leak can be solved in a matter of minutes, or – if it's especially complex – in a couple of hours tops. You can take a heap dump and analyze it with Eclipse Memory Analyzer, and if you need allocation stack-traces, you instrument your code with VisualVM. All of this can be done remotely and without stopping the app.
Flight Recorder, which has recently been added to the HostSpot VM, even gives you instrumentation with hardly any performance penalty (though it requires a commercial license if used in production).
Java lets you trade memory for performance. The bigger the heap, the less frequent the young-gen GCs, and the less frequent young-gen GCs, the less garbage you need to collect (because more young objects get to die). So most times when you see large Java heaps that's by choice, as many people prefer paying with memory (which is cheap on servers) for performance. So Java is a "memory hog" only when you want it to run at full speed (and that's true for all generational GC environments). But nothing is stopping you from running "standard" Java with a really small heap; you'd still get better performance than Node or Erlang, and get enjoy all of the excellent tools.
Not knowing it well is a good reason not to use it, though.
I know for a similar project I did, I ruled out Java as I (and my team at the time) were not very productive in it. It mattered, in that instance, but might not in some others. Depends on your teams skill set I guess!
I've seen two sources of memory leaks in Erlang based systems: 1) unbounded process message queues, and 2) passing binaries across process (pid) boundaries.
Many beginning erlangers run into these, and they're relatively easy to identify and correct. With a little practice, these become easy patterns to recognize and avoid.
As far as httpc, I'm unaware of that bug -- but I can say that I recently worked on a commercial product that leveraged httpc as a core component of the service, and it worked fine.
> As far as httpc, I'm unaware of that bug -- but I can say that I recently worked on a commercial product that leveraged httpc as a core component of the service, and it worked fine.
I'd like to read more about how we can prevent this class of error going forward. Could stronger typing or RAII or some other feature or trick have made the bug apparent at compile time?
I made a very basic Node.js module in C++ with V8 and it was surprisingly difficult to make a good (idiomatic JS behaviour, believably bug-free) wrapper for a straightforward class and factory method. I say this coming from Boost Python and Luabind, where there are some tricky parts to bind complex classes, but simple ones are easy enough, and once written, obviously correct.
I've been running an extremely simple Node application on 0.10.18 for a while now and it has a very gradual memory leak. My code is just a few dozen lines, and it all seems pretty innocent. I am also using Hapi, so I thought maybe Hapi has a leak in it somewhere. Now I wonder if I have the same leak as Walmart here. I just now upgraded to 0.10.22 and am curious to see where I end up. If the leak goes away then hot damn, I got lucky :)
The office photocopier broke down, so the manager called in a repairman. The repairman takes one look at the machine, draws an 'X' at the problem part, and hands the manager a bill for $500. The manager was shocked at the price, and demanded an itemized bill. The repairman simply wrote:
Marking the 'X' - $1
Knowing where to put the 'X' - $499
I started Googling the Picasso "principle" about it being a lifetime to know how to do it, but it turned into Googling this one instead. Found a snippet, "Karl Steinmetz (German-born, U.S citizen), the well known electrical engineer who worked out many details of a.c. theory and was responsible largely for the adoption of a.c. for commercial use, was once called in by the General Electric Company to examine a poorly performing transformer. After a few minutes, Steinmetz marked an x on the transformer core and said, “It will work if you take off the turns from this x to the end.” The prescription worked well, and Steinmetz later sent G.E. a bill for his service of $10,000. The company official thought the bill excessive and asked for the itemization. Steinmetz then sent them a more detailed bill: For putting x on transformer core : $1; for knowing where to put the x: $9999." It's funny that in today's world, both Picasso and Steinmetz take "minutes" to do this, but in perhaps earlier tellings, it took hours for Picasso to do his work and days for Steinmetz: http://edisontechcenter.org/CharlesProteusSteinmetz.html
I had no idea that this quote has such a delightful and well-documented origin. I'd only heard the story told about Picasso (and various mechanics and engineers). A great example of how these things morph over time.
The story is delightful because it pitted two great Victorian aesthetes against one another. Ruskin had said this about Whistler:
I have seen, and heard, much of Cockney impudence before now;
but never expected to hear a coxcomb ask two hundred guineas
for flinging a pot of paint in the public's face.
So Whistler sued for defamation and was examined by Ruskin's lawyer:
Holker: Did it take you much time to paint the Nocturne in Black and Gold?
How soon did you knock it off?
Whistler: Oh, I 'knock one off' possibly in a couple of days – one day
to do the work and another to finish it.
Holker: The labour of two days is that for which you ask two hundred guineas?
Whistler: No, I ask it for the knowledge I have gained in the work of a lifetime.
The insinuation in the lawyer's question ("how soon did you knock it off?") is hilarious!
Whistler, by the way, was a great wit and had a famous skirmish with Oscar Wilde:
Did I read the same page you did? It seemed quite generic and gave the following rating. Which means it hasn't been explicitly disproven. In fact, I'd suggest that these things might indeed have happened if the skilled worker, needing to prove his worth, used such a glib line because he'd heard it somewhere else. But I'm skeptical too, that's why I was Googling. You'd think someone, somewhere would know how long each took and was more consistent in the retelling. Or that they would have kept the itemized receipt for the joke.
LEGEND: Hollow yellow bullets are the ones most commonly associated with "pure" urban legends — entries that describe plausible events so general that they could have happened to someone, somewhere, at some time, and are therefore essentially unprovable. Some legends that describe events known to have occurred in real life are also put into this category if there is no evidence that the events occurred before the origination of the legends.
Wow, thanks for the link! I would never have thought that the joke was based on an actual incident.
For those who are wondering about the Picasso principle, it's based on a story about a woman who asked Picasso why he charged $5000 for a painting, when it only took him seconds to paint it. He replied, "Madam, it took my entire life!".
Wonderful blog post; major props for the engineering time expenditure. But, why do you have an Olark chat widget that says "Contact Sales". I don't want to have anything to do with those schlubs! If anything, I want to talk to serious engineers like you!
Ironically, this page hangs Chrome indefinitely when I try to load it. Luckily it only hangs the tab so I can still close it. I guess I'll fire up Firefox to see if I can actually read the article.
Edit: Actually, it loads fine in a private browsing tab, so it must be a bad interaction with some extension. Oh well.
Yeah, that's pretty much it. I was trying to read about a problem with a javascript engine, and I was prevented from doing so by what could quite likely have been another problem with the same javascript engine.
I don't think "ironic" is entirely the wrong word. It can mean a state of affairs or an event that seems deliberately contrary to what one expects and is often amusing as a result
In this case it's more accidental or incidental than deliberate....
Thank you for indulging my silly word fetish. What is the state of affairs that could be identified as deliberately or merely accidentally/incidentally contrary in this situation:
* There is a discussion of a problem in V8.
* While reading the discussion a user stumbles across a problem in Chrome.
* V8 is a component of Chrome.
* The Chrome bug may or may not be related to V8.
For the sake of the discussion we can pretend that "deliberate" is not a key component of the definition you give[1] and substitute incidental/accidental. What is the contrary result? Given the memory leak in V8 we know that V8 is not the first software product to be bug free. So it should not be surprising or unexpected that there may be another problem in V8 or Chrome.
One of the things that I find interesting about the way irony is tossed around these days is that it signals no information. As a result it is never an "entirely wrong word." Because it does not convey anything to the reader it is merely superfluous. Ironically and randomly have become linguistic NOP sleds.
[1] I have no problem with your definition and I realize you were not endorsing OP's conception of irony.
You're right, I have no idea why the scenario in question could in any way be considered the opposite of what the user expected, and that---perhaps making a leap in logic in conflating the causes---he found it somewhat humorous.
I would put it this way: I was expecting to read about a bug, not encounter one myself. So I guess the two outcomes are not exactly opposite, and maybe I'm using "irony" incorrectly here. I'm not sure.
"Ironically" is a word that you see and hear frequently in everyday english. I think its interesting how varied people's concept of irony is for such a common word. From what I can tell most people's definition is something between "serendipity's evil twin" and "partially related." It seems in this case OP thinks the definition is the latter. I think that sooner or later "ironically" is going to have the same fate as "randomly," ("It is so random we ran into you, we were just talking about you.) which I think has zero meaning in conversational english.
To be honest i think most people's definition of irony is largely shaped by Alanis Morissette's terribly misinformed but catchy song and the hipster d-bag that says he has an "ironic mustache."
I think it is the evolution of language that is interesting? It seems like we have a case of the more a word is used the definition becomes less concise until it carries no meaning. There has to be some linguistic jargon for this type of situation.
Before I posted that I googled "define irony" and was surprised to see the first definition as something that sounded like sarcasm. The definition appears to already be changing, making hipsters retroactively correct, which is just another reason we can all hate them. ;)
I assume that they can restart the server at intervals or use load balancing. A few months of developer timer for something like this seems excessive unless he was working on something else as well.
As a former software engineer at Walmart I can tell you that a few months for something like that is nothing to them. They employ several thousand devs at the home office. Having one of them focus on a bug like this isn't an issue in terms of time or money. In their minds its worth it given the scale of the enterprise.
I think there are still quite a few C and C++ programmers out there. To me this is a great example of why it is better software engineering to write a server in something like Node.js. Because rather than having a million code bases with potential memory leaks like this one, there is just the Node code. In ordinary JavaScript code its impossible to cause a problem just that.
It is fairly easy to create a long running server in a GC'd language that will continually consume more memory. Some don't like to call it a memory leak, which is why I put it the way I did, but the effect is the same.
At the end of the day, the more that you think this is impossible the more likely your programs will experience it. So please don't think that your program is immune to this because you use Javascript.
Good example might be a server process which never releases memory, so the longer it runs the more memory it "consumes". That is, the maximum memory required to handle any previous request.
This might be a well known solved problem, but I have heard it mentioned before.
I have an apache box that runs a bunch of PHP and flat HTML sites. I have to set it to only use 10 processes, and to kill them every half hour, because they all gradually swell up to 35MB each (which I imagine is where they've loaded pretty much all the PHP on my server, independently of each other).
Without the number limit, or the kill policy, the server runs out of RAM and crashes. (it's only a cheap one, with 512MB RAM.) Luckily it's a very low traffic set of sites, so these limits don't break the experience. I'm glad I didn't have to solve this problem any deeper!
Have you enabled the GC? Some default configs disable the PHP GC because it's SOP to run temp startups (eg mod_php) or restart them on a regular basis (like you).
I haven't had to log in for a while (about a year), so I've been happy to leave it as it is. When I migrate over to DigitalOcean (which I've been intending to do for ages now) I'll look into that instead!
It's all a matter of what you're trying to do. If you have a nice test framework with good coverage and automated build tests running stuff by valgrind, you mitigate the risk of having memory leaks.
Is it worth all the extra effort, when you could just go with a language that does GC at runtime? Sometimes it is, it depends on the use case and it depends on the people.
I've done a project rewrite from C to Java where the Java implementation performed a lot better and consumed less memory than the C one. Some of the performance gain was because I chose better algorithms and limited DB interaction, but some of the gain came from having immutability guarantees whereas the C code would just copy a lot of data structures where immutability was not guaranteed. A lot of time in the C based project was spent doing mallocs and frees and memcpys for nothing. This is poor project design, but poor design happens and Java has some protection agains that due to promoting encapsulation to a greater extent than C by default.
I am 100% certain that if the original project would've been better designed and managed, it would've kicked ass because having it in C would have allowed us to have a smaller memory footprint which would've meant a greater monetary profit in the end for this particular project over time due to system constraints.
What it comes down to is that if you have a good team that understands the required dev proccess of a mid-sized C project and who are profficient enough to implement such a project without doing too much "quick fixin'" it can be worth it. If you're limited by the size and/or competence (everyone can't be a rockstar. I certainly am not) of the team or limited in turnaround time for the product, choosing C will probably not be in your best interest. But having the right people around and if there's a monetary gain in doing things efficiently with the hardware you have then C is still an awesome tool to have in your toolbox.
Most of the time, in the world of SaaS and web based solutions, using C doesn't make a lot of sense except for some bits of core functionality. That's why I like languages with good C bindings. Knowing e.g., Python and C, you really can get the best of both worlds.
Excellent details on the sleuthing that went on to find this error. I think it's great that there are great tools available to debug errors like this and your write up helps me in learning more about how to go about properly debugging my Node apps.
cool writeup. while not a node.js user, i love these sorts of tours of system internals - i always learn a lot, both specific tools and also processes of using them.
thanks for the details, very articulate and useful stuff.
My problem is likely in one of our addons, but this kind of debugging, this whole genre of problem solving is entirely beyond me. How do I get to this level? What do I need to learn? To study?
It's just a little depressing to read something like this and see how far the road ahead goes, despite how far I've already traveled...