I'd love to see data on the average on-call incidents for an application written...

rdtsc · on March 6, 2019

> I'd love to see data on the average on-call incidents

Don't have any hard data to compare but having been involved in debugging running Erlang systems. It's very nice having the ability to restart separate supervisors while the rest of the processes handle requests. Being able to do hot code loading to say fix bugs or add extra logging. And my all time favorite -- live tracing after connecting to a VM's remote shell. You can just pick any function, args, and process and say "trace these for a few seconds if a specific condition happens". None of those individually are earth shattering but taken together they are just so pleasant to use. I wouldn't enjoy going back to anything didn't those capabilities.

And yes, that restarting of sub-systems (supervision trees) happens automatically as well. There were a number of cases were it turned a potential "wake up 4am and fix this now, cause everything crashed" into a "meh, it's fine until I get to it next week" kind of a problem.

brightball · on March 6, 2019

Is there a good write up of how to do that somewhere?

rdtsc · on March 6, 2019

Which part or just in general ops with Erlang?

Overall I would say this book is a good start https://www.erlang-in-anger.com/

Supervisors are just a general pattern in Erlang. Any book will have something about it. I like this one: https://learnyousomeerlang.com/supervisors

Restarting frequency and limits are just one of the parameters you specify. So don't need to do anything fancy or special there.

Hot code loading might not be as obvious: http://erlang.org/doc/reference_manual/code_loading.html but is essentially just compiling the module on the same VM version (or close by, no more than 2 version away), copying it to the server in the same path as the original. The original could be save to a backup file. The do `l(modulename)` to load it.

For tracing I recommend http://ferd.github.io/recon/. Erlang in Anger book will also have example of tracing. http://erlang.org/doc/man/dbg.html has some nice shortcuts too, but be careful using it in production is it doesn't have any overload protection. So if you accidentally trace all the messages on all the processes, you might crash your service :-)

brightball · on March 6, 2019

Tracing is mainly what I was going for. I'm very familiar with the various patterns and the run time, but I haven't seen the tracing aspects referenced in as much detail.

Thanks!

kureikain · on March 5, 2019

I don't have but I can tell you from my experience with Ruby, Go, Node and Elixir.

I have zero on-call for Go. I had very few for Elixir. But the bug were in logic code. Same with Ruby.

But it's a disaster with Node. We used TypeScript so it catch lot of type issue. However, the Node runtime is weird. We run into DNS issue(like yo have to bump the libuv thread pool, cache DNS). JSON parsing issue and block the event loop etc...max memory...

optimusclimb · on March 6, 2019

This would be too heavily influenced by confounding factors.

For instance:

* Are the teams that use certain languages comprised of more experienced people?

* How mature is the company and project? I.e., a faster moving startup cutting more corners, where time was decided to be of the essence (rightly or wrongly) will likely produce more on call incidents than a slower, more established company that can takes its time