Hacker News new | past | comments | ask | show | jobs | submit | kevg123's comments login

1. A classic book is The Mythical Man-Month and it discusses a surgical team approach that I think is interesting where there are lead surgeons and the rest of the team is there to support them.

2. Programmer Anarchy by Fred George on YouTube is an interesting idea.


MMM is a good book, but I like to suggest Weinberg's The Psychology of Computer Programming as another contemporary (to it) take that discusses both how to examine what's happening in teams (effective or not) and what was observed in some teams (effective and not). Like MMM, it received an update later on. Weinberg left his original text essentially intact (I think light changes to address errors in the printing, not changes to what was written though) and added commentary to each chapter instead of editorializing them directly.


> What are the key assets you monitor beyond the basics like CPU, RAM, and disk usage?

* Network is another basic that should be there

* Average disk service time

* Memory is tricky (even MemAvailable can miss important anonymous memory pageouts with a mistuned vm.swappiness), so also monitor swap page out rates

* TCP retransmits as a warning sign of network/hardware issues

* UDP & TCP connection counts by state (for TCP: established, time_wait, etc.) broken down by incoming and outgoing

* Per-CPU utilization

* Rates of operating system warnings and errors in the kernel log

* Application average/max response time

* Application throughput (both total and broken down by the error rate, e.g. HTTP response code >= 400)

* Application thread pool utilization

* Rates of application warnings and errors in the application log

* Application up/down with heartbeat

* Per-application & per-thread CPU utilization

* Periodic on-CPU sampling for a bit of time and then flame graph that

* DNS lookup response times/errors

> Do you also keep tabs on network performance, processes, services, or other metrics?

Per-process and over time, yes, which are useful for post-mortem analysis


Those are some great ideas for Prometheus alert rules. If they aren't already added here: https://samber.github.io/awesome-prometheus-alerts/


IO wait time for disks is a great one too for catching IO load, `glances` and `atop` do a good job of surfacing it when it's an issue.


With all that, might want some good automatic anomaly detection. While at IBM's Watson lab, I worked out something new, gave an invited talk on the work at the NASDAQ server farm, and published it.

With a lot of monitoring someone might be interested.


How to identify a mistuned vm.swappiness?


I rely on a heuristic approach which is to track the rate of change in key metrics like swap usage, disk I/O, and memory pressure over time. The idea is to calculate these rates at regular intervals and use moving averages to smooth out short-term fluctuations.

By observing trends rather than just static value ( a data point at specific time) you can get a better sense of whether your system is underutilizing or overutilizing swap space. For instance, if swap usage rates are consistently low but memory is under pressure, you might have vm.swappiness set too low. Conversely, if swap I/O is high, it could indicate that swappiness is too high.

This is a poor man’s approach, and there are definitely more sophisticated ways to handle this task, but it’s a quick solution if you just need to get some basic insights without too much work.


That is a good list, now just need to prioritize (after finding the ICP).


Before you start adding all of that make sure you have customers like parent poster.

For example I monitor disk space, RAM, CPU and that’s it for external tooling.

If any of that goes above thresholds someone will log into the server and use windows or Linux tooling to check what is going on.

I mostly monitor services health check endpoints so http calls to our own services. If network is down or shoddy response times of the services.

So all in all not much of servers itself.


I think the hardest part is deciding which gems to use. It's not uncommon to end up with over 50 gems in your Gemfile.

For example, built-in capabilities for authentication are limited: https://github.com/rails/rails/issues/50446

So then do you go with has_secure_password/etc., Devise, rodauth, authentication-zero, or something else? These are big decisions that then might affect other things like authorization, OAuth, PassKey, etc.

And that's authentication & authorization which are a relatively well-understood and maintained area, but other areas might have totally unmaintained gems that might have issues with recent versions of Rails, or native module compilation issues with more recent versions of operating systems, etc.

A lot of Rails guidance on blog posts and StackOverflow might be outdated.

This problem is not unique to Rails. I still think Rails is great and relatively vibrant. Nevertheless, I suggest being very wary of Rails guides, blog posts, and StackOverflow answers that are more than 1 year old and doing a careful study and inventory of gems before deciding to use them and reviewing their relative recent usage and activity.


Number of hours per week would be nice. I think we'll see a lot more demand and supply of 30 and 20 hour per week jobs.


The more common recommendation is to catch what you need to handle in a special way (if any) and then have a catch-all (or re-throw) for the rest, and if you don't need to catch anything specific, then just catch (or re-throw) all.

Think of exceptions like error codes. Most often one just checks if there is an error or not. Sometimes, one checks for specific error codes in addition to the general check. It would be rare to check every single error code, though possible.

By this analogy, I think the recommendation to check each type of exception is very uncommon.

Most importantly, make sure you do always catch exceptions at some level and handle them somehow (even if it's just logging), and also make sure no exception/error information is lost (e.g. blank catch block, not logging all exception details, not re-throwing with an inner exception so the original stack is lost, etc.).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: