I fear this does too much. All applications should care about is an API to do wh...

I fear this does too much.

All applications should care about is an API to do what they want, so all you need to decide on is a messaging protocol, which is probably going to be GRPC. (Why GRPC? I picked it out of a hat. JSON is very brittle when service definitions change, so you want an IDL. Feel free to pick one and then never care about it again, it doesn't really matter.) Then if you want publish/subscribe, you write a publish/subscribe service and make API calls to it. SendMessage / WaitForMessage / etc.

Service discovery and load balancing are already solved problems. Use Envoy sidecars, Istio, Linkerd, etc. for load balancing, tracing, TLS injection, all that stuff. Use your "job runner"'s service discovery for service discovery (think: k8s services, but feel free not to use k8s. It's just an example.)

The tools you really need for success with microservices:

1) A way to quickly run the subset of services you need.

For unit tests, I prefer "fake" implementations of services. Often your app doesn't need the full API surface of an upstream service. If you have a StoreKey / RetrieveKey service, an implementation like "map[key]value" is good enough for tests. Make it super simple so you test your app, not the upstream app, which already has tests. (Do feel free to write some integration tests as a sanity check for CI, but keep the code/save/test loop fast and focused!)

For the "try it out in the browser", I'm pretty unhappy with the available tools. You want something like docker-compose without requiring docker containers to be built. I ended up writing my own thing to do this at my last job. Each service's directory has a YAML file describing how to run the service and what ports it needs. Then it can start up a service, with Envoy as a go-between for them. That way you get http/2, TLS (important for web apps because some HTML features are only available from localhost or if served over https, and your phone is never going to be retrieving your app's content from localhost), tracing, metrics, a single stream of logs, etc. I got it optimized to the point where you can just type "my-thing ." and have your web app working almost like production in under a second. It was great. I wish I open-sourced it.

2) Observability. You need to know what's going on with every request. What's failing, what's slow, what's a surprising dependency?

2a) Monitoring. With a fleet of applications, it's unlikely that you'll be seeking out failures. Rather they just happen and you don't know how often or why. So every application needs to export metrics, and these metrics need to feed alerts so that you can be informed that something is wrong. (Alert tells you something is abnormal; the dashboard with all the metrics will let you think of some likely causes to investigate.) Just use Prometheus and Grafana. They're pretty great.

2b) Distributed tracing. You don't have an application you can set a breakpoint in to pick apart a failing request. So you need to ephemerally collect and store this information so that when something does break, you have all the information you would have manually obtained all ready for you, so you can dive in and start investigating. Just use Jaeger. It's pretty great. (Jaeger will also give you a service dependency graph based on traces. Great for checking every once in a while to avoid things like "why is the staging server talking to the production database?". We don't know why, but at least we know that it's happening before someone deletes production.)

2c) Distributed logging. You will inevitably produce a lot of interesting logs that will be like gold when you're debugging a problem that you've been alerted to. These all need to be in one place, and need to be tagged so that you can look at one request all at once. The approach I've taken is to use elasticsearch / fluentd / kibana for this, with the applications emitting structured logs (bunyan for node.js, zap for go; but there are many many frameworks like this). I then instructed my frontend proxy (Envoy) to generate a unique UUID and propagate that in the HTTP headers to the backend applications, and wrote a wrapper around my logging framework to extract that from the request context and log it with every log message. (You can also use the opentracing machinery for this; I personally logged the request ID and the trace ID; that way I could easily go from looking at Jaeger to looking at logs, but traces that weren't sampled would still have a grouping key.)

The deeper logs integrate into your infrastructure, the better. As an example, something I did was to include a JWT signed by the frontend SSO server with every request. Then my logging machinery could just log the (internal) username. Then when someone came to my desk and said "I'm trying to foo, but I get 'upstream connect error or disconnect/reset before headers'" and could just look for logs by their username. Much easier than trying to figure out what service that was, or what URL they were visiting.)

Anyway, sorry for the long post. My TL;DR is that you must invest in good tooling no matter what architecture you use. You will be completely unsuccessful if you attempt microservices without the right infrastructure. But all this is great for monoliths too. Less debugging, more relaxing!