Write a cloud formation / terraform template that involves O(1) machines and deploy 2000 identical copies.
Option two:
Write a template that deploys O(N = 2000) interdependent services across roughly 3-10x as many machines, and deploy one copy.
From what I can tell, you are arguing for option 2. It is strictly worse than option one. In addition to being more complex, it has a few nines less reliability, and costs 3-10x more for the hardware. The dev and CI hardware budgets is going to be 10x more because you can't test it on one machine, and it has bugs that only manifest at scale.
Source: I do this for a living, and have been on both sides of this fence. Option 1 typically has 5-6 nines (measured in chance a given customer sees a 10 second outage), option 2 never gets past 3-4 nines (measured in at least N% of customers are not seeing an outage).
The modern vs old technology debate has nothing to do with this tradeoff. If you want, you can build option 2 with EJB + CORBA on an IBM mainframe, and option 1 with rust and json on an exokernel FAAS.
I'd argue for Option 3, which is to try to understand the workloads placed on the original system and then design the new system based on this. I think having 2K independent database servers would not normally be optimal for 2M users, but it is possible.
If the old system is exceeding uptime SLAs, meeting all business needs, and coming in under the budget for such an investigation (it sounds like the total operations budget was less than 10% of one engineer's time), then why bother?
I don’t know the situation, not touching it may have been optimal. I’m suggesting that if it was going to get re-written, I would at least study the basic parameters of the problem by reviewing the workload of the current system.
Write a cloud formation / terraform template that involves O(1) machines and deploy 2000 identical copies.
Option two:
Write a template that deploys O(N = 2000) interdependent services across roughly 3-10x as many machines, and deploy one copy.
From what I can tell, you are arguing for option 2. It is strictly worse than option one. In addition to being more complex, it has a few nines less reliability, and costs 3-10x more for the hardware. The dev and CI hardware budgets is going to be 10x more because you can't test it on one machine, and it has bugs that only manifest at scale.
Source: I do this for a living, and have been on both sides of this fence. Option 1 typically has 5-6 nines (measured in chance a given customer sees a 10 second outage), option 2 never gets past 3-4 nines (measured in at least N% of customers are not seeing an outage).
The modern vs old technology debate has nothing to do with this tradeoff. If you want, you can build option 2 with EJB + CORBA on an IBM mainframe, and option 1 with rust and json on an exokernel FAAS.