This will not stop BigCorp to spend weeks to setup a big ass data analytics pipeline to process a few hundred MB from their „Data Lake“ via Spark.
And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.
Scaling _if_ needed has been the death knell of many companies. Every engineer wants to assume that they will need to scale to millions of QPS, most of the time this is incorrect, and when it is not then the requirement have changed and it needs to be rebuilt anyway.
I think it completely matters - yes these orgs are a lot more wasteful, but there is still an opportunity to save money here, especially is this economy, if not for the internal politics wins.
I’ve spent time in some of the largest distributed computing deployments and cost was always a constant factor we had to account for. The easiest promos were always “I saved X hundred million” because it was hard to argue against saving money. And these happened way more than you would guess.
> I’ve spent time in some of the largest distributed computing deployments
Yeah obviously if you run hundreds or thousands of severs then efficiency matters a lot, but then there isn't really the option to use a single machine with a lot of RAM instead, is there?
I'm talking about the typical BigCorp whose core business is something else than IT, like insurance, construction, mining, retail, whatever. Saving a single AKS cluster just doesn't move the needle.
Yeah I see your point where it just doesn’t matter, especially back the the original point where it may not be at scale now, but you don’t want to go through the budget / approval process when you need it etc.
I think my original point was more in the “engineers want to do cool, scalable stuff” realm - and so any solution has to support scaling out to the n’th degree.
Organisational factors pull a whole new dimension into this.
I mean yeah, definitely - it blows my mind how much tolerance for needless complexity the average engineer has. The principal/agent mismatch applies universally, and beyond that it is also a coordination problem - when every engineer plays by the "resume driven development" rules, opting out may not be best move, individually.
The long term maintainability is an important point that most comments here ignore. If you need to run the command once or twice every now and then in an ad hoc way then sure hack together a command line script. But "email Jeff and ask him to run his script" isn't scalable if you need to run the command at a regular interval for years and years and have it work long after Jeff quits.
Some times the killer feature of that data analytics pipeline isn't scalability, but robustness, reproducibility and consistency.
> "email Jeff and ask him to run his script" isn't scalable
Sure, it's not.
But the only alternative to that is not building some monster cluster to process a few gigabytes.
You can write a good script (instead of hacking one together), put it in source control and pull it from there automatically to the production server and run it regularly from cron. Now you have your robustness, reproducibility and consistency as well as much higher performance, for about one-ten-thousandth of the cost.
And this isn’t even wrong, bc what they need is a long-term maintainable method that scales up IF needed (rarely), is documented and survives loss of institutional knowledge three layoffs down the line.