It's the difference between being a plumber and being someone who makes custom jewelry. Both are artisans of a kind, the one is doing production and trying to 'get the job done' the other is making one-offs that will have a vastly inefficient time:product ratio where a lot of the value created will be in the eye of the beholder. Both are valid paths.
It's more like someone making custom jewelry using modern tools vs. somebody making custom jewelry only using methods available in ancient Rome. You might not see any obvious difference looking a the result but once you know how they're made one is definitely more impressive than the other. I'm also sure that one is more "tricky" and "tedious" than the other, which is what I was addressing.
If you code something for artistic or "competitive" reasons then of course it makes complete sense. Like people making 4KB demos, speedrunners or people folding thousands of paper cranes. There are no invalid paths if you're an artist.
On the other hand if you consider it from a practical engineering perspective then there are few use cases where I'd go with ASM nowadays, well optimized C code will be easier to write, probably nearly as fast, way easier to modify and maintain and much more portable. Some paths are wildly superior to others if you're an engineer.
Modern CPUs are insanely complicated, with thousands of different instructions, countless layers of cruft and bewildering performance variations across architecture generations - what's optimal in one gen can be a worst case scenario two generations down the road. For any modern CPU I'd never attack a problem ASM first and, most probably, never even touch it.
But we are talking about the 6502 inside the NES, running at less than 2 MHz, with one accumulator and two index registers, a couple status flags and an 8-bit stack pointer within a 256-byte stack. It's a simple machine, for simpler problems.
And yet, guys like Paul Lutus were doing real-time 3D wireframes in FORTH, with fast 8-bit scaled trigonometric functions (no floats involved). It was pure badass of a magnitude not heard of since.
My own contribution was a windowing library for the Apple II that did fit in 1024 bytes and ran self-modifying code to display overlapping windows.