It's about information. Gradient-free methods integrate little or no information about the problem; they're a blind watchmaker. This works, but it's slow and gets slower the bigger your problem is. (curse of dimensionality)
Gradients integrate some limited information about the problem. This lets you find solutions much faster, and neural networks are structured specifically to be easy to optimize with gradients. Local minima don't seem to be a problem.
The future is probably even smarter optimizers that integrate more information about the problem and learn to make good assumptions. This is the goal of Learned Optimizers, like Velo (https://arxiv.org/abs/2211.09760).
You could start reading on CMA-ES; which is something like a particle filter on the model parameters. So for 100 "particles", it means 100 resampled copies of the model, which are then evaluated to create something like a "synthetic" gradient which is used to update a distribution over the model parameters.
But it doesn't solve the problem of local minima, and it will also need to use minibatches.