This is an idea that's been around forever and from a hardware perspective it's super easy. The problem is finding applications for this. You need to find something that is so difficult that it requires atleast dozens of instructions (because the FPGA isn't going to be running at the CPU clock speed so it needs to be a decent chunk of work to jsutify), is done often enough to dedicate silicon for (the FPGA can't be dark 99.999% of the time), but not often enough that you can't justify full custom implementation. Then you need to write a set of custom instructions that map to this logic, and build support for using these custom instructions into the compiler - accounting for the fact you don't just need to dispatch data to the fpga, you need to program it each time you change instruction and that takes forever in CPU terms.
It's not impossible, it's just very difficult, impacts every single part of the stack, and is very difficult to justify.
Correct but at the same time that overstates things, because extremely often the choice between full custom ICs versus FPGA is dictated by the expected volume of product, not just functionality considerations.
It is typically simply cheaper to deploy FPGAs in released products when the volume is small, while it may be cheaper to use full custom when the volume is in the millions to hundreds of millions, in the cases where either solution is functionally workable.
That includes amortizing the non-recurring engineering costs over the total units, which is typically higher for full custom than FPGA -- although sometimes they are actually in the same ballpark.
Aside from that you are correct; people sometimes imagine that most any application can be significantly accelerated with FPGAs, but even in the cases where fine-grained parallelism is present to be accelerated (well-known not to be the case for all application areas), the FPGA solution space is decreased by the solution space where full custom makes engineering and financial sense.
Maybe they're targeting a different arch layer? Is there any mileage in pushing that sort of tech into on-chip routing? As you get more and more cores, obviously interconnect area becomes more of a problem (and bus bandwidth more constrained). Is there much to be gained from a compiler being able to say "This next bit of code wants as much uncontended bandwidth as you can muster between 5 cores and L1"? That way, actually reconfiguring anything would be a bunch of microcode, rather than something the compiler took direct control over.
There's some (perhaps lots) of potential for this for in-memory analytic processing (e.g. in an analytics-focused DBMS).
Also, a specific potential use of FPGAs is for pattern matching on large amounts of text/data: If you do it at all, you're likely to do it often; and it can't be a custom implementation since the circuit depends on the specific pattern.
It's not impossible, it's just very difficult, impacts every single part of the stack, and is very difficult to justify.