I am well aware of the technical details and that I used a liberal definition of...

I am well aware of the technical details and that I used a liberal definition of "microprocessor". My wording was vague on purpose (I didn't want to delve into the details). I didn't mean to imply that each "microprocessor" had their own instruction decoding block (they don't).

An AMD Radeon R9 290X has 2816 stream processors (44 compute units of 64 stream processors) per their terminology. There is only 1 instruction decoder per compute unit, so a stream processor cannot completely branch off independently, but it can still follow a unique code path via branch predication. This is kind of comparable to an Nvidia GPU having "44 streaming multiprocessors".

But whether you call this 44 or 2816 processors is irrelevant to my main point: a processor that has to decode/execute 44 or 2816 instructions in a single cycle while supporting complex features like caching, branching, etc, is going to be less efficient than a FPGA with hard-wired logic (edit: "hard-wired" from the view point of "once the logic has been configured").

gchadwick also said integer workloads were "not power efficient" on GPUs, but that's also false. Most SP floating point and integer instructions on GPUs are optimized to execute in 1 clock cycle, so they are equally optimized. And of course integer logic needs fewer transistors than floating point logic, so an integer operation is going to consume less power than the corresponding floating point operation.