If you want some nightmares though It is not really all that bad. The hardest part of pulling it all out and reassembling it into what you want is the shared strings aspect.
It is really quite distinct from the memory dump of pre-xml doc/excel files.
For real. On Palladium/NGSCB there was a strong effort to maintain correspondence between documentation and code, to the point that there was an effort to have header files be generated directly from the specs, which were Word docs. The biggest practical challenge was that the extractor had to instantiate Word just to read the text content of paragraphs with the specified style. This is not something that you want in your build pipeline if you can avoid it.
Hahaha, me too, but the old Word format really is as bad as everyone says, and we were trying to build a system with formal correspondence from spec to code (and sometimes from spec to proof to code), so having a "cobbled" together something really didn't fit the model.
The real problem was using word as our documentation format, but at Microsoft in the early oughts there really weren't many alternatives.
The cobble-together part, to be successful, would pull the text out reliably. And the format is readily documented, and Open/Libre office processes it as well. The code to do the extract might be ugly, but so long as it reliably produced the text in a CI/CD environment, that would be OK.
It is really quite distinct from the memory dump of pre-xml doc/excel files.