I would recommend creating a simplified JSON schema for the slides (say, presentation is an array of slides, each slide has a title, body, optional image, optional diagram, each diagram is one of pie, table, ...
Then use a library to generate the pptx file from the content generated.
It seems to me that a Transformer should excel at Transforming, say, text into pptx or pdf or HTML with CSS etc.
Why don't they train it on that? So I don't have to sit there with manually written libraries. It can easily transform HTML to XML or text bullet points so why not the other formats?
I don't think the name "Transformer" is meant in the sense of "transforming between file formats".
My intuition is that LLMs tend to be good at things human brains are good at (e.g. reasoning), and bad at things human brains are bad at (e.g. math, writing pptx binary files from scratch, ...).
Eventually, we might get LLMs that can open PowerPoint and quickly design the whole presentation using a virtual mouse and keyboard but we're not there yet.