Not sure why this is popping up now, but people should know there's been quite a bit of followup work. One is a draft paper (not yet polished for publication), on arXiv[1]. I have an issue on my blog[2], and one of these days I intend to get back to it. It's also in use in Vello for the clip stack, in particular computing the intersection of bounding boxes of all clip paths in the spine of the tree from the leaf draw object to the root (clips can be nested to considerable depth). In any case, happy to talk about it.
Do succint trees like https://arxiv.org/pdf/1601.06939.pdf solve the same problems that stack monoids solve? I'll take some more time to fully digest the contents of this family of problems, but on first glance they seem closely related, with the input Balanced Parentheses being precisely the "BP" succint tree representation. I'm not sure of it, but from my understanding it seems probable that building the succint tree can be parallelised.
They are related, thanks for the pointer! The representation of the tree is very similar, as are the techniques to get better than O(log n) - the work efficiency stuff. I'll add a reference to that if/when I do a published version. That said, I am unaware of any publications on parallel implementations of succinct trees.
The stack monoid is related to the bicyclic semigroup (which got its name before "monoid" was more common, though I do see both names), but I am drawing a distinction: the bicyclic semigroup counts parentheses, while the stack monoid actually contains elements of some type (which is free). So the primary difference is how much storage is required for the representation; the bicyclic semigroup is very conveniently represented as a pair of natural numbers, while the stack monoid requires space proportional to the maximum stack depth. That's why it's so challenging to implement on GPU.
[1]: https://arxiv.org/abs/2205.11659
[2]: https://github.com/raphlinus/raphlinus.github.io/issues/66