Somewhat tangential question, have you looked for/found any audio models/tools that can be used for separating out individual voices to separate audio tracks automatically? Perhaps this is already possible with existing tools that I am uninitiated in.
I haven't tested this with multiple voices and it sounds like you want something more specific but it's produced 10/10 results with a couple dozen audio files I've thrown at it, might be of use... https://vocalremover.org/
Izotope RX Pro, which is software for the cleaning and refinement of audio for music and audio post production includes 'Multiple Speaker Detection' which analyzes different voices in a recording and allows you to process them independently.
I can't speak to it's effectiveness because I don't have any need for it, and also RX 10 Advanced is commercial software and pretty expensive for a casual user, but the feature seems to be on the horizon for other apps.