RVC is voice conversion (audio to audio), and it's typically finetuned.
This is zero shot TTS. Samples create vector encodings that serve as input to inference. There's no retraining the model unless you want it to generalize or perform better.
It isn't though, people need to read the paper and the comments from the author they aren't actually doing the voice generation they pass the text off to VITS, and then they're sauce is that they are doing tone mapping on that VITS output, so if anything they're a competitor to RVC, it's just that the version they published includes VITS also