Abstract: Current voice cloning models are only trained by approximating the generated speech to match the reference speech, without directly supervising the speaker timbre information during training ...