Provable Speech Attributes Conversion
via Latent Independence

Bar-Ilan University

Model

Indepedence-based Voice Conversion (IVC) consists of an encoder predicting content-like latent variable Ŝ and a decoder conditioned on speaker, emotion, or loudness embeddings C. The model is trained with discrepancy loss R and independence loss I

Highlights

Abstract

While signal conversion and disentangled representation learning have shown promise for manipulating data attributes across domains such as audio, image, and multimodal generation, existing approaches, especially for speech style conversion, are largely empirical and lack rigorous theoretical foundations to guarantee reliable and interpretable control. In this work, we propose a general framework for speech attribute conversion, accompanied by theoretical analysis and guarantees under reasonable assumptions. Our framework builds on a non-probabilistic autoencoder architecture with an independence constraint between the predicted latent variable and the target controllable variable. This design ensures consistent signal transformation conditioned on an observed style variable, preserving the original content while modifying the desired attribute. We further demonstrate the versatility of our method by evaluating it across a range of speech styles beyond speaker identity, including emotion, loudness, and pitch contour. Quantitative evaluations confirm the effectiveness and generality of the proposed approach.

Speaker Conversion

The conversion is done for unseen source and target speakers from LibriSpeech test-clean subset.

Source Target (reference) Converted
Source 5105
Speaker 0672 (ref)
Converted to 0672
Source 5105
Speaker 1320 (ref)
Converted to 1320
Source 5105
Speaker 2830 (ref)
Converted to 2830
Source 5105
Speaker 4446 (ref)
Converted to 4446
Source 5105
Speaker 8555 (ref)
Converted to 8555
Source 5105
Speaker 7127 (ref)
Converted to 7127
Source 5683
Speaker 0672 (ref)
Converted to 0672
Source 5683
Speaker 1320 (ref)
Converted to 1320
Source 5683
Speaker 2830 (ref)
Converted to 2830
Source 5683
Speaker 4446 (ref)
Converted to 4446
Source 5683
Speaker 8555 (ref)
Converted to 8555
Source 5683
Speaker 7127 (ref)
Converted to 7127
Source 7729
Speaker 0672 (ref)
Converted to 0672
Source 7729
Speaker 1320 (ref)
Converted to 1320
Source 7729
Speaker 2830 (ref)
Converted to 2830
Source 7729
Speaker 4446 (ref)
Converted to 4446
Source 7729
Speaker 8555 (ref)
Converted to 8555
Source 7729
Speaker 7127 (ref)
Converted to 7127
Source 3729
Speaker 0672 (ref)
Converted to 0672
Source 3729
Speaker 1320 (ref)
Converted to 1320
Source 3729
Speaker 2830 (ref)
Converted to 2830
Source 3729
Speaker 4446 (ref)
Converted to 4446
Source 3729
Speaker 8555 (ref)
Converted to 8555
Source 3729
Speaker 7127 (ref)
Converted to 7127
Source 7021
Speaker 0672 (ref)
Converted to 0672
Source 7021
Speaker 1320 (ref)
Converted to 1320
Source 7021
Speaker 2830 (ref)
Converted to 2830
Source 7021
Speaker 4446 (ref)
Converted to 4446
Source 7021
Speaker 8555 (ref)
Converted to 8555
Source 7021
Speaker 7127 (ref)
Converted to 7127
Source 4507
Speaker 0672 (ref)
Converted to 0672
Source 4507
Speaker 1320 (ref)
Converted to 1320
Source 4507
Speaker 2830 (ref)
Converted to 2830
Source 4507
Speaker 4446 (ref)
Converted to 4446
Source 4507
Speaker 8555 (ref)
Converted to 8555
Source 4507
Speaker 7127 (ref)
Converted to 7127

Emotion conversion

Trained on VCTK + ESD. The conversion is done for the speaker 0013 in Emotional Speech Dataset.

Sample ID Original Converted
Sample 000301
Emotion: Neutral
Emotion: Angry
Sample 000302
Emotion: Neutral
Emotion: Angry
Sample 000303
Emotion: Neutral
Emotion: Sad
Sample 001352
Emotion: Sad
Emotion: Angry
Sample 001355
Emotion: Sad
Emotion: Surprise

Emotion and Speaker conversion

Trained on VCTK, ESD, LibriSpeech (train-clean-100, libritts-train-clean-360). The conversion is done for the unseen Speakers 0011, 0012, 0015, 0017 in Emotional Speech Dataset.

Source: speaker: 0011 emotion: Neutral utterance: 000137
Source Reference Converted

0011 / Neutral

0011 / Angry

0011 / Angry

0011 / Neutral

0011 / Happy

0011 / Happy

0011 / Neutral

0012 / Angry

0012 / Angry

0011 / Neutral

0012 / Sad

0012 / Sad

0011 / Neutral

0015 / Angry

0015 / Angry

0011 / Neutral

0015 / Sad

0015 / Sad

0011 / Neutral

0017 / Sad

0017 / Sad
Source: speaker: 0011 emotion: Neutral utterance: 000252
Source Reference Converted

0011 / Neutral

0011 / Angry

0011 / Angry

0011 / Neutral

0011 / Happy

0011 / Happy

0011 / Neutral

0012 / Angry

0012 / Angry

0011 / Neutral

0012 / Happy

0012 / Happy

0011 / Neutral

0012 / Sad

0012 / Sad

0011 / Neutral

0015 / Angry

0015 / Angry

0011 / Neutral

0015 / Happy

0015 / Happy

0011 / Neutral

0015 / Sad

0015 / Sad

Source: speaker: 0011 emotion: Neutral utterance: 000265
Source Reference Converted

0011 / Neutral

0011 / Angry

0011 / Angry

0011 / Neutral

0011 / Happy

0011 / Happy

0011 / Neutral

0012 / Angry

0012 / Angry

0011 / Neutral

0012 / Sad

0012 / Sad

0011 / Neutral

0015 / Angry

0015 / Angry

0011 / Neutral

0015 / Sad

0015 / Sad

0011 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0011 emotion: Neutral utterance: 000298
Source Reference Converted

0011 / Neutral

0011 / Angry

0011 / Angry

0011 / Neutral

0011 / Happy

0011 / Happy

0011 / Neutral

0012 / Angry

0012 / Angry

0011 / Neutral

0012 / Happy

0012 / Happy

0011 / Neutral

0012 / Sad

0012 / Sad

0011 / Neutral

0015 / Happy

0015 / Happy

0011 / Neutral

0015 / Sad

0015 / Sad

Source: speaker: 0011 emotion: Neutral utterance: 000307
Source Reference Converted

0011 / Neutral

0011 / Angry

0011 / Angry

0011 / Neutral

0012 / Angry

0012 / Angry

0011 / Neutral

0012 / Happy

0012 / Happy

0011 / Neutral

0012 / Sad

0012 / Sad

0011 / Neutral

0015 / Angry

0015 / Angry

0011 / Neutral

0015 / Sad

0015 / Sad

0011 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0011 emotion: Neutral utterance: 000342
Source Reference Converted

0011 / Neutral

0011 / Angry

0011 / Angry

0011 / Neutral

0012 / Angry

0012 / Angry

0011 / Neutral

0012 / Sad

0012 / Sad

0011 / Neutral

0015 / Sad

0015 / Sad

0011 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000042
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0011 / Happy

0011 / Happy

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Happy

0012 / Happy

0012 / Neutral

0012 / Sad

0012 / Sad

0012 / Neutral

0015 / Sad

0015 / Sad

0012 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000098
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0015 / Angry

0015 / Angry

0012 / Neutral

0017 / Angry

0017 / Angry

0012 / Neutral

0017 / Happy

0017 / Happy

0012 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000104
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0011 / Happy

0011 / Happy

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Happy

0012 / Happy

0012 / Neutral

0012 / Sad

0012 / Sad

0012 / Neutral

0015 / Sad

0015 / Sad

0012 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000117
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Sad

0012 / Sad

0012 / Neutral

0015 / Angry

0015 / Angry

0012 / Neutral

0015 / Sad

0015 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000132
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Sad

0012 / Sad

0012 / Neutral

0015 / Angry

0015 / Angry

Source: speaker: 0012 emotion: Neutral utterance: 000237
Source Reference Converted

0012 / Neutral

0012 / Sad

0012 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000303
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0011 / Happy

0011 / Happy

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Happy

0012 / Happy

0012 / Neutral

0012 / Sad

0012 / Sad

0012 / Neutral

0015 / Angry

0015 / Angry

0012 / Neutral

0015 / Sad

0015 / Sad

0012 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000312
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Sad

0012 / Sad

Source: speaker: 0012 emotion: Neutral utterance: 000327
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Sad

0012 / Sad

0012 / Neutral

0015 / Angry

0015 / Angry

0012 / Neutral

0015 / Happy

0015 / Happy

Source: speaker: 0012 emotion: Neutral utterance: 000342
Source Reference Converted

0012 / Neutral

0011 / Angry

0011 / Angry

0012 / Neutral

0012 / Angry

0012 / Angry

0012 / Neutral

0012 / Sad

0012 / Sad

0012 / Neutral

0015 / Angry

0015 / Angry

0012 / Neutral

0015 / Sad

0015 / Sad

Source: speaker: 0015 emotion: Neutral utterance: 000109
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

0015 / Neutral

0015 / Sad

0015 / Sad

0015 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0015 emotion: Neutral utterance: 000133
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

Source: speaker: 0015 emotion: Neutral utterance: 000173
Source Reference Converted

0015 / Neutral

0011 / Happy

0011 / Happy

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Happy

0012 / Happy

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

0015 / Neutral

0015 / Sad

0015 / Sad

0015 / Neutral

0017 / Happy

0017 / Happy

0015 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0015 emotion: Neutral utterance: 000194
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

Source: speaker: 0015 emotion: Neutral utterance: 000201
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0011 / Happy

0011 / Happy

0015 / Neutral

0011 / Sad

0011 / Sad

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

0015 / Neutral

0015 / Sad

0015 / Sad

0015 / Neutral

0017 / Happy

0017 / Happy

0015 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0015 emotion: Neutral utterance: 000205
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0011 / Happy

0011 / Happy

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Happy

0012 / Happy

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0017 / Happy

0017 / Happy

Source: speaker: 0015 emotion: Neutral utterance: 000249
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Happy

0012 / Happy

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

0015 / Neutral

0015 / Sad

0015 / Sad

0015 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0015 emotion: Neutral utterance: 000253
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0011 / Happy

0011 / Happy

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Happy

0012 / Happy

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

0015 / Neutral

0015 / Happy

0015 / Happy

0015 / Neutral

0015 / Sad

0015 / Sad

0015 / Neutral

0017 / Happy

0017 / Happy

0015 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0015 emotion: Neutral utterance: 000304
Source Reference Converted

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

0015 / Neutral

0015 / Sad

0015 / Sad

0015 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0015 emotion: Neutral utterance: 000327
Source Reference Converted

0015 / Neutral

0011 / Angry

0011 / Angry

0015 / Neutral

0012 / Angry

0012 / Angry

0015 / Neutral

0012 / Sad

0012 / Sad

0015 / Neutral

0015 / Angry

0015 / Angry

Source: speaker: 0017 emotion: Neutral utterance: 000055
Source Reference Converted

0017 / Neutral

0011 / Angry

0011 / Angry

0017 / Neutral

0012 / Angry

0012 / Angry

0017 / Neutral

0012 / Sad

0012 / Sad

0017 / Neutral

0015 / Sad

0015 / Sad

0017 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000067
Source Reference Converted

0017 / Neutral

0011 / Angry

0011 / Angry

0017 / Neutral

0012 / Angry

0012 / Angry

0017 / Neutral

0012 / Sad

0012 / Sad

0017 / Neutral

0015 / Angry

0015 / Angry

0017 / Neutral

0015 / Sad

0015 / Sad

0017 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000096
Source Reference Converted

0017 / Neutral

0011 / Angry

0011 / Angry

0017 / Neutral

0012 / Angry

0012 / Angry

0017 / Neutral

0012 / Sad

0012 / Sad

0017 / Neutral

0015 / Angry

0015 / Angry

0017 / Neutral

0015 / Sad

0015 / Sad

0017 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000124
Source Reference Converted

0017 / Neutral

0011 / Angry

0011 / Angry

0017 / Neutral

0011 / Happy

0011 / Happy

0017 / Neutral

0012 / Angry

0012 / Angry

0017 / Neutral

0012 / Happy

0012 / Happy

0017 / Neutral

0012 / Sad

0012 / Sad

0017 / Neutral

0015 / Angry

0015 / Angry

0017 / Neutral

0015 / Sad

0015 / Sad

0017 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000130
Source Reference Converted

0017 / Neutral

0012 / Angry

0012 / Angry

0017 / Neutral

0015 / Sad

0015 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000208
Source Reference Converted

0017 / Neutral

0011 / Sad

0011 / Sad

0017 / Neutral

0012 / Sad

0012 / Sad

0017 / Neutral

0015 / Sad

0015 / Sad

0017 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000219
Source Reference Converted

0017 / Neutral

0011 / Angry

0011 / Angry

0017 / Neutral

0011 / Happy

0011 / Happy

0017 / Neutral

0012 / Happy

0012 / Happy

0017 / Neutral

0015 / Sad

0015 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000249
Source Reference Converted

0017 / Neutral

0011 / Angry

0011 / Angry

0017 / Neutral

0012 / Angry

0012 / Angry

0017 / Neutral

0012 / Sad

0012 / Sad

0017 / Neutral

0015 / Sad

0015 / Sad

0017 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000304
Source Reference Converted

0017 / Neutral

0011 / Angry

0011 / Angry

0017 / Neutral

0012 / Angry

0012 / Angry

0017 / Neutral

0012 / Sad

0012 / Sad

0017 / Neutral

0015 / Angry

0015 / Angry

0017 / Neutral

0015 / Sad

0015 / Sad

0017 / Neutral

0017 / Sad

0017 / Sad

Source: speaker: 0017 emotion: Neutral utterance: 000309
Source Reference Converted

0017 / Neutral

0012 / Angry

0012 / Angry

Loudness conversion

We present conversion examples corresponding to audio pairs where the mean loudness difference between the source and reference signals exceeds 10 dB, ensuring that the selected samples represent substantial loudness variation. In the proposed framework, loudness serves as an additional conditioning variable for the decoder, alongside speaker identity and emotion, enabling fine-grained control over expressive intensity during synthesis.

Source Converted

0017 / Angry

0017→0015 / Angry→Sad

0011 / Angry

0011→0011 / Angry→Angry

0011 / Angry

0011→0017 / Angry→Angry

0011 / Angry

0011→0011 / Angry→Neutral

0011 / Angry

0011→0012 / Angry→Sad

0011 / Angry

0011→0017 / Angry→Angry

0011 / Happy

0011→0011 / Happy→Angry

0012 / Sad

0012→0011 / Sad→Happy

0015 / Sad

0015→0017 / Sad→Angry

License

MIT License. Feel free to use any of the material in your own work, as long as you give us appropriate credit by mentioning the title and author list of our paper.