Demo page for "Multi-speaker Emotional Text-to-speech Synthesizer"

Brief information

Title: Multi-speaker Emotional Text-to-speech Synthesizer
Authors: Sungjae Cho, Soo-Young Lee
Conference: INTERSPEECH 2021
Session: Show and Tell demonstation
Presentation time: 16:00-18:00 (GMT+2) & 23:00-01:00 (GMT+9), September 1, 2021
Compulsory 3-minute video: [Video] [Slides]
Paper: [PDF] [OfficialPage]
- Sungjae Cho, and Soo-Young Lee (2021) Multi-Speaker Emotional Text-to-Speech Synthesizer. Proceedings of Interspeech 2021, 2337-2338
- Cho, S., & Lee, S.-Y. (2021) Multi-Speaker Emotional Text-to-Speech Synthesizer. Proc. Interspeech 2021, 2337-2338
Paper Abstract: We present a methodology to train our multi-speaker emotional text-to-speech synthesizer that can express speech for 10 speakers' 7 different emotions. All silences from audio samples are removed prior to learning. This results in fast learning by our model. Curriculum learning is applied to train our model efficiently. Our model is first trained with a large single-speaker neutral dataset, and then trained with neutral speech from all speakers. Finally, our model is trained using datasets of emotional speech from all speakers. In each stage, training samples of each speaker-emotion pair have equal probability to appear in mini-batches. Through this procedure, our model can synthesize speech for all targeted speakers and emotions. Our synthesized audio sets are available on our web page.

Brief introduction to demo

In the following demonstration, audios synthesized by our multi-speaker emotional text-to-speech synthesizer are demonstrated. Our synthesizer can synthesize speech for 7 emotions (neutral, angry, disgust, fear, happiness, sadness, and surprise) across 10 speakers (5 females, 5 males).

We, humans, recognize emotion from speech, depending on what textual content is contained. Thus, for each emotion, we present 3 sentences containing content of a particular emotion. For neutral emotion, we synthesized 3 sentences for all possible emotion-speaker pairs. For another emotion, we synthesized 3 sentences for neutral and that textual emotion across all speakers.

In the first demonstration, synthesized utterances of neutral sentences are presented. What we want you to pay attention to is how varying given audios are across emotions and speakers. We recommend you to listen to a neutral audio, and then listen to an emotional audio. Next, check out what varies across speakers. Note that the blue boxes where audios were synthesized even without training supervision. Those expressions were transferred from other speakers' expressions.

You can move onto another page to listen to audios for another emotional sentence, through the following hyperlinks.

[Neutral sentences (home)] [Angry sentences] [Disgusted sentences] [Fearful sentences] [Happy sentences] [Sad sentences] [Surprised sentences]

[One page view]

Neutral sentences

Neutral sentence 1

Speaker	Emotion
Sentence	이 음성합성기는 열 명의 화자와 일곱 개의 감정을 합성할 수 있습니다.
Pronouncing	i eumseonghabseong-gineun yeol myeong-ui hwajawa ilgob gaeui gamjeong-eul habseonghal su issseubnida.
Meaning	This speech synthesizer can synthesize for ten speakers and seven emotions.
Speaker	Neutral	Anger	Disgust	Fear	Happiness	Sadness	Surprise
ketts-30f
ketts-30m
ketts2-20m
ketts2-30f
ketts2-40m
ketts2-50f
ketts2-50m
ketts2-60f
ketts3-f
ketts3-m

Neutral sentence 2

Speaker	Emotion
Sentence	카이스트는 대한민국의 이공계 연구중심대학이다.
Pronouncing	kaiseuteuneun daehanmingug-ui igong-gye yeongujungsimdaehag-ida.
Meaning	KAIST is a research-oriented science and engineering university in South Korea.
Speaker	Neutral	Anger	Disgust	Fear	Happiness	Sadness	Surprise
ketts-30f
ketts-30m
ketts2-20m
ketts2-30f
ketts2-40m
ketts2-50f
ketts2-50m
ketts2-60f
ketts3-f
ketts3-m

Neutral sentence 3

Speaker	Emotion
Sentence	이 학회는 음성처리 분야에서 저명하다.
Pronouncing	i haghoeneun eumseongcheoli bun-ya-eseo jeomyeonghada.
Meaning	This conference is prominent in the field of speech processing.
Speaker	Neutral	Anger	Disgust	Fear	Happiness	Sadness	Surprise
ketts-30f
ketts-30m
ketts2-20m
ketts2-30f
ketts2-40m
ketts2-50f
ketts2-50m
ketts2-60f
ketts3-f
ketts3-m

Acknowledgement

This work was supported by Ministry of Culture, Sports and Tourism and Korea Creative Content Agency [R2019020013, R2020040298].