Background#
Some conceptual thoughts on the processing pipeline to generate the control parameter trajectories, splice out the single words and do the segment-based synthesis.
Goal#
The goal of the pipeline is to generate a corpus for a vocal tract synthesis with VocalTractLab (VTL) from the Mozilla Common Voice corpus. The corpus is specifically tailored for the PAULE model but it can be used for other purposes.
Output Format#
As the final output create_vtl_corpus generates a pandas.DataFrame with the following columns:
file_name: name of the mp3 file in the common voice corpusmfa_word: word type, i. e. type of the word in terms of graphemic transcriptionlexical_word: word as it was writen in the transcriptword_position: postion of the word type in the sentencesentence: transcription of the full sentencewav_recording: spliced out audio as mono audio signalsr_recording: sampling rate of the recordingsampa_phones: list of phones in sampa notationmfa_phones: list of phones in MFA notationphone_durations: list of durations of the phonesvector: fastText vector embedding for the word_typecp_norm: cp-trajectories of the segment-based synthesisclient_id: client_ids of multiple workers in multiprocessing
The following columns are added, even if they can be generated out of the entries we already have for convenience:
wav_synthesized: wave form as mono audio from the segment-based synthesissr_synthesized: sampling rate for the mono audio from the segment-based synthesismelspec_norm_recorded: acoustic representation of human recording of the common voice corpus (log-mel spectrogram)melspec_norm_synthesized: acoustic representation of the segment-based approach (log-mel spectrogram)
Pipeline#
The idea of the processing pipeline is:
align the audio corpus and transcriptions with the MFA
extract the word types and splice out the audio
extract the phones and phone durations from the alignment
convert stereo audio to mono
extract the pitch of the audio signal
generate gestural scores with the segment-based approach in VTL
fit the pitch with the targetoptimizer
merge the f0 gesture of the targetoptimizer to the gestural scores of the segment-based approach
synthesize cp-trajectories from the patched gestural scores
synthesize audio (wav_segment, sr_segment) from the cp-trajectories
retrieve the fastText embedding vector for the word type
calculate the aucoustic representation (log-mel-spectrogram) for the wav_recording, and wav_segment
Notes#
Some random notes to keep in mind.
pauses between words are dropped
we use the MFA (IPA like) phonemes and not the ARPA ones and then convert them to the SAMPA like phonemes needed for VTL
Resources#
The following resources are used:
VocalTractLab (use the version included in create_vtl_corpus)
targetoptimizer (use the version included in create_vtl_corpus)
Phonemes#
The phonemes are converted from the MFA phonemes to the SAMPA phonemes. The following table shows the conversion:
a |
a |
|---|---|
aj |
aI |
aw |
aU |
aː |
a: |
b |
b |
bʲ |
b |
c |
k |
cʰ |
s |
d |
d |
dʒ |
dZ |
dʲ |
d |
e |
e |
ej |
I |
f |
f |
fʲ |
f |
h |
h |
i |
i |
iː |
i: |
j |
j |
k |
k |
kʰ |
k |
l |
l |
m |
m |
mʲ |
m |
m̩ |
m |
n |
n |
n̩ |
n |
o |
o |
ow |
aU |
p |
p |
pʰ |
p |
pʲ |
p |
s |
s |
t |
t |
tʃ |
tS |
tʰ |
t |
tʲ |
t |
u |
u |
uː |
u: |
v |
v |
vʲ |
v |
w |
U |
z |
z |
æ |
a |
ç |
C |
ð |
D |
ŋ |
N |
ɐ |
6 |
ɑ |
o |
ɑː |
o: |
ɒ |
O |
ɒː |
O |
ɔ |
O |
ɔj |
OY |
ə |
@ |
əw |
aU |
ɚ |
@ |
ɛ |
E |
ɛː |
E: |
ɜ |
2 |
ɜː |
2: |
ɝ |
2 |
ɟ |
dZ |
ɡ |
g |
ɪ |
I |
ɫ |
l |
ɫ̩ |
l |
ɱ |
m |
ɲ |
n |
ɹ |
r |
ɾ |
r |
ʃ |
S |
ʉ |
u |
ʉː |
u: |
ʊ |
U |
ʎ |
l |
ʒ |
Z |
ʔ |
? |
θ |
T |
ʁ |
R |
eː |
e: |
x |
x |
ts |
ts |
ɔʏ |
OY |
oː |
o: |
œ |
9 |
yː |
y: |
ʏ |
Y |
øː |
2: |
ø |
2 |
pf |
pf |
l̩ |
l |
t̪ |
T |
ʈʲ |
T |
ʈ |
t |
ʋ |
v |
d̪ |
d |
kʷ |
k |
cʷ |
C |
ɖ |
d |
tʷ |
t |
ɟʷ |
dZ |
ʈʷ |
T |
ɡʷ |
g |
pʷ |
p |
Some phonemes are perhaps not perfectly converted, since VTL does not accept all the phonemes of the SAMPA notation. Also, the MFA phonemes are not always perfectly aligned with the SAMPA phonemes. If VTL accepts more phonemes in the future, the conversion can be improved. Please contact the author if you have suggestions. The conversion should be good enough for the purpose of the corpus generation. A german accent in English is noticable in English pronounciation in the synthesis. If other languages are added the conversion table must be adapted for new phonemes.