Background#

Some conceptual thoughts on the processing pipeline to generate the control parameter trajectories, splice out the single words and do the segment-based synthesis.

Goal#

The goal of the pipeline is to generate a corpus for a vocal tract synthesis with VocalTractLab (VTL) from the Mozilla Common Voice corpus. The corpus is specifically tailored for the PAULE model but it can be used for other purposes.

Output Format#

As the final output create_vtl_corpus generates a pandas.DataFrame with the following columns:

  • file_name : name of the mp3 file in the common voice corpus

  • mfa_word : word type, i. e. type of the word in terms of graphemic transcription

  • lexical_word : word as it was writen in the transcript

  • word_position : postion of the word type in the sentence

  • sentence : transcription of the full sentence

  • wav_recording : spliced out audio as mono audio signal

  • sr_recording : sampling rate of the recording

  • sampa_phones : list of phones in sampa notation

  • mfa_phones : list of phones in MFA notation

  • phone_durations : list of durations of the phones

  • vector : fastText vector embedding for the word_type

  • cp_norm : cp-trajectories of the segment-based synthesis

  • client_id : client_ids of multiple workers in multiprocessing

The following columns are added, even if they can be generated out of the entries we already have for convenience:

  • wav_synthesized : wave form as mono audio from the segment-based synthesis

  • sr_synthesized : sampling rate for the mono audio from the segment-based synthesis

  • melspec_norm_recorded : acoustic representation of human recording of the common voice corpus (log-mel spectrogram)

  • melspec_norm_synthesized : acoustic representation of the segment-based approach (log-mel spectrogram)

Pipeline#

The idea of the processing pipeline is:

  1. align the audio corpus and transcriptions with the MFA

  2. extract the word types and splice out the audio

  3. extract the phones and phone durations from the alignment

  4. convert stereo audio to mono

  5. extract the pitch of the audio signal

  6. generate gestural scores with the segment-based approach in VTL

  7. fit the pitch with the targetoptimizer

  8. merge the f0 gesture of the targetoptimizer to the gestural scores of the segment-based approach

  9. synthesize cp-trajectories from the patched gestural scores

  10. synthesize audio (wav_segment, sr_segment) from the cp-trajectories

  11. retrieve the fastText embedding vector for the word type

  12. calculate the aucoustic representation (log-mel-spectrogram) for the wav_recording, and wav_segment

Notes#

Some random notes to keep in mind.

  • pauses between words are dropped

  • we use the MFA (IPA like) phonemes and not the ARPA ones and then convert them to the SAMPA like phonemes needed for VTL

Resources#

The following resources are used:

Phonemes#

The phonemes are converted from the MFA phonemes to the SAMPA phonemes. The following table shows the conversion:

Phoneme Conversion Table#

a

a

aj

aI

aw

aU

a:

b

b

b

c

k

s

d

d

dZ

d

e

e

ej

I

f

f

f

h

h

i

i

i:

j

j

k

k

k

l

l

m

m

m

m

n

n

n

o

o

ow

aU

p

p

p

p

s

s

t

t

tS

t

t

u

u

u:

v

v

v

w

U

z

z

æ

a

ç

C

ð

D

ŋ

N

ɐ

6

ɑ

o

ɑː

o:

ɒ

O

ɒː

O

ɔ

O

ɔj

OY

ə

@

əw

aU

ɚ

@

ɛ

E

ɛː

E:

ɜ

2

ɜː

2:

ɝ

2

ɟ

dZ

ɡ

g

ɪ

I

ɫ

l

ɫ̩

l

ɱ

m

ɲ

n

ɹ

r

ɾ

r

ʃ

S

ʉ

u

ʉː

u:

ʊ

U

ʎ

l

ʒ

Z

ʔ

?

θ

T

ʁ

R

e:

x

x

ts

ts

ɔʏ

OY

o:

œ

9

y:

ʏ

Y

øː

2:

ø

2

pf

pf

l

T

ʈʲ

T

ʈ

t

ʋ

v

d

k

C

ɖ

d

t

ɟʷ

dZ

ʈʷ

T

ɡʷ

g

p

Some phonemes are perhaps not perfectly converted, since VTL does not accept all the phonemes of the SAMPA notation. Also, the MFA phonemes are not always perfectly aligned with the SAMPA phonemes. If VTL accepts more phonemes in the future, the conversion can be improved. Please contact the author if you have suggestions. The conversion should be good enough for the purpose of the corpus generation. A german accent in English is noticable in English pronounciation in the synthesis. If other languages are added the conversion table must be adapted for new phonemes.