How to use?#

create_vtl_corpus allows to automatically create control parameter trajectories for the VocalTractLab (VTL) artculatory synthesizer for a large corpus of speech audio files and transcriptions of these files. It includes the code to align the transcriptions to the audio with the Monreal Forced Aligner (MFA). create_vtl_corpus is mainly used and tested with data from the Mozilla Common Voice project.

To align and synthesize the first 100 words, which apear at least 4 times, of a German speech corpus at the path CORPUS and save the results as a pandas DataFrame to SAVE_DF_PATH run the following command

python -m create_vtl_corpus.create_corpus --corpus CORPUS --language de --needs_aligner --use_mp --min_word_count 4 --word_amount 100 --save_df_name SAVE_DF_NAME --num_cores 4[

Use --help or -h to get a full list of the command line options.

Furthermore, you can use it within Python. Here the create_vtl_corpus.create_corpus.CreateCorpus class is a good starting point.

Use cases#

create_vtl_corpus can be used to:

  1. Create the vtl corpus with no pre-existing data but not all of the files, i. e. only 10_000 samples.

  2. Add new data to an already existing corpus, add 10_000 new samples.

  3. Filter after words? Create 1000 more “post” word types.

Flags#

The following flags can be used to modify the behaviour of the library.

Converts a corpus to the vocaltract lab format

usage: fancytool [-h] --corpus CORPUS [--language LANGUAGE]
                 [--mfa_workers MFA_WORKERS] [--needs_aligner] [--use_mp]
                 [--append_to_df APPEND_TO_DF]
                 [--min_word_count MIN_WORD_COUNT] [--word_amount WORD_AMOUNT]
                 [--aligner_batch_size ALIGNER_BATCH_SIZE]
                 [--num_cores NUM_CORES] [--save_df_name SAVE_DF_NAME]
                 [--df_save_path DF_SAVE_PATH] [--debug]
                 [--epoch_size EPOCH_SIZE] [--start_epoch START_EPOCH]
                 [--end_epoch END_EPOCH] [--error_factor ERROR_FACTOR]
                 [--add_melspec]

Named Arguments#

--corpus

The path to the corpus which should be converted to the vocaltract lab format

--language

The language of the corpus as an abbreviation

Default: 'de'

--mfa_workers

The number of mfa workers to use

Default: 6

--needs_aligner

If the aligner should be run

Default: False

--use_mp

if multiprocessing should be used in the creation of the dataframe

Default: False

--append_to_df

If a already created dataframe should be searched for and then used instead of creating a new one

--min_word_count

The minimum amount of words a word should have to be included in the word amount argument. This is based a approximate lexical word count

Default: 0

--word_amount

0 the whole corpus shall be processed, a postive integer if the number is limited. Since processing is sentence based, more words ( with lower word count), will also be synthesized

Default: 0

--aligner_batch_size

How many text files the aligner should process in one batch

Default: 5000

--num_cores

The number of jobs the multiprocessing should use, uses maximum on default. If the number is 1 or lower, no multiprocessing is used

Default: 1

--save_df_name

The name to save the dataframe under, language will be added automatically

Default: 'corpus_as_df_mp'

--df_save_path

The path to save the dataframe to

Default: '/mnt/Restricted/Corpora/CommonVoiceVTL/'

--debug

If debug mode should be used

Default: False

--epoch_size

The ampount of clips processed in one epoch and until the dataframe is saved

Default: 5000

--start_epoch

The epoch to start with (inclusive)

Default: 0

--end_epoch

The epoch to end with (inclusive)

Default: 1000

--error_factor

How likely you estimate a word to occur less then your min word count in the corpus

--add_melspec

If mel spectrograms should be added to the dataframe if you use multiprocessing. If you don’t use multiprocessing, mel spectrograms are always added

Default: False

Multiprocessing#

The library supports multiprocessing, which can be used to speed up the process and for large corpora this is absolutely necessary. However it is not enabled by default, to enable it use the --use_mp flag. For multiprocessing melspectrograms are added to the dataframe afterwards without multiprocessing. This however can take some time so only do it if necessary and if you need our kind of melspectrogram Melspectrograms. Melspectrograms can be generated afterwards too however with the information available in the dataframe.

CreateCorpus Class#

class create_vtl_corpus.create_corpus.CreateCorpus(path_to_corpus: str, *, language: str)#

This class generates the vocaltract lab trajectories for a corpus. It’s assumed that a corpus has the following shape as is common with Mozillas Common Voice Corpus using MFA

corpus/
├── validated.tsv         # a file where the transcripts are stored
├── clips/
│   └── *.mp3             # audio files (mp3)
└── files_not_relevant_to_this_project
str path_to_corpus

The path to the corpus

str language

The language of the corpus as an abbreviation

frozen_set word_set

A frozen set of the words that should be used in the corpus

fasttext.FastText._FastText fast_text_model

The loaded fasttext model

dict mfa_to_sampa_dict

A dictionary that maps the MFA phonemes to the SAMPA phonemes

int word_amount

The amount of words that should be used, if 0 all clips are used

int min_word_count

The minimum amount of words a word should have to be included in the word amount argument

format_corpus():

Takes the path to the corpus and formats it to the fitting form

run_aligner():

Runs the Montreal Forced Aligner on the corpus

check_structure():

Checks if the corpus has the right format and if not corrects this

create_dataframe():

Extracts the MFA phonemes from the aligned corpus and creates a dataframe with synthesized data

create_dataframe_mp():

Extracts the MFA phonemes from the aligned corpus and creates a dataframe with synthesized data using multiprocessing, no mel spectrograms are created here

setup():

Downloads the fasttext model for the given language or calls load_fasttext_model() it if it is already downloaded

load_fasttext_model():

Loads the fasttext model for the given language, from the resources folder

create_frozen_set():

Creates a frozen set of the words that should be used in the corpus and saves it as a class attribute

check_structure(word_amount, min_word_count)#

Checks if the corpus has the right and if not corrects this

Parameters:
  • min_word_count (int) – The minimum amount of words a word should have to be included in the word amount argument

  • word_amount (int) – How many words should be processed, if 0 all words are processed, inclusion is based on the min_word_count argument

Returns:

  • list clipnames – A list of the clip names

  • list Sentence_list – A list of the transcriped sentences in the same order as the clips.

create_data_frame(clip_list: list, sentence_list: list)#

Creates Dataframe with Vocaltract Lab data and other data

Parameters:
  • path_to_corpus (str) – The path to the corpus

  • clip_list (list) – A list of the clip names present in the corpus

  • sentence_list (list) – A list of the sentences present in the corpus in the same order as the clip_list , so they fit together

Returns:

label

description

’file_name’

name of the clip

’mfa_word’

the spoken word as it is in the aligned textgrid

’lexical_word’

the word as it is in the dictionary

’word_position’

the position of the word in the sentence

’sentence’

the sentence the word is part of

’wav_recording’

spliced out audio as mono audio signal

’sr_recording’

sampling rate of the recording

’sr_synthesized’

sampling_rates_sythesized,

’sampa_phones’

the sampa(like) phonemes of the word

’mfa_phones’

the phonemes as outputted by the aligner

’phone_durations_lists’

the duration of each phone in the word as list

’cp_norm’

normalized cp-trajectories

’melspec_norm_recorded’

normalized mel spectrogram of the audio clip

’melspec_norm_synthesized’

normalized mel spectrogram synthesized from the cp-trajectories

’vector’

embedding vector of the word, based on fastText Embeddings

’client_id’

id of the client

Return type:

create_data_frame_mp(clip_list: list, sentence_list: list, num_cores)#

Creates Dataframe with Vocaltract Lab data and other data with multiprocessing

Parameters:
  • clip_list (list) – A list of the clip names present in the corpus

  • sentence_list (list) – A list of the sentences present in the corpus in the same order as the clip_list , so they fit together

  • num_cores (int) – The number of cores to maximaly use

Returns:

label

description

’file_name’

name of the clip

’mfa_word’

the spoken word as it is in the aligned textgrid

’lexical_word’

the word as it is in the dictionary

’word_position’

the position of the word in the sentence

’sentence’

the sentence the word is part of

’wav_recording’

spliced out audio as mono audio signal

’sr_recording’

sampling rate of the recording

’sr_synthesized’

sampling_rates_sythesized,

’sampa_phones’

the sampa(like) phonemes of the word

’mfa_phones’

the phonemes as outputted by the aligner

’phone_durations_lists’

the duration of each phone in the word as list

’cp_norm’

normalized cp-trajectories

’vector’

embedding vector of the word, based on fastText Embeddings

’client_id’

id of the client

Return type:

create_frozen_set(validated_sentences, word_amount, min_word_count)#

Creates a frozen set of the words that should be used in the corpus and saves it as a class attribute

Parameters:
  • validated_sentences (pd.Series) – The sentences from the validated.tsv file

  • word_amount (int) – The amount of words that should be used, if 0 all clips are used

  • min_word_count (int) – The minimum amount of words a word should have to be included in the word amount argument

Return type:

None

format_corpus(word_amount, min_word_count)#

Takes the path to the corpus and formats it to the fitting form

Parameters:

word_amount (int) – The amount of words that should be used, if 0 all clips are used word_a

Return type:

None

load_fasttext_model(language: str)#

Loads the fasttext model for the given language

Parameters:

language (str) – The language of the model

Returns:

The loaded fasttext model

Return type:

fasttext.FastText._FastText

run_aligner(mfa_workers: int, batch_size: int)#

Runs the Montreal Forced Aligner on the corpus

Parameters:
  • mfaworkers (int) – The number of workers to use

  • batch_size (int) – The size of the batches

Return type:

None (only side effects)