How to use?#
create_vtl_corpus allows to automatically create control parameter
trajectories for the VocalTractLab (VTL) artculatory synthesizer for a large
corpus of speech audio files and transcriptions of these files. It includes the
code to align the transcriptions to the audio with the Monreal Forced Aligner
(MFA). create_vtl_corpus is mainly used and tested with data from the
Mozilla Common Voice project.
To align and synthesize the first 100 words, which apear at least 4
times, of a German speech corpus at the path CORPUS and save the results as
a pandas DataFrame to SAVE_DF_PATH run the following command
python -m create_vtl_corpus.create_corpus --corpus CORPUS --language de --needs_aligner --use_mp --min_word_count 4 --word_amount 100 --save_df_name SAVE_DF_NAME --num_cores 4[
Use --help or -h to get a full list of the command line options.
Furthermore, you can use it within Python. Here the
create_vtl_corpus.create_corpus.CreateCorpus class is a good starting
point.
Use cases#
create_vtl_corpus can be used to:
Create the vtl corpus with no pre-existing data but not all of the files, i. e. only 10_000 samples.
Add new data to an already existing corpus, add 10_000 new samples.
Filter after words? Create 1000 more “post” word types.
Flags#
The following flags can be used to modify the behaviour of the library.
Converts a corpus to the vocaltract lab format
usage: fancytool [-h] --corpus CORPUS [--language LANGUAGE]
[--mfa_workers MFA_WORKERS] [--needs_aligner] [--use_mp]
[--append_to_df APPEND_TO_DF]
[--min_word_count MIN_WORD_COUNT] [--word_amount WORD_AMOUNT]
[--aligner_batch_size ALIGNER_BATCH_SIZE]
[--num_cores NUM_CORES] [--save_df_name SAVE_DF_NAME]
[--df_save_path DF_SAVE_PATH] [--debug]
[--epoch_size EPOCH_SIZE] [--start_epoch START_EPOCH]
[--end_epoch END_EPOCH] [--error_factor ERROR_FACTOR]
[--add_melspec]
Named Arguments#
- --corpus
The path to the corpus which should be converted to the vocaltract lab format
- --language
The language of the corpus as an abbreviation
Default:
'de'- --mfa_workers
The number of mfa workers to use
Default:
6- --needs_aligner
If the aligner should be run
Default:
False- --use_mp
if multiprocessing should be used in the creation of the dataframe
Default:
False- --append_to_df
If a already created dataframe should be searched for and then used instead of creating a new one
- --min_word_count
The minimum amount of words a word should have to be included in the word amount argument. This is based a approximate lexical word count
Default:
0- --word_amount
0 the whole corpus shall be processed, a postive integer if the number is limited. Since processing is sentence based, more words ( with lower word count), will also be synthesized
Default:
0- --aligner_batch_size
How many text files the aligner should process in one batch
Default:
5000- --num_cores
The number of jobs the multiprocessing should use, uses maximum on default. If the number is 1 or lower, no multiprocessing is used
Default:
1- --save_df_name
The name to save the dataframe under, language will be added automatically
Default:
'corpus_as_df_mp'- --df_save_path
The path to save the dataframe to
Default:
'/mnt/Restricted/Corpora/CommonVoiceVTL/'- --debug
If debug mode should be used
Default:
False- --epoch_size
The ampount of clips processed in one epoch and until the dataframe is saved
Default:
5000- --start_epoch
The epoch to start with (inclusive)
Default:
0- --end_epoch
The epoch to end with (inclusive)
Default:
1000- --error_factor
How likely you estimate a word to occur less then your min word count in the corpus
- --add_melspec
If mel spectrograms should be added to the dataframe if you use multiprocessing. If you don’t use multiprocessing, mel spectrograms are always added
Default:
False
Multiprocessing#
The library supports multiprocessing, which can be used to speed up the process
and for large corpora this is absolutely necessary. However it is not enabled
by default, to enable it use the --use_mp flag. For multiprocessing melspectrograms are added to the dataframe afterwards without multiprocessing. This however can take some time so only do it
if necessary and if you need our kind of melspectrogram Melspectrograms.
Melspectrograms can be generated afterwards too however with the information available in the
dataframe.
CreateCorpus Class#
- class create_vtl_corpus.create_corpus.CreateCorpus(path_to_corpus: str, *, language: str)#
This class generates the vocaltract lab trajectories for a corpus. It’s assumed that a corpus has the following shape as is common with Mozillas Common Voice Corpus using MFA
corpus/ ├── validated.tsv # a file where the transcripts are stored ├── clips/ │ └── *.mp3 # audio files (mp3) └── files_not_relevant_to_this_project
- str path_to_corpus
The path to the corpus
- str language
The language of the corpus as an abbreviation
- frozen_set word_set
A frozen set of the words that should be used in the corpus
- fasttext.FastText._FastText fast_text_model
The loaded fasttext model
- dict mfa_to_sampa_dict
A dictionary that maps the MFA phonemes to the SAMPA phonemes
- int word_amount
The amount of words that should be used, if 0 all clips are used
- int min_word_count
The minimum amount of words a word should have to be included in the word amount argument
- format_corpus():
Takes the path to the corpus and formats it to the fitting form
- run_aligner():
Runs the Montreal Forced Aligner on the corpus
- check_structure():
Checks if the corpus has the right format and if not corrects this
- create_dataframe():
Extracts the MFA phonemes from the aligned corpus and creates a dataframe with synthesized data
- create_dataframe_mp():
Extracts the MFA phonemes from the aligned corpus and creates a dataframe with synthesized data using multiprocessing, no mel spectrograms are created here
- setup():
Downloads the fasttext model for the given language or calls load_fasttext_model() it if it is already downloaded
- load_fasttext_model():
Loads the fasttext model for the given language, from the resources folder
- create_frozen_set():
Creates a frozen set of the words that should be used in the corpus and saves it as a class attribute
- check_structure(word_amount, min_word_count)#
Checks if the corpus has the right and if not corrects this
- Parameters:
min_word_count (int) – The minimum amount of words a word should have to be included in the word amount argument
word_amount (int) – How many words should be processed, if 0 all words are processed, inclusion is based on the min_word_count argument
- Returns:
list clipnames – A list of the clip names
list Sentence_list – A list of the transcriped sentences in the same order as the clips.
- create_data_frame(clip_list: list, sentence_list: list)#
Creates Dataframe with Vocaltract Lab data and other data
- Parameters:
path_to_corpus (str) – The path to the corpus
clip_list (list) – A list of the clip names present in the corpus
sentence_list (list) – A list of the sentences present in the corpus in the same order as the clip_list , so they fit together
- Returns:
label
description
’file_name’
name of the clip
’mfa_word’
the spoken word as it is in the aligned textgrid
’lexical_word’
the word as it is in the dictionary
’word_position’
the position of the word in the sentence
’sentence’
the sentence the word is part of
’wav_recording’
spliced out audio as mono audio signal
’sr_recording’
sampling rate of the recording
’sr_synthesized’
sampling_rates_sythesized,
’sampa_phones’
the sampa(like) phonemes of the word
’mfa_phones’
the phonemes as outputted by the aligner
’phone_durations_lists’
the duration of each phone in the word as list
’cp_norm’
normalized cp-trajectories
’melspec_norm_recorded’
normalized mel spectrogram of the audio clip
’melspec_norm_synthesized’
normalized mel spectrogram synthesized from the cp-trajectories
’vector’
embedding vector of the word, based on fastText Embeddings
’client_id’
id of the client
- Return type:
- create_data_frame_mp(clip_list: list, sentence_list: list, num_cores)#
Creates Dataframe with Vocaltract Lab data and other data with multiprocessing
- Parameters:
clip_list (list) – A list of the clip names present in the corpus
sentence_list (list) – A list of the sentences present in the corpus in the same order as the clip_list , so they fit together
num_cores (int) – The number of cores to maximaly use
- Returns:
label
description
’file_name’
name of the clip
’mfa_word’
the spoken word as it is in the aligned textgrid
’lexical_word’
the word as it is in the dictionary
’word_position’
the position of the word in the sentence
’sentence’
the sentence the word is part of
’wav_recording’
spliced out audio as mono audio signal
’sr_recording’
sampling rate of the recording
’sr_synthesized’
sampling_rates_sythesized,
’sampa_phones’
the sampa(like) phonemes of the word
’mfa_phones’
the phonemes as outputted by the aligner
’phone_durations_lists’
the duration of each phone in the word as list
’cp_norm’
normalized cp-trajectories
’vector’
embedding vector of the word, based on fastText Embeddings
’client_id’
id of the client
- Return type:
- create_frozen_set(validated_sentences, word_amount, min_word_count)#
Creates a frozen set of the words that should be used in the corpus and saves it as a class attribute
- Parameters:
validated_sentences (pd.Series) – The sentences from the validated.tsv file
word_amount (int) – The amount of words that should be used, if 0 all clips are used
min_word_count (int) – The minimum amount of words a word should have to be included in the word amount argument
- Return type:
None
- format_corpus(word_amount, min_word_count)#
Takes the path to the corpus and formats it to the fitting form
- Parameters:
word_amount (int) – The amount of words that should be used, if 0 all clips are used word_a
- Return type:
None
- load_fasttext_model(language: str)#
Loads the fasttext model for the given language
- Parameters:
language (str) – The language of the model
- Returns:
The loaded fasttext model
- Return type:
fasttext.FastText._FastText
- run_aligner(mfa_workers: int, batch_size: int)#
Runs the Montreal Forced Aligner on the corpus
- Parameters:
mfaworkers (int) – The number of workers to use
batch_size (int) – The size of the batches
- Return type:
None (only side effects)