A forced aligner is a tool used to provide automatic segmentation of spoken audio files. Typically, you start with an audio file and some type of transcribed text file and the forced aligner forces the audio to be aligned with the text.
The Montreal Forced Aligner (MFA) is a forced alignment tool created by researchers at McGill University in Montreal.
McAuliffe, Michael, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger (2017). Montreal Forced Aligner: trainable text-speech alignment using Kaldi. In Proceedings of the 18th Conference of the International Speech Communication Association.
Forced alignment requires both an acoustic model which contains information about how the phones of a language are pronounced (both in isolation and when surrounded by other phones), as well as a pronouncing dictionary which gives a phone{m/t/?}ic transcription of words in the language. The MFA is nice because it comes with many pre-trained acoustic models for a wide variety of languages as well as corresponding pronouncing dictionaries. Additionally, it has pre-trained grapheme-to-phoneme models that convert orthographic to phonetic transcription. Finally, it also allows user to train their own acoustic models and grapheme-to-phoneme models for a) potentially better alignment performance and b) the ability to do forced alignment on languages for which they do not have models.
This workshop assumes that you have Praat downloaded on your computer. If you do not, you can install it by going to praat.org and clicking on the link in the top left corner corresponding to your operating system. This will bring you to a page that will guide you through the installation process.
Instructions for installation can be found here. There are two general steps that need to be taken.
conda create -n aligner -c conda-forge montreal-forced-aligner
To activate the MFA environment, you can run the command:
conda activate aligner
. To deactivate it you can run
conda deactivate
.
There are two ways in which we can organize our files for alignment. Each one requires a different directory structure, so we’ll look at them both individually.
The first method uses .lab
or .txt
files
containing tab-delimited orthographic transcription for each word. I
have always just used .txt
files. If you’d like to learn
more about .lab
files you can do so here.
This method assumes that you may have multiple files for multiple
speakers. You should therefore make a new directory
<corpus-name>
with sub-folders speaker1
,
speaker2
, …, speakern
.
Activity 1: During this workshop, we will align data from 4 speakers. So let’s create our corpus directory and 4 speaker sub-directories now. Once you’ve done this, download these files, unzip them, and move them into the correct folders on your machine.
MFA only runs on audio files in a specific format. Fortunately, as
long as you are using .wav
files, it takes care of things
automatically. That being said, it’s recommended to not use files with a
sampling rate lower than 16kHz. If you have recordings that are in
another format (e.g. .mp3
), you should have
sox
or ffmpeg
which provide ways to convert
files into the wav
format. We won’t go over that here, but
I wanted to make sure people were made aware of this fact.
The second method uses Praat TextGrid files and generally can handle longer audio and multiple speakers within a single file. For this reason, you only need to make a corpus directory with no further sub-folders. The TextGrid file should have a unique tier for each speaker. On each tier, you can then make boundaries for chunks around 5-6 seconds. You can then label each interval with the orthographic transcription of what is being said.
Activity 2: Create a second corpus directory (with a different name from the one above). Once you’ve done this, download these files, unzip them, and move them into the correct folders on your machine.
MFA has the Acoustic Models and/or Pronouncing dictionaries for the following languages (as of 11/5/2022). Note, some languages have multiple dictionaries/acoustic models to account for different dialects. Furthermore, some are more expansive than others. So it’s important to read about each model/dictionary before using it. It’s fine to use general models for more specific data sets, but it’s crucial to be aware of the limitations and the social/ethical concerns that may arise from doing so.
Language | Acoustic Model | Pronouncing Dictionary |
---|---|---|
Abkhaz | TRUE | TRUE |
Armenian | TRUE | TRUE |
Bashkir | TRUE | TRUE |
Basque | TRUE | TRUE |
Belarusian | TRUE | TRUE |
Bulgarian | TRUE | TRUE |
Chuvash | TRUE | TRUE |
Croatian | TRUE | TRUE |
Czech | TRUE | TRUE |
Dutch | TRUE | TRUE |
English | TRUE | TRUE |
French | TRUE | TRUE |
Georgian | TRUE | TRUE |
German | TRUE | TRUE |
Greek | TRUE | TRUE |
Guarani | TRUE | TRUE |
Hausa | TRUE | TRUE |
Hungarian | TRUE | TRUE |
Italian | TRUE | FALSE |
Japanese | FALSE | TRUE |
Kazakh | TRUE | TRUE |
Korean | TRUE | TRUE |
Kurmanji | TRUE | TRUE |
Kyrgyz | TRUE | TRUE |
Maltese | FALSE | TRUE |
Mandarin | TRUE | TRUE |
Polish | TRUE | TRUE |
Portuguese | TRUE | TRUE |
Punjabi | FALSE | TRUE |
Romanian | TRUE | TRUE |
Russian | TRUE | TRUE |
Sorbian | TRUE | TRUE |
Spanish | TRUE | TRUE |
Swahili | TRUE | TRUE |
Swedish | TRUE | TRUE |
Tamil | TRUE | TRUE |
Tatar | TRUE | TRUE |
Thai | TRUE | TRUE |
Turkish | TRUE | TRUE |
Ukranian | TRUE | TRUE |
Urdu | FALSE | TRUE |
Uyghur | TRUE | TRUE |
Uzbek | TRUE | TRUE |
Vietnamese | TRUE | TRUE |
To download an acoustic model to your local computer, first make sure
that you have activated the aligner
environment. Then run
the command
mfa model download acoustic <name of acoustic model>
.
Likewise, to download a pronouncing dictionary, run the command
mfa model download dictinoary <name of pronouncing dictionary>
.
Activity 3: Since our audio files are for American
English, we will need to download one of the American English acoustic
models and one of the American English pronouncing dictionaries. The
name of both of these are english_us_arpa
. Use the commands
described above to download both of them to your local machine. They
should end up in Documents/MFA/pretrained_models
. In the
dictionary/
sub-folder you should see the file
english_us_arpa.dict
. Open this file in a text editor
because we will edit it later and it’s good to have an idea of what it
looks like. If you are unfamiliar with ARPABET transcription, you can
read more about it here. These
transcriptions use the 2-letter version.
Now that we have our files in the correct format, and our acoustic
model and pronouncing dictionary, we can now actually get to the main
event: doing the forced alignment! Before we actually do that, though,
we can run a verification command that checks to make sure that
everything is correctly set up. The command for this is:
mfa validate <path to corpus> <name of acoustic model> <name of pronouncing dictionary>
.
Activity 4: Run the validate
command on
our first corpus. It takes ~40 seconds to run on my computer. It will
give you some information in the console. One thing it will tell you is
that there are some OOV words. OOV stands for out
of vocabulary. This means that the
word it is trying to align does not have a pronunciation in the
dictionary. We will ignore this for now, but later I will show you how
to add new words to the pronouncing dictionaries.
The command to run the aligner is quite similar. The only thing we
need to do is add an output directory where the output files (Praat
TextGrids) will do. The command to do alignment is `mfa align
Activity 5: Run the align
command on
our first corpus. It should once again take ~40 seconds. Once you are
finished. Navigate to your corpus and output directories, and open the
corresponding .wav
and .TextGrid
files in
Praat. In one of the files for each speaker, you should notice that the
word was not transcribed. This is unsurprising since we already knew we
had an OOV item. Navigate to the english_us_arpa.dict
file
and open it in a text editor. We can add words by providing the
orthographic form, a tab, then an ARPABET transcription with spaces in
between each segment. In this file, you will also see 4 numbers. We can
ignore these. They refer to various probabilities and will be given
default values if we don’t provide anything. Add the line
skornash S K OW1 R N AA0 SH
to the dictionary file and
then save it. Run the align
command again, but this time
add the --clean
tag to the end. This will ensure that the
previous alignment files are deleted. If they don’t get deleted it will
not run properly. If you re-open the .wav
and
.TextGrid
files you should see that there is now a
transcription and alignment for the problematic word.
Activity 6: run the align
command on
the second corpus. Note, here you provide a .TextGrid
file
in the input and get a .TextGrid
file in your output. So
make sure you open the correct file when looking to see the finished
product.
It is important to remember that using forced aligners is not a replacement for segmenting by hand. It’s best used as a first pass that you should then look over to make sure everything is in the right place. That being said, it is much easier to move Praat TextGrid boundaries around than it is to create them. So using a forced aligner still saves a lot of time. But remember…never assume that you don’t have to check the output. You can, and certainly will, get burned.