The assumption is that you have a project called emu2021
and that it contains the following directories.
If not, please see 1. Preliminaries here
Start up R in the project you are using for this course.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(emuR)
##
## Attaching package: 'emuR'
## The following object is masked from 'package:base':
##
## norm
library(wrassp)
In R, store the path to the directory testsample
as sourceDir
in exactly the following way:
= "./testsample" sourceDir
And also store in R the path to emu_databases as targetDir
:
= "./emu_databases" targetDir
The directory /testsample/german
on your computer contains .wav
files and .txt
files. Define the path to this database in R and check you can see these files with the list.files()
function:
= file.path(sourceDir, "german")
path.german list.files(path.german)
## [1] "K01BE001.txt" "K01BE001.wav" "K01BE002.txt" "K01BE002.wav"
The above is an example of a text collection because it contains matching .wav
and .txt
files in the same directory such that, for each .wav
file, the .txt
file contains the corresponding orthography.
This command converts the text collection into an Emu database with name ger2
and stores the resulting Emu database as ger2_DB
in targetDir
:
# only execute once!
convert_txtCollection(dbName = "ger2",
sourceDir = path.german,
targetDir = targetDir)
## INFO: Parsing plain text collection containing 2 file pair(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Copying 2 media files to EMU database...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Rewriting 2 _annot.json files to file system...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
Load the database in R.
= load_emuDB(file.path(targetDir, "ger2_emuDB")) ger2_DB
## INFO: Loading EMU database from ./emu_databases/ger2_emuDB... (2 bundles found)
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
summary(ger2_DB)
Look at the database. Switch to hierarchy view. The words are a single item in the attribute
tier of bundle
with name transcription
:
serve(ger2_DB, useViewer = F)
That the words are stored in this way is evident in querying this database. Note that calcTimes=F
is needed because the tier transcription
is of type ITEM
and unlinked to a time-based (SEGMENT
or EVENT
) tier.
# segment list
= query(ger2_DB, "transcription =~ .*", calcTimes=F)
t.s # labels
$labels t.s
## [1] "heute ist schönes Frühlingswetter" "die Sonne lacht"
We are now going to run the Munich Automatic Segmentation (MAUS) over the database. The language selected is German deu-DE
. See information of the available languages and search for LANGUAGE.
Available languages At the time of writing, the available languages are:
LANGUAGE: [cat, deu, eng, fin, hat, hun, ita, mlt, nld, nze, pol, aus-AU, afr-ZA, sqi-AL, eus-ES, eus-FR, cat-ES, nld-NL-GN, nld-NL, nld-NL-OH, nld-NL-PR, eng-US, eng-AU, eng-GB, eng-GB-OH, eng-GB-OHFAST, eng-GB-LE, eng-SC, eng-NZ, eng-CA, eng-GH, eng-IN, eng-IE, eng-KE, eng-NG, eng-PH, eng-ZA, eng-TZ, ekk-EE, kat-GE, fin-FI, fra-FR, deu-DE, gsw-CH-BE, gsw-CH-BS, gsw-CH-GR, gsw-CH-SG, gsw-CH-ZH, gsw-CH, hat-HT, hun-HU, isl-IS, ita-IT, jpn-JP, gup-AU, sampa, ltz-LU, mlt-MT, nor-NO, fas-IR, pol-PL, ron-RO, rus-RU, slk-SK, spa-ES, spa-AR, spa-BO, spa-CL, spa-CO, spa-CR, spa-DO, spa-EC, spa-SV, spa-GT, spa-HN, spa-MX, spa-NI, spa-PA, spa-PY, spa-PE, spa-PR, spa-US, spa-UY, spa-VE, swe-SE, tha-TH, guf-AU]
# only execute once!
# you must have an active web connection for this to work!
runBASwebservice_all(ger2_DB,
transcriptionAttributeDefinitionName = "transcription",
language = "deu-DE",
runMINNI = F)
## INFO: Preparing temporary database. This may take a while...
## INFO: Checking if cache needs update for 1 sessions and 2 bundles ...
## INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
## INFO: Nothing to update!
## INFO: Sending ping to webservices provider.
## INFO: Running G2P tokenizer on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running G2P on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running MAUS on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (canonical) on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Sending ping to webservices provider.
## INFO: Running Pho2Syl (segmental) on emuDB containing 2 bundle(s)...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
## INFO: Autobuilding syllable -> segment links from time information
## INFO: Rewriting 2 _annot.json files to file system...
##
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
Now look at the information stored in the database and note the extra tiers that have been created (KAN
, KAS
, MAU
, MAS
)
summary(ger2_DB)
serve(ger2_DB, useViewer = F)
Look at the hierarchy view. Identify the levels, links and attributes.
More complex queries are now possible. e.g. find the word-initial MAU
segments of all polysyllabic words:
= query(ger2_DB,
mau.s "[[MAU =~.* & Start(ORT, MAU)=1] ^ Num(ORT, MAS) > 1]")
mau.s
## # A tibble: 4 × 16
## labels start end db_uuid session bundle start_item_id end_item_id level
## <chr> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <chr>
## 1 h 117. 167. a638e68d-90… 0000 K01BE… 7 7 MAU
## 2 S 597. 717. a638e68d-90… 0000 K01BE… 13 13 MAU
## 3 f 977. 1027. a638e68d-90… 0000 K01BE… 18 18 MAU
## 4 z 967. 1027. a638e68d-90… 0000 K01BE… 8 8 MAU
## # … with 7 more variables: attribute <chr>, start_item_seq_idx <int>,
## # end_item_seq_idx <int>, type <chr>, sample_start <int>, sample_end <int>,
## # sample_rate <int>
Check their word labels:
requery_hier(ger2_DB, mau.s, "ORT")$labels
## [1] "heute" "schönes" "Frühlingswetter" "Sonne"
The task is to try out forced alignment for a different language and also to show how forced alignment can be done from a canonical phonemic transcription instead of from text. The database is here.
= file.path(sourceDir, "albanian") path.albanian
Begin by converting the text collection into an Emu database.
# only execute once!
convert_txtCollection(dbName = "alb",
sourceDir = path.albanian,
targetDir = targetDir)
## INFO: Parsing plain text collection containing 4 file pair(s)...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Copying 4 media files to EMU database...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
## INFO: Rewriting 4 _annot.json files to file system...
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
= load_emuDB(file.path(targetDir, "alb_emuDB")) alb_DB
## INFO: Loading EMU database from ./emu_databases/alb_emuDB... (4 bundles found)
##
|
| | 0%
|
|================== | 25%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|======================================================================| 100%
summary(alb_DB)
Look at the database, switch to hierarchy view, and verify that the words have been located at bundle -> transcription
as for the German database above.
serve(alb_DB, useViewer = F)
Now run MAUS, just as before. The language here is sqi-AL
for Albanian. (Note that this will take a bit longer than for German - possibly a couple of minutes at least).
# only execute once!
runBASwebservice_all(alb_DB,
transcriptionAttributeDefinitionName = "transcription",
language = "sqi-AL",
runMINNI = F)
summary(alb_DB)
Look at the database and verify that the same kind of information has been automatically derived, as for the German database earlier.
serve(alb_DB, useViewer = F)
MAUS also allows an automatic segmentation to be derived directly from the canonical level. This can be useful when the canonical representation provided by MAUS deviates considerably from what was actually said. For one of the words in 0001BF_1syll_1
, the canonical representation has J E
when what was actually said was closer to n J E
.
First switch in hierarchy view from ORT
->KAN
and then change the node J E
of the ORT:KAN
tier to n j E
for file 0001BF_1syll_1
in the manner of 4.1. See section 9.2.2 of the Emu SDMS manual for how to handle hierarchical annotations manually.
Figure 4.1: A fragment of a hierarchy view
In order to run MAUS on this more appropriate pronunciation, first change it as in 4.1 above. (Don’t forget to save the annotation after editing).
Now run MAUS directly on this canonical level. Store the MAU segmentations in a new tier MAU2
(to differentiate it from the already created MAU
tier).
# only execute once!
runBASwebservice_maus(alb_DB,
canoAttributeDefinitionName = "KAN",
mausAttributeDefinitionName = "MAU2",
language = "sqi-AL")
Inspect the database again. There should now be a new tier MAU2
:
summary(alb_DB)
Verify that there is now a visible tier with the added segment /n/:
serve(alb_DB, useViewer = F)