_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de COPYRIGHT Florian Schiel, University of Munich 1998. All rights reserved. This corpus and software may not be disseminated further - not even partly - without a written permission of the copyright holders. Additional Copyright Holders ---------------------------------------------------------------------- Munich AUtomatic Segmentation (MAUS) BAS Distribution Package MAUS 19.08.03 / 21.04.17 ---------------------------------------------------------------------- GENERAL REMARKS The script maus reads a string of phonemic symbols as defined in the param file KANINVENTAR, reads a signal from the file signal.nis and performs a MAUS segmentation according to these inputs. The resulting segmentation is either written into a BPF MAU tier file or into a Praat compatible TextGrid file or an Emu compatible structure. SYNOPSIS and OPTIONS Basic maus script: Please refer to the initial comments block in the maus script. Simply call 'maus | less' to read them. To process a complete corpus use the script maus.corpus. Simply call 'maus.corpus | less' to read the usage message. To adapt the HMM to a speech corpus (e.g. the speech of one speaker or a group of speakers or an new language) use the script maus.iter. CONSTRAINTS There are a number of constraints on how to use this script; please read the following carefully: - never run several maus.corpus processes in parallel on the same speech data set, even if you define different output directories. maus handles different processes gracefully, but maus.corpus and maus.iter do not. - always check the output produced by the scripts for the key word 'ERROR'. If this occurs usually the results are not correct. A good way to use maus is to pipe the stdout and stderr into a log file. The key 'WARNING' indicates that a maus script might find a situation where things possibly go wrong or want to alert the user for a default mechanismn. In most cases this does not cause the output to be formally incorrect, but the segmentation may be not the one you intended to. - parameter sets of other languages than German provided in this package are often produced by users of maus and send in for distribution. In some cases these parameter sets use existing acoustical models and map them to a phoneme set in a different language. For many European languages this works surprisingly well (although we don't have any hard data about this). Also, in some cases the non-German parameter sets contain the German-based statistical and phonological rule sets (see remarks in the READMEs within the parameter sets). Since these rules usually do not fit to other languages it is recommended to use the option MODUS=align to ignore these rule sets and perform a simple forced alignment. PHONEME SYMBOLS IN INPUT The string of phonemic input symbols must not contain any other symbol as defined in PARAM./KANINVENTAR. You may alter KANINVENTAR, but then you also have to take care of a number of other resource files that depend on KANINVENTAR (not recommended). The symbol '#' may be used between words indicating optional pauses between the words but only in KANSTR. This is highly recommended. When reading from a BPF file (option BPF) these optional pauses are inserted automatically. Human noise can be modelled by the symbol '' anywhere in the input string; '' can be used for other noise. SILENCE INTERVALS Automatic modelling: Maus will automatically insert optional silence models (HMM '#') between words (see option MINPAUSLEN) and output these as 'detached' silence segments '' (with word number -1) if they exceed MINPAUSLEN times 10msec. The same is true for utterance initial/final silence ('<' and '>') which used to be non-optional HMMs (before maus 3.33); the option NOINITIALFINALSILENCE=true suppress even these if a user wants to be sure that no silence interval is recognized at the beginning of a recording (e.g. confusions with initial plosives). Manual modelling: Intra-word silence intervals can be modelled by inserting the symbols '' (optional silence) or '

' (enforced silence, minimum length is 30msec) in the canonical input string ('#' in the phonological input will be ignored because in some phonological forms it marks a compound boundary! This is not the case for option KANSTR, though!); e.g. /ba:nhof/ will model an optional silence interval between /n/ and /h/; in the MAUS output these models appear as '' segments (or do not appear at all, IPA [(...)]). Intra-word silence intervals are always linked to the word number in which they appear. If an optional '' is the only symbol within a word, it will be modelled by an non-optional silence model (HMM '<') because HTK cannot model words that consist only of a t-model; it will appear as a single segment '' linked to that 'silence word'. It is allowed to model a 'silence word' as /

/ or /<...>/ (where '...' is an arbitrary string without blanks, but not one of 'usb' or 'nib') in the KAN input tier; both will model a non-optional silence model and both will produce a '' in the phonetic output (IPA [(...)]) that has a word link, and the 'word' appears as a numbered word in the ORT/KAN tiers (see TAGS PASSING below). To summarize: ('#' symbolize word boundaries here, '<' '>' utterance begin/end) KAN input MODEL ORT/KAN OUTPUT MAU OUTPUT ## non-human noise '' segment // (IPA [(.)]) with word number ## human noise '' segment // (IPA [(..)]) with word number #<...># silence word '<...>' segment // with word number #

# silence word '

' segment // with word number #......# non-human noise '......' segment /....../ (IPA [(.)])with word number #......# human noise '......' segment /....../ (IPA [(..)]) with word number #...

...# non-optional sil '...

...' segment /....../ (IPA [(...)]) with word number #......# optional sil '......' segment /....../ (IPA [(...)]) with word number or deleted # (word boundary) - segment // (IPA [(...)]) with word number -1 or deleted < (initial sil) - segment // (IPA [(...)]) with word number -1 > (initial sil) - segment // (IPA [(...)]) with word number -1 (the last three lines are not possible inputs, but are modelled automatically!) TAGS PASSING from KAN tier to MAUS OUTPUT The use of a '<...>' as a word in the phonological input (see preceeding paragraph) can be used to pass 'tags' from the transcript to the output of MAUS, because such 'words' will appear in the MAUS output ORT/KAN levels. The drawback is that a small silence interval (30msec minimum) must be modelled for this 'tag word' in the phonetic level. ADAPT MAUS TO OTHER LANGUAGES To adapt MAUS to another language, several parameter files and programs must be adapted: The set of phonemic symbols used in the input, the MAUS internal symbol set, the mapping functions between them, the Hidden Markov Models used for the search, the mapping from MAUS internal symbols to HMM and the rule set. If a new language set PARAM. is defined, do the following: - copy the standard German set dir PARAM to PARAM. - within the new set dir adapt the following files: KANINVENTAR : define here the set of phonemes used in the canonical input and MAUS output. KANINVENTAR must be sorted by descending string length! GRAPHINVENTAR : this is usually just a copy of KANINVENTAR with all symbols starting with a number replaced by a masked symbol string and the extra symbol '#' used for internal word boundary modelling (a '#' inKANINVENTAR does not hurt, but will be ignored in the KAN input). Typical examples are: r\ -> r- 6 -> P6 9 -> P9 3 -> P3 2: -> P2: etc. - Store the acoustical models for the new language in MMF.mmf; this can be either new HMMs trained on a segmented speech corpus (if available) or a set of standard HMMs (e.g. the SUPERHMM set in subdir HMM). - store the list of HMM names (~h "..." entries in MMF.mmf) in the file HMMINVENTAR (see example); note that the HMM names must not match the phoneme names in GRAPHINVENTAR. - define a mapping of the phoneme names (1st column) to the HMM names (2nd column) in teh file DICT; be sure to use the phoneme names as listed in GRAPHINVENTAR, not as in KANINVENTAR. For example the entry T s will cause MAUS to acoustically model an English voiceless 'th' (/T/) like a /s/. - If you can provide a (non-statistical) pronunciation rule set for the new language, put it into the file .nrul (see an example in regeln9.nrul). Then call maus with the option RULESET=.../.nrul The synopsis for a phonological rule (one line of the RULESET) is: (leftcontext)-(pattern)-(rightcontext)>(leftcontext)-(replacement)-(rightcontext) all (...) can be comma-separated strings of SAM-PA symbols (including the utterance-initial symbol //!) or even the empty string (meaning all contexts), e.g. P2:-C-s,t>P2:-k-s,t -> a /C/ in context 2: ... st can be replaced by a /k/ -a-#>-A-# -> all word final /a/ can be uttered as /A/ See more examples in PARAM/regeln9.nrul. - If you can provide a statistical pronunciation rule set for the new language, put it into the file .rul (see an example in rml-0.95.rul). Then call maus with the option RULESET=.../.rul The synopsis for a statistical rule (one line of the RULESET) is: (leftcontext),(pattern),(rightcontext)>(leftcontext),(replacement),(rightcontext) ln(P(r|l,p,r) 0.000000 (leftcontext)/(rightcontext) must be single phoneme symbols that match on both sides of the rule (including the utterance-initial symbol //!); the pattern/replacement can be comma-separated strings of SAM-PA symbols (including the utterance-medial word boundary /#/, but NOT the utterance-inital/end symbols /<>/!) or the empty string, ln(P(r|l,p,r) is the (natural) log conditional probability for a replacement r given the context and pattern l,p,r. Last column is always '0.000000', e.g. t,E,t>t,e:,t -0.916293 0.000000 -> an /E/ is replaced by an /e:/ in the context t ... t with 40% probability. g,@,n,t>g,N,t -1.292769 0.000000 -> reduces a /@n/ by /N/ in context g ... t I,n,#>I,# -5.190177 0.000000 -> deletes word fine /n/ in pre-context /I/ See more examples in PARAM/rml-0.95.rul. Remark regarding MAUS rule sets: A context or pattern string within a MAUS rule as discussed above is parsed in a somewhat sloppy (but robust) way: a sequence of characters that should encode a single phonemic symbol (i.e. enclosed by '-...-' or ',...,' respectively) such as ',abcd,' is firstly checked whether the complete sequence 'abcd' represents a symbol in GRAPHINVENTAR. If not, MAUS tries to parse the character sequence as a sequence of valid phonemic symbols. E.g. if 'ab' and 'cd' are included in GRAPHINVENTAR but 'abcd' is not, the sequence is interpreted as ',ab,cd,'. Parsing is applied left-to-right and by maximum local string length. For instance the above example will not lead to ',a,bcd,' even if 'a' and 'bcd' are valid symbols, because 'ab' and 'cd' are also valid symbols. Only if the character sequence cannot be parsed into a sequence of valid phonemic symbols, MAUS will issue a warning message, such as: File: CrlkontWordVar.cc, Line 46 error in rule 1 (#sa>), discarding and will ignore this rule for the further processing (rules are counted starting with 0). This peculiar way of parsing rules may lead to unexpected results if - the order of symbols in GRAPHINVENTAR is not by descending string length - the rules contain combinations of phonemic symbols that are valid but not intended. SIGNAL PROPERTIES This script is intended to work for mono NIST and WAV sound files with 16 kHz sampling rate and 16 bit linear (FIXRATE), because the HMM are trained to this type of data. MAUS will automatically resample the signal / convert to mono using sox; to prevent this automatism set the option 'allowresamp' to 'no'. The script will complain if you try to use other sampling rates or HMM trained with other sampling rates. Note that ALL kinds of re-sampling detoriate the signals! If you use WAV signal files as input, the tool sox must be installed on your computer. Then sox will always produce a suitable input file for maus regardless what you give as an input file. MAUS CACHE Maus can check the cache $TEMP for existing *.htk files with the same name and take these instead of performing the frontend processing anew (this is done to save time on larger corpora). Use the option CLEAN=0 (default is 1) to get this effect, but keep in mind that your signals must not be altered then. INTER-WORD SILENCE The silence model '#' in the HMM set must be a tee-model. The HVite will always complain about the 'words' '#' or '&' that are tee-words. It's safe to ignore these warning. EXAMPLES Simply calling maus will issue a long and detailed usage message: % ./maus # usage: maus SIGNAL=signal.nis|wav BPF=signal.par [OUT=maustier.mau][OUTFORMAT=mau|TextGrid][CLEAN=1][PARAM=parameter-dir][CANONLY=no][allowresamp=no][WEIGHT=weight][INSPROB=insprob][STARTWORD=0][ENDWORD=999999] # usage: maus SIGNAL=signal.nis|wav KANSTR="a: b e: t s e:" [OUT=maustier.mau][OUTFORMAT=mau|TextGrid][CLEAN=1][PARAM=parameter-dir][CANONLY=no][allowresamp=no][WEIGHT=weight][INSPROB=insprob] # General remarks: ... The following call will read the canonical pronunciation from a BPF file and segment the signal in EXAMPLES/GERMAN/g001acn1_000_AAJ.nis using classical MAUS into the file EXAMPLES/GERMAN/g001acn1_000_AAJ.mau: % ./maus v=1 SIGNAL=EXAMPLES/GERMAN/g001acn1_000_AAJ.nis \ BPF=EXAMPLES/GERMAN/g001acn1_000_AAJ.par The following call will do the same but write the resulting MAU tier into the file 'Result.mau': % ./maus v=1 OUT=Result.mau SIGNAL=EXAMPLES/GERMAN/g001acn1_000_AAJ.nis \ BPF=EXAMPLES/GERMAN/g001acn1_000_AAJ.par The following call will do the same but instead of a BPF tier it will create a praat compatible TextGrid file 'Result.TextGrid': % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid \ SIGNAL=EXAMPLES/GERMAN/g001acn1_000_AAJ.nis \ BPF=EXAMPLES/GERMAN/g001acn1_000_AAJ.par The next call will write two additional tiers into the TextGrid output with a word segmentation and a canonical transcript of the words % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid \ SIGNAL=EXAMPLES/GERMAN/g046acn1_037_AFI.wav INSORTTEXTGRID=yes\ BPF=EXAMPLES/GERMAN/g046acn1_037_AFI.par INSKANTEXTGRID=yes The following call will do the same but instead of a TextGrid it will create Emu compatible files 'Result.hlb' and 'Result.phonetic': % ./maus v=1 OUT=Result OUTFORMAT=emu \ SIGNAL=EXAMPLES/GERMAN/g001acn1_000_AAJ.nis \ BPF=EXAMPLES/GERMAN/g001acn1_000_AAJ.par If you want the output files to have the same name and location of the signal file, simply ommit the option OUT=... The next call will use a TRN tier in the input BPF to restrict the search on a segment given there; by doing this long initial or final silence intervals are being ignored by maus; this can also be used to selectively segment only parts of a longer recording; note however that the timing of the results is always based on the total signal. % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid \ SIGNAL=EXAMPLES/GERMAN/g046acn1_037_AFI.wav USETRN=yes\ BPF=EXAMPLES/GERMAN/g046acn1_037_AFI.par The following call will do the same but the canonical string that MAUS uses will start with the 5th word and end with the 9th word of the BPF file (counting starts with 0): % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid \ SIGNAL=EXAMPLES/GERMAN/g001acn1_000_AAJ.nis \ BPF=EXAMPLES/GERMAN/g001acn1_000_AAJ.par STARTWORD=4 ENDWORD=8 The following call will read the canonical pronunciation from the command line instead from a KAN BPF tier; please note the usage of blanks and quotes! % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid \ SIGNAL=EXAMPLES/GERMAN/g001acn2_075_AAK.nis \ KANSTR="f i: r Q U n t t s v a n t s I C s t 6 # Q aU f # f Y n f Q U n t t s v a n t s I C s t @ n # j u: n i: # d i: n s t a: k # Q aU f # m I t v O x" Note that the symbol '#' may be used to indicate possible pauses between words. This might improve the quality of your MAUS output. Optional pauses are inserted automatically when reading from a BPF file instead from command line (see option BPF). Initial and final pauses are also inserted automatically. The next call will use a WAV sound file as input instead of SPHERE NIST. Maus will automatically recognize this but it will only work if the WAV sound file contains a mono signal with 16 kHz/16 bit sampling rate: % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid \ SIGNAL=EXAMPLES/GERMAN/g046acn1_037_AFI.wav \ BPF=EXAMPLES/GERMAN/g046acn1_037_AFI.par The next call uses a WAV input with a different sampling rate; maus will detect this and re-sample the signal; note that a mau output file will be still based on the original sampling rate. % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid \ SIGNAL=EXAMPLES/GERMAN/g046acn1_037_AFI.wav allowresamp=yes\ BPF=EXAMPLES/GERMAN/g046acn1_037_AFI.par The next call will not clean the $TEMP area after processing; the preprocessed signal file (*.htk) and all intermediate files up to the result of the Viterbi alignment (*.rec) will remain. In a possible identical later call the *.htk file will be recycled thus saving processing time. % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid CLEAN=0 \ SIGNAL=EXAMPLES/GERMAN/g046acn1_037_AFI.wav \ BPF=EXAMPLES/GERMAN/g046acn1_037_AFI.par The next call will not use the MAUS method but do a forced alignment to the given canonical pronunciation. Note that when using CANONLY=yes the maus script will not require the C++ program word_var-2.0, which might be useful on platforms where this program does not compile at installation: % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid CLEAN=1 \ SIGNAL=EXAMPLES/GERMAN/g046acn1_037_AFI.wav \ BPF=EXAMPLES/GERMAN/g046acn1_037_AFI.par CANONLY=yes Finally, the last example uses a different parameter set than classical MAUS; be very careful when designing such a parameter set: % ./maus v=1 OUT=Result.TextGrid OUTFORMAT=TextGrid CLEAN=1 \ SIGNAL=EXAMPLES/GERMAN/g046acn1_037_AFI.wav \ BPF=EXAMPLES/GERMAN/g046acn1_037_AFI.par PARAM=MyParamDir PARAMETERS See the file PARAM/README for details about the parameter files that maus needs and about their somewhat complicated relationship to each other. In this package there are the following parameter sets: PARAM : classical MAUS with statistically learned rule set PARAM. : MAUS adapted for language PARAM.MAN : phonological MAUS with a hand-crafted phonological rule set without statistics PARAM.sampa : lnguage independent parameter set (forced alignment only!) Default is 'PARAM'; to use other parameters sets use the option 'PARAM' to define the directory where the parameter files reside. All parameter sets use a set of contex-free phoneme HMMs stored in the file MMF.mmf. It always contains a special model for articulated noise (), for instance for non-understandable words, cough, throat clear, laughs or hesitations, for non-articulated noise (), for optional silence (#) and non-optional silence (). The models are plain left-to-right HMM with 3-5 states and 5 gaussian mixctures per state (diagonalized covariance matrices). They were trained to the manually segmented and labelled part of the Verbmobil data set. A 3-state HMM without leap transitions implies a minimum duration of 3 x 10msec = 30msec of the corresponding phonetic segment. This may lead to 'ceiling effects' in duration analysis when using force-alignment modus in MAUS, since the Viterbi is then forced to model a minimum of 30msec segment. We do not apply shorter minimum durations in our HMM sets (e.g. by introducing leap transitions) because the decision whether a phonemic segment is there or not shouldbe modelled in the pronunciation models not in the acoustic model. However, for certain investigations you might consider to replace the standard HMM set of a language by a customized HMM set with shorter minimum durations. See also: Katarina Bartkova, Denis Jouvet. Impact of frame rate on automatic speech-text alignment for corpus-based phonetic studies. ICPhS'2015 - 18th International Congress of Phonetic Sciences, Aug 2015, Glasgow, United Kingdom. Proceedings ICPhS 2015. for a discussion of minimum duration modelling in automatic phonetic segmentation systems. HISTORY See file HISTORY in this dir. EXIT CODES 0 : everything seems ok (but we never know, do we?) 1 : serious error 2 : probably just a signal file with the wrong coding, 3 : the BPF contains no KAN tier - doin' nothin' POSSIBLE PROBLEMS Check out the section 'KNOWN BUGS' in file HISTORY PROCESSING A CORPUS WITH MAUS Use the wrapper script maus.corpus This script reads a list of signal files from a file SLIST, searches for corresponding BAS Partitur Format (BPF) files to each signal file and performs a MAUS segmentation. Please refer to the remarks in the header of maus.corpus for detailed usage. Very useful is the option OUTDIR='#APPEND#': resulting MAU tiers are automatically inserted into the input BPF files. Option CREATETRN=yes or CREATETRN=force will call the speech detector wav2trn to create a TRN tier in the BPF input file (force will overwrite existing TRN tiers!). The maus script is then called with option USETRN=yes and segments only speech within the detected boundaries. You may use the simple script txt2par in this package to create BPF files from simple two-column TXT files: - create TXT files with the same name as the sound files with one word per line and orthography in the 1st column and transcript (SAM-PA) in the second column. - call txt2par in the dir (creates BPF files in the same dir) - make list ls *.wav > SLIST.txt - call maus.corpus maus.corpus SLIST=SLIST.txt BPFDIR=

... USING ITERATIVE MAUS Use the wrapper script maus.iter (also see HISTORY.ITERATIVE for details) Iterative MAUS denotes a variant of maus.corpus where the acoustical HMM of maus are iteratively adapted to the MAUS segmentation of the target material. You will need at least 20 min of target material preferably of one single speaker or of a speaker group with common features (e.g. a dialect). See the remarks in the header of maus.iter for usage. To build a seed model set for maus.iter you may use the tools in subdir HMM (see the README there). USING THE VISUALIZING TOOL GRAPHVIS graphvis is a Motif binary that should plot a lattice file *.slf on screen. The lattice file contains the pronunciation graph used by MAUS: the nodes contain phonemic symbols while the arcs contain probabilities. Usage: graphvis if=file.slf iv=inventar The lattice file file.slf can be obtained by running maus with option CLEAN=0 and then looking for the last *.slf file in the cache MAUSTEMP. As inventar use the file PARAM/KANINVENTAR