MAUS Parameter Set

See also the file USAGE in the top directory of the package.

This is the German language parameter set of MAUS (default). 
Acoustic models were trained to the 
manually segmented part of the Kiel Corpus of Read Speech.
The (default) statistical pronunciation model rml-0.95.rul was trained to the 
same corpus by Andreas Kipp [1]; the 'phonological' pronunciation model
(= rules without posteriors) regeln9.nrul is based on the meta analysis of B. Wesenick 
on a large set of publications regarding German pronunciation variation [2]
(5545 re-write rules).

In general we found that the statistical model out-performs the phonological 
model. The statistical model contains 90 rules (from 1213) that represent 
pronuciation variants that were always observed in the training data, and therefore
carry a posterior probability of almost 1, e.g. 

g,@,n,#,?,i:>g,N,#,i: -0.000010 0.000000

'The word-final syllable /g@n/ is reduced to /gN/ in case that the following 
word start with /?i:/. At the same time the glottal stop /?/ is deleted.'

All these 90 rules have a tiny (discounted) probability (p=0.00001) that 
the original phonological form is retained. If you do not like these 
(corpus-dependent) rules you may delete them from a copy of rml-0.95.rul and 
use these altered rule set by means of the MAUS option RULSET=<copyOfRml>


KANINVENTAR : list of all phonemic symbols (denoted as 'symbols'
	henceforth) that are recognized in the input of MAUS.
	Symbols must be ordered in reversed string length (/aI/
	before /a/ etc.). For backward compatibility the symbol /Q/ 
	is allowed in the input denoting the glottal stop (in older
	GermanSAM-PA transcripts this might occur).
	The set contains additional symbols aside from the official set, 
	e.g. English and French symbols.
	The script kan2mlf will map KANINVENTAR symbols to the symbol set in 
	GRAPHINVENTAR and script rec2mau will map them back.
KANINVENTAR.inv : extended version of KANINVENTAR used for reference:
        MAUS:           SAMPA symbols as supported by MAUS (KAN tier input)
        SAMPA:          Original SAMPA symbol
        IPA:            IPA symbol (if applicable, coded UTF-8)
        PHONETICS:      phonetic description (if applicable)
        EXAMPLES:       orthographic examples (if applicable)
        ISO639-3:       Iso 639-3 code of the SAMPA set
                        'xxx' = used in multiple languages
GRAPHINVENTAR : list of all symbols as being used in the statistical 
        rule set and HMM set; numerical symbols must be preceeded 
	by a 'P' so that the rule generator and the visualization program 
	graphvis works on the graph files; also HTK does not tolerate
	HMM names with a leading numerical. GRAPHINVENTAR may be a superset 
	of the set of symbols that actually occur in the rule set. The usage
	of an symbol not modelled in the rule set might prevent the 
	application of a rule.
	Such symbols are simply passed through by the program
	word_var-2.0 which builds the graph.
	For example:
	GRAPHINVENTAR contains the symbol /a~/ but the rule set knows only
	about /a/. Then the rule #,d,a>#,s,a will not applied to the input
	..,#,d,a~,... because the context does not match.
HMMINVENTAR : list of all symbols where MAUS has a HMM available.
        Again numerical symbols must be preceeded by a 'P'. For each symbol
	listed here there must be a corresponding HMM definition the file
	MMF.mmf.
	Also, the right hand side of the table DICT must not contain any 
	symbol that is not listed here.
DICT :  HTK-Lexikon, contains as 'words' the symbols of GRAPHINVENTAR and
        as 'pronunciation' the symbols of HMMINVENTAR.
MMF.mmf : points to the acoustical HMM models (must match HMMINVENTAR)
regeln*
rml*    : rule sets
rml-0.95.rul : points to the used rule set
PRECONFIGNIST : HTK config file for fontend processing (HCopy)
HVITECONF : HTK config file for Viterbi processing (HVite)
LAT.bigram : phonotactic bigram for MINNI service

Virtual Symbols

If you want MAUS to segment symbols that are not contained in the 
acoustical model (that is they are not contained in MMF.mmf and not listed in 
HMMINVENTAR) but are very similar to an existing symbol, you may define them 
as 'virtual symbols' for MAUS. 
For example:
The HMM set does not contain nasalized vowels, but you want to process /E~/
You have to
- insert 'E~' into KANINVENTAR and GRAPHINVENTAR
- add a the following line to DICT:
  E~ E
This causes MAUS to pass through all /E~/ and use the acoustical model /E/ 
for it. Note that the application of rules where /E/ is in the context of the 
rule will not take place here!

History (sorry, only in German)
 
 11.03.03 : GRAPHINV = KANINV nur dass numerische Symbole mit einem 'P'
            versehen werden, z.B. /6/ -> /P6/ (57 Symbole)
          HMMINV ist Teilmenge von GRAPHINV (48 Symbole).
          Regelset rml-0.95.rul enthaelt Teilmenge (rml-0.95.set, 43 Symbole)
          von GRAPHINV.
 14.08.03 : /y/ als erlaubtes Symbol in KANINV und GRAPHINV eingefuegt. 
            DICT bildet /y/ auf /y:/ ab. 
          Grund: Neue Konventionen der kanonischen Aussprache erlauben auch 
          /y/; daher kommt es im Lexikon vor.
 30.01.04 : /a:~/ als erlaubtes Symbol in KANINV und GRAPHINV eingefuegt. 
            DICT bildet /a:~/ auf /a:/ ab. 
          Grund: Neue Konventionen der kanonischen Aussprache erlauben auch 
          /a:~/; daher kommt es im Lexikon vor.
 08.04.04 : parameter set KANINVENTAR, GRAPHINVENTAR and DICT
            extended by 'foreign' phonemes that might appear in German when
            foreign words are uttered. These phonemes are mapped to their
            nearest German symbols for HMM modelling but passed as is to the
            segmentation output. Therefore a /T/ in the input will be internal
            modelled by /s/ but shows up in the output as label /T/ again.
          Note that the usage of symbols that are not contained in the
          statistical rule set will prevent the usage of certain rules
          if the symbol should appear in the context of the rule.

