Here we briefly describe the requirements for an annotated speech corpus 
to be used in MAUS training (e.g. for a new language or dialect):

MAUS uses two statistical models that are language dependend: the
acoustical model (AM) in form of phonetic HMMs, and the pronunciation
model (PM) in form of statistically weighted re-write rules. Depending on
the amount of annotated speech you have for the new language, you can
train the AM, or the PM, or both. As a general rule speech data for MAUS
training should be annotated in a known format (e.g. praat *.TextGrid,
ELAN *.eaf, BPF *.par or Transcriber), and in a known phonetic alphabet
(e.g. IPA or SAMPA). If the corpus consists of long recordings (e.g.
interviews or dialogues) the transcription should contain a 'chunk
segmentation', i.e. the begin/end of chunks of transcribed speech are
marked on the time line.
To train the AM you'll need manually segmented and labelled (in some
phonetic alphabet) speech of approx. 100 speakers of both genders,
preferably spontaneous speech (like in a map task recording). Second best
is read speech, but each speaker should then have read different
sentences, so that the covered vocabulary is at least a 1000 words. Third
best is a speech corpus without phonetic segmentation but with phonetic
transcriptions per recording. Forth best is a speech corpus with just the
orthographic transcript per recording (or chunk within a recording).</p>
<p>If all these options are not available, as a last resort you can map
the phoneme symbols of the new language to existing phoneme AMs of other
languages, e.g. if your new language requires a long open /a:/ you may use
that of the German AM.
To train the PM you'll need basically the same as for the AM training
but additionally we need a hierarchical relation of each
segmented/transcribed phone to a word token. For instance if a recording
contains the words 'hello world' transcribed as phone segments /h E l O v
2: 6 l d/ we need a link from the symbols /h E l O/ to the word token
'hello' and from /v 2: 6 l d/ to the token 'world'. Such a hierarchical
linking of phone to words can be encoded either directly in a format that
allows hierarchical annotation (e.g. Emu, BPF, annotation graphs) or
indirectly by using two time synchronized annotation layers where the
begin of a word segment exactly matches the begin of the first phone
within that word, and the end of a word segment matches exactly the end of
the last phome segment of that word (e.g. praat TextGrid,
Transcriber).
The training and setup of a new language in MAUS is not a trivial
routine task. We therefore offer to do the MAUS extension for you, if you
can provide us with the necessary speech data (as outlined above). Please
do not hesitate to contact us, in case you need help
(bas@bas.uni-muenchen.de).

