_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de Infos to SpeechDatII Data Sets ============================== Version 1.0 K. Proell 13.09.2004 This document contains information regarding the usage of the German SpeechDat(II) speech corpus for ASR or other experiments where a defined distinction between training, development and test set is necessary. ----------------------------------------------------------------------- Division and basic numbers -------------------------- The official SpeechDat(II) training and test (split up in development and test) sessions of the fixed network database are used. The subsets of the mobile data defined here with official SpeechDat algorithm. Basic numbers of the basic subsets: SET WORDS TURNS LEX SPEAK --------------------------------------------- TRAIN_FIX 843384 150867 23246 3500 DEV_FIX 37886 6421 802* 250 DEV_MOBIL 32168 7027 947* 250 TEST_FIX 34086 6403 807** 250 TEST_MOBIL 32070 7085 921** 250 --------------------------------------------- * here combined DEV_FIXMOBIL lexicon with 1179 words ** here combined TEST_FIXMOBIL lexicon with 1179 words The trainset include all utterance types: TYPE CORPUS CODE ----------------------------------- isolated digit items I digit/number strings B,C natural number(s) N money amounts M yes/no questions Q dates D times T application keywords/keyphrases A word spotting phrase E directory assistance names O spellings L phonetically rich words W phonetically rich sentences S partner specific material* Y ----------------------------------- * speaker gender question, birthdate request, speaker region question, today's date Utterance types O,W,S are excluded from the development and test sets. Examples ASR Results -------------------- Using the above defined subsets we obtain currently (Sept 2004) the following accuracies using a HTK recognizer and a bigram trained solely on the training corpus: Trained on TRAIN_FIX; Tested on DEV_FIX and DEV_MOBIL sets with lexicon DEV_FIXMOBIL.lex (total: 1179 lexical entries): DEV_FIX: WA = 68.61% DEV_MOBIL: WA = 48.14% Test "HOME and PUBLIC environments" on mobile network data (lexicon DEV_FIXMOBIL.lex): The calls from mobile development set are divided into two parts, dependent on the environment of the call.("HOME": home; "PUBLIC": public, street, vehicle) DEV_MOBIL_HOME(15462 words): WA = 58.45% DEV_MOBIL_PUBLIC(16705 words): WA = 38.59% Test "Noiseless" on fixed and mobil network data (lexicon DEV_FIXMOBIL.lex): Utterances with mispronunciation, unintelligible speech or truncations, stationary noise [sta] and intermittent noise [int] are excluded from the development sets. DEV_FIX_noiseless(15462 words): WA = 70.32% DEV_MOBIL_noiseless(13163 words): WA = 55.79% Some more details for those who are interested: 12 Standard MFCC + Energy + velocity + acceleration (39) Diagonal covariance matrices 3-5 states per phoneme 40 phoneme classes (extended German SAMPA) + garbage + voice garbage + silence (43) Models initialized using the flat start procedure Re-estimation and splitting mixtures after 6 iterations on total TRAIN; testing after every two iterations on DEV_FIX (61 iterations) Optimal performance with 256 mixtures per state. Weight of language model fixed to 6.5 (option -s); word end penalty -15 (option -p); beam search width 100.0 No testing on TEST until now.