
                     _/_/_/_/         _/_/         _/_/_/_/
                    _/      _/       _/ _/        _/      _/
                   _/      _/       _/  _/       _/
                  _/      _/       _/   _/       _/
                 _/_/_/_/         _/_/_/_/        _/_/_/
                _/      _/       _/     _/             _/
               _/      _/       _/      _/             _/
              _/      _/       _/       _/    _/      _/
             _/_/_/_/         _/        _/     _/_/_/_/


                   BAVARIAN ARCHIVE FOR SPEECH SIGNALS

               University of Munich, Institut of Phonetics
               Schellingstr. 3/II, 80799 Munich, Germany
                      bas@phonetik.uni-muenchen.de




Infos to VM Data Sets
=====================

Version 	2.3

F. Schiel 28.10.2003 / 02.03.2004

This document contains information regarding the usage of the German
VM speech corpus for ASR or other experiments where a defined
distinction between training, development and test set is necessary.
The subsets defined here are not of any official nature; they are 
merely given here as a guidance for future experiments and to allow
authors the refer to defined subsets of the German VM corpus.

The sets given here were newly defined in Nov 2003. Please beware,
if you already have worked with this set definition before.
The older set definition can be still downloaded from
ftp://ftp.bas.uni-muenchen.de/pub/BAS/VM/SETS.20031112


-----------------------------------------------------------------------

Division and basic numbers
--------------------------

The VM corpus is divided into VM1 (recordings before 1997) and VM2
(recording after 1996). Both sets differ in recording conditions
and tasks (s. general documentation to the VM corpora).

The training (TRAIN), development (DEV) and test (TEST) sets currently
used in our experiments on the VM corpus are a compromise of the 
following constraints:
- each speaker is allowed in only one set (hard constraint)
- of each speaker there must be at least one complete dialogue
  sequence of turns (to allow speaker adaptation algorithms to be
  applied; hard constraint)
- speakers should be distributed equally across sexes in all sets (soft constraint)
- recordings should be distributed equally across recording sites in all sets
  (to cover possible accents preferences in one site; soft constraint)
- number of words in the DEV and TEST set should be around 14000 (to
  allow significant differences of 0.5% in the range of 95% word accuracy, p=0.005)
  
Basic numbers of the subsets of the total corpus:

SET	WORDS	TURNS	DIALOGS	LEX	SPEAK	TAKEN FROM VOLUMES
---------------------------------------------------------------------
DEV 	26989	1222  	125  	2218	48	14.1 15.1 20.1 21.1
                                		22.1 24.1 29.1 30.1
TEST 	24470	1223 	126 	1946	46	14.1 15.1 20.1 21.1
						22.1 24.1 29.1 39.1 48.1
TRAIN 	438718	24435 	962 	9045	748	1.1 2.1 3.1 4.1 5.1 7.1
						12.1 14.1 15.1 20.1 21.1
						22.1 24.1 29.1 30.1 38.1
						39.1 48.1 49.1 53.1

Basic numbers for the subsets in VM1 and VM2:

VM1:

SET	WORDS	TURNS	LEX	SPEAK
-------------------------------------
DEV  	15084	630	1537	35
TEST 	14615	631	1342	33
TRAIN 	285280	12600	6472	629

VM2:

SET	WORDS	TURNS	LEX	SPEAK
-------------------------------------
DEV 	11905	592	1397	13
TEST 	9855	592	1264	13
TRAIN 	153438	11835	5238	119

Turn listings
-------------

Turn listings for all 6 subsets are stored in:

VM<#>_<SET>
                  with:   #   : 1|2  (VM1, VM2)
		          SET : TRAIN, TEST, DEV

Format:
<TURN-ID> TAB <VOL-ID>

You may obtain the 3 sets for the total corpus by concatenating the
correponding subset lists.

Lexica
------

Lexica and word listings for all 6 subsets:

VM<#>_<SET>.list
VM<#>_<SET>.lex

Again you may obtain lexica for the total corpus by concatenating and
sorting the corresponding subsets.


Remarks
-------

- The DEV and TEST set of VM1 is solely taken from the volume VM14.1
  This has historical reasons: VM14.1 was the last official evaluation
  data set in the VM1 project.
- The low number of speakers in the DEV and TEST sets are a compromise:
  to test speaker adaptation techniques it it required that enough data
  of single speakers are contained in these sets. Therefore the total
  number of speakers is rather low.

Examples ASR Results
--------------------

Using the above defined subsets we obtain currently (Jan 2004) the
following accuracies using a HTK recognizer and a bigram
trained solely on the training corpus:

Trained on VM1_TRAIN + VM2_TRAIN;
Tested on VM1_DEV + VM2_DEV set with lexicon VM1_DEV.lex + VM2_DEV.lex +
VM1_TEST.lex + VM2_TEST.lex (total: 2944 lexical entries):

Monophones:   WA = 64.27%  (52 iterations of HERest)
(512 mixtures per state)

Triphones crossword: WA = 64,51% (37 Iteration of HERest)
(8 mixtures per state, same number of parameters as monophone system)

Some more details for those who are interested:
12 Standard MFCC + Energy + velocity + acceleration (39)
Diagonal covariance matrices
3-5 states per phoneme
43 phoneme classes (extended German SAMPA) + garbage + voice garbage +
  silence + laugh + breath (48)
Models initialized using the MAU tier of the BPF from 1 third of TRAIN
Re-estimation and splitting mixtures after 6 iterations on total TRAIN;
  testing after every two iterations on DEV
Weight of language model fixed to 6.5 (option -s); beam search width 100.0
No testing on TEST until now.

History

21.12.03 : Version 2.1 : Corrections of pronunciations to vm_ger.lex
                         transferred to the lex and list sets
29.01.04 : Version 2.2 : Evaluation of monophone to new sets
02.03.04 : Version 2.3 : Evaluation to word cross triphones added
