===========================================================================

VERBMOBIL-language-model version 2.4       (15-01-2000)

(C) Philips GmbH Forschungslaboratorien, Aachen, 1992-2000

Kopieren ist ohne schriftliche Genehmigung nur VERBMOBIL-Partnern und
nur fuer Zwecke des Projektes VERBMOBIL gestattet.
Only VERBMOBIL partners are allowed to copy this Software for purposes of
the VERBMOBIL project.

Errors, comments to: 
{peters,klakow}@pfa.research.philips.com

===========================================================================

The command probabilities can be tuned only by using further software
including the routine LMSetCmdWeight (24.6.99, in EXCHANGE /
MOD_VM-99-II-LM-Cmd-Software_PETERS / VM-99-II-LM.Cmd.Software.tgz).

As far as the language model will only work with text including no 
commands 
	VM-99-II-LM.Software.tgz
can be used.

===========================================================================

The Perplexity of the new trigrams VM2-2.4.M3.lm on the 
official testset was 65.50 (Detail see below).

===========================================================================

Aenderungen gegenueber dem Sprachmodell Version 2.3:

Wordlist:    The entry Verbmobile was removed.
             4 first names were replaced by new ones.

Wordclases:  There is one new class UNK:Female.
             The classes Verbmobile and $U-$S-$A were removed.

Training data:  increased by 32000 Worte (ca. 4%).

===========================================================================

Files
-----

README
VM2-2.4.lm.wl		(word list)
VM2-2.4.lm.cs		(class sizes)
VM2-2.4.lm.map		(map [word->class])
VM2-2.4.M3.lm.gz	(trigram LM)

Wordlist
---------

The language model is based on the wordlist vmII-whg.wl.5.1.

- Additionally the symbols <UNK> for unknown words,
  @ for the ende of a turn and h"as for the hesitation 
  <h"as> of the transliterations was included.

- The entry #PAUSE# has been removed.

- The 4 firstnames Christian, Holger, Jochen and Ulf were replaced
  by Elke, Berta, Emil and Hermann 

This wordlist is distributed as VM2-2.4.lm.wl together with the language
models.

Class size:
----------

Both the simpler LMWordInit- and the more general LMInit-Routine need the 
specification of the numbers of elements per class in the file
VM2-2.4.lm.cs in two columns.

The more general LMInit routine must get the left column as an array
*ClassList[] and the right column as an array
*ClassSizes.

The simpler LMWordInit-Routine only needs the file name.

Word classes:
-------------

For both the simpler LMWordInit routine as well as for the 
"self-management" when using the more general LMInit routine
an allocation of all words to their classes is necessary. It is based on
the UnkTags of the LexDb.Integrated.10.0. 

This allocation can be found in the file VM2-2.4.lm.map <Word class>.

The LMWordInit routine only needs the file name.

Training data
-------------

The VERBMOBIL-language-model version 2.4 was trained on the German date of  
VERBMOBIL-I (CD1, CD2, CD3, CD4, CD5, CD7, CD12, CD14) and the  
VERBMOBIL-II-data of CD15, CD20, CD21, CD22, CD24, CD30, CD32,
CD38, CD39, CD48 and CD49 as well as the WOZ-Dialogues prepared in
Hamburg.
All the dialogues of CD29 were used except those that were included
into test set and cross validation set. 
CD29/g{372,373,374,386,392,393,394,395,400,412,413,414,415}ac.trl
were used. All the data were filtered with the new trl-filter trl-Filter.
The Flags "--wortkat --awortdef --tger --mger --pros --mling" were set. 
Non-german turns were removed, discontinuous words were concatenated, 
truncated words were removed. After that the symbols in < > were changed
to the symbols in the word list.

The German data of the Call-Home-Corpus were included. 

--> VM1: 342788 Woerter
    VM2: 268057 Woerter
    c_h: 220562 Woerter

All training data were used to train word and class based models which
were afterwards linearly interpolated. The optimaisation was done on the
cross validation set.

cross validation set
--------------------

CD15/g009ac.trl
CD15/g010ac.trl
CD15/g011ac.trl
CD15/g012ac.trl
CD15/g017ac.trl
CD20/g018ac.trl
CD15/g040ac.trl
CD15/g041ac.trl
CD15/g042ac.trl
CD20/g043ac.trl
CD21/g203ac.trl
CD15/g204ac.trl
CD20/g205ac.trl
CD20/g206ac.trl
CD20/g219ac.trl
CD24/g220ac.trl
CD24/g221ac.trl
CD21/g222ac.trl
CD30/g598bc.trl
CD30/g599bc.trl
CD30/g600bc.trl
CD30/g601bc.trl
CD22/g592bc.trl
CD24/g593bc.trl
CD22/g588bc.trl
CD24/g589bc.trl

--> 18081 words. The OOV-rate on the word list VM2-2.4.lm.wl is 
    1.0% (183 words).

Test data
---------

CD29/g380ac.trl
CD29/g381ac.trl
CD29/g382ac.trl
CD29/g383ac.trl
CD29/g388ac.trl
CD29/g389ac.trl
CD29/g390ac.trl
CD29/g391ac.trl
CD29/g594ac.trl
CD29/g595ac.trl
CD22/g584bc.trl
CD24/g585bc.trl
CD29/g596bc.trl
CD29/g597bc.trl

--> 8693 words. The OOV-rate on the word list VM2-2.4.lm.wl is
    1.4% (118 Words). 

Results
-------

During the PP-Measurement the OOV-Words were _NOT_ allocated to <UNK>, but
after them the calculation started with a unigram. This corresponds to
the new implemented perp.c.

--> On these test data the new trigram (VM2-2.4.M3.lm) performed with a 
    perplexity of 65.50 . 

--> On the cross validation data used for the optimisation of the
    parameters the PP was 58.06. 

===========================================================================
