Revalidation Report for the SI100 Database

Authors
Florian Schiel, Angela Baumann  
Affiliation  
BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München
Postal address
Schellingstr. 3
D 80799 München
E-Mail
schiel@phonetik.uni-muenchen.de
bas@phonetik.uni-muenchen.de
Telephone
+49-89-2180-2758
Fax
+49-89-2800362
Corpus Version
3.0
Date
03/07/2003
Status
final
Comment
The following validation results show a lack of essential information in documentation and annotation.
Validation Guidelines
Florian Schiel: The Validation of Speech Corpora, Bastard Verlag München, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook 

Validation Results of the SI100 Corpus

Summary

The speech corpus of SI100 has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected subsamples. Missing data reduces the corpus in its usability and there could occur problems in using the corpus for other applications.

Introduction and Corpus Description

This document summarises the results of an in-house validation of the speech corpus SI100 made in the year 2003 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The recordings took place at the same institute in 1995 and were accurately controlled by a supervisor in quiet studio environment. The language of the corpus is German. The corpus contains read speech of 101 different speakers (50 female, 50 male, 1 unknown). Each speaker has read approx. 100 sentences from either the SZ sub-corpus or the CeBit sub-corpus. The sub-corpus SZ contains 544 sentences from newspaper articles ("Sueddeutsche Zeitung"). The sub-corpus CeBit contains 483 sentences from newspaper articles about the CeBit 1995. Each sub-corpus is divided into 5 parts of approx. 100 utterances each. Every speaker read only one part of one sub-corpus (with some exceptions), thus resulting in a total of approx. 10.100 recorded utterances (31,5 h of speech).

I) Validation of Documentation

The General Documentation directory contains the following documentation files for the SI100 corpus which can be found under: doc/

README: general documentation
SI100_id.lis: list of speakers for the total corpus
SI100_#.lis: list of speaker ids for the single volume (# = 1,...,7)
SI100_ce.txt: texts of sub-corpus CeBit
SI100_sz.txt: texts of sub-corpus SZ
SI100_wo.txt: list of spoken words
SI100.lex: pronunciation lexicon
partitur/: BAS Partitur files
pardoc/: BAS Partitur Files Docu

The following required contents of the documentation have been checked:

Status documentation: not acceptable, but repairable.

II.) Automatic Validation

The following list contains all validation steps with the methodology and results.

Difference between the README file and the "/doc/SI100_id.lis" file: (repairable)

Differences between the "/doc/SI100_id.lis" file and the "/doc/SI100_*.lis" files and the structure of the file system. It seems as if the speaker ids were at some point changed to 4 characters. In this effort some parts of the corpus and the documentation seem to have been forgotten: (repairable)

The following speaker ids were only partially changed:

The links to these files in the directory "SI100_total" partially point into nowhere.

Differences between the "/doc/SI100_id.lis" file concerning the spoken parts of the sub-corpora and the really spoken parts:

The documentation did not properly explain the purpose and the meaning of the following three different orthographic representations:

The following corrections were made in the headers and the partitur files but not in the files SI100_sz.txt and SI100_ce.txt :

For the following files Errors in the NIST-header were detected:

ERROR: /bmnt/BAS/SI100_1/alsc/alsc283c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/chkr/chkr480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/erai/erai480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/itrc/itrc480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_1/jume/jume480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_2/rten/rten480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_2/sole/sole480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_4/anwi/anwi480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_4/bawe/bawe480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_6/bija/bija480c.nis : wrong size calculation
ERROR: /bmnt/BAS/SI100_6/flsc/flsc480c.nis : wrong size calculation

The first file is one byte too long, but the size of the header seems correct.

The word list SI100_wo.txt (6187 lines) and the lexicon SI100.lex (6186 lines) are of different length and are sorted differently, because of the Umlauts (Latex and ISO 8859-1 German). SI100_wo.txt has German Umlauts and includes the words "<drittens" and "a", which are not included in SI100.lex. The word "drittens" was only found in SI100.lex.

Apart from the Umlaut coding and some spaces, the ORT tier in the partitur file and the orthography tags in the NIST header are identical.

The word list SI100_wo.txt was compared with a word list created from the ORT tiers of the partitur files. Interestingly there are more words in file SI100_wo.txt than in the word list created from the annotations.

The second tier of SI100.lex, that contains the pronunciations, was compared to a list compiled form the KAN tier. Again there are more entries in SI100.lex. Some pronunciations differ. In the partitur files the sound "R" is used, while in SI100.lex only "r" occurs. It was not checked, whether s

The tiers of the partitur file DBN, LHD, NCH, REP, SAM, SBF, SNB and SSB were checked and no error was found. The tier LBD has no entry. For the files, where the speaker ids were changed (bija, kipp, miol, peko, step, waba, ziul and zuen ), the tiers SPN and SRC are different from the actual file and path name. It is quite surprising that some partitur files include a MAU tier (bija/bija383c.par) and some not (e.g. adhe329.par).

III.) Manual Validation

In a rough comparison of the KAN tiers in some partitur files and the respective sound files (by listening) the following errors were found:

In SI100.lex, two distinct pronunciations are specified for "-" and "\-":

In the file /bmnt/BAS/SI100_1/alsc/alsc283c.nis the speaker uttered "bInd@strIC", as specified in the prompt file. In NIST-header and partitur file "-" and "g@daNk@nstrIC" are annotated, which is not correct.

IV.) Other Relevant Observations

We found an inconsistent policy concerning white-space characters. This is a source of errors for automatic processing. We suggest to use one white-space between words and no spaces at the end of a line (found in SI100_3.lis).

The corpora were divided in different parts. in the case of CeBit the boundaries for part 5 are wrong in the README file. They are not "5 = 388-483", but "5 = 378-483".

The CeBit recordings 201-273 of speaker brda are not in part 2 as indicated in the file SI100_id.lis, but in part 3.

The speaker "jore" occurs twice in the file SI100_id.lis the only difference between the two entries is that the speaker read sentences of part 1 and part 5. This is impractical for automatic processing that is based on the speaker id. Dividing the corpus in different parts is not really necessary. For reasons of simplicity we would suggest to remove the "part" information and the second entry in the speaker-id file.

The entry orthography in the NIST-header allows German Umlauts. The evaluator is not entirely sure, if this is conform with the specifications by NIST.

The URL "http://www.icp.grenet.fr/Relator/standnist.html", which points to the documentation of the NIST-header does not exist anymore.

The file pardoc/PARSAMPA.HTM is not an HTML file but a simple ASCII text file. The appropriate name would be PARSAMPA.TXT.

There are no recordings for sentence 7 of the SZ corpus and sentence 303 of the CeBit corpus.

The documentation of the software is incomplete.

V.) Comments for Improvement

The use of "-" and "\-" to distinguish the pronunciations "g@daNk@nstrIC" and "bInd@strIC" is quite confusing. We would suggest to use at least in the partitur files more intuitive tags like "<Bindestrich>" and "<Gedankenstrich>".

There are three orthographic representations in the corpus. It makes sense to keep the actual prompts separate from a pseudo transcription. But it is rather impractical for the correction of errors and maintenance to have the transcription in the NIST-header and in the partitur file. We would suggest to remove one of them; probably the one in the NIST header."

The report of the previous validation is a word document, we suggest to make a HTML version of this document available as well.

The directory structure on the CDs should be modified, such that the directories containing the data files are all in one directory. This data directory, the partitur directory and the doc directory should be all on the same level.

The meaning of the SI100_WO.TXT file should be explained in the README file.

The labels contained in the NIST headers should be documented in the README file.

VI.) Result

It looks as if the SI100 corpus has gone through many changes that have not all improved the quality. We suggest that all rendundancies should be removed from the corpus. This makes error correction, maintenance and documentation much easier. After the correction of the errors described in this report and after a revision of the documentation the corpus will be a valuable speech resource again.

In this evaluation the script par2ags.pl was used to test if the partitur files were formated according to the partitur file conventions. It might be useful for further evaluations to a have a proper partitur-parser at hand, that tests all dependencies within the file.