Revalidation report for the WEBCOMMAND Database

Authors Florian Schiel, Katerina Louka
Affiliation   BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München
Postal address Schellingstr. 3
D 80799 München
E-mail schiel@phonetik.uni-muenchen.de
bas@phonetik.uni-muenchen.de
Telephone +49-89-2180-2758
Fax +49-89-2800362
Corpus Version 2.0
Date 18.06.2003
Status final
Comment  
Validation Guidelines Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook 

Validation results of the WEBCOMMAND Corpus:

Summary

The speech corpus of WEBCOMMAND has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected subsamples. Missing data reduces the corpus in its usability and there could occur problems in using the corpus for other applications. 

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus  WEBCOMMAND made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich.  The speech corpus contains recording sessions of 49 native speakers of France and Great Britain in two different quiet office rooms.
In each session one speaker reads a list of 130 prompts from a screen. There are two prompt lists of 130 items each for each language;
therefore most of the speakers have read 260 items in two different rooms.
Speakers are recorded with two microphones: a headset and a microphone fixed to a 'webpad' hold on the lap.
The corpus contains a total of 15600 two-channel recordings in 120 sessions.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the WEBCOMMAND corpus which can be found under: doc/

README general documentation
SPEAKER.TBL
list of speakers for the total corpus
SESSION.TBL
list of  sessions,  place and date of the recording
microphone types, channel
 
prompt table French P
PROMPT_FR_S.TBL
prompt table French S
PROMPT_EN_P.TBL
prompt table English P
PROMPT_EN_S.TBL
prompt table English S
SUMMARY.TXT
summary of the recordings
SAMEXPORT.TBL
summary of all SAM label files
PICS/
pictures of the recording setup
PRON_FR.LEX
pronunciation dictionary, SAM-PA, french
PRON_EN.LEX
pronunciation dictionary, SAM-PA, english
TRANSCRP.PDF
description of rules and conventions of SpeechDat
transcription in german
TRANSCRP_EN.PDF
description of rules and conventions of SpeechDat
transcription in english

           Distribution of speakers over sex, age, dialect regions: given in the SPEAKER.TBL ok
           Description/definition of dialect regions:  No information

          Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

The data of the english speaker 1047 is missing from the file SPEAKER.TBL

Completeness of signal files: ok

Completeness of meta data files: ok

Completeness of annotation files: Q14001000.par is superfluous -the orthography and the canonical annotation is missing.

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

In the following annotation files some of the canonical annotation is labeled as unknown:
Q18002034.PAR, Q18002110.PAR, Q18002126.PAR, Q18008128.PAR
Q18010005.PAR, Q18025065.PAR, Q18027079.PAR, Q18028018.PAR
Q18028077.PAR, Q18028104.PAR, Q18030090.PAR, Q18002083.PAR
Q18002083.PAR, Q18002083.PAR, Q18003083.PAR, Q18003083.PAR
Q18003083.PAR, Q18008083.PAR, Q18008083.PAR, Q18008083.PAR
Q18013083.PAR, Q18013083.PAR, Q18013083.PAR, Q18016083.PAR
Q18016083.PAR, Q18016083.PAR, Q18017083.PAR, Q18017083.PAR
Q18017083.PAR, Q18024083.PAR, Q18024083.PAR, Q18024083.PAR
Q18025083.PAR, Q18025083.PAR, Q18025083.PAR, Q18031083.PAR
Q18031083.PAR, Q18031083.PAR

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: In the lexicon PRON_EN.LEX following elements are missing:
    - M
    - P
    - three


III.) Manual Validation

5% of the data and annotations files was checked in comparison. 15,84% of the data contained errors. The most errors (103 out of 119) were found in
the tagging of noises. The remaining errors were found in the sampa annotations.

IV.) Other Relevant Observations

Noise markers were not used consistently. In some files general noise before or after the spoken prompt was tagged, but in others not. 

V.) Comments for Improvement

The revalidation was able to repair some data (lexicon, speaker file, annotations files, README). The results of the manual validation
couldn't be repaired. The sampa annotations and the noise markers should be revised.

VI.) Result

The corpus is ok. No data or documentations files are missing and the most important annotation errors are repaired.