homes/bits/validationlist.html

Revalidation report for the WEBCOMMAND Database

Authors	Florian Schiel, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version	2.0
Date	18.06.2003
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the WEBCOMMAND Corpus:

Summary

The speech corpus of WEBCOMMAND has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected subsamples. Missing data reduces the corpus in its usability and there could occur problems in using the corpus for other applications.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus WEBCOMMAND made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus contains recording sessions of 49 native speakers of France and Great Britain in two different quiet office rooms.
In each session one speaker reads a list of 130 prompts from a screen. There are two prompt lists of 130 items each for each language;
therefore most of the speakers have read 260 items in two different rooms.
Speakers are recorded with two microphones: a headset and a microphone fixed to a 'webpad' hold on the lap.
The corpus contains a total of 15600 two-channel recordings in 120 sessions.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the WEBCOMMAND corpus which can be found under: doc/

README	general documentation
SPEAKER.TBL	list of speakers for the total corpus
SESSION.TBL	list of sessions, place and date of the recording microphone types, channel
	prompt table French P
PROMPT_FR_S.TBL	prompt table French S
PROMPT_EN_P.TBL	prompt table English P
PROMPT_EN_S.TBL	prompt table English S
SUMMARY.TXT	summary of the recordings
SAMEXPORT.TBL	summary of all SAM label files
PICS/	pictures of the recording setup
PRON_FR.LEX	pronunciation dictionary, SAM-PA, french
PRON_EN.LEX	pronunciation dictionary, SAM-PA, english
TRANSCRP.PDF	description of rules and conventions of SpeechDat transcription in german
TRANSCRP_EN.PDF	description of rules and conventions of SpeechDat transcription in english

Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok?

Number and type of media: 3 DVDok

Content of each medium: total size 8,4 GByte. ok

Copyright statement and intellectual property rights (IPR): ok
Technical information:

Layout of media: Information about file system type and directory structure:
DVD-5 ok

File nomenclature: Explanation of used codes (no white space in file names!):
Q1<session number><prompt number> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format:
The signal file format is WAV stereo.ok

Coding: PCM linear ok

Compression: Just widely supported compressions like zip or gzip should be used.
n. a.

Sampling rate: 22050 Hz ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bits/samp ok

Used bytes per sample: 2 bytes/samp ok

Multiplexed signals: (exact de-multiplexing algorithm; tools)
n. a.
Database contents:

Clearly stated purpose of the recordings:
No information, not important for this corpus

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy: a short instruction at the beginning of each session, ok
Linguistic contents of prompted speech:

Specifications of the individual text items: ok

Specification for the prompt sheet design or specification of the design of the speech prompts: not given, not ok

Example prompt sheet or example sound file from the speech prompting: not given, not ok
Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n. a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n. a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.
Speaker information:

Speaker recruitment strategies: No information, not importatnt for this corpus

Number of speakers: 49 (27 english speakers - 15 female, 12 male; 22 french speakers -12 female, 10 male)
ok

Distribution of speakers over sex, age, dialect regions: given in the SPEAKER.TBL ok
Description/definition of dialect regions: No information

Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones:
- Company name and type id: Beyerdynamik NEM192, Beyerdynamik MCE 10
- Electret, dynamic, condenser: no information
- Directional properties: no information
- Mounting: the speaker wears the mic NEM192, the mic MCE10 is mounted on the upper left corner of a dummy laptop case

Position of speakers: (distance to microphone) No information

Bandwidth: (if other than zero to half of sampling rate) ok

Number of channels and channel separation: 2, ok

Acoustical environment: quiet office enviroment, ok
Annotation (orthographic transcription):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): list of the words in the files PROMPT_FR_P.TBL, PROMPT_EN_P.TBL, PROMPT_FR_S.TBL, PROMPT_EN_S.TBL

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: n.a.

Any other language dependent information as abbreviations etc: given in the TRANSCRP_EN.PDF and TRANSCRP.PDF

Annotation manual, guidelines, instructions: ok

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: not given
Annotation (BAS Partitur Format Files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): given PRON_FR.LEX, PRON_EN.LEX

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: ok - PRON_FR.LEX, PRON_EN.LEX (http://www.bas.uni-muenchen.de/Bas/BasFormatseng.html)

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: given
Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: an indirect reference in the file TRANSCRP_EN.PDF

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok
Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.
Others:

Any other essential language-dependent information or convention: given.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

The data of the english speaker 1047 is missing from the file SPEAKER.TBL

Completeness of signal files: ok

Completeness of meta data files: ok

Completeness of annotation files: Q14001000.par is superfluous -the orthography and the canonical annotation is missing.

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

In the following annotation files some of the canonical annotation is labeled as unknown:
Q18002034.PAR, Q18002110.PAR, Q18002126.PAR, Q18008128.PAR
Q18010005.PAR, Q18025065.PAR, Q18027079.PAR, Q18028018.PAR
Q18028077.PAR, Q18028104.PAR, Q18030090.PAR, Q18002083.PAR
Q18002083.PAR, Q18002083.PAR, Q18003083.PAR, Q18003083.PAR
Q18003083.PAR, Q18008083.PAR, Q18008083.PAR, Q18008083.PAR
Q18013083.PAR, Q18013083.PAR, Q18013083.PAR, Q18016083.PAR
Q18016083.PAR, Q18016083.PAR, Q18017083.PAR, Q18017083.PAR
Q18017083.PAR, Q18024083.PAR, Q18024083.PAR, Q18024083.PAR
Q18025083.PAR, Q18025083.PAR, Q18025083.PAR, Q18031083.PAR
Q18031083.PAR, Q18031083.PAR

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: In the lexicon PRON_EN.LEX following elements are missing:
    - M
    - P
    - three

III.) Manual Validation

5% of the data and annotations files was checked in comparison. 15,84% of the data contained errors. The most errors (103 out of 119) were found in
the tagging of noises. The remaining errors were found in the sampa annotations.

IV.) Other Relevant Observations

Noise markers were not used consistently. In some files general noise before or after the spoken prompt was tagged, but in others not.

V.) Comments for Improvement

The revalidation was able to repair some data (lexicon, speaker file, annotations files, README). The results of the manual validation
couldn't be repaired. The sampa annotations and the noise markers should be revised.

VI.) Result

The corpus is ok. No data or documentations files are missing and the most important annotation errors are repaired.