Revalidation report for the SC10 Database

Revalidation report for the SC2 Database

Authors	Florian Schiel, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version	2.1
Date	02.08.2004
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the SC2 Corpus:

Summary

The speech corpus SC2 has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. It must be considered that missing data reduces the corpus in its usability and there could occur problems in using the corpus for further applications.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus SC2 made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus was created in 1997 in collaboration with Siemens Company Munich at the same institute.

The corpus contains read speech of 10 different speakers with screen prompted 'automobil diagnosis phrases' recorded under real conditions in two
different car maintenance halls. The language is German. All speakers are male native Germans and have never participated in such a task
before. They are all experts in the field of car diagnosis.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the SC2 corpus which can be found under /doc:

README	file describing the database structure
sc2_ses.txt	session information
sc2_ort_iso.txt	text prompt corpus (100) in ISO88591
sc2_ort_tex.txt	text prompt corpus (100) in LaTeX
sc2_pro_iso.txt	text prompts in order of prompting (400) in ISO88591
sc2_pro_tex.txt	text prompts in order of prompting (400) in LaTeX
sc2_wordlist_tex.txt	words of all spoken words (LaTeX)
sc2_lex_tex.txt	SAMP-PA dictionary of canonical pronunciation (LaTeX)
sampa.txt	Extended German SAM-PA
transkonv_engl	documentation of the BAS guidelines for proper pronunciation coding in German SAM-PA

· Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: 1 CD-ROM ok

Content of each medium: total size 520MB

· Technical information:

Layout of media: Information about file system type and directory structure:

1 CDROM with ISO 9660 format

File nomenclature: Explanation of used codes (no white space in file names!):
data/<session>/<speaker><hall>< mic><utt_id>.nis ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok, raw signal file

Coding: NIST

Compression: n. a.

Sampling rate: 8kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bit, ok.

Used bytes per sample: 2 ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

· Database contents:

Clearly stated purpose of the recordings: ok. (README)

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy: ok

· Linguistic contents of prompted speech:

Specifications of the individual text items: ok (sc2_pro_*.txt)

Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n.a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: indirect information in the README file

Number of speakers: 10 ok

Distribution of speakers over sex, age, dialect regions: ok, (in each data directory is a speaker file: vp.txt)
Description/definition of dialect regions: indirectly given through the place of origin (vp.txt)

· Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones: The data was transmitted via a wireless phone (Siemens Gigaset) to an analog phone set connected to the base station of the wireless. To this analog phone set a laptop was connected via standard I/O technique.
- Company name and type id: ok
- Electret, dynamic, condenser: n.a
- Directional properties: n.a.
- Mounting: n.a.

Position of speakers: (distance to microphone) ok

Bandwidth: (if other than zero to half of sampling rate) no information

Number of channels and channel separation: 1 channel

Acoustical environment: a hall of car diagnosis maintenance

· Annotation (BAS Partitur Format (BPF) files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): given

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: given

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: n.a.

· Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: ok. (SAMPA.TXT)

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

· Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: ok

Completeness of meta data files: ok

Completeness of annotation files: ok.

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: ok

III.) Manual Validation

5% of the 'usable' data, the BAS files and the annotation SAM files were checked in comparison. 10,50% of the data contained errors (42 errors out of 400). In some par files the various noise classes were not marked.

IV.) Other Relevant Observations

none

V.) Comments for Improvement

The revalidation was able to repair some data (par files, lexicon). The results of the manual validation couldn't be repaired.

VI.) Result

The corpus is ok. The corpus is well documented and no data or documentation files are missing.