homes/bits/validationlist.html

Revalidation report for the RVG1 Database

Authors	Florian Schiel, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version	2.0
Date	26.11.2004
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the RVG1 Corpus:

Summary

The speech corpus of RVG1 has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected subsamples. Missing data reduces the corpus in its usability and there could occur problems in using the corpus for other applications.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus RVG1 made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The acronym RVG1 stands for Regional Variants of German 1.

The corpus consists of single digits, connected digits, phone numbers, phonetically balanced sentences, computer command phrases and spontaneous
speech.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the RVG1 corpus which can be found under: doc/

README	general documentation
SPRK_ATT.TXT	speaker information
PROMPTS	directory with prompt lists
TRLMAN.PS	description of Verbmobil II transliteration format
VMTRLEX2D.HTML	description of Verbmobil II transliteration format (German)
RVG1_TRL.LIST	word list spontaneous speech
RVG1_TRL.LEX	canonic pronunciation in SAM-PA spontaneous speech
RVG1_READ.LIST	word list read speech
RVG1_READ.LEX	canonic pronunciation in SAM-PA read speech
PARDOC	documentation of the BAS Partitur Format
SPRK_DIR.TXT	information about the location of the data of each speaker

The main directory of each volume contains following directories:

README	general documentation
A-I	speech data of dialect regions a-i
DOC	General Documentation directory
SOFTW	BAS standard software package
TRL	Transliteration Files to spontaneous monologues
PAR	Bas Partitur Files to all utterances (BPF) (only on volumes RVG1_LQ_19 and RVG1_HQ_12)
LABELS	original validation files (only on volumes RVG1_LQ_20 and RVG1_HQ_12)

Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok?

Number and type of media: 32 CDs or 5 DVDs ok

Content of each medium: ~ 600 - 650 MB ok

Copyright statement and intellectual property rights (IPR): ok
Technical information:

Layout of media: Information about file system type and directory structure:
32 ISO 9660 volumes or DVD-5

File nomenclature: Explanation of used codes (no white space in file names!):
<prompt class><type of file><item reference number> .NIS ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format:

The signal file format is NIST SPHERE.ok

Compression: Just widely supported compressions like zip or gzip should be used.
n. a.

Sampling rate: 22050 Hz (001 to 036 were recorded with 11.025 Hz)ok

Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bits/samp ok

Used bytes per sample: 2 bytes/samp ok

Multiplexed signals: (exact de-multiplexing algorithm; tools)
n. a.

Database contents:

Clearly stated purpose of the recordings:
No information, not important for this corpus

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy: no information not ok

Linguistic contents of prompted speech:

Specifications of the individual text items: ok

Specification for the prompt sheet design or specification of the design of the speech prompts: not given, not ok

Example prompt sheet or example sound file from the speech prompting: not given, not ok

Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n. a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n. a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

Speaker information:

Speaker recruitment strategies: No information, not importatnt for this corpus

Number of speakers: 501 ( 282 male speakers and 219 female) ok

Coding: PCM linear ok

Distribution of speakers over sex, age, dialect regions: given in the file "sprk_att.txt" and in the file "dialects.txt" ok
Description/definition of dialect regions: given in the README file

Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones:
- Company name and type id: Sennheiser HD 410, Sennheiser MD 441 U, Telex (Soundblaster), Talk Back (AT&T).
- Electret, dynamic, condenser: no information
- Directional properties:
Sennheiser HD 410 <=> microphone 1 inch to the left and 1 inch down from left mouth corner
The desktop microphones Sennheiser MD 441 U, Telex (Soundblaster) and Talk Back (AT&T) are placed as follows:
                 /
                /
               +    left loadspeaker
Speaker           __
           ___   | |
          | K | |__|            ____
          | E |                 |    \_____
-         | Y |                 |          |
)_\       | B | #---O          | DISPLAY |
) /       | O |                 |          |
-         | A |                 |     _____|
          | R |   __            |____/
          |_D_| |* |
                 |__|

                    right loadspeaker

          * Talk Back     # Telex       + Sennheiser
            Microphone      Microphone    MD 441 U

- Mounting: Information given in the README file

Position of speakers: (distance to microphone) 2-4 feet to the screen, 1-2 feet from the desktop microphones.

Bandwidth: (if other than zero to half of sampling rate) ok

Number of channels and channel separation: mono signal file, ok

Acoustical environment: usual noise in office enviroment, ok
Annotation (Transliteration files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): information given in the files: RVG1_trl.lex, RVG1_trl.list

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: n.a.

Any other language dependent information as abbreviations etc: ok, see the files above

Annotation manual, guidelines, instructions: translitaration of spontaneous speech according Verbmobil II Format (http://www.phonetik.uni-muenchen.de/VMtrlex2d.html, or TRLMAN.ps) ok

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: not given
Annotation (BAS Partitur Format Files):

Unambiguous spelling standard used in annotations: ok

Labeling symbols: ok

List of non-standard spellings (dialectal variation, names etc.): ok

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: given

Annotation manual, guidelines, instructions: ok - information given under: /pardoc/

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: given
Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: an indirect reference in the file RVG1_read.lex

Phonological or higher order phenomena accounted in the phonemic transcriptions: ok
Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.
Others:

Any other essential language-dependent information or convention: given.

Indication of how many files were double-checked by the producer together with percentage of detected errors: given -information in the README file

Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files:

Spontaneous files of the following speakers are missing:

           - 011
           - 146
           - 162
           - 283
           - 316

Phone number with area code files (2) of the speaker 352 are missing.

Completeness of meta data files: ok

Completeness of annotation files:

The transliteration files of following speakers are missing:

          - 145
          - 146
          - 316

The par files of following speakers are missing:

          - 145: No spontaneaus speech par file
          - 146: No spontaneaus speech par file
          - 316: Some phone number with area code par files are missing, no spontaneaus speech par file
          - 330: Exists only the spontaneaus speech par file
          - 331: Exists only the spontaneaus speech par file
          - 331: Exists only the spontaneaus speech par file
          - 332: Exists only the spontaneaus speech par file
          - 352: Phone number with area code files are missing (2)

The following validations files are missing:

          - There are no validations files for the speakers 330-331-332
          - 352: 2 phone number with area code validation files are missing
          - 146: No spontaneaus speech validation file
          - 316: 2 phone number with area code validation files are missing, no spontaneaus speech validation file

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: Following words are differently annoted in the par files as in the lexicon files:

QasthmanfEl@

zi:m@ns$a$g

gnU$tskOmpi:l6s

fo:to$ts$ts

vIndaUs$n$t

Q'aInts

Su:l@b@Ru:fR@zotsializi:RUN$@$f

III.) Manual Validation

5% of the data and annotations files was checked in comparison. 2,65% of the data contained errors. The most errors were found in
annotation of the spontaneaus speech files.

IV.) Other Relevant Observations

none

V.) Comments for Improvement

The revalidation was able to repair some data and to add some more informations about the speakers' data . The results of the manual validation
couldn't be repaired.

VI.) Result

The corpus is ok. No documentations files are missing and the corpus is well documented.