Authors | Florian Schiel, Katerina Louka |
Affiliation | BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München |
Postal address | Schellingstr. 3 D 80799 München |
schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de |
|
Telephone | +49-89-2180-2758 |
Fax | +49-89-2800362 |
Corpus Version | 2.0 |
Date | 26.11.2004 |
Status | final |
Comment | |
Validation Guidelines | Florian Schiel: The Validation of Speech
Corpora, Bastard Verlag, 2003,
www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus of RVG1 has been validated against general
principles of good practise. The validation covered completeness,
formal checks and manual checks of the selected subsamples. Missing
data reduces the corpus in its usability and there could occur problems
in using the corpus for other applications.
This document summarizes the results of an in-house validation of
the speech corpus RVG1 made in the year 2004 within the
project 'BITS' by the Institute of Phonetics of the
Ludwig-Maximilians-University Munich. The acronym RVG1 stands
for Regional Variants of German 1.
The corpus consists of single digits, connected digits, phone
numbers, phonetically balanced sentences, computer command phrases and
spontaneous
speech.
The General Documentation directory contains the following
documentation files for the RVG1 corpus which can be found under:
doc/
README | general documentation |
SPRK_ATT.TXT |
speaker information |
PROMPTS |
directory with prompt lists |
TRLMAN.PS |
description of Verbmobil II
transliteration format |
VMTRLEX2D.HTML |
description of Verbmobil II
transliteration format (German) |
RVG1_TRL.LIST |
word list spontaneous speech |
RVG1_TRL.LEX |
canonic pronunciation in SAM-PA
spontaneous speech |
RVG1_READ.LIST |
word list read speech |
RVG1_READ.LEX |
canonic pronunciation in SAM-PA
read speech |
PARDOC |
documentation of the BAS
Partitur Format |
SPRK_DIR.TXT |
information about the location
of the data of each speaker |
README | general documentation |
A-I |
speech data of dialect regions a-i |
DOC |
General Documentation directory |
SOFTW |
BAS standard software package |
TRL |
Transliteration Files to
spontaneous monologues |
PAR | Bas Partitur Files to all
utterances (BPF) (only on volumes
RVG1_LQ_19 and RVG1_HQ_12) |
LABELS |
original validation files (only
on volumes RVG1_LQ_20 and RVG1_HQ_12) |
Administrative Information:
Validating person: n. a.
Date of validation: n. a..
Contact for requests regarding the corpus: ok?
Number and type of media: 32 CDs or 5 DVDs ok
Content of each medium: ~ 600 - 650 MB ok
Copyright statement and intellectual property rights (IPR): ok
Technical information:
Layout of media: Information about file system
type and directory structure:
32 ISO 9660 volumes or DVD-5
File nomenclature: Explanation of used codes
(no white space in file names!):
<prompt class><type of file><item reference number>
.NIS ok
The signal file format is NIST
SPHERE.ok
Compression: Just
widely supported compressions like
zip or gzip should be used.
n. a.
Sampling rate: 22050 Hz (001 to 036 were recorded with 11.025 Hz)ok
Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bits/samp ok
Used bytes per sample: 2 bytes/samp ok
Multiplexed signals: (exact
de-multiplexing
algorithm; tools)
n. a.
Database contents:
Clearly stated purpose of
the recordings:
No information, not important for
this corpus
Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok
Instruction to speakers in full copy: no information not ok
Linguistic contents of prompted speech:
Specifications of the individual text items: ok
Specification for the
prompt sheet design or specification of
the design of the speech prompts: not given, not ok
Example prompt sheet or
example sound file from the speech
prompting: not given, not ok
Linguistic contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n. a.
Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n. a.
Human-machine dialogues:
(domain(s), topic(s),
dialogues strategy followed by the machine, e.g. system driven, mixed
initiative, type of system, e.g.
test, operational service,
Wizard-of-Oz) n.a.
Speaker information:
Speaker recruitment strategies: No information, not
importatnt for this corpus
Number of speakers: 501 ( 282 male speakers and 219 female) ok
Distribution
of speakers over sex, age, dialect regions: given in the file
"sprk_att.txt" and in the file "dialects.txt" ok
Description/definition of
dialect regions: given in the README file
Recording platform and recording conditions:
Recording platform: ok
Position and type of microphones:
- Company name and type id: Sennheiser HD 410, Sennheiser MD 441 U,
Telex (Soundblaster), Talk Back (AT&T).
- Electret, dynamic, condenser: no information
- Directional properties:
Sennheiser HD 410 <=> microphone 1 inch to the left and 1 inch down from left mouth corner
- Mounting: Information given in the README file
Position of speakers: (distance to microphone)
2-4 feet to the screen, 1-2 feet from the desktop microphones.
Bandwidth: (if other than zero to half of sampling rate) ok
Number of channels and channel separation: mono signal file, ok
Acoustical environment: usual noise in office enviroment, ok
Annotation (Transliteration files):
Unambiguous spelling standard used in annotations: ok
Labeling symbols: ok
List of non-standard spellings (dialectal variation,
names etc.): information given in the files: RVG1_trl.lex,
RVG1_trl.list
Distinction of homographs which are no homophones: n.a.
Character set used in annotations: n.a.
Any other language dependent information as abbreviations etc:
ok, see the files above
Annotation manual, guidelines, instructions: translitaration of spontaneous speech according Verbmobil II Format (http://www.phonetik.uni-muenchen.de/VMtrlex2d.html, or TRLMAN.ps) ok
Description of quality assurance procedures: not given
Selection of annotators: not given
Training of annotators: not given
Annotation tools used: not given
Annotation (BAS Partitur Format Files):
Unambiguous spelling standard used in annotations: ok
Labeling symbols: ok
List of non-standard spellings (dialectal variation,
names etc.): ok
Distinction of homographs which are no homophones: n.a.
Character set used in annotations: ok
Any other language dependent information as abbreviations etc: given
Annotation manual, guidelines, instructions: ok -
information given under: /pardoc/
Description of quality assurance procedures: not given
Selection of annotators: not given
Training of annotators: not given
Annotation tools used: given
Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation or reference to the phoneme set: an indirect
reference in the file RVG1_read.lex
Phonological or higher order phenomena accounted in the
phonemic transcriptions: ok
Statistical information:
Frequency of sub-word units: phonemes (diphones,
triphones, syllables,...): n.a.
Word frequency table: n.a.
Others:
Any other essential language-dependent information or convention: given.
Indication of how many files were double-checked by the
producer together with percentage of detected errors: given
-information in the README file
Status of documentation: acceptable
The following list contains all validation steps with the
methodology and results.
Completeness of signal files:
Completeness of meta data files: ok
Completeness of annotation files:
Correctness of file names: ok.
Empty files: none
Status of signal, annotation and meta data files: ok
Cross
checks of meta information: ok
Cross
checks of summary listings: ok
Annotation and lexicon contents: Following words are differently annoted in the par files as in the lexicon files:
The revalidation was able to repair some data and to add some more
informations about the speakers' data . The results of the manual
validation
couldn't be repaired.
The corpus is ok. No documentations files are missing and the
corpus is well documented.