Authors |
Florian Schiel, Katerina Louka |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
2.2 |
Date |
08.09.2004 |
Status |
final |
Comment |
|
Validation Guidelines |
Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003,
www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus SI1000P has been validated
against general principles of good practise. The validation covered
completeness, formal checks and manual checks of the selected sub
samples. It must be considered that missing data reduces the corpus in
its usability and there could occur problems in using the corpus for
further applications.
This document
summarizes the results of an in-house validation of the speech corpus
SI1000P made in the year 2004 within the project 'BITS' by the
The SI1000P
recordings were done to provide material for high quality concatenative
speech synthesis. It contains 993 newspaper sentences
read by two professional broadcasting announcers in studio quality
together with the laryngographic signal and the glottal pulse stream.
Parts of the corpus were labeled and segmented phonemically (SAM-PA)
and prosodically (boarders + accents).
The General
Documentation directory contains the following documentation files for
the SI1000P corpus which can be found under doc/:
README |
file describing the organisation
of the docu-directory and the speech data |
SI1000P.TXT | Spoken texts |
SI1000P.PROMPTS |
original prompts as presented to
the speakers |
SI1000P.LEX |
Pronunciation dictionary
(Extended German SAM-PA) |
EXT_SAM.TXT |
Description of Extended German
SAM-PA |
PHONDAT.DOC |
Prosodic boarder segmentation
and accents labeling of speaker AI |
PARDOC |
Documentation of BAS Partitur
FORMAT (BPF) files |
· Administrative Information:
Validating person: n. a.
Date of validation: n. a..
Contact for
requests regarding the corpus: ok
Number and type of
media: 4 CD-ROMs ok
Content of each
medium: no information
Copyright statement
and intellectual property rights (IPR): ok
· Technical information:
Layout of media: Information about file system type and
directory structure: no information
File nomenclature: Explanation of used codes (no white space in
file names!):
<speaker-id><session-number><.nis/.dat> ok
Formats of signals
and annotation files: If non standard
formats are used it is common to give a full description or to convert
into a standard format: ok
Coding: NIST SPHERE
Compression: n. a.
Sampling rate: 16 kHz ok
Valid bits per
sample: (others than 8, 16, 24,
should be reported): 16 bit ok.
Used bytes per
sample: no information
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools) n.a.
The speech signals and the laryngograph signals were filtered and
downsampled to 16 kHz.
Database contents:
Clearly stated
purpose of the recordings: ok. (README
file)
Speech type(s): (multi-party conversations, human-human dialogues,
read sentences, connected and/or isolated digits, isolated words etc.)
ok
Instruction to
speakers in full copy: indirect information
is given in the README file
Linguistic contents of prompted speech:
Specifications of
the individual text items: ok
("SI1000P.prompts" file)
Specification for
the prompt sheet design or specification of the design of the speech
prompts: n.a.
Example prompt
sheet or example sound file from the speech prompting: n.a.
Linguistic contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of setting -
formal/informal) n.a.
Human-human
dialogues: (type of dialogues, e.g.
problem solving, information seeking, chat etc., relation between
speakers, topic(s) discussed, type of setting, scenarios) n.a.
Human-machine
dialogues: (domain(s), topic(s),
dialogues strategy followed by the machine, e.g. system driven, mixed
initiative, type of system, e.g. test, operational service,
Wizard-of-Oz) n.a.
Speaker information:
Speaker recruitment
strategies: n.a.
Number of speakers: 2 male speakers
ok
Distribution of speakers over sex, age, dialect regions: no information
Description/definition
of dialect regions: n.a.
Recording platform and recording conditions:
Recording platform: echo cancelling studio - ok
Position and type
of microphones:
- Company name and type id: Sennheiser MKH20
- Electret, dynamic, condenser: n.a
- Directional properties:
omnidirectional
- Mounting: 30cm from mouth
Position of speakers: (distance to microphone) 30 cm from mouth
ok
Bandwidth: (if other than zero to half of sampling rate)
ok
Number of channels
and channel separation: 4 channels
Acoustical
environment: echo
cancelling studio
Orthographic Transcriptions:
Unambiguous
spelling standard used in annotations: ok
Labeling symbols: n.a.
List of
non-standard spellings (dialectal variation, names etc.): given (README)
Distinction of
homographs which are no homophones: n.a.
Character set used
in annotations: ok, (German Umlaute
are coded in LATEX)
Any other language
dependent information as abbreviations etc:
given (README)
Annotation manual,
guidelines, instructions: ok –(EXT_SAM.TXT
PHONDAT.DOC,PARDOC)
Description of
quality assurance procedures: not given
Selection of
annotators: not given
Training of
annotators: not given
Annotation
tools used:
not given
Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation or
reference to the phoneme set: ok (EXT_SAM.TXT)
Phonological or
higher order phenomena accounted in the phonemic transcriptions: n.a.
Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones,
syllables,...): n.a.
Word frequency
table: n.a.
Others:
Any other essential
language-dependent information or convention: n.a.
Indication of how
many files were double-checked by the producer together with percentage
of detected errors: no information
Status
of documentation: acceptable
The following list contains all validation steps
with the methodology and results.
Completeness of signal files: not ok
(error mentioned in the README file:)
Following numbers of the utterances are
missing from both speakers:
These items are not listed in the files SI1000P.txt and SI1000P.prompts
Completeness of meta data files: ok
Completeness of annotation files: not ok.
For both speakers exist the following .par files although there are no
signal files:
Correctness of file names: ok.
Empty files:
none
Status of signal, annotation and meta data
files: ok
Cross checks of meta
information: ok
Cross checks of summary
listings: ok
Annotation and lexicon contents: ok
10%
of the data (100 files of the SI1000P corpus) and the partitur files
were checked in comparison. 4% of the data of the SI1000P corpus
contained errors (4 errors out of 100). More files of the "ai"
speaker were checked, so that the prosodic annotation could
also be included in the comparison.
The errors in the partitur files should be
repaired. Some more information about the
speakers were useful.
The corpus is ok.