Authors |
Florian Schiel, Katerina Louka |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
1.0 |
Date |
22.07.2004 |
Status |
final |
Comment |
|
Validation Guidelines |
Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003,
www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus ERBA has been validated against
general principles of good practise. The validation covered
completeness, formal checks and manual checks of the selected sub
samples. It must be considered that missing data reduces the corpus in
its usability and there could occur problems in using the corpus for
further applications.
This document
summarizes the results of an in-house validation of the speech corpus
ERBA made in the year 2004 within the project 'BITS' by the
This corpus is the
re-distribution of the original ERBA corpus that was first distributed
in 1992. It contains a collection of over 10.000 different utterances
in a certain task domain (train inquiries) of over 100 speakers. All
recordings were made in a quiet office room using a close talking
microphone.
The corpus is
divided into two parts: the part called 'ERBA' contains the vast amount
of data. The part called 'TEST'
contains 100 utterances disjunctive to 'ERBA' but from the same
task domain.
The General
Documentation directory contains the following documentation files for
the ERBA corpus which can be found under docu/:
README |
file describing the organisation
of the docu-directory and the speech data |
INFO |
general information on ERBA |
RECORD |
recording information |
SPEAKERS |
speaker information |
SESSIONS |
session information |
SIZE |
size info |
STATS |
statistical info |
README | iformation about the media and
the validation of ERBA and a short description of the corpus |
BASEVAL |
information of the corpus'
validation |
README |
short overview of the structure
of TEST |
RECORD |
recording information |
SPEAKERS |
speaker information |
SESSIONS |
session information |
SIZE |
size info |
· Administrative Information:
Validating person: n. a.
Date of validation: n. a..
Contact for
requests regarding the corpus: ok
Number and type of
media: 4 CD-ROMs ok
Content of each
medium: total size 1818 MB
Copyright statement
and intellectual property rights (IPR): ok
· Technical information:
Layout of media: Information about file system type and
directory structure:
4 CD-ROMs with ISO 9660 format
File nomenclature: Explanation of used codes (no white space in
file names!):
<speaker-id><session-number><sentence> ok
Formats of signals
and annotation files: If non standard
formats are used it is common to give a full description or to convert
into a standard format: ok, raw signal file
Coding: raw
Compression: n. a.
Sampling rate: 16 kHz ok
Valid bits per
sample: (others than 8, 16, 24,
should be reported): 16 bit, ok.
Used bytes per
sample: no information
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools) n.a.
Database contents:
Clearly stated
purpose of the recordings: ok. ("info"
file)
Speech type(s): (multi-party conversations, human-human dialogues,
read sentences, connected and/or isolated digits, isolated words etc.)
ok
Instruction to
speakers in full copy: no
information
Linguistic contents of prompted speech:
Specifications of
the individual text items: ok
("info" file)
Specification for
the prompt sheet design or specification of the design of the speech
prompts: n.a.
Example prompt
sheet or example sound file from the speech prompting: n.a.
Linguistic contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of setting -
formal/informal) n.a.
Human-human
dialogues: (type of dialogues, e.g.
problem solving, information seeking, chat etc., relation between
speakers, topic(s) discussed, type of setting, scenarios) n.a.
Human-machine
dialogues: (domain(s), topic(s),
dialogues strategy followed by the machine, e.g. system driven, mixed
initiative, type of system, e.g. test, operational service,
Wizard-of-Oz) n.a.
Speaker information:
Speaker recruitment
strategies: information in
the "speakers" file
Number of speakers: 101 (40 female speaker and 61 male ones)
ok
Distribution of speakers over sex, age, dialect regions: ok,
("speakers" file)
Description/definition of
dialect regions: ok ("speakers"
file)
Recording platform and recording conditions:
Recording platform: ok
Position and type
of microphones: a close talking microphone
- Company name and type id: Shure SM 10A
- Electret, dynamic, condenser: n.a
- Directional properties: n.a.
- Mounting: n.a.
Position of speakers: (distance to microphone) ok
Bandwidth: (if other than zero to half of sampling rate)
100-6756 Hz
Number of channels
and channel separation: 1 channel
Acoustical
environment:
ordinary quiet office surrounding
Orthographic Transcriptions:
Unambiguous
spelling standard used in annotations: ok
Labeling symbols: n.a.
List of
non-standard spellings (dialectal variation, names etc.): given (README)
Distinction of
homographs which are no homophones: n.a.
Character set used
in annotations: ok, ASCII
Any other language
dependent information as abbreviations etc:
given (README)
Annotation manual,
guidelines, instructions: ok –(README)
Description of
quality assurance procedures: not given
Selection of
annotators: not given
Training of
annotators: not given
Annotation
tools used:
not given
Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation or
reference to the phoneme set: not given
Phonological or
higher order phenomena accounted in the phonemic transcriptions: n.a.
Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones,
syllables,...): ok ("stats" file)
Word frequency
table: n.a.
Others:
Any other essential
language-dependent information or convention: n.a.
Indication of how
many files were double-checked by the producer together with percentage
of detected errors: ok (BASEVAL)
Status
of documentation: acceptable
The following list contains all validation steps
with the methodology and results.
In the "sessions" file the female speakers are
tagged both with "f" and with "w". The same error occured also in the
file "/test/docu/speakers" and in the file "/test/docu/sessions".
Completeness of signal files: ok
Completeness of meta data files: ok
Completeness of annotation files: ok.
Correctness of file names: ok.
Empty files:
none
Status of signal, annotation and meta data
files: ok
Cross checks of meta
information: ok
Cross checks of summary
listings: ok
Annotation and lexicon contents: ok
5%
of the data (500 files of the ERBA corpus and 50 files of the TEST
corpus) and the orthographic transcription were checked in comparison.
1,8% of the data of the ERBA corpus contained errors (9 errors out of
500). Errors are the mismatching between the audio files and the
orthographic transcription. Pecular background noises (for
example bird singing) , buzzing, blowing to the microphone or
microphone touch are not being labelled. The TEST
corpus didn't have any errors, although the
blowing to the microphone in some files or the background
noise could be irritating.
Background noises, buzzing, blowing to the microphone or microphone
touch is not labelled in this corpus. Some utterances
are leveled much weaker than others.
The revalidation was able to repair some
data (orthographic transcription files, speakers file).
The corpus is ok. The corpus is well documented
and no data or documentation files are missing.