Authors |
Florian Schiel, Katerina Louka |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2800362 |
Corpus Version |
2.1 |
Date |
02.08.2004 |
Status |
final |
Comment |
|
Validation Guidelines |
Florian Schiel: The
Validation of Speech Corpora, Bastard Verlag,
2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus SC2 has
been validated against general principles of good practise. The
validation
covered completeness, formal checks and manual checks of the selected
sub samples.
It must be considered that missing data reduces the corpus in its
usability and
there could occur problems in using the corpus for further
applications.
This
document summarizes
the results of an in-house validation of the speech corpus SC2 made
in
the
year 2004 within the project 'BITS' by the
The
General Documentation directory contains the following documentation
files for
the SC2 corpus which can be found under /doc:
README |
file describing the database structure |
sc2_ses.txt |
session information |
sc2_ort_iso.txt |
text prompt corpus (100) in
ISO88591 |
sc2_ort_tex.txt |
text prompt corpus (100) in LaTeX |
sc2_pro_iso.txt |
text prompts in order of
prompting (400) in ISO88591 |
sc2_pro_tex.txt |
text prompts in order of
prompting (400) in LaTeX |
sc2_wordlist_tex.txt |
words of all spoken words (LaTeX) |
sc2_lex_tex.txt |
SAMP-PA dictionary of
canonical pronunciation (LaTeX) |
sampa.txt |
Extended German SAM-PA |
transkonv_engl |
documentation of the BAS
guidelines for proper pronunciation coding in German SAM-PA |
· Administrative Information:
Validating person: n. a.
Date of
validation: n. a..
Contact
for requests regarding the corpus:
ok
Number and
type of media: 1 CD-ROM ok
Content of
each medium: total size
520MB
Copyright
statement and intellectual property rights (IPR): ok
· Technical information:
Layout of
media: Information
about file system type and directory structure:
1 CDROM with ISO 9660 format
File
nomenclature: Explanation
of used codes (no white space in file names!):
data/<session>/<speaker><hall><
mic><utt_id>.nis ok
Formats of
signals and annotation files: If
non standard formats are used it is
common to give a full description or to convert into a standard format:
ok, raw signal file
Coding: NIST
Compression:
n. a.
Sampling
rate:
8kHz ok
Valid bits
per sample: (others than 8,
16, 24, should be reported): 16 bit, ok.
Used bytes
per sample: 2 ok
Multiplexed
signals: (exact
de-multiplexing algorithm;
tools) n.a.
· Database contents:
Clearly
stated purpose of the recordings:
ok. (README)
Speech
type(s): (multi-party
conversations, human-human dialogues, read sentences,
connected and/or isolated digits, isolated words etc.) ok
Instruction
to speakers in full copy: ok
·
Linguistic
contents of prompted speech:
Specifications
of the individual text items:
ok (sc2_pro_*.txt)
Specification
for the prompt sheet design or specification of the design
of the speech prompts: n.a.
Example
prompt sheet or example sound file from the speech prompting: n.a.
·
Linguistic
contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of
setting -
formal/informal) n.a.
Human-human
dialogues: (type of
dialogues, e.g. problem solving, information seeking, chat
etc., relation between speakers, topic(s) discussed, type of setting,
scenarios) n.a.
Human-machine
dialogues: (domain(s),
topic(s), dialogues strategy followed by the machine,
e.g. system driven, mixed initiative, type of system, e.g. test,
operational
service, Wizard-of-Oz) n.a.
· Speaker information:
Speaker
recruitment strategies:
indirect information in the README file
Number of
speakers: 10
ok
Distribution of speakers over sex, age, dialect regions: ok,
(in each data directory is a speaker file: vp.txt)
Description/definition of
dialect
regions: indirectly given
through the place of origin (vp.txt)
·
Recording
platform and recording conditions:
Recording
platform: ok
Position
and type of microphones: The data was transmitted via a wireless phone
(Siemens Gigaset) to an analog phone set connected to the base station
of the wireless. To this analog phone set a laptop was connected via
standard I/O technique.
- Company name and type id: ok
- Electret, dynamic, condenser: n.a
- Directional properties: n.a.
- Mounting: n.a.
Position
of speakers: (distance to
microphone) ok
Bandwidth: (if
other than zero to half of sampling rate) no information
Number of
channels and channel separation:
1 channel
Acoustical
environment: a hall
of car diagnosis maintenance
·
Annotation
(BAS Partitur Format (BPF) files):
Unambiguous
spelling standard used in annotations: ok
Labeling
symbols: ok
List of
non-standard spellings (dialectal variation, names etc.): given
Distinction
of homographs which are no homophones: n.a.
Character
set used in annotations: ok
Any other
language dependent information as abbreviations etc: given
Annotation
manual, guidelines, instructions:
given
Description
of quality assurance procedures:
not given
Selection
of annotators: not given
Training
of annotators: not given
Annotation tools used: n.a.
· Lexicon:
Format: ok
Text-to-phoneme procedure: ok
Explanation
or reference to the phoneme set:
ok.
(SAMPA.TXT)
Phonological
or higher order phenomena accounted in the phonemic
transcriptions: ok
·
Statistical
information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.
Word
frequency table: n.a.
· Others:
Any other
essential language-dependent information or convention: n.a.
Indication
of how many files were double-checked by the producer
together with percentage of detected errors: not given
Status of documentation: acceptable
The following list contains
all validation steps with the methodology and results.
Completeness of signal
files: ok
Completeness of meta
data files: ok
Completeness of
annotation files: ok.
Correctness of file
names: ok.
Empty files: none
Status of signal,
annotation and meta data files:
ok
Cross checks
of meta information: ok
Cross checks
of summary listings: ok
Annotation and lexicon
contents: ok
5% of the 'usable' data, the BAS files and the annotation SAM
files were
checked in comparison. 10,50% of the data contained errors (42 errors
out of 400). In some par files the various noise classes were not
marked.
none
The revalidation was able to
repair
some
data (par files, lexicon). The results of
the
manual validation couldn't be repaired.
The corpus is ok. The corpus is well
documented and no data
or documentation files are missing.