Revalidation Report for the SPINA Database
Authors
|
Florian Schiel, Angela Baumann, Katerina Louka
|
Affiliation
|
BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München
|
Postal address
|
Schellingstr. 3
D 80799 München
|
E-Mail
|
schiel@phonetik.uni-muenchen.de
bas@phonetik.uni-muenchen.de
|
Telephone
|
+49-89-2180-2758
|
Fax
|
+49-89-2800362
|
Corpus Version
|
2.3
|
Date
|
27.05.2004
|
Status
|
final
|
Comment
|
|
Validation Guidelines
|
Florian Schiel: The Validation of Speech
Corpora, <Verlag>, 2003,
www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook
|
Validation Results of the SPINA Corpus
Summary
The speech corpus of SPINA has been validated against general
principles of good practise. The
validation covered completeness, formal checks and manual checks of the
selected subsamples.
The missing data reduces the corpus in its usability and there could
occur problems in using the
corpus for other applications.
Introduction and Corpus Description
This document summarizes the results of an inhouse validation of the
speech corpus
SPINA
made in the year 2003 within the project 'BITS' by the Institute of
Phonetics of the
Ludwig-Maximilians-University Munich.
The corpus contains read speech of 22 different speakers (6 male, 16
female).
The corpus consists of 10 robot command sentences and 62 robot command
words.
Each speaker read the whole corpus 5 times, except speaker 03 who read
the
sentence corpus 16 times and the word corpus 51 times.
The speakers were recorded at two different sites in Germany
(University of
Goettingen: speaker 25,26,27,28,29,30,75,76,77,78,79,80,81,82,
University of
Bochum: speakers 01,02,03,04,05,50,51,52). The language is German.
The corpus contains a total of 10810 recorded utterances.
I) Validation of Documentation
The General Documentation directory contains the following
documentation files for the SPINA
corpus which can be found under: /doc/...
README: general documentation
BASEVAL: BAS validation protocol of first validation 1997
SPI_SPRK.TXT: speaker information
SPI_SENT.TXT: orthography of the sentences
SPI_WORD.TXT: orthography of the words
SAMPA.TXT: table containing the SAM-PA for German
CAN_LEXICON.TXT: canonical SAM-PA annotations of the words
The following required contents of the documentation have been checked:
- Administrative Information:
Validating person: n.a.
Date of validation: n.a.
Contact for requests regarding the corpus: ok.
Number and type of media: 1 volume on CDROM. ok.
Content of each medium: a total of
10810 recorded utterances, total size 266361 KB. ok.
Copyright statement and intellectual property rights (IPR): ok.
- Technical information:
Layout of media: information about file system type and
directory structure:
The volume is stored on a photo CDROM with ISO 9660 format. ok.
Root directory structure not given: repairable.
File nomenclatura: explanation of used codes (no white space
in file names!):
<utterance key><speaker id><# of
repetition>.<ext>. ok.
Formats of signals and annotation files: if non standard
formats are used, it's common to give
a fully description or convert into standard format: 'although the
corpus does not contain any
PhonDat header files, you may use HEAR_RAW to play the signal files,
RAW2PHONDAT
to create PhonDat compatible files from the raw signal files and
PHO2NIST to
create NIST SPHERE compatible files'. No further information about the
signal
file format given. Not ok.
The files for the wordsegmentation have the extention .WLB, the files
for the phonological
segmentation have the extention .PLB. ok.
Coding: Not given. Not ok.
Compression: just widely supported compressions like zip or
gzip should be used: uncompressed data. ok.
Sampling rate: 16 kHz. The speech data were digitally filtered
to 8 kHz cutoff
frequency and then downsampled to 16 kHz. ok.
Valid bits per sample: 16 bit???.
ok.
Used bytes per sample: 2. ok.
Multiplexed signals: n.a.
- Database contents:
Clearly stated purpose of the recordings: robot commands. ok.
Speech type(s): read sentences. Prompting method not given:
repairable
Instruction to speakers in full copy: just a verbal
instruction: 'read carefully but fluently'. ok.
- Linguistic contents of prompted speech:
Specifications of the individual text
items: (see PD2_ort.txt). This file contains all text
prompts that have been spoken together with the utterance id. Umlauts
coding not given: repairable
Specification for the prompt sheet design or specification of the
design of the speech prompts:
screen-prompted, read from a train query task. No specification of
prompt design given. Not ok.
Example prompt sheet or example sound file from the speech prompting:
n.a.
- Speaker information:
Speaker recruitment strategies: no information. Not ok.
Number of speakers: 16 speakers (6f/10m). Number of male and
female speakers just indirectly given
in PD2_sprk.txt: repairable
Distribution of speakers over sex, age, dialect regions: no
distribution over age and dialects.
Just a simple classification in 'old' (A) and 'young' (J) and the
recording site (A=Kiel, N=Bonn,
D=Munich) given (see PD2_sprk.txt). Not ok.
Description/definition of dialect regions: not given. Not ok.
- Recording platform and recording conditions:
Recording platform: not given. Not ok.
Position and type of microphones:
various Sennheiser Microphones, e.g. MKH 20 P48.
No further specification given. Not ok.
- Electret, dynamic, condenser: not given. Not ok.
- Directional properties: not given. Not ok.
- Mounting: not given. Not ok.
Position of speakers: distance to microphone not given. Not
ok.
Bandwidth of microphones: not given. Not ok.
Number of channels and channel separation: n.a.
Acoustical environment: quiet studio conditions. But no further
information about individual
conditions in the 3 recording rooms. Not ok.
- Annotation 1 ('Manual phonological segmentation', *.plb files):
Labeling symbols: SAMP-PA. ok.
Annotation manual, guidelines, instructions: instruction
not given. not ok.
Description of quality assurance procedures: not given. Not ok.
Selection of annotators: not given. Not ok.
Training of annotators: not given. Not ok.
Annotation tools used: not given (probably XWaves+). Not ok.
- Annotation 2 ('Word segmentation', *.wlb files):
Unambiguous spelling standard used in
annotations: not given, assuming Duden standard. acceptable
Labeling symbols: simple orthograhics. ok.
List of non-standard spellings: not given. not ok.
Character set used in all annotations: not given. not ok.
Any other language dependend information as abbreviations etc.:
not given.
Annotation manual, guidelines, instructions: not given. Not ok.
Description of quality assurance procedures: not given. Not ok.
Selection of annotators: not given. not ok.
Annotation tools used: not given. Not ok.
-Lexicon: not provided - 'can_lexicon.txt' added for this validation
Format: simple two-column list, 7-bit-ASCII, Umlauts in LaTeX, phonetic coding in SAM-PA as used in the phonological segmentation (no glottal stops!)
Text-to-phoneme procedure: manual by experienced phonetician. ok.
Explanation or reference to the phoneme set:SAM_PA. ok.
Phonological or higher order phenomena accounted in the phonemic
transcriptions: none. ok.
- Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones,
syllables,...): not given. ok.
Word frequency table: word counts in lexicon. Acceptable.
- Others:
Any other essential language-dependent information or convention:
not given. ok.
Indication of how many files were double-checked by the producer
together with percentage of detected errors: not given. Not ok.
Status documentation: not acceptable.
II.) Automatic Validation
The following list contains all validation
steps with the methodology and results.
Completeness of files:
- Signal files: ok
- Annotation files: one file missing (Subdirectory: 50wlb ->
s0035001.wlb)
- Meta data files: ok
Correctness of file names: ok
2 errors in the file names in SPI_word.txt (grei instead of gref,
zw"o instead of zwoe)
No mismatches between signal and annotation files.
Empty files: no
Signal Format: ok
In some plb
files the duration of the last segment exceed the total length
of the file. However the remaining segments were
correct. Many of the exceeding segments are labelled as "-" (pause). In
the following files even the last non pause segment exceeded
the total length:
arm_0301.plb, bewe0301.plb, brin0301.plb, das_0301.plb, den_0301.plb, die_0301.plb, dreh0301.plb, drei0301.plb
eins0301.plb, elf_0301.plb, entf0301.plb, fahr0301.plb, flas0301.plb, fuen0301.plb, gege0301.plb, glas0301.plb
grad0301.plb, gref0301.plb, grei0301.plb, hale0301.plb, halt0301.plb, hand0301.plb, hebe0301.plb, hebn0301.plb
hint0301.plb, in__0301.plb, jetz0301.plb, komm0301.plb, lang0301.plb, link0301.plb, losl0301.plb, nach0301.plb
neun0301.plb, null0301.plb, oben0301.plb, ober0301.plb, obje0301.plb, rech0301.plb, schn0301.plb, sech0301.plb
senk0301.plb, tass0301.plb, teil0301.plb, um__0301.plb, und_0301.plb, unte0301.plb, vier0301.plb, vor_0301.plb
vorn0301.plb, vorw0301.plb, wart0301.plb, weit0301.plb, wend0301.plb, wink0301.plb, zehn0301.plb, zur_0301.plb
zuru0301.plb, zwei0301.plb, zwoe0301.plb
The file s0015100.wlb has a last segment that exceeds the total length
of the file.
III.) Manual Validation
not applicable
IV.) Other Relevant Observations
V.) Comments for Improvement
The revalidation was able to repair some data, but most important data
like the specification
of the used microphones and the recording platform could not be found
and therefore not be
repaired. To avoid such tremendous bugs in future the
producers of speech corpora should
first think about what infomation they tend to take for granted and
therefore will be forgotten to
be reported. Such forgotten and not reported information makes it
sometimes impossible for the
user itself to operate with the corpus for the desired application and
makes the corpus on the long
run nearly worthless for other applications. Before producing speech
corpora a guideline like the
book 'The Production and Validation of Speech Corpora' can help to
think about all information
to be documented.
VI.) Result
The results show a lack of essential information in documentation and
annotation. Some data have
been repairable, but most relevant data like the specification of
the used microphones, the recording
platform and the exact acoustical environment could not be
repaired. These missings reduce the
corpus in its usability for other applications.