homes/bits/validationlist.html

Revalidation Report for the SPINA Database

Authors	Florian Schiel, Angela Baumann, Katerina Louka
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-Mail	schiel@phonetik.uni-muenchen.de bas@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2800362
Corpus Version	2.3
Date	27.05.2004
Status	final
Comment
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, <Verlag>, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation Results of the SPINA Corpus

Summary

The speech corpus of SPINA has been validated against general principles of good practise. The
validation covered completeness, formal checks and manual checks of the selected subsamples.
The missing data reduces the corpus in its usability and there could occur problems in using the
corpus for other applications.

Introduction and Corpus Description

This document summarizes the results of an inhouse validation of the speech corpus SPINA
made in the year 2003 within the project 'BITS' by the Institute of Phonetics of the
Ludwig-Maximilians-University Munich.

The corpus contains read speech of 22 different speakers (6 male, 16 female).
The corpus consists of 10 robot command sentences and 62 robot command words.
Each speaker read the whole corpus 5 times, except speaker 03 who read the
sentence corpus 16 times and the word corpus 51 times.
The speakers were recorded at two different sites in Germany (University of
Goettingen: speaker 25,26,27,28,29,30,75,76,77,78,79,80,81,82, University of
Bochum: speakers 01,02,03,04,05,50,51,52). The language is German.
The corpus contains a total of 10810 recorded utterances.

I) Validation of Documentation

The General Documentation directory contains the following documentation files for the SPINA
corpus which can be found under: /doc/...

README: general documentation
BASEVAL: BAS validation protocol of first validation 1997
SPI_SPRK.TXT: speaker information
SPI_SENT.TXT: orthography of the sentences
SPI_WORD.TXT: orthography of the words
SAMPA.TXT: table containing the SAM-PA for German
CAN_LEXICON.TXT: canonical SAM-PA annotations of the words

The following required contents of the documentation have been checked:

- Administrative Information:

Validating person: n.a.

Date of validation: n.a.

Contact for requests regarding the corpus: ok.

Number and type of media: 1 volume on CDROM. ok.

Content of each medium: a total of 10810 recorded utterances, total size 266361 KB. ok.

Copyright statement and intellectual property rights (IPR): ok.

- Technical information:

Layout of media: information about file system type and directory structure:
The volume is stored on a photo CDROM with ISO 9660 format. ok.
Root directory structure not given: repairable.

File nomenclatura: explanation of used codes (no white space in file names!):
<utterance key><speaker id><# of repetition>.<ext>. ok.

Formats of signals and annotation files: if non standard formats are used, it's common to give
a fully description or convert into standard format: 'although the corpus does not contain any
PhonDat header files, you may use HEAR_RAW to play the signal files, RAW2PHONDAT
to create PhonDat compatible files from the raw signal files and PHO2NIST to
create NIST SPHERE compatible files'. No further information about the signal
file format given. Not ok.
The files for the wordsegmentation have the extention .WLB, the files for the phonological
segmentation have the extention .PLB. ok.


Coding: Not given. Not ok.

Compression: just widely supported compressions like zip or gzip should be used: uncompressed data. ok.

Sampling rate: 16 kHz. The speech data were digitally filtered to 8 kHz cutoff
frequency and then downsampled to 16 kHz. ok.

Valid bits per sample: 16 bit???. ok.

Used bytes per sample: 2. ok.

Multiplexed signals: n.a.

- Database contents:

Clearly stated purpose of the recordings: robot commands. ok.

Speech type(s): read sentences. Prompting method not given: repairable

Instruction to speakers in full copy: just a verbal instruction: 'read carefully but fluently'. ok.

- Linguistic contents of prompted speech:

Specifications of the individual text items: (see PD2_ort.txt). This file contains all text
prompts that have been spoken together with the utterance id. Umlauts coding not given: repairable

Specification for the prompt sheet design or specification of the design of the speech prompts:
screen-prompted, read from a train query task. No specification of prompt design given. Not ok.
Example prompt sheet or example sound file from the speech prompting: n.a.

- Speaker information:

Speaker recruitment strategies: no information. Not ok.

Number of speakers: 16 speakers (6f/10m). Number of male and female speakers just indirectly given
in PD2_sprk.txt: repairable

Distribution of speakers over sex, age, dialect regions: no distribution over age and dialects.
Just a simple classification in 'old' (A) and 'young' (J) and the recording site (A=Kiel, N=Bonn,
D=Munich) given (see PD2_sprk.txt). Not ok.

Description/definition of dialect regions: not given. Not ok.

- Recording platform and recording conditions:

Recording platform: not given. Not ok.

Position and type of microphones: various Sennheiser Microphones, e.g. MKH 20 P48.
No further specification given. Not ok.

   - Electret, dynamic, condenser: not given. Not ok.
   - Directional properties: not given. Not ok.
   - Mounting: not given. Not ok.

Position of speakers: distance to microphone not given. Not ok.

Bandwidth of microphones: not given. Not ok.

Number of channels and channel separation: n.a.

Acoustical environment: quiet studio conditions. But no further information about individual
conditions in the 3 recording rooms. Not ok.

- Annotation 1 ('Manual phonological segmentation', *.plb files):

Labeling symbols: SAMP-PA. ok.

Annotation manual, guidelines, instructions: instruction not given. not ok.

Description of quality assurance procedures: not given. Not ok.

Selection of annotators: not given. Not ok.

Training of annotators: not given. Not ok.

Annotation tools used: not given (probably XWaves+). Not ok.

- Annotation 2 ('Word segmentation', *.wlb files):

Unambiguous spelling standard used in annotations: not given, assuming Duden standard. acceptable

Labeling symbols: simple orthograhics. ok.

List of non-standard spellings: not given. not ok.

Character set used in all annotations: not given. not ok.

Any other language dependend information as abbreviations etc.: not given.

Annotation manual, guidelines, instructions: not given. Not ok.

Description of quality assurance procedures: not given. Not ok.

Selection of annotators: not given. not ok.

Annotation tools used: not given. Not ok.

-Lexicon: not provided - 'can_lexicon.txt' added for this validation

Format: simple two-column list, 7-bit-ASCII, Umlauts in LaTeX, phonetic coding in SAM-PA as used in the phonological segmentation (no glottal stops!)

Text-to-phoneme procedure: manual by experienced phonetician. ok.

Explanation or reference to the phoneme set:SAM_PA. ok.

Phonological or higher order phenomena accounted in the phonemic transcriptions: none. ok.

- Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): not given. ok.

Word frequency table: word counts in lexicon. Acceptable.

    - Others:

Any other essential language-dependent information or convention: not given. ok.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given. Not ok.

Status documentation: not acceptable.

II.) Automatic Validation

The following list contains all validation steps with the methodology and results.

Completeness of files:
- Signal files: ok
- Annotation files: one file missing (Subdirectory: 50wlb -> s0035001.wlb)
- Meta data files: ok

Correctness of file names: ok
2 errors in the file names in SPI_word.txt (grei instead of gref, zw"o instead of zwoe)
No mismatches between signal and annotation files.

Empty files: no

Signal Format: ok

In some plb files the duration of the last segment exceed the total length of the file. However the remaining segments were
correct. Many of the exceeding segments are labelled as "-" (pause). In the following files even the last non pause segment exceeded
the total length:

arm_0301.plb, bewe0301.plb, brin0301.plb, das_0301.plb, den_0301.plb, die_0301.plb, dreh0301.plb, drei0301.plb
eins0301.plb, elf_0301.plb, entf0301.plb, fahr0301.plb, flas0301.plb, fuen0301.plb, gege0301.plb, glas0301.plb
grad0301.plb, gref0301.plb, grei0301.plb, hale0301.plb, halt0301.plb, hand0301.plb, hebe0301.plb, hebn0301.plb
hint0301.plb, in__0301.plb, jetz0301.plb, komm0301.plb, lang0301.plb, link0301.plb, losl0301.plb, nach0301.plb
neun0301.plb, null0301.plb, oben0301.plb, ober0301.plb, obje0301.plb, rech0301.plb, schn0301.plb, sech0301.plb
senk0301.plb, tass0301.plb, teil0301.plb, um__0301.plb, und_0301.plb, unte0301.plb, vier0301.plb, vor_0301.plb
vorn0301.plb, vorw0301.plb, wart0301.plb, weit0301.plb, wend0301.plb, wink0301.plb, zehn0301.plb, zur_0301.plb
zuru0301.plb, zwei0301.plb, zwoe0301.plb

The file s0015100.wlb has a last segment that exceeds the total length of the file.

III.) Manual Validation

not applicable

IV.) Other Relevant Observations

V.) Comments for Improvement

The revalidation was able to repair some data, but most important data like the specification
of the used microphones and the recording platform could not be found and therefore not be
repaired. To avoid such tremendous bugs in future the producers of speech corpora should
first think about what infomation they tend to take for granted and therefore will be forgotten to
be reported. Such forgotten and not reported information makes it sometimes impossible for the
user itself to operate with the corpus for the desired application and makes the corpus on the long
run nearly worthless for other applications. Before producing speech corpora a guideline like the
book 'The Production and Validation of Speech Corpora' can help to think about all information
to be documented.

VI.) Result

The results show a lack of essential information in documentation and annotation. Some data have
been repairable, but most relevant data like the specification of the used microphones, the recording
platform and the exact acoustical environment could not be repaired. These missings reduce the
corpus in its usability for other applications.