Revalidation Report for the PhonDat1 Database

Florian Schiel, Angela Baumann , Katerina Louka  
BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München
Postal address
Schellingstr. 3
D 80799 München
Corpus Version
The following validation results show a lack of essential information in documentation and annotation. The reason for this may be the missing standardized principles for validation to that point of time and simply a short experience in producing speech corpora.
Validation Guidelines
Florian Schiel: The Validation of Speech Corpora, <Verlag>, 2003, 

Validation Results of the PhonDat1 Corpus


The speech corpus of PhonDat1 has been validated against general principles of good practise. The
validation covered completeness, formal checks and manual checks of the selected subsamples.
The missing data reduces the corpus in its usability and there could occur problems in using the
corpus for the intended and other applications. 

Introduction and Corpus Description

This document summarizes the results of an inhouse validation of the speech corpus PhonDat1
made in the year 2003 within the project 'BITS' by the Institute of Phonetics of the
Ludwig-Maximilians-University Munich. The corpus was produced in 1992 (exact date not given
in the documentation) at four different sites in Germany, namely the University of Kiel, University
of Bonn, University of Bochum and the University of Munich acting as contractors to Siemens
Company, Daimler Benz Company, Philips Company and  SEL-Alcatel. The language of the
corpus is German. The aim was to record diphone combinations of the German language,
accurately controlled and read from a screen in quiet studio environment. The spoken texts were
prompted on screen and recorded with a variety of different microphones. The corpus contains read
speech of 201 different speakers. Each speaker read a subcorpus of 450 different sentence equivalents
(including alphanumericals and two shorter passages of prose text); 8 speakers read the whole sentence
corpus; 40 speakers read the subcorpora BR and MR; 112 speakers read 70 utterances of the rest corpus
including alphabet, numbers 0 to 12 and stories.  

I) Validation of Documentation

The General Documentation directory contains the following documentation files for the PD1 corpus
which can be found under: /docu/...

README: documentation,
PD1_sprk.txt: speaker information
PD1_ort.txt: orthography of the corpus
PD1_can.txt: canonical forms of the corpus
phondat.doc: documentation about the file format PhonDat
seg_conv.txt: handbook for hand segmentation (only in German)
ext_sampa.txt: table containing the extended SAM-PA for German.

The following required contents of the documentation have been checked:

- Administrative Information:

Validating person:
not given. Not acceptable.

Date of validation: The original README document file is dated 1995. No exact date given. Not ok.

Contact for requests regarding the corpus: ok.

Number and type of media: 4 volumes on CDROM or DVD5. ok.

Content of each medium: 21681 recorded utterances (CD1: 4889, CD2: 6467, CD3: 6067,
CD4: 4258). No further information on which CD every single speaker can be found. Not ok.

Copyright statement and intellectual property rights (IPR): ok.

- Technical information:

Layout of media: Information about file system type and directory structure: ISO 9660
or UFS. ok. Root directory structure not given: repairable.

File nomenclatura: Explanation of used codes (no white space in file names!):
<speaker_id><recording site id><sentence #><# of repetition>.<ext>. ok.

Formats of signals and annotation files: If non standard formats are used, it's common to give
a fully description or convert into standard format
: the signal file format is NIST, the used
formats for annotation are s1 and s2 (see README) and the the partitur files can be found under
subdirectory /doc/pardoc/PARMAIN.HTM. ok.
Coding: PCM. ok.

Compression: Just widely supported compressions like zip or gzip should be used: gzip (only used
on volume 4: s2_tar.gz). ok.

Sampling rate: 16 kHz. ok.

Valid bits per sample: 16. ok.

Used bytes per sample: 2. ok.

Multiplexed signals: n.a.

- Database contents:

Clearly stated purpose of the recordings: one heading of the README documentation file gives a
small hint ('Diphone Material') that the corpus was created to get diphone material of the German
language. A clearly stated purpose of the recordings is not given. Not acceptable.
Speech type(s): read sentences from screen. ok.
Instruction to speakers in full copy: Just a verbal instruction 'read carefully but fluently'. ok.

- Linguistic contents of prompted speech:

Specifications of the individual text items: (see PD1_ort.txt). This file contains all text prompts that
have been spoken together with the utterance id, but the second column of this file is not explained
in the documentation: repairable.

Specification for the prompt sheet design or specification of the design of the speech prompts:
screen-prompted, no specification of prompt design given.  Not ok.                                                                                                                                                                                                                                                                                                    
Example prompt sheet or example sound file from the speech prompting: n.a.

Speaker information:

Speaker recruitment strategies: no information. Not ok.

Number of speakers: 201 speakers (100m/101f). ok.

Distribution of speakers over sex, age, dialect regions: distribution over age and dialects not
given. Not acceptable.

Description/definition of dialect regions: not given. Not ok.

- Recording platform and recording conditions:
Recording platform: Not given. Not acceptable.

Position and type of microphones: various Sennheiser Microphones, e.g. MKH 20 P48.
No further specification given. Not acceptable.

- Electret, dynamic, condenser: not given. Not ok.
   - Directional properties: not given. Not ok.
   - Mounting: not given. Not ok.

Position of speakers: distance to microphone not given. Not ok.

Bandwidth of microphones:  Not given. Not ok.

Number of channels and channel separation: 1. ok.

Acoustical environment: quiet studio conditions. But no further information about individual
conditions in the 4 recording rooms. Not ok.

- Annotation 1  (s1-files 'Phonnological hand segmentation'):

Unambiguous spelling standard used in annotations: not given. 
Labeling symbols: (see seg_conv.txt) ok.

List of non-standard spellings: not given.

Distinction of homographs which are no homophones: not given.

Character set used in annotations: Not given: repairable.

Any other language dependend information as abbreviations etc.: not given.

Annotation manual, guidelines, instructions: instruction and guideline for annotators (see
seg_conv.txt). ok.

Description of quality assurance procedures: not given.

Selection of annotators: not given.

Training of annotators:  not given.

Annotation tools used: not given. Not acceptable.

- Annotation 2 (s2-files 'Automatic time alignment') :

Description of quality assurance procedures: not given. Not ok.

Annotation tools used: not given. Not ok.


Format: No description of format: repairable

Text-to-phoneme procedure: in PD1_lex.txt, only in German: repairable

Explanation or reference to the phoneme set: SAM_PA. ok.

Phonological or higher order phenomena accounted in the phonemic transcriptions: in PD1_lex.txt,
only in German: repairable

- Statistical information:

Frequency of sub-word units: phonemes: not given. Acceptable.

Word frequency table: word counts in lexicon. Acceptable.

Status documentation: not acceptable

II.)  Automatic Validation

The following contains all validation steps together with the methodology and results.

Mountability-Check: mountability on three OS: Windows, Macintosh and Linux: ok

Completness of signal, annotation and meta data files:

The S2 file collection will be omitted in the next edition of the PD1 corpus.

Correctness of file names: ok

Empty files: none

Annotation and Lexicon Contents: not ok /repairable
Mismatches between canonical file and partitur files

III.) Manual Validation 

IV.) Other Relevant Observations

V.) Comments for Improvement

The revalidation was able to repair some data, but most important data like the specification
of the used microphones and the recording platform could not be found and therefore not be
repaired. To avoid such tremendous bugs in future the producers of speech corpora should
first think about what infomation they tend to take for granted and therefore will be forgotten to
be reported. Such forgotten and not reported information makes it sometimes impossible for the
user itself to operate with the corpus for the desired application and makes the corpus on the long
run nearly worthless for other applications. Before producing speech corpora a guideline like the
book 'The Production and Validation of Speech Corpora' can help to think about all information
to be documented. 

VI.) Result

The results show a lack of essential information in documentation and annotation.
The missing data reduces the corpus in its usability and there could occur problems in using the
corpus for the intended and other applications.  Some missing data like the specification of
the purpose of this corpus and some additions regarding the lexicon have been repaired in the
README file.