Revalidation report for the SI1000P Database

Authors

Florian Schiel, Katerina Louka

Affiliation  

BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München

Postal address

Schellingstr. 3
D 80799 München

E-mail

schiel@phonetik.uni-muenchen.de
bas@phonetik.uni-muenchen.de

Telephone

+49-89-2180-2758

Fax

+49-89-2800362

Corpus Version

2.2

Date

08.09.2004

Status

final

Comment


Validation Guidelines

Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook 

Validation results of the SI1000P Corpus:

Summary

The speech corpus SI1000P has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. It must be considered that missing data reduces the corpus in its usability and there could occur problems in using the corpus for further applications. 

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus SI1000P made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. 

The SI1000P recordings were done to provide material for high quality concatenative speech synthesis. It contains 993 newspaper sentences
read by two professional broadcasting announcers in studio quality together with the laryngographic signal and the glottal pulse stream.
Parts of the corpus were labeled and segmented phonemically (SAM-PA) and prosodically (boarders + accents).

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the SI1000P corpus which can be found under doc/:

README
file describing the organisation of the docu-directory and the speech data
SI1000P.TXT Spoken texts
SI1000P.PROMPTS
original prompts as presented to the speakers
SI1000P.LEX
Pronunciation dictionary (Extended German SAM-PA)
EXT_SAM.TXT
Description of Extended German SAM-PA
PHONDAT.DOC
Prosodic boarder segmentation and accents labeling of speaker AI
PARDOC
Documentation of BAS Partitur FORMAT (BPF) files


·         Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: 4 CD-ROMs ok

Content of each medium:  no information

Copyright statement and intellectual property rights (IPR): ok


·  Technical information:

Layout of mediaInformation about file system type and directory structure: no information

File nomenclatureExplanation of used codes (no white space in file names!):
<speaker-id><session-number><.nis/.dat> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok

Coding:   NIST SPHERE

Compression: n. a.

Sampling rate: 16 kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported):  16 bit ok.

Used bytes per sample:  no information

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

     The speech signals and the laryngograph signals were filtered and downsampled to 16 kHz.

Database contents:

Clearly stated purpose of the recordings: ok. (README file)

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy:  indirect information is given in the README file

    
Linguistic contents of prompted speech:

Specifications of the individual text items: ok  ("SI1000P.prompts" file)

Specification for the prompt sheet design or specification of the design of the speech prompts:  n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

      Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal)  n.a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

     Speaker information:

Speaker recruitment strategies:   n.a.

Number of speakers: 2 male speakers
 ok

           Distribution of speakers over sex, age, dialect regions: no information

           Description/definition of dialect regions:  n.a.

     Recording platform and recording conditions:

Recording platform: echo cancelling studio - ok

Position and type of microphones:
- Company name and type id: Sennheiser MKH20
- Electret, dynamic, condenser: n.a
- Directional properties: 
omnidirectional
- Mounting:  30cm from mouth

Position of speakers: (distance to microphone)  30 cm from mouth ok

Bandwidth: (if other than zero to half of sampling rate)  ok

Number of channels and channel separation: 4 channels

            Acoustical environment:  echo cancelling studio

 Laryngograph signal:
- Company name and type id: LxProc of Laryngograph Ltd. London

      Orthographic Transcriptions:

Unambiguous spelling standard used in annotations: ok

Labeling symbols:  n.a.

List of non-standard spellings (dialectal variation, names etc.):  given (README)

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok,  (German Umlaute are coded in LATEX)

Any other language dependent information as abbreviations etc: given  (README)

Annotation manual, guidelines, instructions: ok –(EXT_SAM.TXT
PHONDAT.DOC,PARDOC)

Description of quality assurance procedures: not given

Selection of annotators:  not given

Training of annotators:   not given

Annotation tools used: not given

     Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: ok (EXT_SAM.TXT)

Phonological or higher order phenomena accounted in the phonemic transcriptions:  n.a.

     Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

      Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors:  no information

            Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.


Completeness of signal files:  not ok (error mentioned in the README file:)
Following numbers of the utterances are missing from both speakers:

These items are not listed in the files SI1000P.txt and SI1000P.prompts

Completeness of meta data files: ok

Completeness of annotation files: not ok.
For both speakers exist the following .par files although there are no signal files:

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: ok


III.) Manual Validation

10% of the data (100 files of the SI1000P corpus) and the partitur files were checked in comparison.  4% of the data of the SI1000P corpus contained errors (4 errors out of 100). More files of the  "ai" speaker were checked,  so that the prosodic annotation  could also be included in the comparison.

IV.) Other Relevant Observations

none

V.) Comments for Improvement

 The errors in the partitur files should be repaired.  Some  more  information about the  speakers  were  useful.

VI.) Result

The corpus is ok.