Validation report for the CI_1 Database

Authors

Susanne Beinrucker, Florian Schiel

Affiliation

BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München

Postal address

Schellingstr. 3
D 80799 München

E-mail

schiel@phonetik.uni-muenchen.de
bas@phonetik.uni-muenchen.de

Telephone

+49-89-2180-2758

Fax

+49-89-2180-5790

Corpus Version

1.1

Date

2014-06-13

Status

In progress

Comment


Validation Guidelines

Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the CI_1 Corpus:

Summary

The speech corpus CI_1 (Thesis data Veronika Neumeyer: CI Articulation) has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. The corpus and its documentation is partially incomplete but with the suggested improvements and corrections it will be in good condition.

Aside from some missing information, the corpus is in moderate to good shape for scientific usage and will become as good for technical usage if the sample rate will be corrected.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus AsiCa made in the year 2014 by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus was created in 2009 by a graduate student for her final thesis at the same institute.

CI_1 is a collection of 29 native German speakers - normal hearing and CI-users. The speech items are recordings of written prompts. 23 speakers were phonetically analysed. The EMU analysis contains formant (fms), fundamental frequency (f0), energy (rms) and zero-crossing rate (zcr) analysis.



I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the CI_1 corpus which can be found under:

1.) the main directory

README.deu

File describing the content of the subdirectories in German

README.eng

File describing the content of the subdirectories in English

rename

Csh-script for renaming (removed)

BAS_Validation.HTML

This file


2.) the directory /annot:

README.deu

File describing the content of this subdirectory in German

README.eng

File describing the content of this subdirectory in English

hlb

Directory containing the EMU hierarchical label files

KAN

Directory containing the canonical annotation files

MAU

Directory containing the phonetic segmented annotation files

ORT

Directory containing the orthographic annotation files

textgrid

Directory containing the corrected TextGrid-files


3.) the directory /data:

README.deu

File describing the content of this subdirectory in German

README.eng

File describing the content of this subdirectory in English

f0

Directory containing EMU analysing/output files of fundamental frequeny (f0)

fms

Directory containing EMU analysing/output files of formants (fms)

rms

Directory containing EMU analysing/output files of energy (rms)

wav

Directory containing audio files

zcr

Directory containing EMU analysing/output files of zero-crossing rate (zcr)


4
.) the directory /doc:

README.deu

File describing the database and its structure in German

README.eng

File describing the database and its structure in English

Magisterarbeit.PDF

Final thesis of Veronika Neumeyer in German


5.) the directory /rawdata:

0003 … 0027

Directories containing raw audio files

RDI0002 … RDI0005

Directories containing raw audio files


6.) the directory /table:

ci-sprecher.tbl

EMU template

PROMPTS.TBL

List of prompts

SPEAEXT.TBL

Speaker information file





·         Administrative Information:

Validating person: n.a.

Date of validation: n.a.

Contact for requests regarding the corpus: missing

Number and type of media: n.a.

Content of each medium: missing

Copyright statement and intellectual property rights (IPR): missing



·         Technical information:

Layout of media: Information about file system type and directory structure: n.a.

File nomenclature: Explanation of used codes (no white space in file names!):
<SS (speaker code)><PP (prompt number)><_WW (repetition: 01 … 05)>< .wav | .f0 | .par | .TextGrid | .rms | .fms | .zcr> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok, raw signal file in .wav and raw annotation file in .TextGrid and .par (converted in EMU label files .hlb, .KAN, .ORT, .MAU and EMU output files .fms, .f0, .rms, .zcr)

Coding: Signed Integer PCM (signal files), US-ASCII (annotation files and one README) and ISO-8859-1 (the other READMEs), see more in V.

Compression: no compression

Sampling rate: not ok, signal files with 22050 (speaker code without R) and 44100Hz (speakercode with R)

Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bit, ok

Used bytes per sample: 2 bytes/sample, ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) single channel, ok

·         Database contents:

Clearly stated purpose of the recordings: ok

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok, read sentences and letters

Instruction to speakers in full copy: screenshots of the surface of SpeechRecorder and questionnaire for meta data are available, but no complete instruction is given

·         Linguistic contents of prompted speech:

Specifications of the individual text items: ok (in PROMPTS.TBL)

Specification for the prompt sheet design or specification of the design of the speech prompts: ok (in Magisterarbeit.PDF)

Example prompt sheet or example sound file from the speech prompting: ok (in Magisterarbeit.PDF)

·         Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) n.a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

·         Speaker information:

Speaker recruitment strategies: ok (information in the Magisterarbeit.PDF)

Number of speakers: ok, 29 speakers (11 normal hearing, 18 CI-users)

           Distribution of speakers over sex, age, dialect regions: ok, for CI-users and control group


           
Description/definition of dialect regions: n.a.

·         Recording platform and recording conditions: (information in Magisterarbeit.PDF)

Recording platform: Speech Recorder software

Position and type of microphones: ok, headset
- Company name and type id:
Sennheiser USB 36 Headset
- Electret, dynamic, condenser: n.a.
- Directional properties: n.a.
- Mounting: n.a.

Position of speakers: (distance to microphone) no information

Bandwidth: (if other than zero to half of sampling rate) n.a.

Number of channels and channel separation: 1 channel ok

Acoustical environment: home/office, ok

·         Annotation (Textgrid label file): (information in Magisterarbeit.PDF)

Unambiguous spelling standard used in annotations: not given

Labeling symbols: ok, SAMPA

List of non-standard spellings (dialectal variation, names etc.): n.a.

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok, SAMPA and US-ASCII

Any other language dependent information as abbreviations etc: n.a.

Annotation manual, guidelines, instructions: automatic annotation with MAUS and manual corrections at the boundaries of the target word and target vowels

Description of quality assurance procedures: n.a.

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: praat, MAUS, EMU ok

·         Lexicon:

Format: n.a.

Text-to-phoneme procedure: n.a.

Explanation or reference to the phoneme set: n.a.

Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.

·         Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

·         Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors: n.a., no double checks

          Status of documentation: partially incomplete

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: ok (./rawdata/0004: complete file set but wrong naming of the signal files – instead of the speaker code 20 the files were named with 24, but were different from the signal files of the speaker 24; ./rawdata/0020: 22 signal files are missing, but this is documented (also in ./data/... these files for speaker 82 are missing))

Completeness of meta data files: ok

Completeness of annotation files: ok (./rawdata/0020: 22 annotation files are missing, but this is documented (also in ./annot/... these files for speaker 82 are missing))

Correctness of file names: ok

Empty files: two empty files (./data/f0/R513_03.f0 and ./data/f0/R502_04.f0)

Cross checks of meta information: not ok: wrong naming in ./rawdata/0004 (see Completeness of signal files)

Cross checks of summary listings: n.a.

Annotation and lexicon contents: no lexicon, annotation content ok



III.) Manual Validation

In the manual validation only the TextGrid-Annotation files were checked for the specified labeling of the target words and vowels. About two percent (1.86 % ~ 42 annotation files) of these files were checked manually and a few errors were noticed: incorrect annotation (wrong symbols etc.) and incorrect boundaries especially the boundaries of the target vowels; in the most cases these were minor errors.

IV.) Other Relevant Observations

There are many README-files which are named equally and can only be distinguished by their memory location; also the English versions of the README-files were missing and few typing errors were located (was supplied and corrected by validator).

In README (main directory) a wrong number of analysed speakers is noted: 20 + 1 + 1, but actually there 23 speakers analysed.

V.) Comments for Improvement

No copyright statement available.

The DVD-structure of the corpus was not given. The selection of the data for analysis is only documented in Magisterarbeit.PDF, but some information about this selection process should be put into the README.

Coding of the files were not documented and especially the README-files were not consistent in coding: the most README-files were coded in ISO-8859-1, but one is coded in US-ASCII (README-file in main directory).

During the automatic validation some problems appeared: wrong naming of the signalfiles in ./rawdata/0004/; two empty files (./data/f0/R513_03.f0 and ./data/f0/R502_04.f0).

VI.) Result

The speech corpus CI_1 has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. The corpus and its documentation is partially incomplete but with the suggested improvements and corrections it will be in good condition.

Aside from some missing information, the corpus is in moderate to good shape for scientific usage and will become as good for technical usage if the sample rate will be corrected.