Revalidation report for the ERBA Database

Authors

Florian Schiel, Katerina Louka

Affiliation  

BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München

Postal address

Schellingstr. 3
D 80799 München

E-mail

schiel@phonetik.uni-muenchen.de
bas@phonetik.uni-muenchen.de

Telephone

+49-89-2180-2758

Fax

+49-89-2800362

Corpus Version

1.0

Date

22.07.2004

Status

final

Comment


Validation Guidelines

Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook 

Validation results of the ERBA Corpus:

Summary

The speech corpus ERBA has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples. It must be considered that missing data reduces the corpus in its usability and there could occur problems in using the corpus for further applications. 

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus ERBA made in the year 2004 within the project 'BITS' by the Institute of Phonetics of the Ludwig-Maximilians-University Munich.  ERBA is an acronym and stands for "Erlanger Bahn Anfragen", which means Erlangen train inquiries.The speech data was collected at four institutes: Philips Research Labs, Daimler-Benz Research Labs, University of Bielefeld and University of Erlangen.

This corpus is the re-distribution of the original ERBA corpus that was first distributed in 1992. It contains a collection of over 10.000 different utterances in a certain task domain (train inquiries) of over 100 speakers. All recordings were made in a quiet office room using a close talking
microphone.

The corpus is divided into two parts: the part called 'ERBA' contains the vast amount of data. The part called 'TEST'
 contains 100 utterances disjunctive to 'ERBA' but from the same task domain.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the ERBA corpus which can be found under docu/:

README
file describing the organisation of the docu-directory and the speech data
INFO
general information on ERBA
RECORD
recording information
SPEAKERS
speaker information
SESSIONS
session information
SIZE
size info
STATS
statistical info

Further documentation files can be found in the main directory:
README iformation about the media and the validation of ERBA and a short description of the corpus
BASEVAL
information of the corpus' validation

Documentation for the part of the corpus named "TEST" can be found under /test/docu/:
README
short overview of the structure of TEST
RECORD
recording information
SPEAKERS
speaker information
SESSIONS
session information
SIZE
size info

·      Administrative Information:

Validating person: n. a.

Date of validation: n. a..

Contact for requests regarding the corpus: ok

Number and type of media: 4 CD-ROMs ok

Content of each medium:  total size 1818 MB

Copyright statement and intellectual property rights (IPR): ok


·         Technical information:

Layout of mediaInformation about file system type and directory structure:

  4 CD-ROMs  with ISO 9660 format

File nomenclatureExplanation of used codes (no white space in file names!):
<speaker-id><session-number><sentence> ok

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: ok, raw signal file

Coding:   raw

Compression: n. a.

Sampling rate: 16 kHz ok

Valid bits per sample: (others than 8, 16, 24, should be reported):  16 bit, ok.

Used bytes per sample:  no information

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

     Database contents:

Clearly stated purpose of the recordings: ok. ("info" file)

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) ok

Instruction to speakers in full copy:  no information

     Linguistic contents of prompted speech:

Specifications of the individual text items: ok  ("info" file)

Specification for the prompt sheet design or specification of the design of the speech prompts:  n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

     Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting - formal/informal)  n.a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

     Speaker information:

Speaker recruitment strategies:   information in the "speakers" file

Number of speakers: 101 (40 female speaker and 61 male ones)
 ok

           Distribution of speakers over sex, age, dialect regions: ok, ("speakers" file)
           Description/definition of dialect regions: 
ok ("speakers" file)

    Recording platform and recording conditions:

Recording platform: ok

Position and type of microphones: a close talking microphone
- Company name and type id: Shure SM 10A
- Electret, dynamic, condenser: n.a
- Directional properties:  n.a.
- Mounting:  n.a.

Position of speakers: (distance to microphone)  ok

Bandwidth: (if other than zero to half of sampling rate)  100-6756 Hz

Number of channels and channel separation: 1 channel

Acoustical environment:  ordinary quiet office surrounding

 

    Orthographic Transcriptions:

Unambiguous spelling standard used in annotations: ok

Labeling symbols:  n.a.

List of non-standard spellings (dialectal variation, names etc.):  given (README)

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok, ASCII

Any other language dependent information as abbreviations etc: given  (README)

Annotation manual, guidelines, instructions: ok –(README)

Description of quality assurance procedures: not given

Selection of annotators:  not given

Training of annotators:   not given

Annotation tools used: not given

     Lexicon:

Format: ok

Text-to-phoneme procedure: ok

Explanation or reference to the phoneme set: not given

Phonological or higher order phenomena accounted in the phonemic transcriptions:  n.a.

     Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): ok ("stats" file)

Word frequency table: n.a.

     Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors:  ok (BASEVAL)

          Status of documentation: acceptable

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

In the "sessions" file the female speakers are tagged both with "f" and with "w". The same error occured also in the file "/test/docu/speakers" and  in the file "/test/docu/sessions".

Completeness of signal files:  ok

Completeness of meta data files:
ok

Completeness of annotation files: ok.

Correctness of file names: ok.

Empty files: none

Status of signal, annotation and meta data files: ok

Cross checks of meta information: ok

Cross checks of summary listings: ok

Annotation and lexicon contents: ok


III.) Manual Validation

5% of the data (500 files of the ERBA corpus and 50 files of the TEST corpus) and the orthographic transcription were checked in comparison. 1,8% of the data of the ERBA corpus contained errors (9 errors out of 500). Errors are the mismatching between the audio files and the orthographic transcription.  Pecular background noises  (for example bird singing) , buzzing, blowing to the microphone or microphone touch are not  being labelled.  The TEST corpus  didn't have any  errors,  although  the blowing to the microphone in some files or  the  background noise could be irritating.

IV.) Other Relevant Observations

Background noises,  buzzing, blowing to the microphone or microphone touch is not  labelled  in this corpus.  Some utterances are leveled much weaker than others.

V.) Comments for Improvement

 The revalidation was able to repair some data (orthographic transcription files, speakers file).

VI.) Result

The corpus is ok. The corpus is well documented and  no data or documentation files are missing.