Authors |
Nina Pörner |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
npoerner@phonetik.uni-muenchen.de |
Telephone |
- |
Fax |
- |
Corpus Version |
02.00.00 |
Date |
08.04.15 |
Status |
Corpus revalidated and updated. Status ok. |
Comment |
- |
Validation Guidelines |
Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus SIEMENS HOERGERAETE SPRACHKORPUS has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of a sub sample.
This document summarizes the results of an in-house validation of the speech corpus HOESI made in the year 2015. The speech corpus was created in 2008.
HOESI is a collection of 144 informal two-party conversations (corpus version 1.00: 156 conversations) in a number of settings. The corpus is subdivided into 12 sessions (version 1.00: 13 sessions) containing 12 conversations each. Within a single session, the speakers are the same. Settings include:
In a car, at three different velocities
In a canteen, with and without background noise
In a studio without noise
In a studio with lombard noise at three different amplitudes
Conversations are informal, mainly about topics of personal interest to the participants.
The General Documentation directory contains the following documentation files for the HOESI corpus which can be found under:
/vdata/BAS/HOESI
1.) the directory /DOC
README.eng |
text document providing documentation in English |
README.deu |
text document providing documentation in German |
directory LOMBARD |
contains wav files of car sounds at 80, 120 and 160 km/h, without speakers |
directory EICHSIGNAL |
contains wav files of calibration signals (only Eichsignal-2.wav is relevant, as the HDO microphone signals are not included in the corpus) |
Validation_HOESI.html |
Validation Report as html file |
· Administrative Information:
Validating person: Nina Pörner
Date of validation: 08/04/15
Contact for requests regarding the corpus:
BAS Bayerisches Archiv für
Sprachsignale
Institut für Phonetik
Universität München
Schellingstr. 3
D 80799 München
Number and type of media: 5 folders (potentially 5 CDs)
Content of each medium: directories DATA, DOC, ANNOT, META
Copyright statement and intellectual property rights (IPR):
This edition has been authorised by the copy right holder to be distributed to interested third parties for scientific and commercial usage. The respective terms of usage as posted on the web site of BAS applies. Re-distribution of this corpus, parts of it or of transformed data is prohibited. In contrast to the original corpus this edition does not contain the recorded input signals from the hearing aids (channel 3 to 8) but only the signals recorded from the headsets (channel 1-2). Insofar some parts of the following documentation may refer to recording channels actually not contained in this edition.
· Technical information:
Layout of media: Information about file system type and directory structure:
HOESI_1 containing:
- DATA directory containing:
- - session directories 001*, 002 and 003
- ANNOT directory containing:
- - session directories 001*, 002 and 003
- DOC directory containing:
- - README.deu
- - README.eng
- - EICHSIGNAL directory
- - LOMBARD directory
- META directory containing:
- - SPEAEXT.TBL
- - SESSEXT.TBL
- - LOMBARD.TBL
- - STAT.TBL
HOESI_2 containing:
- DATA directory containing:
- - session directories 004, 005 and 006
- ANNOT directory containing:
- - session directories 004, 005 and 006
- DOC and META directories as in HOESI_1
HOESI_3 containing:
- DATA directory containing:
- - session directories 007, 008 and 009
- ANNOT directory containing:
- - session directories 007, 008 and 009
- DOC and META directories as in HOESI_1
HOESI_4 containing:
- DATA directory containing:
- - session directories 010 and 011
- ANNOT directory containing:
- - session directories 010 and 011
- DOC and META directories as in HOESI_1
HOESI_5 containing:
- DATA directory containing:
- - session directories 012 and 013
- ANNOT directory containing:
- - session directories 012 and 013
- DOC and META directories as in HOESI_1
* Files marked with an asterisk are not included in the present version of the corpus.
File
nomenclature: Explanation
of used codes (no white space in file
names!):
<type>_<noise>_<lombard>_<session>-<channel>.wav
<type>_<noise>_<lombard>_<session>.TextGrid
Type: A = training, E = test
setting:
U0 = studio
U1 = quiet canteen
U2 = loud canteen
U4 = car, 80 km/h
U5 = car, 120 km/h
U6 = car, max velocity
lombard: L0 (no lombard noise), L1, L2, L3
session: 001 - 013
channel: 1 or 2
Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: audio files: wav, annotation files: TextGrid, ok
Coding: LPCM, ok
Compression: not compressed, ok
Sampling rate: 48000, ok
Valid bits per sample: (others than 8, 16, 24, should be reported):16, ok
Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.
· Database contents:
Clearly stated purpose of the recordings: not provided
Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) Human-human dialogues, ok
Instruction to speakers in full copy: not provided
· Linguistic contents of prompted speech:
Specifications of the individual text items: spontaneous conversational speech, ok
Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.
Example prompt sheet or example sound file from the speech prompting: n.a.
· Linguistic contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of setting – formal/informal) Two speakers per conversation, informal, various topics (e.g. hearing aids, cinema, family, football, gardening), ok
Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) informal chat, ok
Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.
· Speaker information:
Speaker recruitment strategies: no information provided
Number of speakers: 25 (indices 01-26). VP 01 appears in two sessions, there is no speaker with index 20
Distribution of speakers over sex, age, dialect regions: Sex distribution 7 (female sessions) to 6 (male sessions). Age range 46 – 74. Various dialectal regions. ok
Description/definition of dialect regions: Definition by state (Bundesland) or Austria, ok
· Recording platform and recording conditions:
Recording platform: MAUDIO 1814 Multi, 8-Kanal, described in README, ok
Position and
type of microphones:
- Company name
and type id: Headset Mikrophon Beyerdynamic Opus 54, described in
README, ok
-
Electret, dynamic, condenser: no
information provided
-
Directional properties: described in README, ok
-
Mounting: described in README, ok
Position of speakers: (distance to microphone) described in README, ok
Bandwidth: (if other than zero to half of sampling rate) ok
Number of channels and channel separation: 1 channel per wav file, ok
Acoustical environment: different environments, described in README, ok
· Annotation (TextGrid):
Unambiguous spelling standard used in annotations: n.a.
Labeling symbols: Annotation “1” for turns by speaker 1, “2” for turns by speaker 2, “N” for noise. No transcription or other annotation provided, ok
List of non-standard spellings (dialectal variation, names etc.): n.a.
Distinction of homographs which are no homophones: n.a.
Character set used in annotations: ok
Any other language dependent information as abbreviations etc: n.a.
Annotation manual, guidelines, instructions: Provided in README, ok. Speaker turns labelled as “1” and “2” respectively, noise is labelled “N”. Backchannelling, laughter etc. is not labelled. Pauses of more than 2 secs are cut out, turns of more than 30 secs are divided. Guidelines not always followed (see manual validation below).
Description of quality assurance procedures: not provided
Selection of annotators: no information provided
Training of annotators: no information provided
Annotation tools used: no information provided
· Lexicon:
Format: n.a.
Text-to-phoneme procedure: n.a.
Explanation or reference to the phoneme set: n.a.
Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.
· Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.
Word frequency table: n.a.
· Others:
Any other essential language-dependent information or convention: n.a.
Indication of how many files were double-checked by the producer together with percentage of detected errors: not provided
Status of documentation: README files: spelling mistakes corrected, annotation guidelines and corpus history updated. Files in LOMBARD and EICHSIGNAL directories complete. ok.
The following list contains all validation steps with the methodology and results.
Completeness of signal files:
checked
using a script, all expected files are present, ok
Completeness
of meta data files:
checked manually, all expected information is present, ok
Completeness of annotation files: checked using a script, all expected files are present, ok
Correctness of file names: checked using a script, all file names conform to nomenclature, ok
Empty files: none
Status of signal, annotation and meta data files: In SPEAEXT.TBL, information about profession partly in German, partly in English. Fixed to German, ok.
Signal durations: According to README, type A recordings should be 7 minutes, type E recordings 11 minutes long. Checked using a script with 10% tolerance. Four recordings in session 001 too short, a number of recordings from other sessions too long. Fixed, README now reads „at least X minutes“.
Duration cross checks: checked using a script, all signals/TextGrids belonging to the same recording are of the same length, ok.
Cross checks of meta information: Checked using a script. Speaker 02 from session 001 is reported to be from Bayern in SPEAEXT.TBL and from Rheinland-Pfalz in SESSEXT.TBL. Recording sounds more like a RP accent, but hard to tell. Apart from that, ok.
Cross checks of summary listings: n.a
Annotation contents: Checked using a script. All TextGrids from session 001 are missing the “NOISE” tier. Apart from that, ok.
Annotation tier nomenclature: Tier names should be speaker ID or “NOISE”. Checked using a script. A number of mistakes in sessions 007, 008 and 010 (missing zeroes in IDs, superfluous tabulator chars). Fixed, ok.
Annotation texts: Annotation texts checked using script, all were either “1”, “2”, “N” or empty strings, as expected according to README. Ok
For one randomly chosen recording per session (~8 % of the data), the wav and TextGrid files were manually checked in comparison using PRAAT. 4 major errors were found in 3 files (~23 % of the sample), such as non-annotated turns or substantial duration differences between turns and the respective annotation. More generally, it appears that recording specifications were not always consistently followed with respect to maximal turn durations: a number of segments surpassed the maximal duration of 30 sec, esp. in sessions 001 (34 segments exceeding 30 sec), 004 (24) and 009 (25).
-
The non-adherence to annotation specifications is not a big problem in this corpus. Annotations were made for the purpose of detecting speaker turns, which is perfectly possible even when segments surpass 30 seconds. In order to avoid confusion, annotation specifications in the READMEs should be changed accordingly.
Possibility of not including session 001 in the release. This would resolve a number of issues:
insufficient recording durations
missing NOISE channels in TextGrids
VP2 (the one with the unclear state of origin) only appears in session 001
the double appearance of VP1 in sessions 001 and 013
sex distribution would go back to 50/50 (as was originally intended)
session 001 contains excessive noise stemming from the recording device, which is not present in the other sessions
Exclusion of session 001 and updating annotation specifications. Updated corpus is ok.