Revalidation report for the SC10 Database

Revalidation report for the HOESI Database

Authors	Nina Pörner
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	npoerner@phonetik.uni-muenchen.de
Telephone	-
Fax	-
Corpus Version	02.00.00
Date	08.04.15
Status	Corpus revalidated and updated. Status ok.
Comment	-
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook

Validation results of the HOESI Corpus:

Summary

The speech corpus SIEMENS HOERGERAETE SPRACHKORPUS has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of a sub sample.

Introduction and Corpus Description

This document summarizes the results of an in-house validation of the speech corpus HOESI made in the year 2015. The speech corpus was created in 2008.

HOESI is a collection of 144 informal two-party conversations (corpus version 1.00: 156 conversations) in a number of settings. The corpus is subdivided into 12 sessions (version 1.00: 13 sessions) containing 12 conversations each. Within a single session, the speakers are the same. Settings include:

In a car, at three different velocities

In a canteen, with and without background noise

In a studio without noise

In a studio with lombard noise at three different amplitudes

Conversations are informal, mainly about topics of personal interest to the participants.

I.) Validation of Documentation

The General Documentation directory contains the following documentation files for the HOESI corpus which can be found under:

/vdata/BAS/HOESI

1.) the directory /DOC

README.eng	text document providing documentation in English
README.deu	text document providing documentation in German
directory LOMBARD	contains wav files of car sounds at 80, 120 and 160 km/h, without speakers

directory EICHSIGNAL

contains wav files of calibration signals (only Eichsignal-2.wav is relevant, as the HDO microphone signals are not included in the corpus)

Validation_HOESI.html

Validation Report as html file

· Administrative Information:

Validating person: Nina Pörner

Date of validation: 08/04/15

Contact for requests regarding the corpus:

BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München

Schellingstr. 3
D 80799 München

Number and type of media: 5 folders (potentially 5 CDs)

Content of each medium: directories DATA, DOC, ANNOT, META

This edition has been authorised by the copy right holder to be distributed to interested third parties for scientific and commercial usage. The respective terms of usage as posted on the web site of BAS applies. Re-distribution of this corpus, parts of it or of transformed data is prohibited. In contrast to the original corpus this edition does not contain the recorded input signals from the hearing aids (channel 3 to 8) but only the signals recorded from the headsets (channel 1-2). Insofar some parts of the following documentation may refer to recording channels actually not contained in this edition.

· Technical information:

Layout of media: Information about file system type and directory structure:

HOESI_1 containing:

- DATA directory containing:

- - session directories 001*, 002 and 003

- ANNOT directory containing:

- - session directories 001*, 002 and 003

- DOC directory containing:

- - README.deu

- - README.eng

- - EICHSIGNAL directory

- - LOMBARD directory

- META directory containing:

- - SPEAEXT.TBL

- - SESSEXT.TBL

- - LOMBARD.TBL

- - STAT.TBL

HOESI_2 containing:

- DATA directory containing:

- - session directories 004, 005 and 006

- ANNOT directory containing:

- - session directories 004, 005 and 006

- DOC and META directories as in HOESI_1

HOESI_3 containing:

- DATA directory containing:

- - session directories 007, 008 and 009

- ANNOT directory containing:

- - session directories 007, 008 and 009

- DOC and META directories as in HOESI_1

HOESI_4 containing:

- DATA directory containing:

- - session directories 010 and 011

- ANNOT directory containing:

- - session directories 010 and 011

- DOC and META directories as in HOESI_1

HOESI_5 containing:

- DATA directory containing:

- - session directories 012 and 013

- ANNOT directory containing:

- - session directories 012 and 013

- DOC and META directories as in HOESI_1

* Files marked with an asterisk are not included in the present version of the corpus.

File nomenclature: Explanation of used codes (no white space in file names!):
<type>_<noise>_<lombard>_<session>-<channel>.wav

<type>_<noise>_<lombard>_<session>.TextGrid

Type: A = training, E = test

setting:

U0 = studio

U1 = quiet canteen

U2 = loud canteen

U4 = car, 80 km/h

U5 = car, 120 km/h

U6 = car, max velocity

lombard: L0 (no lombard noise), L1, L2, L3

session: 001 - 013

channel: 1 or 2

Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: audio files: wav, annotation files: TextGrid, ok

Coding: LPCM, ok

Compression: not compressed, ok

Sampling rate: 48000, ok

Valid bits per sample: (others than 8, 16, 24, should be reported):16, ok

Multiplexed signals: (exact de-multiplexing algorithm; tools) n.a.

· Database contents:

Clearly stated purpose of the recordings: not provided

Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) Human-human dialogues, ok

Instruction to speakers in full copy: not provided

· Linguistic contents of prompted speech:

Specifications of the individual text items: spontaneous conversational speech, ok

Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.

Example prompt sheet or example sound file from the speech prompting: n.a.

· Linguistic contents of non-prompted speech:

Multi-party:(number of speakers, topics discussed, type of setting – formal/informal) Two speakers per conversation, informal, various topics (e.g. hearing aids, cinema, family, football, gardening), ok

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) informal chat, ok

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.

· Speaker information:

Speaker recruitment strategies: no information provided

Number of speakers: 25 (indices 01-26). VP 01 appears in two sessions, there is no speaker with index 20

Distribution of speakers over sex, age, dialect regions: Sex distribution 7 (female sessions) to 6 (male sessions). Age range 46 – 74. Various dialectal regions. ok

Description/definition of dialect regions: Definition by state (Bundesland) or Austria, ok

· Recording platform and recording conditions:

Recording platform: MAUDIO 1814 Multi, 8-Kanal, described in README, ok

Position and type of microphones:
- Company name and type id: Headset Mikrophon Beyerdynamic Opus 54, described in README, ok
- Electret, dynamic, condenser: no information provided
- Directional properties: described in README, ok
- Mounting: described in README, ok

Position of speakers: (distance to microphone) described in README, ok

Bandwidth: (if other than zero to half of sampling rate) ok

Number of channels and channel separation: 1 channel per wav file, ok

Acoustical environment: different environments, described in README, ok

· Annotation (TextGrid):

Unambiguous spelling standard used in annotations: n.a.

Labeling symbols: Annotation “1” for turns by speaker 1, “2” for turns by speaker 2, “N” for noise. No transcription or other annotation provided, ok

List of non-standard spellings (dialectal variation, names etc.): n.a.

Distinction of homographs which are no homophones: n.a.

Character set used in annotations: ok

Any other language dependent information as abbreviations etc: n.a.

Annotation manual, guidelines, instructions: Provided in README, ok. Speaker turns labelled as “1” and “2” respectively, noise is labelled “N”. Backchannelling, laughter etc. is not labelled. Pauses of more than 2 secs are cut out, turns of more than 30 secs are divided. Guidelines not always followed (see manual validation below).

Description of quality assurance procedures: not provided

Selection of annotators: no information provided

Training of annotators: no information provided

Annotation tools used: no information provided

· Lexicon:

Format: n.a.

Text-to-phoneme procedure: n.a.

Explanation or reference to the phoneme set: n.a.

Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.

· Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: n.a.

· Others:

Any other essential language-dependent information or convention: n.a.

Indication of how many files were double-checked by the producer together with percentage of detected errors: not provided

Status of documentation: README files: spelling mistakes corrected, annotation guidelines and corpus history updated. Files in LOMBARD and EICHSIGNAL directories complete. ok.

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal files: checked using a script, all expected files are present, ok

Completeness of meta data files: checked manually, all expected information is present, ok

Completeness of annotation files: checked using a script, all expected files are present, ok

Correctness of file names: checked using a script, all file names conform to nomenclature, ok

Empty files: none

Status of signal, annotation and meta data files: In SPEAEXT.TBL, information about profession partly in German, partly in English. Fixed to German, ok.

Signal durations: According to README, type A recordings should be 7 minutes, type E recordings 11 minutes long. Checked using a script with 10% tolerance. Four recordings in session 001 too short, a number of recordings from other sessions too long. Fixed, README now reads „at least X minutes“.

Duration cross checks: checked using a script, all signals/TextGrids belonging to the same recording are of the same length, ok.

Cross checks of meta information: Checked using a script. Speaker 02 from session 001 is reported to be from Bayern in SPEAEXT.TBL and from Rheinland-Pfalz in SESSEXT.TBL. Recording sounds more like a RP accent, but hard to tell. Apart from that, ok.

Cross checks of summary listings: n.a

Annotation contents: Checked using a script. All TextGrids from session 001 are missing the “NOISE” tier. Apart from that, ok.

Annotation tier nomenclature: Tier names should be speaker ID or “NOISE”. Checked using a script. A number of mistakes in sessions 007, 008 and 010 (missing zeroes in IDs, superfluous tabulator chars). Fixed, ok.

Annotation texts: Annotation texts checked using script, all were either “1”, “2”, “N” or empty strings, as expected according to README. Ok

III.) Manual Validation

For one randomly chosen recording per session (~8 % of the data), the wav and TextGrid files were manually checked in comparison using PRAAT. 4 major errors were found in 3 files (~23 % of the sample), such as non-annotated turns or substantial duration differences between turns and the respective annotation. More generally, it appears that recording specifications were not always consistently followed with respect to maximal turn durations: a number of segments surpassed the maximal duration of 30 sec, esp. in sessions 001 (34 segments exceeding 30 sec), 004 (24) and 009 (25).

IV.) Other Relevant Observations

V.) Comments for Improvement

The non-adherence to annotation specifications is not a big problem in this corpus. Annotations were made for the purpose of detecting speaker turns, which is perfectly possible even when segments surpass 30 seconds. In order to avoid confusion, annotation specifications in the READMEs should be changed accordingly.
Possibility of not including session 001 in the release. This would resolve a number of issues:
- insufficient recording durations
- missing NOISE channels in TextGrids
- VP2 (the one with the unclear state of origin) only appears in session 001
- the double appearance of VP1 in sessions 001 and 013
- sex distribution would go back to 50/50 (as was originally intended)
- session 001 contains excessive noise stemming from the recording device, which is not present in the other sessions

VI.) Result

Exclusion of session 001 and updating annotation specifications. Updated corpus is ok.