Validation report for the aGender Database

Authors	Florian Schiel, Christian Heinrich
Affiliation	BAS Bayerisches Archiv für Sprachsignale Institut für Phonetik Universität München
Postal address	Schellingstr. 3 D 80799 München
E-mail	schiel@phonetik.uni-muenchen.de heinrich@phonetik.uni-muenchen.de
Telephone	+49-89-2180-2758
Fax	+49-89-2180-5790
Corpus Version	1.1
Date	2013-11-27
Status	Finished
Comment	Most of the flaws listed in this revalidation have been solved in aGender version 2.0. Exception that cannot be corrected are marked with red. The corpus is suitable for publication according to BAS and CLARIN standards.
Validation Guidelines	Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Bas/BITS_Cookbook_TP1.pdf

Validation results:

Summary

The speech corpus aGender version 1.1 has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples.

Introduction and Corpus Description

German native speakers call am automated Voiceportal and read text + answer some open questions. The aim of this corpus is to provide material for speaker age and gender recognition over telephone. The corpus contains the voices of 945 German speakers (at least 100 speakers per class), each taking part in up to seven sessions covering up to 18 items each. The audio was recorded over cell phones and landline connections in 8000 Hz, 8 bit alaw format, then converted to 8kHz, 16bit PCM (13 bits valid). Two working sets were created on that data: a training+development set (770 speakers, 81.5%) and an evaluation set (175 speakers, 25 per class, 18.5%), each with non-overlapping speakers. The speakers are classified in to 7 age + gender calsses; the data of the test set is masked, i.e. no classification is given.

I.) Validation of Documentation

The General directory contains the following documentation files for the aGender corpus which can be found under:

1. ) main directory

transcripts_test.txt	Transcriptions for test set
transcripts_traindev.txt	Transcriptions for Training and Development Set
trainSampleList_train.txt	List of filenames, size in bytes, speakerID, age and gender Training Set
trainSampleList_devel.txt	List of filenames, size in bytes, speakerID, age and gender Development Set

2. ) directory /DOCU

README	File describing the database; age/gender classification test example
TRANSRULES.TXT	Simple description of the transcription rules
SPEAKERHIST.JPG	Age / number of speakers histogram

3. ) directory /DOCU/PAPERS

LREG.PDF

Paper describing the corpus

Administrative Information:

Validating person: C.H.

Date of validation: November 2013

Contact for requests regarding the corpus: ok.

Number and type of media: not given

Content of each medium: not given

Copyright statement and intellectual property rights (IPR): not given

Technical information:

Layout of media: Information about file system type and directory structure: not given

File nomenclatura: Explanation of used codes - no white space in file names!: a1<speaker id><turn id>.raw ok

Formats of signals and anntotation files: If non standard formats are used, it's common to give
a fully description or convert into standard format: raw signal files, no annotation files: o.k.

Coding: PCM

Compression: none

Sampling rate: 8kHZ

Valid bits per sample: 13bit

Used bytes per sample: 2

Multiplexed signals: n.a.

Database contents:

Clearly stated purpose of the recordings: ok. (LREG.pdf)

Speech type(s):(multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) fixed and free text ok

Instruction to speakers in full copy: not given

Linguistic contents of prompted speech:

Specificatons of the individual text items: fixed and free text

Specification for the prompt sheet design or specification of the design of the speech prompts: not given

Example prompt sheet or example sound file from the speech prompting: not given

Linguistic contents of non-prompted speech:

Multi-party: (number of speakers, topics discussed, type of setting – formal/informal) n.a.

Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) n.a.

Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) free text typical for automated call centers, numbers, names, time, date, etc., system driven ok

Speaker information:

Speaker recruitment strategies: not given

Number of speakers: 945

Distribution of speakers over sex, age, dialect regions: sex, age Training and Development Set in trainSampleList_train.txt und trainSampleList_devel.txt ok

Description/definition of dialect regions: not given

Recording platform and recording conditions:

Recording platform: LREG.pdf, Genesys Voice Platform recording feature ok

Position and type of microphones: n.a.

- Company name and type id: n.a.
- Electret, dynamic, condenser: n.a.
- Directional properties: n.a.
- Mounting: n.a.

Position of speakers: (distance to microphone) n.a.

Bandwidth: (if other than zero to half of sampling rate) not given

Number of channels and channel separation: 1 ok

Acoustical environment: indoor (landline telephone), indoor and outdoor (mobile phone) ok

.....Plus for telephone recordings:

Recording hardware, telephone link (analog, digital): mobile network or ISDN and PBX, application server hosting the recording application and a VoiceXML telephony server (Genesys Voice Platform, GVP) (LREG.pdf) ok

Network from where the call originated: not given

Type of handset: not given

Annotation (txt):

Unambigous spelling standard used in annotations: ok

Labeling symbols: TRANSRULES.txt ok

List of non-standard spellings (dialectal variation, names etc.): not given

Distinction of homographs which are no homophones: not given

Character set used in annotations: Iso8859, not given

Any other language dependend information as abbreviations etc: not given

Annotation manual, guidelines, instructions: TRANSRULES.txt

Description of quality assurance procedures: not given

Selection of annotators: not given

Training of annotators: not given

Annotation tools used: not given

Lexicon:

Format: the resource contains not pronunciation lexicon; since the transcription contains bon-explicite numberals (e.g. '... der 16.10....') we cannot provide an automatically created pronuciation dictionary. not given

Text-to-phoneme procedure: n.a.

Explanation or reference to the phoneme set: n.a.

Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.

Statistical information:

Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.

Word frequency table: not given

Others:

Any other essential language-dependent information or convention: not given

Indication of how many files were double-checked by the producer together with percentage of detected errors: not given

Status documentation: acceptable?: no

II.) Automatic Validation

The following contains all validation steps with the methodology and results.

Completeness of signal files:

Training and Development Set:

Number of speakers : 770

Number of recordings: 53076

Number of speakers with complete set of recordings (6x18) : 72

empty files : 0

1 session for 30 speakers

2 sessions for 29 speakers

3 sessions for 75 speakers

4 sessions for 135 speakers

5 sessions for 245 speakers

6 sessions for 245 speakers

7 sessions for 11 speakers

8 sessions for 0 speakers

1 recording in 80 sessions

2 recordings in 62 sessions

3 recordings in 56 sessions

4 recordings in 79 sessions

5 recordings in 89 sessions

6 recordings in 86 sessions

7 recordings in 69 sessions

8 recordings in 123 sessions

9 recordings in 67 sessions

10 recordings in 67 sessions

11 recordings in 69 sessions

12 recordings in 95 sessions

13 recordings in 82 sessions

14 recordings in 118 sessions

15 recordings in 89 sessions

16 recordings in 82 sessions

17 recordings in 193 sessions

18 recordings in 2119 sessions

19 recordings in 0 sessions

Completeness of meta data files: only transcription lists with some speaker metadata ok

Completeness of annotation files: 3 missing entries in transcripts_traindev.txt (53073 entries); 3547/4/a13547s11.raw, 4139/2/a14139s17.raw, 6723/6/a16723s18.raw

Correctness of file names: ok

Empty files: none

Status of signal, annotation and meta data files: signals ok, annotation incomplete, no annotation files, no meta data files

Test Set:

17332 files vs. 12165 entries in transcripts_test.txt, double entries of several files. Those that are not listed in transcripts_test.txt can be deleted!

III.) Manual Validation

Random comparison of audio and transcription reveal transcription errors in approx. 10 % of the transcriptions.

For example /1084/4/a11084s16.raw transcription “Grell” – spoken “Prell”

/1194/4/a11194s10.raw transcription “97” – spoken “87”

IV.) Other Relevant Observations

Numbers are not all spelled out in the transcription files.

Encoding ISO 8859 should be changed to UTF-8.

Transcription is one-level, 1 person, no quality control, no inter-labeller agreement.

There exist no written permissions of the speakers to use their recordings for scientific purposes. However, the speaker were paid to deliver their recordings. We can therefore imply a consent of the speaker that his/her data are being used for scientific development of speaker classification.

Because of the erroneous transcription and the missing spelled out numerals the corpus is not suitable for speech recognition. However, it is suitable for the original purpose namely speaker classification.

Corpus is very inhomogenous: only a few speakers provide a full set of 6 sessions (72 of 770).

V.) Comments for Improvement

The corpus documentation has to be renewed. Encoding ISO 8859 should be changed to UTF-8.

Numbers should be all spelled out in the transcription files (free text items!).

Transcriptions should be re-labelled (at least partially to measure inter-labeller agreement).

A word list, word statistic and pronunciation dictionary should be added.

Duplicate utterances in the test set should be filtered out.

Standardized metadata files and SpeechDat conform MD tables for speaker and recording information should be added to the corpus.

VI.) Result

The corpus is usable for its original purpose, after the corrections listed as not red in this validation report have been implemented.

To make the corpus usabel for other task (such as speech recognition), some manual improvements have to be implemented.

The corpus is suitable for publication according to BAS and CLARIN standards.