Authors |
Florian Schiel, Christian Heinrich |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2180-5790 |
Corpus Version |
1.1 |
Date |
2013-11-27 |
Status |
Finished |
Comment |
Most of the flaws listed in this revalidation have been solved in aGender version 2.0. Exception that cannot be corrected are marked with red. The corpus is suitable for publication according to BAS and CLARIN standards. |
Validation Guidelines |
Florian Schiel: The Validation of Speech Corpora, Bastard
Verlag, 2003, |
The speech corpus aGender version 1.1 has been validated against general principles of good practise. The validation covered completeness, formal checks and manual checks of the selected sub samples.
German native speakers call am automated Voiceportal and read text + answer some open questions. The aim of this corpus is to provide material for speaker age and gender recognition over telephone. The corpus contains the voices of 945 German speakers (at least 100 speakers per class), each taking part in up to seven sessions covering up to 18 items each. The audio was recorded over cell phones and landline connections in 8000 Hz, 8 bit alaw format, then converted to 8kHz, 16bit PCM (13 bits valid). Two working sets were created on that data: a training+development set (770 speakers, 81.5%) and an evaluation set (175 speakers, 25 per class, 18.5%), each with non-overlapping speakers. The speakers are classified in to 7 age + gender calsses; the data of the test set is masked, i.e. no classification is given.
The General directory contains the following documentation files for the aGender corpus which can be found under:
1. ) main directory
transcripts_test.txt |
Transcriptions for test set |
transcripts_traindev.txt |
Transcriptions for Training and Development Set |
trainSampleList_train.txt |
List of filenames, size in bytes, speakerID, age and gender Training Set |
trainSampleList_devel.txt |
List of filenames, size in bytes, speakerID, age and gender Development Set |
2. ) directory /DOCU
README |
File describing the database; age/gender classification test example |
TRANSRULES.TXT |
Simple description of the transcription rules |
SPEAKERHIST.JPG |
Age / number of speakers histogram |
3. ) directory /DOCU/PAPERS
LREG.PDF |
Paper describing the corpus |
Administrative Information:
Validating
person: C.H.
Date
of validation: November
2013
Contact for requests regarding the corpus:
ok.
Number and type of media: not given
Content
of each medium: not given
Copyright statement and
intellectual property rights (IPR): not given
Technical
information:
Layout of media: Information
about file system type and directory structure: not
given
File nomenclatura: Explanation of used codes -
no white space in file names!: a1<speaker id><turn
id>.raw ok
Formats of signals and anntotation files:
If non standard formats are used, it's common to give
a fully
description or convert into standard format: raw signal files, no
annotation files: o.k.
Coding: PCM
Compression:
none
Sampling rate: 8kHZ
Valid
bits per sample: 13bit
Used bytes per sample:
2
Multiplexed signals: n.a.
Database
contents:
Clearly stated purpose of the recordings:
ok. (LREG.pdf)
Speech type(s):(multi-party
conversations, human-human dialogues, read sentences, connected
and/or isolated digits, isolated words etc.) fixed
and free text ok
Instruction to speakers in
full copy: not
given
Linguistic contents of prompted speech:
Specificatons of the individual text items:
fixed and free text
Specification for the prompt
sheet design or specification of the design of the speech prompts:
not
given
Example prompt sheet or example sound
file from the speech prompting: not
given
Linguistic contents of non-prompted
speech:
Multi-party: (number of speakers,
topics discussed, type of setting – formal/informal)
n.a.
Human-human
dialogues: (type of dialogues, e.g. problem solving,
information seeking, chat etc., relation between speakers, topic(s)
discussed, type of setting, scenarios) n.a.
Human-machine
dialogues: (domain(s), topic(s), dialogues strategy followed
by the machine, e.g. system driven, mixed initiative, type of system,
e.g. test, operational service, Wizard-of-Oz) free
text typical for automated call centers, numbers, names, time, date,
etc., system driven ok
Speaker
information:
Speaker recruitment strategies:
not
given
Number of speakers:
945
Distribution of speakers over sex, age, dialect
regions: sex, age Training and Development Set in
trainSampleList_train.txt und trainSampleList_devel.txt
ok
Description/definition of dialect regions:
not
given
Recording platform and recording
conditions:
Recording platform:
LREG.pdf, Genesys Voice Platform recording feature ok
Position
and type of microphones:
n.a.
- Company name and type id: n.a.
- Electret,
dynamic, condenser: n.a.
- Directional properties: n.a.
-
Mounting: n.a.
Position of speakers: (distance
to microphone) n.a.
Bandwidth:
(if other than zero to half of sampling rate)
not
given
Number of channels and channel
separation: 1 ok
Acoustical environment:
indoor (landline telephone), indoor and outdoor (mobile phone)
ok
.....Plus for telephone
recordings:
Recording hardware, telephone link (analog,
digital): mobile network or ISDN and PBX, application server
hosting the recording application and a VoiceXML telephony server
(Genesys Voice Platform, GVP) (LREG.pdf) ok
Network from
where the call originated: not
given
Type of handset: not
given
Annotation (txt):
Unambigous
spelling standard used in annotations: ok
Labeling
symbols: TRANSRULES.txt ok
List of non-standard
spellings (dialectal variation, names etc.): not
given
Distinction of homographs which are no
homophones: not
given
Character set used in annotations:
Iso8859, not given
Any other language dependend information
as abbreviations etc: not given
Annotation manual,
guidelines, instructions: TRANSRULES.txt
Description of
quality assurance procedures: not
given
Selection of annotators: not
given
Training of annotators:
not
given
Annotation tools used: not
given
Lexicon:
Format:
the resource contains not pronunciation lexicon; since the
transcription contains bon-explicite numberals (e.g. '... der
16.10....') we cannot provide an automatically created pronuciation
dictionary. not
given
Text-to-phoneme procedure:
n.a.
Explanation or reference to the phoneme set:
n.a.
Phonological or higher order phenomena
accounted in the phonemic transcriptions:
n.a.
Statistical information:
Frequency
of sub-word units: phonemes (diphones, triphones,
syllables,...):
n.a.
Word frequency table: not
given
Others:
Any
other essential language-dependent information or convention:
not given
Indication of how many files were
double-checked by the producer together with percentage of detected
errors: not given
Status
documentation: acceptable?: no
The following contains all validation steps with the methodology
and results.
Completeness of signal files:
Training and Development Set:
Number of speakers : 770
Number of recordings: 53076
Number of speakers with complete set of recordings (6x18) : 72
empty files : 0
1 session for 30 speakers
2 sessions for 29 speakers
3 sessions for 75 speakers
4 sessions for 135 speakers
5 sessions for 245 speakers
6 sessions for 245 speakers
7 sessions for 11 speakers
8 sessions for 0 speakers
1 recording in 80 sessions
2 recordings in 62 sessions
3 recordings in 56 sessions
4 recordings in 79 sessions
5 recordings in 89 sessions
6 recordings in 86 sessions
7 recordings in 69 sessions
8 recordings in 123 sessions
9 recordings in 67 sessions
10 recordings in 67 sessions
11 recordings in 69 sessions
12 recordings in 95 sessions
13 recordings in 82 sessions
14 recordings in 118 sessions
15 recordings in 89 sessions
16 recordings in 82 sessions
17 recordings in 193 sessions
18 recordings in 2119 sessions
19 recordings in 0 sessions
Completeness
of meta data files:
only transcription lists with some speaker metadata ok
Completeness of annotation files: 3 missing entries in transcripts_traindev.txt (53073 entries); 3547/4/a13547s11.raw, 4139/2/a14139s17.raw, 6723/6/a16723s18.raw
Correctness of file names: ok
Empty files: none
Status of signal, annotation and meta data files: signals ok, annotation incomplete, no annotation files, no meta data files
Test Set:
17332 files vs. 12165 entries in transcripts_test.txt, double entries of several files. Those that are not listed in transcripts_test.txt can be deleted!
Random comparison of audio and transcription reveal transcription errors in approx. 10 % of the transcriptions.
For example /1084/4/a11084s16.raw transcription “Grell” – spoken “Prell”
/1194/4/a11194s10.raw transcription “97” – spoken “87”
Numbers are not all spelled out in the transcription files.
Encoding ISO 8859 should be changed to UTF-8.
Transcription is one-level, 1 person, no quality control, no inter-labeller agreement.
There exist no written permissions of the speakers to use their recordings for scientific purposes. However, the speaker were paid to deliver their recordings. We can therefore imply a consent of the speaker that his/her data are being used for scientific development of speaker classification.
Because of the erroneous transcription and the missing spelled out numerals the corpus is not suitable for speech recognition. However, it is suitable for the original purpose namely speaker classification.
Corpus is very inhomogenous: only a few speakers provide a full set of 6 sessions (72 of 770).
The corpus documentation has to be renewed. Encoding ISO 8859 should be changed to UTF-8.
Numbers should be all spelled out in the transcription files (free text items!).
Transcriptions should be re-labelled (at least partially to measure inter-labeller agreement).
A word list, word statistic and pronunciation dictionary should be added.
Duplicate utterances in the test set should be filtered out.
Standardized metadata files and SpeechDat conform MD tables for speaker and recording information should be added to the corpus.
The corpus is usable for its original purpose, after the corrections listed as not red in this validation report have been implemented.
To make the corpus usabel for other task (such as speech recognition), some manual improvements have to be implemented.
The corpus is suitable for publication according to BAS and CLARIN standards.