Authors |
Susanne Beinrucker, Florian Schiel |
Affiliation |
BAS Bayerisches Archiv für Sprachsignale |
Postal address |
Schellingstr. 3 |
|
schiel@phonetik.uni-muenchen.de |
Telephone |
+49-89-2180-2758 |
Fax |
+49-89-2180-5790 |
Corpus Version |
1.1 |
Date |
2014-05-09 |
Status |
Finished |
Comment |
|
Validation Guidelines |
Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, www.bas.uni-muenchen.de/Forschung/BITS/TP2/Cookbook |
The speech corpus AsiCa has been validated against general principles of good practise. The validation covered completeness and formal checks of the selected sub samples. The corpus has some good basic approaches but especially the annotation and its documentation is in need of improvement.
Because of this fact and other missing informations, the corpus is in moderate to poor condition for scientific and technical usage.
This document summarizes the results of an in-house validation of the speech corpus AsiCa made in the year 2014 by the Institute of Phonetics of the Ludwig-Maximilians-University Munich. The speech corpus was created in 2004/2005 by the department of romance philology of the Ludwig-Maximilians-University Munich.
AsiCa is a collection of 331 speech items in Calabrian and Italian. Speakers were interviewed in three ways: 1) with a prompt sheet of 54 stimuli, 2) in a spontanous and informal talk with the interviewer, 3) in a talk for further information about the Calabrian language. Not all three types were recorded with each speaker.
The AsiCa speech database consists of 331 recorded audio files from a digital recording microphone in a compressed format which was converted in .WAV afterwards.
The General Documentation directory contains the following documentation files for the AsiCa corpus which can be found under:
1.) the main directory
README |
File describing the content of the corpus |
DISK.ID |
Directory ID for OS |
COPYRIGH.TXT |
Copyright text |
2.) the directory /doc:
README.DEU |
File describing the database and its structure in German |
README.ENG |
File describing the database and its structure in English |
RECORDING_EQUIPMENT.JPG |
Picture of the recording equipment |
3.) the directory /table:
LOCATIONS.TBL |
Location information file |
PROMPTS.TBL |
List of prompts |
SPEAKER.TBL |
Speaker information file |
SESSION.TBL |
Session information file |
· Administrative Information:
Validating person: n. a.
Date of validation: n. a.
Contact for requests regarding the corpus: ok
Number and type of media: n.a.
Content of each medium: missing
Copyright statement and intellectual property rights (IPR): ok
· Technical information:
Layout of media: Information about file system type and directory structure: n.a.
File nomenclature:
Explanation of used codes (no white
space in file names!):
<three-figure
code of point of origin><[1|2]><[m|w]><[I|D]><[Q|D|I]><one-digit
index>< .wav | .Textgrid> ok
Formats of signals and annotation files: If non standard formats are used it is common to give a full description or to convert into a standard format: raw signal file converted in .mp3 and .wav
Coding: Signed Integer PCM, ok
Compression: yes, Sony Minidisc,de-compressed to .wav
Sampling rate: 44100Hz, ok
Valid bits per sample: (others than 8, 16, 24, should be reported): 16 bit, ok
Used bytes per sample: 2 bytes/sample, ok
Multiplexed signals: (exact de-multiplexing algorithm; tools) single channel, ok
· Database contents:
Clearly stated purpose of the recordings: ok
Speech type(s): (multi-party conversations, human-human dialogues, read sentences, connected and/or isolated digits, isolated words etc.) 3 types of dialogues: answering prompt sheet, informal interview, informative conversation -> semi-spontaneous speech ok
Instruction to speakers in full copy: no
· Linguistic contents of prompted speech:
Specifications of the individual text items: ok (in PROMPTS.TBL)
Specification for the prompt sheet design or specification of the design of the speech prompts: n.a.
Example prompt sheet or example sound file from the speech prompting: n.a.
· Linguistic contents of non-prompted speech:
Multi-party:(number of speakers, topics discussed, type of setting - formal/informal) ok
Human-human dialogues: (type of dialogues, e.g. problem solving, information seeking, chat etc., relation between speakers, topic(s) discussed, type of setting, scenarios) ok
Human-machine dialogues: (domain(s), topic(s), dialogues strategy followed by the machine, e.g. system driven, mixed initiative, type of system, e.g. test, operational service, Wizard-of-Oz) n.a.
· Speaker information:
Speaker recruitment strategies: ok (detailed information in the README-file)
Number of speakers: ok, 68 speakers
Distribution of speakers over sex, age, dialect regions: ok, but only for 6 regions
Description/definition of dialect
regions: ok
· Recording platform and recording conditions:
Recording platform: Sony Minidisc-Player: Sony Digital Megabass MZ-R55, digital recording Sony microphone ECM MS907
Position and type
of microphones: on
desk, ok
- Company name and type
id: Sony
microphone ECM MS907
-
Electret, dynamic, condenser: condenser
- Directional properties:
n.a.
- Mounting: n.a.
Position of speakers: (distance to microphone) no information
Bandwidth: (if other than zero to half of sampling rate) 44100Hz ok
Number of channels and channel separation: 1 channel ok
Acoustical environment: home, ok
· Annotation (Textgrid label file):
Unambiguous spelling standard used in annotations: not given
Labeling symbols: mixed orthographic and phonetic, no definition given, not okay, probably a phonetic transcript using Praat internal phonetic coding
List of non-standard spellings (dialectal variation, names etc.): n.a.
Distinction of homographs which are no homophones: n.a.
Character set used in annotations: not defined, probably ASCII
Any other language dependent information as abbreviations etc: n.a.
Annotation manual, guidelines, instructions: no information, not okay
Description of quality assurance procedures: not given
Selection of annotators: not given
Training of annotators: not given
Annotation tools used: praat, ok
· Lexicon:
Format: n.a.
Text-to-phoneme procedure: n.a.
Explanation or reference to the phoneme set: n.a.
Phonological or higher order phenomena accounted in the phonemic transcriptions: n.a.
· Statistical information:
Frequency of sub-word units: phonemes (diphones, triphones, syllables,...): n.a.
Word frequency table: n.a.
· Others:
Any other essential language-dependent information or convention: n.a.
Indication of how many files were double-checked by the producer together with percentage of detected errors: not given
Status of documentation: incomplete
The following list contains all validation steps with the methodology and results.
Completeness of signal files: ok (but not all speakers have the same number of recordings)
Completeness of meta data files: ok
Completeness of annotation files: not all signalfiles were annotated (204 of 331 signalfiles were not annotated)
Correctness of file names: ok
Empty files: none
Status of signal, annotation and meta data files: n.a.
Cross checks of meta information: ok ?
Cross checks of summary listings: n.a.
Annotation and lexicon contents: no lexicon, n.a.
Due to the fact that the annotation in AsiCa has not been properly documented, a manual validation of annotation, i.e. the manual check of a random sample of annotations regarding the speech signal is not possible at the present state.
The status of copyright was not clear and seemed to be incorrect.
A website in Italian language was given in the README for information.
The DVD-structure of the corpus was not given and also the English version of the README-file was missing (was supplied by validator).
The indices of the filenaming is not clear: the numbers are self-explanatory, but 'a' is not defined. In the automatic validation some further unclarities appeared: Which range is defined for the two groups of age/generation? For example the age 37 can be found in group 1 and 2. Also there is no consistence in the naming of the speakers education (Italian and German terms are found).
The data structure of the corpus should be supplied by the creators and the status of copyright has to be cleared and added. The information about the annotation should be completed and more detailed. The filenaming, in particular the indices should be adjusted. Definition of the groups and consistence should be cleared (see above IV.).
The speech corpus AsiCa has been validated against general principles of good practise. The validation covered completeness and formal checks of the selected sub samples. The corpus has some good basic approaches but especially the annotation and its documentation is in need of improvement.
Because of this fact and other missing informations, the corpus is in moderate to poor condition for scientific and technical usage.