Validation report for the database "Regional Variants of German - Junior" (RVG-J, Bavaria)

Authors Karl Weilhammer
Affiliation   BAS Bayerisches Archiv für Sprachsignale
Institut für Phonetik
Universität München
Postal address Schellingstr. 3
D 80799 München
Telephone +49-89-2180-2758
Fax +49-89-2800362
Corpus Version 1.0
Date 22.12.2003
Status pre-final
Comment Validation of pre-final corpus
Validation Guidelines Florian Schiel: The Validation of Speech Corpora, Bastard Verlag, 2003, 

Validation results of the corpus "Regional Variants of German - Junior" (RVG-J, Bavaria):


The speech corpus "Regional Variants of German - Junior" (RVG-J, Bavaria) is in good order, however there are some details that should be improved, in order to get a sound data base.

Introduction and Corpus Description

The RVG-J Corpus (Regional Variants of German - Junior) was recorded in 2001 at the Institute of Phonetics and Speech Communication at the University of Munich, Germany.

The corpus contains both read and non-scripted German utterances. It comprises the original RVG prompts (telephone numbers, sentences, commands, digits, etc.) plus spellings, date and time expressions, and free form responses to questions, e.g. "What are you wearing?", "How did you get here?", etc.

The speakers were adolescents between 13 and 20 years of age, recruited in public schools in Munich and the suburbs. More than 95% of the speakers have German as their mother language, and almost all of them attended school in Bavaria; 89 of them were male and 93 female. Speakers younger than 18 years were required to provide a waiver signed by their parents stating that they were allowed to participate in the recordings. The corpus can be used for the training of speech recognizers or analyses of adolescent speech.

This document summarizes the results of an in-house validation of the speech corpus "Regional Variants of German - Junior".

I.) Validation of Documentation

The General Documentation directory contains the following files. They can be found under: doc/...

HANDBOOK.PDF: Explanation of transliteration procedure.
ISO8859-1.PDF: Character set
PROMPTS: Directory with files containing the prompt texts.
SAMPA.TXT: Transcription symbols used in RVG-J.
SAMPSTAT.TXT: Basic signal statistics for every recording file.
SUMMARY.TXT: Automatically generated SpeechDat conform summary of recordings.

Status of documentation: acceptable.

II.) Automatic validation

The following list contains all validation steps with the methodology and results.

Completeness of signal, annotation and meta data files: ok

Correctness of file names: ok

Empty files: none

Status of signal, annotation and meta data files: ok

- Annotation Files: The signal byte order should be "lohi" but in the SAM files is "HILO"- corrected

Signal Format: ok

Meta information in signal files, meta files and annotation files:  ok

Cross Checks of summary listings: ok

III.) Manual Validation

Manual Validation was not carried out.

IV.) Other Relevant Observations

I would think that prompts and annotations contain somehow similar information. It would therefore be consequent to have the PROMPT/ directory on the same directory level as the ANNOT/ directory.

V.) Comments for Improvement

The channels of the recording files should be split into two separate files. The WAV-Header should be replaced by a Nist-Header.

The lexicon should be completed (canonical pronunciation is missing).

The format error on page 5/6 in the HANDBOOK.PDF file should be corrected.

The prompt directory should be moved out of the DOC-directory to the root level.

It might be a good idea to put the general README file into the DOC-directory and provide a short README in on the root level explaining the file structure briefly.

The abbreviations BE, BW, BY, NN and NW should be specified in the documentation.

VI.) Result

The corpus is ok, although some details should be fixed before publication.