next up previous contents
Next: Automatic Validation of Data Up: The Validation of Speech Previous: Check List   Contents

Documentation

Validation of the documentation is simple for an external validator and hard for an internal validator, because the latter knows too much about the corpus. If you act as an internal validator, try to `erase' everything you know about the project and pretend to have just received the corpus and would like to get going with it.

To validate the documentation of the corpus run through the following steps and summarize all missing items and/or deviations from the specifications in the validation report. If possible, list the missing items or numbers in the report to simplify the correction process.





Administrative Information

$\bigcirc$ Contact for requests regarding the corpus
$\bigcirc$ Number and type of media
$\bigcirc$ Content of each medium
$\bigcirc$ Copyright statement and intellectual property rights (IPR)
$\bigcirc$ Validation date(s)4.2
$\bigcirc$ Validation person(s)/institution(s)4.2

Technical Information

$\bigcirc$ Layout of media: file system type and directory structure
$\bigcirc$ File nomenclatura: explanation of codes used; no 'white spaces' in file names
$\bigcirc$ Formats of signal and annotation files: if non-standard formats are used, a full description is required and tools to convert this format into a standard format
$\bigcirc$ Coding: PCM linear, Mu-Law or A-Law; if other codings must be used, they must be fully described
$\bigcirc$ Compression: only widely supported compressions (e.g. zip, gzip) should be used
$\bigcirc$ Sampling rate: rates others than 8000, 11025, 16000, 22050, 32000, 44100 and 48000 should be reported
$\bigcirc$ Valid bits per sample: others than 8, 16 and 24 should be reported
$\bigcirc$ Used bytes per sample4.3
$\bigcirc$ Multiplexed signals: exact de-multiplexing algorithm; tools

Database Contents

$\bigcirc$ Clearly stated purpose of the recordings
$\bigcirc$ Speech type(s): multi-party conversations, human-human dialogues, human-machine dialogues, read sentences, connected and/or isolated digits, isolated words etc.
$\bigcirc$ Instruction to speakers (full copy)4.2

Linguistic Contents of Prompted Speech

$\bigcirc$ Specification of the individual text items
$\bigcirc$ Specification for the prompt sheet design or
$\bigcirc$ Specification of the design of the speech prompts
$\bigcirc$ Example prompt sheet or
$\bigcirc$ Example sound file from the speech prompting4.2

Linguistic Contents of Non-Prompted Speech

$\bigcirc$ Multi-party: Number of speakers, topics, discussed, type of setting (formal/informal)
$\bigcirc$ Human-human dialogues: type of dialogue (problem solving, information seeking, chat etc.), relation between speakers, topic(s) discussed, type of setting, scenarios
$\bigcirc$ Human-machine dialogues: domain(s), topic(s), dialogue strategy followed by the machine (system driven, mixed initiative), type of system (test, operational service, Wizard-of-Oz4.2)

Speaker Information

$\bigcirc$ Speaker recruitment strategies
$\bigcirc$ Number of speakers
$\bigcirc$ Distribution of speakers over sex, age, dialect regions
$\bigcirc$ Description/definition of dialect regions

Recording platform and recording conditions

$\bigcirc$ Recording platform
$\bigcirc$ Position and type of microphone(s)
$\bigcirc$ Company name and type id
$\bigcirc$ Electret, dynamic, condenser
$\bigcirc$ Directional properties
$\bigcirc$ Mounting
$\bigcirc$ Position of speaker(s) (distance to microphone)
$\bigcirc$ Bandwidth (if other than zero to half of sampling rate)
$\bigcirc$ Number of channels and channel separation
$\bigcirc$ Acoustical environment4.2
... plus for telephone recordings
$\bigcirc$ Recording hardware, telephone link (analog, digital)
$\bigcirc$ Network from where the call originated
$\bigcirc$ Type of handset

... plus for recording in the automobile environment

$\bigcirc$ Recording hardware4.2
$\bigcirc$ Type of vehicle
$\bigcirc$ Average speed of vehicle
$\bigcirc$ Status of windows (open/closed)
$\bigcirc$ Type of pavement
$\bigcirc$ Audio equipment playing during the recording
Annotation (for each of the contained annotations)
$\bigcirc$ Unambiguous spelling standard used in annotations
$\bigcirc$ Labeling symbols
$\bigcirc$ List of non-standard spellings (dialectal variation, names etc.)
$\bigcirc$ Distinction of homographs which are not homophones
$\bigcirc$ Character set used in annotations
$\bigcirc$ Any other language dependent information (such as abbreviations etc.)
$\bigcirc$ Annotation manual, guidelines, instructions
$\bigcirc$ Description of quality assurance procedures
$\bigcirc$ Selection of annotators
$\bigcirc$ Training of annotators
$\bigcirc$ Annotation tools used
Lexicon
$\bigcirc$ Format
$\bigcirc$ Text-to-phoneme procedure
$\bigcirc$ Explanation or reference to the phoneme set
$\bigcirc$ Phonological or higher order phenomena accounted for in the phonemic transcriptions
Statistical Information
$\bigcirc$ Frequency of sub-word units: phonemes (diphones, triphones, syllables, ...)
$\bigcirc$ Word frequency table
Others
$\bigcirc$ Any other essential language-dependent information or convention
$\bigcirc$ Indication of how many files were double-checked by the producer together with percentage of detected errors


next up previous contents
Next: Automatic Validation of Data Up: The Validation of Speech Previous: Check List   Contents
Angela Baumann 2004-06-03