To validate the documentation of the corpus run through the following steps and summarize all missing items and/or deviations from the specifications in the validation report. If possible, list the missing items or numbers in the report to simplify the correction process.
Don't even consider proprietary formats like Word or WordPerfect or StarOffice etc. Documentation should never be delivered in such formats. Put a note in the protocol that the producer should convert them into standard formats and re-supply them.
Administrative Information
Contact for requests regarding the corpus
Number and type of media
Content of each medium
Copyright statement and intellectual property rights (IPR)
Validation date(s)4.2
Validation person(s)/institution(s)4.2
Technical Information
Layout of media: file system type and directory structure
File nomenclatura: explanation of codes used; no 'white spaces' in file names
Formats of signal and annotation files: if non-standard formats are used, a full description is required and tools to convert this format into a standard format
Coding: PCM linear, Mu-Law or A-Law; if other codings must be used, they must be fully described
Compression: only widely supported compressions (e.g. zip, gzip) should be used
Sampling rate: rates others than 8000, 11025, 16000, 22050, 32000, 44100 and 48000 should be reported
Valid bits per sample: others than 8, 16 and 24 should be reported
Used bytes per sample4.3
Multiplexed signals: exact de-multiplexing algorithm; tools
Database Contents
Clearly stated purpose of the recordings
Speech type(s): multi-party conversations, human-human dialogues, human-machine dialogues, read sentences, connected and/or isolated digits, isolated words etc.
Instruction to speakers (full copy)4.2
Linguistic Contents of Prompted Speech
Specification of the individual text items
Specification for the prompt sheet design or
Specification of the design of the speech prompts
Example prompt sheet or
Example sound file from the speech prompting4.2
Linguistic Contents of Non-Prompted Speech
Multi-party: Number of speakers, topics, discussed, type of setting (formal/informal)
Human-human dialogues: type of dialogue (problem solving, information seeking, chat etc.), relation between speakers, topic(s) discussed, type of setting, scenarios
Human-machine dialogues: domain(s), topic(s), dialogue strategy followed by the machine (system driven, mixed initiative), type of system (test, operational service, Wizard-of-Oz4.2)
Speaker Information
Speaker recruitment strategies
Number of speakers
Distribution of speakers over sex, age, dialect regions
Description/definition of dialect regions
Recording platform and recording conditions
Recording platform... plus for telephone recordings
Position and type of microphone(s)
Company name and type idPosition of speaker(s) (distance to microphone)
Electret, dynamic, condenser
Directional properties
Mounting
Bandwidth (if other than zero to half of sampling rate)
Number of channels and channel separation
Acoustical environment4.2
Recording hardware, telephone link (analog, digital)
Network from where the call originated
Type of handset
... plus for recording in the automobile environment
Recording hardware4.2Annotation (for each of the contained annotations)
Type of vehicle
Average speed of vehicle
Status of windows (open/closed)
Type of pavement
Audio equipment playing during the recording
Unambiguous spelling standard used in annotationsLexicon
Labeling symbols
List of non-standard spellings (dialectal variation, names etc.)
Distinction of homographs which are not homophones
Character set used in annotations
Any other language dependent information (such as abbreviations etc.)
Annotation manual, guidelines, instructions
Description of quality assurance procedures
Selection of annotators
Training of annotators
Annotation tools used
FormatStatistical Information
Text-to-phoneme procedure
Explanation or reference to the phoneme set
Phonological or higher order phenomena accounted for in the phonemic transcriptions
Frequency of sub-word units: phonemes (diphones, triphones, syllables, ...)Others
Word frequency table
Any other essential language-dependent information or convention
Indication of how many files were double-checked by the producer together with percentage of detected errors