Next: Automatic Validation of Data Up: The Validation of Speech Previous: Check List Contents

Documentation

Validation of the documentation is simple for an external validator and hard for an internal validator, because the latter knows too much about the corpus. If you act as an internal validator, try to `erase' everything you know about the project and pretend to have just received the corpus and would like to get going with it.

To validate the documentation of the corpus run through the following steps and summarize all missing items and/or deviations from the specifications in the validation report. If possible, list the missing items or numbers in the report to simplify the correction process.

Check the reference (see chapter 3) for any specifications regarding the documentation.^4.1. If you find any, check out whether they have been fulfilled.
Identify all files that belong to the documentation and try to read them on different OS (in most cases Windows, Macintosh and Linux will suffice) and with standard software (like a text editor or Acrobat). If you find documentation files in other formats than plain ASCII, HTML or Portable Document Format (PDF), report this as not acceptable. If you find documentation files rendered in HTML, try to read them with three different browsers (e.g. Internet Explorer, Mozilla (Netscape) and Opera or lynx) on varying platforms (Windows, Linux, Macintosh). You will be surprised how many pages won't work, especially frame based pages.
Don't even consider proprietary formats like Word or WordPerfect or StarOffice etc. Documentation should never be delivered in such formats. Put a note in the protocol that the producer should convert them into standard formats and re-supply them.
Go through the `survey list' you created in the previous chapter (p. ) and check whether all items appear in the documentation.
Finally, to check for a minimum standard documentation as expected for a speech corpus, go through the first chapter of [2]. A summary of this list of requirements is given in the following check list.

Administrative Information

$\bigcirc$ Contact for requests regarding the corpus
$\bigcirc$ Number and type of media
$\bigcirc$ Content of each medium
$\bigcirc$ Copyright statement and intellectual property rights (IPR)
$\bigcirc$ Validation date(s)^4.2
$\bigcirc$ Validation person(s)/institution(s)^4.2

Technical Information

$\bigcirc$ Layout of media: file system type and directory structure
$\bigcirc$ File nomenclatura: explanation of codes used; no 'white spaces' in file names
$\bigcirc$ Formats of signal and annotation files: if non-standard formats are used, a full description is required and tools to convert this format into a standard format
$\bigcirc$ Coding: PCM linear, Mu-Law or A-Law; if other codings must be used, they must be fully described
$\bigcirc$ Compression: only widely supported compressions (e.g. zip, gzip) should be used
$\bigcirc$ Sampling rate: rates others than 8000, 11025, 16000, 22050, 32000, 44100 and 48000 should be reported
$\bigcirc$ Valid bits per sample: others than 8, 16 and 24 should be reported
$\bigcirc$ Used bytes per sample^4.3
$\bigcirc$ Multiplexed signals: exact de-multiplexing algorithm; tools

Database Contents

$\bigcirc$ Clearly stated purpose of the recordings
$\bigcirc$ Speech type(s): multi-party conversations, human-human dialogues, human-machine dialogues, read sentences, connected and/or isolated digits, isolated words etc.
$\bigcirc$ Instruction to speakers (full copy)^4.2

Linguistic Contents of Prompted Speech

$\bigcirc$ Specification of the individual text items
$\bigcirc$ Specification for the prompt sheet design or
$\bigcirc$ Specification of the design of the speech prompts
$\bigcirc$ Example prompt sheet or
$\bigcirc$ Example sound file from the speech prompting^4.2

Linguistic Contents of Non-Prompted Speech

$\bigcirc$ Multi-party: Number of speakers, topics, discussed, type of setting (formal/informal)
$\bigcirc$ Human-human dialogues: type of dialogue (problem solving, information seeking, chat etc.), relation between speakers, topic(s) discussed, type of setting, scenarios
$\bigcirc$ Human-machine dialogues: domain(s), topic(s), dialogue strategy followed by the machine (system driven, mixed initiative), type of system (test, operational service, Wizard-of-Oz^4.2)

Speaker Information

$\bigcirc$ Speaker recruitment strategies
$\bigcirc$ Number of speakers
$\bigcirc$ Distribution of speakers over sex, age, dialect regions
$\bigcirc$ Description/definition of dialect regions

Recording platform and recording conditions

$\bigcirc$ Recording platform
$\bigcirc$ Position and type of microphone(s)

$\bigcirc$ Company name and type id
$\bigcirc$ Electret, dynamic, condenser
$\bigcirc$ Directional properties
$\bigcirc$ Mounting

$\bigcirc$ Position of speaker(s) (distance to microphone)
$\bigcirc$ Bandwidth (if other than zero to half of sampling rate)
$\bigcirc$ Number of channels and channel separation
$\bigcirc$ Acoustical environment^4.2

... plus for telephone recordings

$\bigcirc$ Recording hardware, telephone link (analog, digital)
$\bigcirc$ Network from where the call originated
$\bigcirc$ Type of handset

... plus for recording in the automobile environment

$\bigcirc$ Recording hardware^4.2
$\bigcirc$ Type of vehicle
$\bigcirc$ Average speed of vehicle
$\bigcirc$ Status of windows (open/closed)
$\bigcirc$ Type of pavement
$\bigcirc$ Audio equipment playing during the recording

Annotation (for each of the contained annotations)

$\bigcirc$ Unambiguous spelling standard used in annotations
$\bigcirc$ Labeling symbols
$\bigcirc$ List of non-standard spellings (dialectal variation, names etc.)
$\bigcirc$ Distinction of homographs which are not homophones
$\bigcirc$ Character set used in annotations
$\bigcirc$ Any other language dependent information (such as abbreviations etc.)
$\bigcirc$ Annotation manual, guidelines, instructions
$\bigcirc$ Description of quality assurance procedures
$\bigcirc$ Selection of annotators
$\bigcirc$ Training of annotators
$\bigcirc$ Annotation tools used

Lexicon

$\bigcirc$ Format
$\bigcirc$ Text-to-phoneme procedure
$\bigcirc$ Explanation or reference to the phoneme set
$\bigcirc$ Phonological or higher order phenomena accounted for in the phonemic transcriptions

Statistical Information

$\bigcirc$ Frequency of sub-word units: phonemes (diphones, triphones, syllables, ...)
$\bigcirc$ Word frequency table

Others

$\bigcirc$ Any other essential language-dependent information or convention
$\bigcirc$ Indication of how many files were double-checked by the producer together with percentage of detected errors

Next: Automatic Validation of Data Up: The Validation of Speech Previous: Check List Contents

Angela Baumann 2004-06-03