Structure

Next: File Naming Conventions Up: Corpus Structure Previous: Corpus Structure Contents

Structure

As mentioned before, it is a good idea to keep signal data files and annotation data separately. The reason for this is that very often users will need only access to the symbolic data of your speech corpus. Furthermore, the annotation part is much more likely subject to updates than the signal data. Therefore it's better to have them separated for an easier maintenance of the corpus.

Small corpora will have the following typical structure in the root of the distribution media:

DATA : contains all signal files
ANNOT : contains all annotation files
META : contains all meta data files
DOC : contains the documentation
LEX : contains the lexica (if any)
TOOLS : contains software to access signal, annotation and lexicon data

Larger corpora might distribute the DATA part on other media but the basic structure remains the same.

Within the DATA and ANNOT directories organize the files in a way to avoid very large (approx. ) numbers of directory entries, and try to provide a natural order to the prospective user. Depending on the aims of your speech corpus this order of subdirectories may be:

male / female
recording sessions
speakers
different acoustical environments
languages
dialect classes
speech types (read, non-prompted, ...)
technical setups (telephone, on-site, ...)
in ANNOT: different annotation types

Next: File Naming Conventions Up: Corpus Structure Previous: Corpus Structure Contents

BITS Projekt-Account 2004-06-01