next up previous contents
Next: File Naming Conventions Up: Corpus Structure Previous: Corpus Structure   Contents

Structure

As mentioned before, it is a good idea to keep signal data files and annotation data separately. The reason for this is that very often users will need only access to the symbolic data of your speech corpus. Furthermore, the annotation part is much more likely subject to updates than the signal data. Therefore it's better to have them separated for an easier maintenance of the corpus.

Small corpora will have the following typical structure in the root of the distribution media:

Larger corpora might distribute the DATA part on other media but the basic structure remains the same.

Within the DATA and ANNOT directories organize the files in a way to avoid very large (approx. $>256$) numbers of directory entries, and try to provide a natural order to the prospective user. Depending on the aims of your speech corpus this order of subdirectories may be:


next up previous contents
Next: File Naming Conventions Up: Corpus Structure Previous: Corpus Structure   Contents
BITS Projekt-Account 2004-06-01