next up previous contents
Next: Distribution Media Up: Corpus Structure Previous: Structure   Contents

File Naming Conventions

The file naming conventions (or nomenclature) define the allowed file and directory names within your speech corpus. A very common approach is to use content-based file names, an alternative approach is to use generated file names.

Content-based file names are constructed from features of the corpus, e.g. language code, speaker gender, type of speech, etc. Content-based names allow access to specific fragments of the corpus simply by filtering file names. Of course the information encoded in the file name must be meaningful and easy to extract. One problem with content-based file names is the platform- or medium dependent length restriction of file names, e.g. 8.3 for ISO 9960 CDs. Another problem is that there is often no natural hierarchical structure in a speech corpus: is it better to organize the recordings by recording location and then by gender, or the other way round?

Generated file names are usually created automatically, e.g. as sequence numbers. Generated file names can easily be organized in hierarchies, e.g. BLOCKxx/SESxxyy with xx and yy numbers from $00$ to $99$. To retrieve fragments of a corpus, a separate document is necessary listing the contents of a signal file.

Some operating systems and programming languages are case sensitive, some are not; some apply their own rules for capitalization, others do not. Sometimes case changes when data is copied to another medium, sometimes it does not. The lesson here is: do not define a nomenclature that is case sensitive.

Here is an example from the German Verbmobil II corpus:

Dialog names are coded as follows:

1st character:
      <lang> [g,e,j,m,n] recorded language
      g(erman), e(nglish), j(apanese), m(ultilingual), n(oise)

2nd to 4th character:
      dialogue number i.e.\ 001

5th character:
      a(main), b(information desk), c(remote maintenance),
      d(VM1), n(noise)

Turn names consist of the dialog name (char 1-5) and the following:

6th character:
      technical definition of recording
      c(lose), r(oom), t(elephone)

7th character:

      detailed description of recording means (microphone)
      m(obile), p(hone,analog), w(ireless), d(ect)

      h(eadset), n(eckband microphone), c(lip microphone)


8th character:
      channel coding
9th character: '_'

10th - 12th character:
      turn number starting with '000'

13th character: '_'

14th - 16th character:
      <sp_id> speaker ID

The extensions code the contents of the file:
       .nis  NIST file
       .trl  transliteration
       .spr  speaker protocol file
       .rpr  recording session protocol
       .par  symbolic information in "partitur" format

Each recording consists of a set of files like the following:

Type                            Name                    Location

signals                         <turn>.nis              data/<dialog>/
recording session protocol      <dialog>.rpr            data/<dialog>/
speaker protocol file           <lang>_<sp_id>          spr/
transliteration                 <dialog>.trl            trl/
Bas Partitur Files (BPF)        <turn>.par              par/<dialog>/
In the above example the dialog name is used as a structural element to sort files into groups, while the turn name is the prefix to signal and annotation files.

next up previous contents
Next: Distribution Media Up: Corpus Structure Previous: Structure   Contents
BITS Projekt-Account 2004-06-01