next up previous contents
Next: Lexical Encoding Up: Pronunciation Dictionary Previous: File Format   Contents

Pronunciation Encoding

As mentioned earlier there are a number of coding schemes for phonetic or phonological units available. Probably the most universal and widely spread coding is the SAM Phonetic Alphabet (SAMPA and XSAMPA) as defined by Wells9.1. SAMPA and XSAMPA provide phonological codings in 7 Bit ASCII for a large number of European and other languages.

Apart from the actual coding scheme, you have to decide about the contents as well. The minimum content as described in section [*] will be a simple table containing a consistent orthographic representation and a most likely or canonical pronunciation. Since the latter is not a well defined term for most languages, please make sure that you come up with a definition that may successfully be used in the creation of the dictionary. In some cases you may refer to a standard dictionary9.2 or even better to a standardized rule set of pronunciation9.3. If this is not possible, provide a minimal rule set for problematic cases to be used by your staff during the work on pronunciation. Include this rule set into your documentation of the corpus.

If you are working on a German speech corpus, you may use the BAS rule set for manual transcription as given in appendix [*].


next up previous contents
Next: Lexical Encoding Up: Pronunciation Dictionary Previous: File Format   Contents
BITS Projekt-Account 2004-06-01