%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% How to create a g2p mapping table %%%%%%%%%%% %% for the BAS webservices %%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Uwe Reichel, BAS % INFO: Currently used mapping tables are stored in % /usr/local/repo_pl/data/prondict//_map.txt %%%% basic format % two columns separated by semicolon % gra;pho % example: map letter to sound /E/: e;E % phonemes must be separated by a blank, % letters must not be: x;k s sch;S %%%% no comments allowed % no comment lines allowed, since any symbol could be % meaningful in some SAMPA variant %%%% word boundaries % marked by '#' % examples: % map at end of word to /t/ % map at beginning of word to /e:/ d#;t #e;e: %%%% greedy match % in application, gra-strings will be sorted by length, % and mapping starts with the longest string. % Given the mapping entries: e;E ee;e: % a letter sequence will thus be mapped to % /e:/ and not to /EE/ %%%% warning: robust processing % if a rule is syntactically wrong (e.g. 'u:U' or 'u U') or % the rule set misses a rule for a grapheme (e.g. a punctuation), the service % will *not* issue an ERROR but rather ignore these entries; % this can lead to missing sounds in the G2P output! Always % test your mapping with an input that covers all possible % graphemes that should be translated to sound! %%%% abstract mappings % define sets for more abstract operations: % set mySetName gra;pho % gra;pho here specifies the mapping if gra is addressed % by its set name mySetName (see below) % Example (word-final devoicing): set WORDFINALOBSTR b;p set WORDFINALOBSTR d;t % to address all members of set mySetName % for a specific mapping use $mySetName$: $WORDFINALOBSTR$#;. % this means: all members of WORDFINALOBSTR will be % mapped to the phonemes . specified in the 'set...' lines % at word ends #. % will thus be mapped to /p/, to /t/, and so forth. % Another example: % palatalize Russian consonants, if next letter is <и> set CONS б;b set CONS в;v set CONS г;g set CONS д;d ... $CONS$и;.' i % <би> will thus be mapped to /b'i/, <ви> to /v'i/, and so forth. %%%% phoneme-phoneme mappings % after grapheme-phoneme mapping a subsequent phoneme-phoneme % mapping can be activated e.g. to account for assimilations % example: map phoneme sequence /s d/ to /z d/ % To account for phonological chain processes, P2P is repeated % till the transcription does not change anymore. pho sd;z d % replacing phonemes must be separated by a blank, % phonemes to be replaced must NOT be blank-separated % as with G2P, also for P2P abstract mappings are supported phoset ASSIM s;z ... pho $ASSIM$d;. d % transforms /sd/ to /zd/ %%%% differences between G2P and P2P %% 1. lower- vs uppercase % For G2P the word input is per default unified to lowercase (for the % currently supported alphabetic systems). Thus defined mappings for % uppercase graphemes would be ignored, unless the user calls g2p.pl % with the option '-lowercase no'. % For P2P trivially the strings are not unified to lowercase, since % upper- and lowercase letters denote different phonemes. %% 2. treatment of non-matching substrings % In G2P non-matching word substrings are REMOVED, so that no % letters remain in the output transcription. The G2P mapping should thus % be defined for all letter types of the input text. % In P2P non-matching transcription substrings are KEPT, so that % the user does not need to provide an exhaustive (often self-) mapping of % all phonemes. %%%% P2P only % If the mapping table contains pho* lines only, the G2P step is skipped % (implying that the input text is not removed as described above) % and only P2P is carried out. %%%% Symbol inventory constraints % Since the application of g2p.pl contains at least a basic text normalization % (punctuation identification, tokenization), punctuation marks cannot be % used on the left-hand side of the mapping table.