%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% How to create a g2p mapping table %%%%%%%%%%%
%% for the BAS webservices           %%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Uwe Reichel, BAS

% INFO: Currently used mapping tables are stored in 
% /usr/local/repo_pl/data/prondict/<lng>/<lng>_map.txt

%%%% basic format
% two columns separated by semicolon
%    gra;pho
% example: map letter <e> to sound /E/:

e;E

% phonemes must be separated by a blank,
% letters must not be:

x;k s
sch;S

%%%% no comments allowed
% no comment lines allowed, since any symbol could be
% meaningful in some SAMPA variant

%%%% word boundaries
% marked by '#'
% examples:
% map <d> at end of word to /t/
% map <e> at beginning of word to /e:/

d#;t
#e;e:

%%%% greedy match
% in application, gra-strings will be sorted by length,
% and mapping starts with the longest string.
% Given the mapping entries:

e;E
ee;e:

% a letter sequence <ee> will thus be mapped to
% /e:/ and not to /EE/

%%%% warning:  robust processing
% if a rule is syntactically wrong (e.g. 'u:U' or 'u U') or
% the rule set misses a rule for a grapheme (e.g. a punctuation), the service
% will *not* issue an ERROR but rather ignore these entries;
% this can lead to missing sounds in the G2P output! Always
% test your mapping with an input that covers all possible 
% graphemes that should be translated to sound!

%%%% abstract mappings
% define sets for more abstract operations:
%    set mySetName gra;pho
% gra;pho here specifies the mapping if gra is addressed
% by its set name mySetName (see below)
% Example (word-final devoicing): 

set WORDFINALOBSTR b;p
set WORDFINALOBSTR d;t

% to address all members of set mySetName 
% for a specific mapping use $mySetName$:

$WORDFINALOBSTR$#;.

% this means: all members of WORDFINALOBSTR will be
% mapped to the phonemes . specified in the 'set...' lines
% at word ends #.
% <b#> will thus be mapped to /p/, <d#> to /t/, and so forth.

% Another example:
% palatalize Russian consonants, if next letter is <и>

set CONS б;b
set CONS в;v
set CONS г;g
set CONS д;d
...
$CONS$и;.' i

% <би> will thus be mapped to /b'i/, <ви> to /v'i/, and so forth.

%%%% phoneme-phoneme mappings

% after grapheme-phoneme mapping a subsequent phoneme-phoneme
% mapping can be activated e.g. to account for assimilations
% example: map phoneme sequence /s d/ to /z d/
% To account for phonological chain processes, P2P is repeated
% till the transcription does not change anymore.

pho sd;z d

% replacing phonemes must be separated by a blank,
% phonemes to be replaced must NOT be blank-separated

% as with G2P, also for P2P abstract mappings are supported

phoset ASSIM s;z
...
pho $ASSIM$d;. d 

% transforms /sd/ to /zd/

%%%% differences between G2P and P2P

%% 1. lower- vs uppercase
% For G2P the word input is per default unified to lowercase (for the
% currently supported alphabetic systems). Thus defined mappings for
% uppercase graphemes would be ignored, unless the user calls g2p.pl
% with the option '-lowercase no'.
% For P2P trivially the strings are not unified to lowercase, since
% upper- and lowercase letters denote different phonemes.

%% 2. treatment of non-matching substrings
% In G2P non-matching word substrings are REMOVED, so that no
% letters remain in the output transcription. The G2P mapping should thus
% be defined for all letter types of the input text.
% In P2P non-matching transcription substrings are KEPT, so that
% the user does not need to provide an exhaustive (often self-) mapping of
% all phonemes.

%%%% P2P only
% If the mapping table contains pho* lines only, the G2P step is skipped
% (implying that the input text is not removed as described above)
% and only P2P is carried out.

%%%% Symbol inventory constraints

% Since the application of g2p.pl contains at least a basic text normalization
% (punctuation identification, tokenization), punctuation marks cannot be
% used on the left-hand side of the mapping table.