_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de COPYRIGHT University of Munich 1995. All rights reserved. This corpus and software may not be disseminated further - not even partly - without a written permission of the copyright holders. Additional Copyright Holders Deutsches Forschungszentrum fuer kuenstliche Intelligenz, Saarbruecken, Germany (DFKI). Universitaet Leipzig, Leipzig Germany (UL). ---------------------------------------------------------------------- PHONOLEX 4.0 01.09.96 / 19.09.13 F. Schiel This is a crude excerpt from the WWW documentation. Please refer to http://www.phonetik.uni-muenchen.de/Bas/BasPHONOLEXeng.html for an up-to-date description about the project. Contents README : this file phonolex : main PHONOLEX list phonolex_core : three-columns list of manually verified entries phonolex_list : three-columns list of all entries phonolex_xml : XML version of phonolex phonolex.dtd : DTD for phonolex_xml DocGerman.html : copy of the German Web docu DocEnglish.html : copy of the English Web docu trans : rules for proper German transcription (copy of the Web docu; start with index.html) SourceTable.pdf : table of different source characteristics Overview PHONOLEX is the result of an internal BAS project with contributions from the DFKI Saarbruecken, Computational Linguistics Lab, the University of Leipzig and the University of Bonn. It comprises a simple list of word forms (inflected words) with the following entries: * Orthography Features: - ASCII or UTF-8; German 'Umlauts' can be in LaTeX format - Capital nouns - Mostly old German spelling rules (depending on source; modern sources are: alc,sw_ger) - Only single words - no phrases - spelled letters marked with preceeding '$' (optional) or using single capital letters - non-spelled letters marked with '/./' (optional) - parts of words, if derived from empirically based corpora * Linguistic and other information (optional) - word class (CL) - origin of entry (OR) (= corpus abbreviation) - genus of entry (GE) - method of phonological transcript, text-to-phoneme method (TP) - word ID in corpus (ID, only if contained in source lexicon) - count occurance in corpus (CT, only if contained in source lexicon) * Pronunciation Features: - extended German SAM-PA as being used in Verbmobil, PhonDat, SmartKom, SmartWeb, BITS etc. - Citation form including word accent ('), second word accents ("), morph boundary (#) (optional) - Zero or more empirically detected pronunciation variants with corpus, count and detection method. The citation form is generated either by TTP (P-TRA), by BALLOON or by manual transcription. The program P-TRA was provided by the University of Bonn, Dr. Stock. P-TRA was ported to UNIX at BAS and modified for the usage within the PHONOLEX project. BALLOON was provided by Uwe Reichel, BAS. PHONOLEX is a collective resource. That means that we did not rank entries with identical orthographic key stemming from different sources. On the contrary, all entries have an 'origin tag' (OR:) which gives the sources for this special entry. The reason for this is that pronunciation, even the so called citation form, is disputable for many German words. For the user of pronunciation lists we recommend to use a filter that checks whether a pronunciation in this list occurs more than once, and if so, take the most reliable origin. You may also decide by checking the TP: tag (mode of creation). Manually created or checked pronunciation (manu) tend to be more reliable than automatically generated pronunciations (ptra/basttp). The 'best' or most consistant entries are marked with TP:manu_veri; these are manually verified entries that were produced according to the BAS Guideline for German Canonical Pronunciation (see trans/index.html in this directory for a copy of the BAS guidelines). The file 'scripts/filter' on this volume contains a simple example to filter out reliable entries. However, to be consistant across various lexica we apply some adaptations to the original corpus lexica before integrating them into PHONOLEX: - The Orthography is adapted to LaTeX notation; some corpora lexica use ISO88591 to annotate German Umlauts; to be conform with other sources we map these to LaTeX - The usage of /?/ for the glottal stop is not allowed in PHONOLEX; therefore we map this symbol to /Q/ Structure PHONOLEX is currently build as a simple ASCII file (file phonolex). The entries are sorted to ASCII order in orthography ('NL' = new line). (See also XML section below for a XML formatted version) file -> item 'NL' [ item 'NL' ] ... item -> orthography 'NL' linguistic 'NL' pronunciation_list 'NL' '*' orthography -> German Orthography with LateX Umlauts linguistic -> TAB-separated list of keys:string CL:adj|nom|prop|verb|baseform|det|adv|prep|par|num word class GE:f|m word genus OR: origin CT: corpus frequency of word ID: word ID in corpus TP:ptra|manu|manu_veri transcription method ptra : automatic rule-based TTP P-TRA manu : manual transcription un-verified manu_veri : manual transcription verified against standard method described in www.phonetik.uni-muenchen.de/forschung/Bas/BasGermanPronunciation pronunciation_list -> canonic_word_form 'NL' [ list_of_alternate_pronunciation ] list_of_alternate_pronunciation -> pronunciation count corpus transcription-method 'NL' [ pronunciation count corpus transcription-method 'NL' ] ... canonic_word_form -> string of extended German SAM-PA pronunciation -> string of extended German SAM-PA Pronunciation in extended German SAM-PA See the file doc/trans/index.html or www.bas.uni-muenchen.de/forschung/Bas/BasSAMPA for a description of the used SAM-PA symbols and www.phonetik.uni-muenchen.de/forschung/Bas/BasGermanPronunciation/index.html for the standard transcription conventions (TP:manu_veri). Some additional remarks: - denotes a 'phoneme' to model articulatory noise that is not speech e.g. non-understandable items - denotes a 'phoneme' for background noise The package contains three other forms of PHONOLEX: - phonolex_list : simple three-column list with orthography, pronunciation, source - phonolex_core : dito but only verified entries (TP:manu_veri) - phonolex_xml : XML formatted version Known Bugs Plenty (and hopefully decreasing). See our German WWW documentation for that. Coverage The PHONOLEX list covers most of the German inflected word forms, because the UL part is derived from print media, while the SB part garanties to cover all inflected forms of the most prominent German words. Aside from that the list garanties to cover a variety of German spoken language corpora as follows: Verbmobil German I Verbmobil German II PhonDat 1 (PD1) PhonDat 2 (PD2) SI100 SI1000 SI1000P SmartKom RVG1_read RVG1_trl Speechdat (SPEECHDAT(M), FIXED1DE, MOBIL1DE, VEHIC1DE, VERIF1DE, ORIENTEL) HEMPEL RVG-J ZIPTEL ALC SmartWeb (see BAS Web documentation for details about these corpora: www.bas.uni-muenchen.de/Bas) For specific remarks and features of the different PHONOLEX source please refer to the table in doc/SourceTable.pdf History Dez 95 : Working group DFKI - BAS founded Aug 96 : Version 1.0 : First Edition - 665.893 entries Dez 96 : Version 1.1 : Improved P-TRA, Exception lists, 666.237 entries Dez 96 : Version 1.2 : Improved usage of glottal stops, geminates in pronunciation deleted Jan 97 : Version 1.3 : Improved rule set, benchmark from 62 to 67 % Feb 97 : Version 1.4 : Bug removed: in some contexts a superfluous > /S/ was appended to words. Jun 98 : University of Leipzig joins Working group Sep 98 : Extended Wordlist to 1.600.000 Nov 98 : Version 2.0 : Changed format of info line to 'Key:Text', Inserted ORIGIN marker, Improved Rule set for P-TRA (bench mark to 80%). Mar 99 : Version 2.1 : Bug caused some items of origin 'lg' not to be marked text-to-phoneme method 'ptra ('TP:ptra') All items from origin 'lg' had an empty class tag Improved canonical pronunciation for items with morph boundaries (bench mark to 90%) May 99 : Version 2.2 : Improved rule sets for the pronunciation (bench mark: with morph boundaries : 93% w/out " " : 83%) Jun 99 : Version 2.3 : Added new class of noun baseforms ('baseform') that are NOT compounds of German Jul 99 : Version 2.4 : Extended empiric pronunciation from VM corpus Aug 99 : Version 2.5 : 48 entries contain a 8-Bit char in pronunciation denoting /O~/. Fixed. Jul 01 : Version 2.6 : New empirical analysed pronunciations added from the Verbmobil projects. This analysis covers all German volumes of VM I and VM II (with multilingual data!). New sources PhonDat1, Phondat2, SI100, SI1000, RVG1 added (OR:pd1, OR:pd2, OR:si100, OR:si1000) Jan 03 : Version 2.7 : New source SmartKom German (OR:sk_ger) added. Jan 03 : Version 2.8 : New sources RVG1_read and RVG1_trl added. (OR:rvg1_read, OR:rvg1_trl). Jan 03 : Version 2.9 : New sources speechdat (OR:fixed1de, OR:mobil1de, OR:vehic1de, OR:verif1de, OR:orientel) added. Feb 03 : Version 2.10 : New corrected version of Verbmobil (OR:vm) source, SourceTable.pdf added with specific description and features of sources. Apr 03 : Version 3.0 : Added a rule set for proper transcription in German SAM-PA. The following resources were re-transcribed to meet the requirements of the new standard: Verbmobil I + II (OR:vm) SmartKom (OR:sk_ger) Jul 03 : Version 3.1 : Added filter that prevents /R/ (instead of /r/ in the rule based pronunciation output. Re-build phonolex Aug 03 : Version 3.2 : Added 'TP:manu_veri' descriptor, that denotes an manually verified canonical pronunciation according to the 'Transcription Conventions for Canonical German' as published on the BAS Web site. Re-calculated transcription of the German VM corpus and updated the empirical word forms in phonolex accordingly. Re-build phonolex. Sep 03 : Version 3.3 : Extended the makefile for the generation of phonolex_core, a list of all phonolex-entries that that have been manually checked for accuracy and tagged with "manu_veri" Jan 04 : Version 3.4 : Fixed a bug in the creation of phonolex_core The bug caused the first column to have multiple identical entries with different pronunciations Jan 04 : Version 3.5 : Fixed some bugs in sk_ger.lex, RVG1_trl R-substitution did not work caused by a faulty script for RVG1 lexica. Feb 04 : Version 3.6 : Updated documentation; mapped orthography of SpeechDat lexica to LaTeX Added hempel Feb 04 : Version 3.7 : Mapped glotal stops /?/ in sd1 lexica to /Q/ Added rvg-j; phonolex_core now at 22086 entries Mar 04 : Version 3.8 : Re-calculated OR:vm entries after bug fix in volume 4.1 signals. Apr 04 : Version 3.9 : Bugfix in source RVG-J : This bug caused about 100 entries from RVG-J to be false aligned. Fixed. May 04 : Version 3.10: Approx. 120 typos fixed; mainly in source hempel Changed /R/ to /r/ in all speechdat sources Jun 04 : Version 3.11: Re-calculated MAUS segmentations of VM corpora; included new empirical wordforms (OR:vm) Fixed approx 230 typos from several sources Oct 04 : Version 3.12: Fixed /R/ -> /r/ in HEMPEL source Fixed errors in RVG1 source Dec 04 : Version 3.13: Added speechdat_m section added third column to phonolex_core output derived from key OR:... added phonolex_list output with a simple three-column list (as phonolex_core) with all phonolex entries, where each orthographic entry comes only once (the first one, if there are multiple of equal quality) Feb 05 : Version 3.14: Replaced original PD1 lexicon by BAS standard list (PD1_bas.lex) Apr 05 : Version 3.15: Replaced FIXED1DE by a manually verified version Replaced MOBIL1DE by a manually verified version Replaced VEHIC1DE by a manually verified version May 05 : Version 3.16: Replaced VERIF1DE by a manually verified version Replaced ORIENTEL by a manually verified version Added ZIPTEL manually verified Jun 05 : Version 3.17: Replaced RVG1_TRL by a manually verified version Replaced RVG1_READ by a manually verified version Sep 05 : Version 3.18: Replaced SI100 by a manually verified version Replaced SI1000 by a manually verified version Multiple pronunciation error fixes in the following source lexica: HEMPEL, ORIENTEL, PD1, RVG-J, SmartKom (sk_ger), FIXED1DE, MOBIL1DE, VEHIC1DE, VERIF1DE, ZIPTEL User notify on 28th of Sept 2005 Sep 05 : Version 3.19: Multiple pronunciation error fixes in the following source lexica: PD2, vm_ger (German Verbmobil) New calculation of pronunciation variants of vm_ger Oct 05 : Version 3.20: Added XML version of PHONOLEX: phonolex_xml see DTD phonolex.dtd for details Nov 08 : Version 3.21: Fixed bugs in speechdat.fixed1de, sk_ger.lex, vm_ger.lex: single phonemes pronunciation is not as a spelling Added SmartWeb (SW) manually verified Apr 11 : Version 3.22: Fixed 98 errors in lg portion of phonolex Jul 11 : Version 3.23: changed phonolex_list so that it contains ALL entries of phonolex not just unique (and arbitrarily chosen) orthographic entries. Added alc section (alc) manually verified Aug 11 : Version 3.24: Multiple pronunciation error fixes in FIXED1DE, ZIPTEL, SMARTKOM, ALC. Sep 13 : Version 4.0: re-coding of orthographic string. Until now the coding of the orthographic string depended on the coding of the source. This let to mixed-coding files. From version 4.x the coding must be either LaTeX or UTF-8 resulting in true UTF-8 coded files: ziptel,hempel,rvg-j are recoded on-the-fly from ISO8859 to UTF-8 (sources still ISO8859!) Bug fix: corrupt entry at begin of phonolex_core/list: 'OR:si100...' Content fix: several wrong pronunciations fixed in sources Still to do: Insert BITS,SC1,SC10,SC2,SPINA,TAXI Text-to-phoneme methode : switch from PTRA to BALLOON Contacts: Florian Schiel info@bas-services.de Availability: A copy of the current PHONOLEX version may be obtained at BAS. The purchase of a user licence is necessary. The user licence entitles to use the PHONOLEX list for scientific, commercial and educational purpose. Furthermore the owner of a user licence will get free upgrades of higher versions of PHONOLEX for free. The user licence does not entitle the user to re-distribute the list in any form, not partly and not modified or extended to third parties. Furthermore the user agrees to report any errors found in the list to the BAS. This way we hope to achieve improvements in the future. All rights stay with DFKI, UL and BAS. By purchasing the user licence the user will accept the above conditions. Please send orders to the following address: bas@bas-services.de