_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institute of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de BITS UNIT SELECTION CORPUS DVD-ROM Database Copyright(C) 2005 by Bavarian Archive for Speech Signals University of Munich, Germany Version 1.7 Short Name: BAS BITS-US Corpus Date: 2006/06/08 Modification Date: 2006/08/25 The BITS synthesis corpus consists of two parts: a set of logatome recordings for controlled diphone synthesis and a set of sentence recordings for unit selection techniques. BITS stands for "BAS Infrastructures for Technical Speech Processing" and was funded by the German Ministry of Science and Education during 2003-2005. This README file reports on the BITS Unit Selection Corpus which consists of 6732 recordings stored on 4 DVDs. Each DVD contains the recordings, the annotation files and the meta data files of one of the four professional speakers, and the entire corpus' documentation. Note that all documentation files are coded in Unicode UTF-8 if not stated otherwise. Table of contents ----------------- 1.) Introduction 2.) Speakers 2.1.) Recruitment of Speakers 2.2.) Speakers Profile 3.) Recording 4.) Recording Procedure 5.1.) Phonemic Annotation 5.2.) Prosodic Annotation 6.) File Nomenclatura 7.) Structure of each DVD 8.) Other Documentation Files 9.) Contact 10.) History Introduction ------------ The BITS Unit Selection Corpus consists of a set of sentence recordings for concatenative 'unit selection' speech synthesis. Speech synthesis using concatenative techniques is maturing to a point where standard procedures are being implemented in a variety of products. However, because of the considerable costs most small and medium-sized companies as well as university labs cannot afford to produce the required speech resources on their own. Although there are some public domain German diphone voices available for research purposes (e.g. MBROLA) there is definitely a lack of publicly available synthesis resources. The BITS synthesis corpus (recorded and) produced by BAS fills the obvious gap. The work was funded by the German Ministry of Education and Science (grant no 01 IV B01). Speakers -------- Recruitment of speakers: ------------------------ 45 speakers were invited for a casting. They were asked to read 90 logatomes that contained a subset of our diphone set so that three target sentences of nearly all German phonemes could be synthesised. Based on a ranking according to naturaless and pleasantness 10 speakers were selected as nominees. After an overall evaluation - by specialists in speech synthesis and by the BITS group - the best four speakers (two male and two female) were chosen for the final recordings. More informations about the recruitment of the speakers can be found under: /DOC/HTML/Specification_Unit_Selection_corpus.pdf Speakers Profile: ----------------- Four professional speakers were recorded, between the age of 40 to 45. All speakers were of German nationality and had at least foreign language competence in English. More informations about the speakers can be found in the table /DOC/SRPK.TBL. SRPK.TBL contains a list that gives information about the speakers. The ordered list has 10 columns (seperated by tabs): ID : speaker id (SES200[1-4]) Sex : M = male, W = female Age : age of the speakers at the time of the recordings Name : full name Nationality : the nationality of the speaker Size : size in cm Weight : weight in kg ACC : the accent of the speaker is determined through the federal state the speaker entered the school Edu : Education of the speaker PoL : current place of living Prof : current occupation FL : foreign languages ENG - English FR - French I - Italian EL - Greek Smk : smoker (y=yes, n=no, cas=casually) Recording: ---------- The speech signal was recorded in three channels (0 : headset-microphone, 1 : laryngograph signal and 2 : room microphone). The sampling rate is 48kHz, with 16 bit quantization. All signals are recorded via a Yamaha 02R digital sound mixer directly to hard disk using the multi-channel recording software SpeechRecorder (www.phonetik.uni-muenchen.de/Bas/software/speechrecorder/). - Channel 0 : close talk microphone (Beyerdynamic NEM 192) positioned 7cm to the right of the mid-sagital plane at the height of the upper lip. - Channel 1 : laryngograph signal (LaryngoGraph PCLX) - Channel 2 : large membrane condenser microphone (Neumann Type TLM 103) 60cm from the mouth. Channels were separated into standard WAV format files; no further processing was performed to avoid any undesired degradations of the signals. Recording Procedure: -------------------- The speaker was seated in an insulated room with low reverberation. The positions of the chair and room microphone were marked on the floor. Before the recordings the speaker was asked to put on the headset microphone and the laryngograph electrodes. During the session the speech prompts are displayed through a window using the program "SpeechRecorder". Three supervisors monitored the recording and a prompt was repeated until all three supervisors agreed about its quality. More informations about the recording procedure can be found under: /DOC/HTML/Preparation_and_Execution_of_Recordings_8_2.pdf Annotation Files ---------------- Results of manual annotation are stored in the directory ANNOT/SES#### (#### = speaker number) in three different file formats: - BAS Partitur Format (BPF,*.par) with the following tiers: ORT : orthographic representation of the prompted sentence KAN : canonical pronunciation (SAM-PA) of sentence SAP : segmentation of phonemes in augmented German SAM-PA (see details below) PRM : prosodic labelling in GTobi 'light' (see below) Note that the label for lip smack (§) is replaced by '\S' in the BPF files. (see DOC/README.PAR for more details) - Annotation Graphs (XML, *.ags) This is basically the same information in XML form. See the ag.dtd and metadata.dtd in directory DOC as well as http://agtk.sourceforge.net/ for details. - TextGrid (praat, *.TextGrid) The original segmentation results in the praat format. The two tiers contain the phonemic and prosodic labelling. Phonemic Annotation: -------------------- For the phonetic annotation all sentences were segmented in a first pass with MAUS into German SAM-PA. (More about the SAMPA encoding used for the annotation under /DOC/HTML/Conventions_for_segmentation_8_5e.pdf) Phonemic annotions are stored either in the original Praat TextGrid files and also in the SAP tier of the BAS Partitur Format (BPF) files. In a second pass a group of ten to twelve trained phoneticians manually corrected the pre-segmented sentences. After that three phoneticians that were consistent to each other corrected the segmentations in a third pass. In a last step all segmentations were reviewed by the team supervisor. The following rules of annotation were used: - the placing of boundaries is primarily based on the auditory judgement. - the boundaries of segments are always placed at positive zero-crossings of the oscillogram. - the placement of the boundaries should be controlled by sonagram and oscillogram. - in transitions where both of two adjacent phonemes can be heard, the boundary is placed in the middle of this transition (50% rule). - voiced (periodic) elements start with the first clearly identifiable glottal pulse. - the boundaries of segments with low intensity (e.g. /h/, aspiration) are set where the signal can be clearly distinguished from the background noise. Noises of breathing - if clearly recognised - have to be cut off from the friction or aspiration. Special labels aside from standard German SAM-PA: Q : glottal stop (SAM-PA: ?) ~ : preceeding vowel is nasalized § : preceeding phoneme contains an audible lip smack (replaced by '\S' in the BPF annotation files) q : preceeding vowel was glottalized More informations about the annotation of the corpus can be found under: /DOC/HTML/Conventions_for_segmentation_8_5e.pdf Prosodic Annotation: -------------------- For the prosodic annotation ofthe BITS-US corpus a reduced subset of the GToBI set was used (GToBI Light). The motivation for this was the experiences of the colleagues at IMS Stuttgart about the reproducebility of the GToBI tag set by different human labellers. The reduced set includes the following tags: Boundaries: - intermediate phrase boundary (ip) % general phrase boundary or intonation phrase boundary (IP) H% high boundary tone IP -? uncertain: ip present ? %? uncertain: - or % ? H%? uncertain: % or H% ? Accents: H*L fall L*H rise H* high target on accented syllable ..L low trail tone *L low target on accented syllable ..H high trail tone *? uncertain: accentuation ? x? uncertain about label x More informations about the label types, rules of placement, the anotation procedure and certain problem type can be found under: /DOC/HTML/Prosodic_annotation_8_4e.pdf File Nomenclatura: ------------------ The names of both audio and annotation files consist of the following: US####%%%%_$ with #### : speaker id 1001 - 1004 (corresponds to the id 2001 to 2004 in the BITS logatome corpus) %%%% : logatom id (see table /DOC/BITS-US.TBL) $ : channel 0 - 2 File name extension mappings .TextGrid Praat Label file with interval tiers .wav Audio file .txt Text file .html HTML file .par BAS Partitur Format file Structure of each DVD --------------------- Each DVD contains the following: README : this file DATA/ : the recordings of one of the four speakers ANNOT/ : the annotations files of one of the four speakers Praat annotation files (*.TextGrid), BPF files (*.par) DOC/ : documentation files The DOC/ directory contains the following: README.PAR : brief documentation of the BAS Partitur Format (BPF) SPRK.TBL : speaker profiles (see before) KNOWN-ERRORS : list of known errors that cannot be fixed BITS-US.TBL : sentence corpus lists BITS-US.XML README.BITS-US : docu to sentence lists LEXICON.TBL : pronunciation dictionary (see below) MAPPING.TBL : mapping of orthographic strings between prompted and 'spelled out' orthography (see below) DOC.HTML : start of main documentation HTML/ : main documentation files PUBLICATIONS/ : publications Other Documentation / Meta Data Files ------------------------------------- BITS-US.TBL - sentence corpus list This file contains a 4-column table describing the recorded sentences of the corpus: ;;; where: : sentence id 0001-1683 : SAM-PA transcript of the sentence; words are separated by a blank; ' denotes the lexical accent : prompt text as being displayed to speakers : orthographic form of the sentence in contrast to here no punctuations are used, number and abbreviations are spelled out. Doc file MAPPING.TBL contains a list of mappings between and . BITS-US.XML - sentence corpus list This file contains the sentence corpus together with additional information such as 'part-of-speech' tags and prosodic hypothesis labelings. Also see README.BITS-US. LEXICON.TBL - pronunciation dictionary This file contains a two-column list of orthographic words and their respective canonical pronunciation coded in extended German SAM-PA. Please note: - the orthographic string is coded as in the prompt text displayed to the speakers; to achieve a spelled out version of the orthographic string use the mapping table in MAPPING.TBL - the SAM-PA coding follows the extended German SAM-PA coding as described in /DOC/HTML/Conventions_for_segmentation_8_5e.pdf Phonemes are separated by a '|'; lexical accents in non-functional words are coded by a preceding '. PHONEME.TBL - list of phonemic SAM-PA symbols Contact ------- For questions, remarks, bug reports etc. please contact Florian Schiel schiel@bas-services.de +49-89-2180-5751 History ------- 15.03.06 : Version 1.0 27.04.06 : Version 1.1 : Documentation re-worked BAS Partitur Format files added 03.05.06 : Version 1.3 : changed all plain text files to UTF-8 coding Documentation re-worked meta files removed 08.06.06 : Version 1.4 : bug fix in LEXICON.TBL : /pfuI/ -> /pfUI/ /be:'Qa:t@/ -> /be:Q'a:t@/ added phoneme list DOC/PHONEME.TBL changed canonical SAM-PA transcripts in - LEXICON.TBL - KAN tier of BPF files (*.par) - BITS-US.TBL - BITS-US.XML in such a way that the phonemic units are separated by blancs, e.g. /bo:tSaft6/ -> /b o: t S a f t 6/ 12.07.06 : Version 1.5 : inserted phoneme separator '|' in - BPF files (tier KAN) - LEXICON.TBL - BITS-US.TBL - BITS-US.XML 09.08.06 : Version 1.6 : Bug fix: in BPF files was one 8-bit char '§' denoting the 'lip smack'. This was not conform to BPF and was replaced by the corresponding LaTeX code '\S'. 25.08.06 : Version 1.7 : Bug fix in AGS file creation, all *.ags files anew