_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@bas.uni-muenchen.de COPYRIGHT University of Munich 2001. All rights reserved. This corpus and software may not be disseminated further - not even partly - without a written permission of the copyright holders. Additional Copyright Holders Deutsches Forschungszentrum fuer kuenstliche Intelligenz (DFKI), Saarbruecken, Germany ---------------------------------------------------------------------- TAXI - multilingual telephone dialog database 27.06.01 / 23.05.13 / 24.03.17 Version 2.5 ---------------------------------------------------------------------- This is the documentation for the TAXI dialog database created in June 2001 in collaboration with the DFKI, Saarbruecken. TAXI contains 86 recorded dialogues between a cab dispatcher and a client recorded over public phone lines (network and GSM). The dispatcher always spoke German, while the clients always spoke English. Total recorded speech: 74min. ------------------- Contents of this file ------------------------ CD directory structure Recording situation Naming conventions Signal file formats Nist-Header field definitions for TAXI Transcription and error markers Translation Annotation format BAS Partitur Files emuDB Word lists and dictionaries Known errors History ----------------- CD directory structure -------------------------- Dialogues are situated in the "data" directory. The directory names of the dialogues consist of the letters 'SES' followed by a four digit number representing the order of recordings. For example 'SES0037'. The corpus contains 86 recordings. Each recording contains a pre-dialogue part, turns by the dispatcher and client and a hang-up part after the last turn of the the speakers until the hang up of both parties. Note that a recording does not imply that the recording went without any errors. Refer to the section 'error marker' for details on how single erroneous turns were handled. Caution: Some dialogues do not contain a single valid turn. (see section 'Error markers') The directory "doc" contains additional documents about the corpus recording and annotation as well as pronunciation dictionaries for the German spoken part. If not stated otherwise, text files are coded either in 7-bit ASCII or UTF-8: WORDS_DE.TXT : Word list, German part WORDS_EN.TXT : Word list, English part PRON_DE.LEX : Pronunciation dictionary, extended German SAM-PA, German part (spoken sentences AND translations) SAMPA.txt : table of extended German SAMPA PRON_EN.LEX : Pronunciation dictionary, English SAM-PA, English part (spoken sentences AND translations) TRANS.TXT : Results of validation, transcription and translation (see section below) TRANSCRP.PDF : description of rules and conventions of SpeechDat transcription (German) TRANSCRP_EN.PDF : description of rules and conventions of SpeechDat transcription (English) BasFormatseng.html : Descritption of the BAS Partitur Format (see www.bas.uni-muenchen.de/Bas/BasFormatseng.html for an updated version of this document) The directory "softw" contains some tools that are distributed with the BAS corpora. They might be useful to test or modify data stored on this volume. The directory "par" contains BAS Partitur Format files for each recording (not for the 'PD' and 'HP' items!) ----------------- Recording Situation -------------------------- All recordings consist of a dialogue between one of two German speaking cab dispatchers in the German city of Hannover and varying clients calling for a ride who always speak English. As a third party a recording server listens to both channels and records according to a system of DTMF tones. Both parties had to follow certain conventions during the dialogue: To prevent overlap and to allow automatic segmentation by the recording server each party had to press a button on his phone to signal the other party that his turn is over. The overall structure of each dialog is as follows: - both parties call in and prepare for the recording. This part is recorded in turn '00' and marked with 'PD' ('Pre-Dialogue'). - the dispatcher press button '0' to signal the start of the recording. - the dispatcher speaks the first turn in the manner as if he just had answered the call and finishes his turn by pressing the button '1'. His turns are marked with 'DP'. - the client answers and finishes his turn by pressing the button '2'. His turns are marked by 'CL' ('Clients') - Both parties continue their dialog; there is no convention who has the last turn. - the part between the last button press '1' or '2' at the end of the dialogue) and the hang-up of both parties is recorded into the last turn and marked by 'HP' ('Hang-Up') DTMF tones are not included in the recordings. Individual speakers or dispatchers are not marked. ----------------------- Naming conventions ---------------------- Dialog names are coded as follows: 'SES' + four digit recording number e.g. 'SES0037' Turn names consist of the four digit recording number followed by a two character marker for the speaker or dialog situation followed by a two digit turn number: '0037' + 'DP' + '01 | | | | | ------> Turn number starting with '00' | ------------> Speaker marker : PD : 'Pre-Dialogue' - part of recording before the actual dialog starts HP : 'Hang-Up' - part of the recording after the last valid tuern until the hang-up of both parties (can be discarded in most cases!) DP : 'Dispatcher' - German speaking dispatcher; two speakers throughout the corpus CL : 'Clients' - English speaking clients calling the dispatcher and asking for a fare The extention codes the format of the file: .al NIST file ALAW coding .pcm NIST file PCM coding .par BAS Partitur File (BPF) ------------------------- Signal file formats ---------------------- a. Physical signal characteristics Signal files contain the digital signal as provided by the phone company: 8 bit, 8 kHz, ALAW coding as well as an expansion of ALAW to PCM: 16 bit, 8 kHz, signed, Intel byte order. The recording channels were not separated, that is if an overlap occurs in a dialogue, the signals of both speakers are superimposed in the signal file. b. Logical signal characteristics Each signal file contains one turn of a dialogue session of one speaker. ----------------- Nist-Header field definitions for TAXI -------------- The signal files begin with a header following the NIST conventions. It has a minimum size of 1024 bytes and consists of ascii characters. The format is as follows: key type description (possible) value(s) ------------------------------------------------------------------------ database_id string database TAXI database_version string version 1.0 scenario_language string recorded [german|english|japanese| language multi_english_japanese| multi_german_japanese| multi_english_german| multi_german| multi_japanese| multi_english|noise] recording_site string site BAS recording_medium string rec. medium telephone sample_coding string coding alaw sample_n_bytes int bytes/sample 1 channel_count int # of channels 1 sample_count int # of samples sample_byte_format string little/big 1 endian, one byte sample_rate int samp. freq 8000 sample_sig_bits int number of valid bits with a word scenario_date string logical date of YYMMDD, 980101 recording The remaining bytes are filled with spaces. Example header: NIST_1A 1024 database_id -s10 TAXI database_version -s3 1.0 scenario_language -s 20 multi_english_german recording_site -s3 BAS recording_medium -s9 telephone sample_coding -s3 pcm sample_n_bytes -i 2 channel_count -i 1 sample_count -i 124798 sample_byte_format -s2 01 sample_rate -i 8000 sample_sig_bits -i 12 scenario_date -s6 010616 end_head If necessary, software for extracting information from the header, editing header information etc. can be obtained from the NIST ftp-server under the address: ftp://jaguar.ncsl.nist.gov/pub/ . The source package ( for unix ) is called "sphere_2.6a.tar.Z". ------------------------------------------------------------------------- Transcription and error markers All recordings were annotated according to SpeechDat conventions. See the document doc/TRANSCRP.PDF for details about this. Every transcript was validated once by a different transcriber. Furthermore the transcriber pressed a button for an overall evaluation 'usable/garbage' for each turn. As 'garbage' all turns were marked that contained - overlapping speech - Meta-Talk - empty turns - turns with noise only - use of the wrong language This validation resulted in 645 usable and 372 garbage turns. ------------------------------------------------------------------------- Translation All usable turns were crosswise translated into the other dialogue language. The translation was done in a way to yield naturally formed sentences (no 'close' translation). Only turns marked 'usable' were translated. The translation does not contain SpeechDat markers as the transcript but follows the SpeechDat conventions otherwise (no capital letter except for names; no punctuation; etc.) ------------------------------------------------------------------------- Annotation format The results of annotation, validation and translation were summarized in the file doc/TRANS.TXT. Only recordings marked 'usable' were annotated and translated. Recordings from the 'PD' and 'HP' part are not listed here, because it is unclear which language is used or which speaker speaks. The format of this file is as follows: Turn name ; Validation ; Transcript ; Translation where: Turn name = body name of signal file e.g. 0083HP02 Validation = usable|garbage Transcript = transcript according SpeechDat conventions Translation = translation into other dialogue language e.g.: 0040DP05;usable;[spk] [fil] leider akzeptieren wir nur Barzahlung;unfortunately we only accept cash ------------------------------------------------------------------------- BAS Partitur Format file (BPF) In the subdir par you will find BAS Partitur Format files for every recording file that was validated 'usable'. The BPF format is documented in www.bas.uni-muenchen.de/Bas/BasFormatseng.html. A copy of this document at the time of CDROM production is in doc/BasFormatseng.html. The BPF files of TAXI contain the following tiers: - ORT : Lexical access (annotation as in SpeechDat, with '[fil]', without '[spk]', '[int]' and '[sta]'. German Umlauts are coded in UTF-8. - KAN : Canonical pronunciation in extended German SAMPA or English SAMPA respectively. Not understandable words ('**') are coded with the garbage model /usb/. Filled pause is coded as /QE:/ in German and /V/ in English. - NOI : Noise marker - TLN : Translation. German Umlauts are coded in UTF-8. ------------------------------------------------------------------------- emuDB Starting from version 2.5, the corpus is also distributed as an emuR compatible emuDB. The emuDB is in the TAXI_emuDB directory. To load and view in R: install.packages("emuR") # if necessary library(emuR) handle = load_emuDB("/path/to/TAXI_emuDB") serve(handle) Note that bundles that do not have a BPF (e.g. because they are empty or unusable) have empty annotation files in the emuDB. The sound files are present nonetheless. Contrary to the previous version, version 2.5 contains audio files in WAVE format. ------------------------------------------------------------------------- Word lists and dictionaries The word lists WORDS_DE.TXT and WORDS_EN.TXT contain all words occuring in the spoken text as well in the translation for the two languages. Note that ALL items are included in the list: words, incomplete words (~), badly recognizable words (*) and the filles pause '[fil]'. Not included are noise markers [sta], [int] and [spk]. The corresponding pronunciation dictionaries PRON_DE.LEX and PRON_EN.LEX contain the extended German SAMPA and English SAMPA transcriptions to every item on the list (filled pause is coded as /QE:/ in German and /V/ in English). Dictionaries cover the ORT tier of the BAS Partitur Format files (*.par) completely, but contain 353 additional entries. ------------------------------------------------------------------------- Known errors The dialogues 49, 53, 75, 83, 92, 94, 99 and 127 do not contain a single usable marked turn. (removed in Version 2.3) The first version 1.0 does not contain BPF files. Dictionaries cover the ORT tier of the BAS Partitur Format files (*.par) completely, but contain 353 additional entries not occuring in the transcriptions. ------------------------------------------------------------------------- History 16.06.01 : recording date 26.06.01 : validation and delivery date 18.06.03 : Version 2.2 : some repairs in documentation and lexica after inhouse validation (see doc/Revalidation_TAXI.html) 28.08.12 : Version 2.3 : removed corrupt sessions 49, 53, 75, 83, 92, 94, 99 and 127 23.05.13 : Version 2.4 : changed encoding from ISO8859 to UTF-8 in - BPF files (*.par) - docu files Checked synchronity of dictionaries and ORT tier -> ok 24.03.17 : Version 2.5 : converted BPF collection into an emuDB. For this purpose, the NIST SPHERE files were converted into WAVE files. The old NIST SPHERE files continue to be available in the data/ directory; the new WAVE and *_annot.json files are contained in the emuDB in TAXI_emuDB.