_/_/_/_/         _/_/         _/_/_/_/
                    _/      _/       _/ _/        _/      _/
                   _/      _/       _/  _/       _/
                  _/      _/       _/   _/       _/
                 _/_/_/_/         _/_/_/_/        _/_/_/
                _/      _/       _/     _/             _/
               _/      _/       _/      _/             _/
              _/      _/       _/       _/    _/      _/
             _/_/_/_/         _/        _/     _/_/_/_/


                   BAVARIAN ARCHIVE FOR SPEECH SIGNALS 

               University of Munich, Institut of Phonetics
               Schellingstr. 3/II, 80799 Munich, Germany
                      bas@bas.uni-muenchen.de


         COPYRIGHT University of Munich 2001. All rights reserved.   
    This corpus and software may not be disseminated further - not even
      partly - without a written permission of the copyright holders.  

                      Additional Copyright Holders
             Deutsches Forschungszentrum fuer kuenstliche
              Intelligenz (DFKI), Saarbruecken, Germany

----------------------------------------------------------------------

TAXI - multilingual telephone dialog database

27.06.01 / 23.05.13 / 24.03.17		Version 2.5

----------------------------------------------------------------------

This is the documentation for the TAXI dialog database created in 
June 2001 in collaboration with the DFKI, Saarbruecken.

TAXI contains 86 recorded dialogues between a cab dispatcher and
a client recorded over public phone lines (network and GSM).
The dispatcher always spoke German, while the clients always spoke 
English. Total recorded speech: 74min.


------------------- Contents of this file ------------------------

   CD directory structure
   Recording situation
   Naming conventions
   Signal file formats
   Nist-Header field definitions for TAXI
   Transcription and error markers
   Translation
   Annotation format
   BAS Partitur Files
   emuDB
   Word lists and dictionaries
   Known errors
   History


----------------- CD directory structure --------------------------


Dialogues are situated in the "data" directory.

The directory names of the dialogues consist of the letters
'SES' followed by a four digit number representing the order 
of recordings. For example 'SES0037'. 
The corpus contains 86 recordings. Each recording contains a pre-dialogue
part, turns by the dispatcher and client and a hang-up part after the last
turn of the the speakers until the hang up of both parties.
Note that a recording does not imply that the recording went without 
any errors. Refer to the section 'error marker' for details on how single 
erroneous turns were handled. Caution: Some dialogues do not contain a single
valid turn. (see section 'Error markers')

The directory "doc" contains additional documents about the 
corpus recording and annotation as well as pronunciation dictionaries
for the German spoken part. If not stated otherwise, text files are coded 
either in 7-bit ASCII or UTF-8:

WORDS_DE.TXT	: Word list, German part
WORDS_EN.TXT	: Word list, English part
PRON_DE.LEX	: Pronunciation dictionary, extended German SAM-PA, German part
                  (spoken sentences AND translations)
SAMPA.txt 	: table of extended German SAMPA		  
PRON_EN.LEX	: Pronunciation dictionary, English SAM-PA, English part
                  (spoken sentences AND translations)
TRANS.TXT	: Results of validation, transcription and translation
		  (see section below)
TRANSCRP.PDF	: description of rules and conventions of SpeechDat
                  transcription (German)
TRANSCRP_EN.PDF	: description of rules and conventions of SpeechDat
                  transcription (English)
BasFormatseng.html	: Descritption of the BAS Partitur Format 
			  (see www.bas.uni-muenchen.de/Bas/BasFormatseng.html
			   for an updated version of this document)
		  
The directory "softw" contains some tools that are distributed with the
BAS corpora. They might be useful to test or modify data stored on this
volume.		  

The directory "par" contains BAS Partitur Format files for each recording
(not for the 'PD' and 'HP' items!)

----------------- Recording Situation  --------------------------

All recordings consist of a dialogue between one of two German speaking
cab dispatchers in the German city of Hannover and varying clients 
calling for a ride who always speak English.
As a third party a recording server listens to both channels and records
according to a system of DTMF tones.
Both parties had to follow certain conventions during the dialogue:
To prevent overlap and to allow automatic segmentation by the recording
server each party had to press a button on his phone to signal the other
party that his turn is over.

The overall structure of each dialog is as follows:

- both parties call in and prepare for the recording. This part is
  recorded in turn '00' and marked with 'PD' ('Pre-Dialogue').
- the dispatcher press button '0' to signal the start of the recording.
- the dispatcher speaks the first turn in the manner as if he just had 
  answered the call and finishes his turn by pressing the button '1'.
  His turns are marked with 'DP'.
- the client answers and finishes his turn by pressing the button '2'.
  His turns are marked by 'CL' ('Clients')
- Both parties continue their dialog; there is no convention who has the
  last turn.
- the part between the last button press '1' or '2' at the end of the
  dialogue) and the hang-up of both parties is recorded into the last turn
  and marked by 'HP' ('Hang-Up')

DTMF tones are not included in the recordings. Individual speakers or
dispatchers are not marked. 

----------------------- Naming conventions ----------------------


Dialog names are coded as follows:

'SES' + four digit recording number

e.g. 'SES0037'


Turn names consist of the four digit recording number followed by
a two character marker for the speaker or dialog situation followed by a 
two digit turn number:

'0037' + 'DP' + '01 
          |      |
	  |      |
	  |      ------>  Turn number starting with '00'
	  |
	   ------------>  Speaker marker :  PD : 'Pre-Dialogue' - part of
	                                         recording before the actual 
						 dialog starts
					    HP : 'Hang-Up' - part of the 
					         recording after the last 
						 valid tuern until the 
						 hang-up of both parties
						 (can be discarded in most 
						 cases!)
					    DP : 'Dispatcher' - German
					         speaking dispatcher; two
					         speakers throughout the
					         corpus 
					    CL : 'Clients' - English 
					         speaking clients calling
						 the dispatcher and
						 asking for a fare


The extention codes the format of the file:

       .al   NIST file ALAW coding
       .pcm  NIST file PCM coding
       .par  BAS Partitur File (BPF)


------------------------- Signal file formats ----------------------

 a. Physical signal characteristics

Signal files contain the digital signal as provided by the phone
company: 8 bit, 8 kHz, ALAW coding
as well as an expansion of ALAW to PCM: 16 bit, 8 kHz, signed, Intel byte
order.
The recording channels were not separated, that is if an overlap occurs in
a dialogue, the signals of both speakers are superimposed in the signal
file.

 b. Logical signal characteristics

Each signal file contains one turn of a dialogue 
session of one speaker.


----------------- Nist-Header field definitions for TAXI --------------


The signal files begin with a header following the NIST conventions.
It has a minimum size of 1024 bytes and consists of ascii characters.
The format is as follows:


key                     type    description     (possible) value(s)
------------------------------------------------------------------------
database_id             string  database        TAXI
database_version        string  version         1.0
scenario_language       string  recorded        [german|english|japanese|
                                language	multi_english_japanese|
						multi_german_japanese|
						multi_english_german| 
						multi_german|
						multi_japanese|
						multi_english|noise]
recording_site          string  site            BAS
recording_medium        string  rec. medium     telephone
sample_coding           string  coding          alaw
sample_n_bytes          int     bytes/sample    1
channel_count           int     # of channels   1
sample_count            int     # of samples
sample_byte_format      string  little/big      1
                                endian, one byte                
sample_rate             int     samp. freq	8000
sample_sig_bits		int	number of valid bits with a word
scenario_date           string  logical date of   YYMMDD, 980101
                                recording


The remaining bytes are filled with spaces.

Example header: 

NIST_1A
   1024
database_id -s10 TAXI
database_version -s3 1.0
scenario_language -s 20 multi_english_german
recording_site -s3 BAS
recording_medium -s9 telephone
sample_coding -s3 pcm
sample_n_bytes -i 2
channel_count -i 1
sample_count -i 124798
sample_byte_format -s2 01
sample_rate -i 8000
sample_sig_bits -i 12
scenario_date -s6 010616
end_head


If necessary, software for extracting information from the header, 
editing header information etc. can be obtained from the NIST 
ftp-server under the address: 
        ftp://jaguar.ncsl.nist.gov/pub/  .

The source package ( for unix ) is called "sphere_2.6a.tar.Z".
 

-------------------------------------------------------------------------

Transcription and error markers

All recordings were annotated according to SpeechDat conventions.
See the document doc/TRANSCRP.PDF for details about this. Every transcript was
validated once by a different transcriber.
Furthermore the transcriber pressed a button for an overall evaluation
'usable/garbage' for each turn.
As 'garbage' all turns were marked that contained
- overlapping speech
- Meta-Talk
- empty turns
- turns with noise only
- use of the wrong language

This validation resulted in 645 usable and 372 garbage turns.

-------------------------------------------------------------------------

Translation

All usable turns were crosswise translated into the other dialogue
language. The translation was done in a way to yield naturally 
formed sentences (no 'close' translation). Only turns marked 'usable' were
translated.
The translation does not contain SpeechDat markers as the transcript but 
follows the SpeechDat conventions otherwise (no capital letter except for
names; no punctuation; etc.)

-------------------------------------------------------------------------

Annotation format

The results of annotation, validation and translation were summarized in
the file doc/TRANS.TXT.
Only recordings marked 'usable' were annotated and translated.
Recordings from the 'PD' and 'HP' part are not listed here, because 
it is unclear which language is used or which speaker speaks.

The format of this file is as follows:

Turn name ; Validation ; Transcript ; Translation

      where: Turn name = body name of signal file
	                 e.g. 0083HP02
	     Validation = usable|garbage
	     Transcript = transcript according SpeechDat conventions
	     Translation = translation into other dialogue language

e.g.:

0040DP05;usable;[spk] [fil] leider akzeptieren wir nur Barzahlung;unfortunately we only accept cash

-------------------------------------------------------------------------

BAS Partitur Format file (BPF)

In the subdir par you will find BAS Partitur Format files for every
recording file that was validated 'usable'. The BPF format is documented
in www.bas.uni-muenchen.de/Bas/BasFormatseng.html.
A copy of this document at the time of CDROM production is in 
doc/BasFormatseng.html.

The BPF files of TAXI contain the following tiers:
- ORT : Lexical access (annotation as in SpeechDat, with '[fil]', without
        '[spk]', '[int]' and '[sta]'. German Umlauts are coded in 
	UTF-8.
- KAN : Canonical pronunciation in extended German SAMPA or English SAMPA
        respectively. Not understandable words ('**') are coded with 
	the garbage model /usb/. Filled pause is coded as /QE:/ in German and
	/V/ in English.
- NOI : Noise marker
- TLN : Translation. German Umlauts are coded in UTF-8.


-------------------------------------------------------------------------

emuDB

Starting from version 2.5, the corpus is also distributed as an emuR compatible
emuDB. The emuDB is in the TAXI_emuDB directory. To load and view in R:

install.packages("emuR") # if necessary
library(emuR)
handle = load_emuDB("/path/to/TAXI_emuDB")
serve(handle)

Note that bundles that do not have a BPF (e.g. because they are empty or unusable)
have empty annotation files in the emuDB. The sound files are present nonetheless.

Contrary to the previous version, version 2.5 contains audio files in WAVE format.

-------------------------------------------------------------------------

Word lists and dictionaries

The word lists WORDS_DE.TXT and WORDS_EN.TXT contain all words 
occuring in the spoken text as well in the translation for the 
two languages. Note that ALL items are included in the list:
words, incomplete words (~), badly recognizable words (*) and 
the filles pause '[fil]'. Not included are noise markers [sta], 
[int] and [spk].
The corresponding pronunciation dictionaries PRON_DE.LEX and PRON_EN.LEX 
contain the extended German SAMPA and English SAMPA transcriptions to 
every item on the list (filled pause is coded as /QE:/ in German and 
/V/ in English).
Dictionaries cover the ORT tier of the BAS Partitur Format files (*.par) 
completely, but contain 353 additional entries.

-------------------------------------------------------------------------

Known errors

The dialogues 49, 53, 75, 83, 92, 94, 99 and 127 do not contain a single usable
marked turn. (removed in Version 2.3)
The first version 1.0 does not contain BPF files.
Dictionaries cover the ORT tier of the BAS Partitur Format files (*.par) 
completely, but contain 353 additional entries not occuring in the 
transcriptions.

-------------------------------------------------------------------------

History

16.06.01 : recording date
26.06.01 : validation and delivery date
18.06.03 : Version 2.2 : some repairs in documentation and lexica 
           after inhouse validation
           (see doc/Revalidation_TAXI.html)
28.08.12 : Version 2.3 : removed corrupt sessions 49, 53, 75, 83, 92, 94, 99 and 127	   
23.05.13 : Version 2.4 : changed encoding from ISO8859 to UTF-8 in 
           - BPF files (*.par)
	   - docu files
	   Checked synchronity of dictionaries and ORT tier -> ok
24.03.17 : Version 2.5 : converted BPF collection into an emuDB. For this
           purpose, the NIST SPHERE files were converted into WAVE files.
           The old NIST SPHERE files continue to be available in the data/
           directory; the new WAVE and *_annot.json files are contained in
           the emuDB in TAXI_emuDB.