_/_/_/_/ _/_/ _/_/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/_/_/_/ _/_/_/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/ _/_/_/_/ _/ _/ _/_/_/_/ BAVARIAN ARCHIVE FOR SPEECH SIGNALS University of Munich, Institut of Phonetics Schellingstr. 3/II, 80799 Munich, Germany bas@phonetik.uni-muenchen.de COPYRIGHT University of Munich, University of Erlangen 2005,2006,2007. All rights reserved. This corpus and software may not be disseminated further - not even partly - without a written permission of the copyright holders. Additional Copyright Holders ----------------------------------------------------------------------------- SMARTWEB UMTS Database 3.7 (Public Edition) 17.05.2021 (MM.DD.YYYY) ----------------------------------------------------------------------------- The SMARTWEB UMTS data collection was created within the publicly funded German SmartWeb project in the years 2004 - 2006. It comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC), field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC) as well as mobile recordings performed on a BMW motorbike (one speaker, SmartWeb Motorbike Corpus SMC). An addendum DVD-R (dvd-fau, vol 24) contains additional data derived from the basic SVC corpus data provided by FAU Erlangen (see readme.fau). ------------------- Contents of this directory ------------------------------ clip-rates : table summarizing the clip rates of each recording (number of clipped samples divided by total number of samples) dtd/ : document type definitions for recording protocols, speaker profiles and session scripts german-sampa.txt : definition of extended German SAM-PA as used in most German speech resources instructions/ : (German) instruction sheets for users papers/ : reports and publications pardoc/ : copy of the BAS Partitur File definition (HTML: start with "BasFormatseng.html") prompt/ : contains all prompts used in orthographic form and as audio (alaw) files readme : this file readme.mar : format description to the turn segment files *.mar readme.par : format description to the BAS Partitur Format (BPF) files *.par readme.rpr : description of recording protocols readme.rsc : description of recording scripts readme.spr : description of speaker protocols sw_ger.lex : pronunciation dictionary (SAM-PA) to all SW data: simple list of all spoken words in the corpus in the first column and the canonical pronunciation in extended German SAM-PA in the second (see table german-sampa.txt). For a detailed description of the conventions for canonical German pronunciation see: www.bas.uni-muenchen.de/Bas/BasGermanPronunciation/ Note that all special characters of the SW transliteration are deleted. Therefore the orthographic representation in the dictionary is exactly the same as in the ORT tier of the BAS Partitur Format (BPF). techdocs/ : project reports in German readme.trl : format description to the transliteration files *.trl This file basically lists the elements of the SmartKom TRL conventions (see trl-coding) that are used in the SW transliteration session-statistics : 22 column table summarizing the most prominent parameters of each recording session. A header line (marked with a leading '#' describes the 22 field values. trl-coding/ : a copy of the english version of the conventions of transliteration in SmartKom (HTML: start with "index.html") validation-reports/: validation reports in HTML format ------------------- Contents of this file ----------------------------------- General information DVD directory structure and session naming Signal files contained in this edition Naming conventions Annotation and segmentation files contained in this edition emuDB Recruitment Additional interesting documentation General known errors across all recordings History ------------------- General Information ------------------------------------- The SmartWeb (SW) UMTS speech corpus was produced in the years 2004 - 2006 at the Bavarian Archive for Speech Signals (BAS) located at the University of Munich (LMU, www.bas.uni-muenchen.de/Bas). The corpus was 100% funded by the German Ministry for Education and Science and is therefore freely available for all kinds of usage except re-distribution to third parties. The primary aim of the corpus was the empirical study of human - computer interaction (HCI) using a natural language interface to the Internet. The recordings were not performed in the classic Wizard-of-Oz setting where the machine is simulated by a human operator, because this would require more personnel and technical effort. The SW corpus was collected using a fully automated prompt system where the user was instructed to act in an artificial context prior to each recording. Example: Instructor: Imagine you are at Duesseldorf main station and would like to know how to get to the city center by bus. Ask SmartWeb for instructions User: How do I get to the city center by bus? SmartWeb: Bus number 17 goes from the main station to the city center. Instructor: Now you would like to know when busses are leaving. User: When does the next bus number 17 leave at main station? SmartWeb: The bus number 17 leaves every full and half hour. Users were carefully instructed to act as naturally as possible and to act as if using a fully functional SW system. After the instruction the user was left alone with the system; in the data collection using the handheld UMTS device (SHC), the user was allowed to move freely in the public space (street, station, park, restaurants, lobby, etc.); in the multi-party situation with video capture (SVC) the users were ask not to move in order not to further complicate the triad communication setting; in the data collection on motorbike (SMC) the user was instructed to follow a certain route where UMTs coverage was verified. The recordings with the handheld device were performed in two different technical setups. In the classic handheld corpus (SHC) a single user delivers all queries to the system. In the video handheld corpus (SVC) a pair of users are recorded, where the primary user acts as the speaker to the SmartWeb system while the second user interrupts the primary user during the recording to provoke OnView/OffView as well as OnTalk/OffTalk situations. The latter recordings contain a video of the face of the primary user captured by the built-in camera of the handheld device. After the recording, the different sound channels (see section 'signal files' below) were synchronized and annotated using the SW annotation conventions (see readme.trl) Each volume of the SW corpus edition contains exclusively data of either the handheld (SHC), the motorbike (SMC) or the video (SVC) corpus collection. Volume 1 - 15 : SHC Volume 16 - 18 : SMC Volume 19 - 23 : SVC Volume 24 : dvd-fau : addendum DVD-R with additional data derived from SVC (See readme.fau) ------------------- DVD directory structure and session naming -------------- The root directory of each DVD contains the following: data : subdirectories for each session on this DVD with all signal files doc : all documentation; the directory of this file annot : subdirectories for each annotation type that contain all annotations in the SW corpus meta : speaker profiles, recording scripts and recording protocols to all SW recordings This means that while a given distribution may contain signal files from only one corpus, it will always contain documentation, annotation files and meta data for all three corpora. ------------------- Signal files contained in this edition ------------------ SHC handheld device recordings (type u) 156 speakers (86 female, 6 adolescents); 10966 recordings totalling 30.6h. The following three signal types were recorded in a SHC recording 1. UMTS channel The speech signal is captured by a close distance microphone (via Bluetooth), transfered via a standard UMTS transmission line (WCDMA) to a speech server connected via a standard Euro-ISDN line. The result of this transmission line is a signal coded in 8bit, 8Khz, ALAW. These recordings are available in two forms: one total recording of the whole session (including all system prompts) and one recording per prompt (that is the recording as would be done by the real SmartWeb system). 2. Harddisc recording left channel The speech signal is captured by the identical microphone as in the UMTS channel but recorded directly to a harddisc recorder before any UMTS channel transmission. Therefore this channel contains the pure microphone as received by a standard Bluetooth receiver (Bluetooth 1.1). The result of this recording is coded in 16bit, 44,1kHz, PCM. 3. Harddisc recording right channel The speech signal is captured by an additional collar attached microphone and recorded directly to a harddisc recorder before any UMTS channel transmission. The result of this recording is coded in 16bit, 44,1kHz, PCM. Remarks: Since specification required a variety of microphones, please refer to the individual recording protocols (/meta/rpr/) for the exact specification of the used microphones in each session. SMC motorbike recordings (type m) 36 speakers (2 female, 0 adolescents); 2315 recordings totalling 6.3h. The following three signal types were recorded in a motorbike recording 1. UMTS channel The speech signal is captured by a Bluetooth connected motorbike helmet using a microphone arry and internal DSP for signal enhancement, transfered via a standard UMTS transmission line (WCDMA) to a speech server connected via a standard Euro-ISDN line. The result of this transmission line is a signal coded in 8bit, 8Khz, ALAW. These recordings are available in two forms: one total recording of the whole session (including all system prompts) and one recording per prompt (that is the recording as would be done by the real SmartWeb system). 2. Harddisc recording left channel The speech signal is captured by the identical microphone as in the UMTS channel but recorded directly to a harddisc recorder before any UMTS channel transmission. Therefore this channel contains either the pure microphone signal of the Bluetooth signal as received by a standard Bluetooth receiver (Bluetooth 1.1) The result of this recording is coded in 16bit, 44,1kHz, PCM. 3. Harddisc recording right channel The speech signal is captured by an additional laryngeal neck microphone and recorded directly to a harddisc recorder before any UMTS channel transmission. The result of this recording is coded in 16bit, 44,1kHz, PCM. Remarks: Due to user requirements a variety of different motorbike helmets were used. Please refer to the individual recording protocols (/meta/rpr/) for the exact specification of the used helmet. SVC video recordings (type i) 99 speakers (63 female, 2 adolescents); 2218 recordings totalling 16.2h. The following four signal types were recorded in a video recording 1. UMTS channel The speech signal is captured by a close distance microphone (via Bluetooth), transfered via a standard UMTS transmission line (WCDMA) to a speech server connected via a standard Euro-ISDN line. The result of this transmission line is a signal coded in 8bit, 8Khz, ALAW. These recordings are available in two forms: one total recording of the whole session (including all system prompts) and one recording per prompt (that is the recording as would be done by the real SmartWeb system). 2. Video capture of built-in camera (format 3GPP) over the total session without audio channel. 3. Video capture of built-in camera (format MPG) over the total session with synchronised audio channel from UMTS channel. 4. Harddisc recording left channel The speech signal is captured by an additional collar attached microphone and recorded directly to a harddisc recorder before any UMTS channel transmission. The result of this recording is coded in 16bit, 44,1kHz, PCM. Note that additional data files for SVC provided by FAU Erlangen are stored on the addendum DVD-R 'dvd-fau': - automatic face detection results (rectangulars) - manually segmented faces (rectangulars) - videos as JPG sequences + classification into OnView/OffView The time structure of all types of recordings is as follows: - Total session in UMTS quality The UMTS signal sent by the handheld device was recorded throughout the length of he recording session. Please note that this recording is not present in all handheld (u) SW recordings; this is due the fact that the recording technique was introduced during the recording phase. - Each recorded prompt in UMTS quality Starting with the prompt beep ending either when the silence detection triggered or the maximum recording length (12sec) for one query was reached. This recording MUST be present in a distributed session. - Total session in harddisc left channel - Total session in harddisc right channel - Only SVC recordings: An additional manual turn segmentation was performed on the SVC recordings yielding a second (better suited) segmentation into user queries (files *_man.mar). This was due to the fact that the automatic server-based segmentation of user input was rather chaotic because the user was constantly distracted by the second speaker. The transliteration of the SVC recordings (*.trl files) as well as all derivated annotations are therefore based on this manual segmentation. Consequently there are two versions of turn segmentation files (*.mar and *_man.mar) as well as two versions of UMTS quality prompt recordings based on these turn segmentations. To distinguish the manually segmented files from the standard segmentation of the server these files carry the string 'man' instead of the 'abc' string (see below) in the file name. E.g. i067_ifp-0020rec-010.al (traditional seg.) i067_man-0000rec-010.al (manual seg.) Please note that due to technical malfunctions the harddisc recordings may be not present in some recordings. Since the harddisc recording usually started before the server call was initiated, the harddisc recording may contain a large portion of time before the actual recording starts. Since in this time the voices of subject(s) and experimenter were recorded, this portions were blanked out for personal rights protection. It is therefore not unusual to have a long period of silence in a harddisc recording before the actual session starts. Please do not remove this leading silence portions unless you are sure that you will not exploit the information in the marker files which contain an alignment of harddisc recording to UMTS server recordings (see below). ------------------- Naming conventions -------------------------------------- A session () is named by a 4-character string: a### a : recording type g : video test i : video m : motor bike n : motor bike test t : hand held test u : hand held y : pre study ### : session number within recording type starting with 001 Remarks: Session numbers start with 001 within each type. There might be missing numbers. A recording session is made of a succession of recording blocks which contain query stimuli of the same topic. A recording block () is named by a 8-character string: abc-###% a : SHC,SMC : topic f : soccer o : open m : soccer world series 2006 n : navigation t : tourist info v : public transport c : soccer community (where to watch, where to get tickets etc.) a : SVC : always 'i' b : SHC,SMC : type of prompt p : standard i : individualized s : scripted b : SVC : topic c : soccer community (where to watch, where to get tickets etc.) f : soccer o : open t : tourist info c : SHC,SMC : contents about soccer world series 2006 m : block contains queries about teams w : block contains queries about locations p : non of w or m c : SVC : type of prompt p : standard i : individualized s : scripted abc : xxx,yxx : training prompts (first 3 of each session) ### : block number (fixed alignment to server set) % : block version number starting with 0 A textual form of all used recording blocks can be found in the text files in meta/prompt/abc-###%.txt. A single turn UMTS recording is named by a combination of recording session (), block name (, the recording prompt block to which this recording belongs) and the prompt number within this block: _rec-##%.al ## : prompt number within the block starting with 01 % : version number of recording al : extension indicating 8bit, 8kHz ALAW The corresponding BAS Partitur Format file is named like this: _rec-##%.par In case of a SVC recording there will be two single turn UMTS recording files: one based on the server segmentation named as above and a second one based on the manual turn segmentation named _man-0000rec-##%.al Examples: A SHC or SMC session recording (e.g. 'm066') may contain a subset of files of the following: Type Name Location Total server recording m066_umts.al /data/m066/ Total harddisc recording left m066_hdrl.wav /data/m066/ Total harddisc recording right m066_hdrr.wav /data/m066/ Single turn server recording * m066_rec-###.al /data/m066/ Recording session protocol * m066.rpr /meta/rpr/ Speaker protocol file * AABB.spr /meta/spr/ Session script file * m066.rsc /meta/rsc/ Turn Segmentation * m066.mar /annot/mar/ Transliteration * m066.trl /annot/trl/ Bas Partitur Files (BPF) * m066_rec-###.par /annot/par/m066 A SVC session recording (e.g. 'i067') may contain a subset of files of the following: Type Name Location Total server recording i067_umts.al /data/i067/ Total harddisc recording left i067_hdrl.wav /data/i067/ Single turn server recording * i067_rec-###.al /data/i067/ Single turn manual recording * i067_man-0000rec-###.al /data/i067/ Recording session protocol * i067.rpr /meta/rpr/ Speaker protocol file * AMPT.spr /meta/spr/ Session script file * i067.rsc /meta/rsc/ Turn Segmentation Server * i067.mar /annot/mar/ Turn Segmentation Manual * i067_man.mar /annot/mar/ Transliteration * i067.trl /annot/trl/ Bas Partitur Files (BPF) * i067_man-0000rec-###.par/annot/par/i067 Total video capture i067.3gp /data/i067/ Total video synchronized i067.mpg /data/i067/ Remarks: - the obligatory files are marked with a '*'. - the file "session-statistics" in this directory gives a listing of all channels and annotations that are present per session (as well as some of the more important recording and speaker features). ------------------- Annotation and segmentation files ----------------------- Annotation and segmentation files for all SmartWeb sessions are summarized in subdirs of /annot. Depending on the type of edition the following data may be found there: /annot/trl : transliteration files according to SmartWeb annotation (see /doc/readme.trl and /doc/trl-coding) /annot/mar : marker files containing the segmentation of the harddisc recordings according to the single UMTS recordings. Files are in SmartKom format (see /doc/readme.mar) /annot/par : BAS Partitur Format files. BAS standard representation of annotations and segmentations. The SmartWeb BPF files contain the following tiers: TRW, ORT, KAN, TRN See /doc/readme.par for a short description of SW BPF files; see /doc/pardoc for a detailed description of the BPF. Note that video segmentations and labeling provided by FAU Erlangen are stored on the addendum DVD-R dvd-fau (see readme.fau) ------------------------------- emuDB -------------------------------------- Starting from version 3.6, the SmartWeb databases (SMC, SVC, SHC) are also distributed as emuR compatible emu databases (SMC_emuDB, SVC_emuDB, SHC_emuDB). This emu database contains the annotation from the BAS Partitur files (tiers "TRW", "KAN", "ORT" as well as bundle headers) in _annot.json format. It contains one bundle for every annotated server recording. This excludes the *hdrl.wav / *hdrr.wav harddisc recordings, and the full server recordings (*umts.al). In order to comply with the requirements of emuR, recordings were converted into pcm 8kHz, WAVE RIFF format, i.e.: m032_mpp-0020rec-060.al -> m032_mpp-0020rec-060.wav. To load and view the database, start an R session and: install.packages("emuR") # if necessary library(emuR) handle = load_emuDB("/path/to/emuDB") serve(handle) The emuDB files are additional material. None of the original files from the corpus were replaced. ------------------- Recruitment -------------------------------------------- Any speaker could participate in the SHC and SVC recordings if she/he has used a mobile phone at least once. Speakers of the SMC recordings were exclusively recruited from BMW Munich. Speakers with accent (dialect or foreign language speaker) were recorded, but not specifically recruited. Non-students and persons over 35 years were systematically recruited, to form a balance with the students that typically volunteer for such recordings. It turned out that it is virtually impossible to recruit female users for the motorbike recordings. ------------------- File formats -------------------------------------------- Most of the file formats used in this corpus are widely used standards. The following file extentions are used in this corpus: .par BAS Partitur Format file (BPF) .ags BPF represented as an annotation graph (XML) .rpr recording session protocol (XML) .spr speaker protocol file (XML) .rsc recording session script (XML) .trl transliteration .mar turn segmentation with regard to the single server files .wav RIFF audio file: 16bit, 44,1kHz, PCM .al RAW audio file: 8bit, 8kHz, ALAW .3gp 3GPP video file; original video file captured by PDA .mpg MPG video file; video file with synchronised audio from the server recording .jpg JPG picture from a picture sequence .pgm PGM picture with selected face picture -------------------------------- Numbers ------------------------------------ Total SHC SMC SVC FAU Total sessions 291 156 36 99 Total speakers 291 156 36 99 Total interactions 15499 10966 2315 2218 (manual seg) Total running words 145085 102759 17175 25151 Vocabulary size 7326 5624 1687 1643 Total speech time 2420min 1835min 377min 208min (manual seg) Number of DVD-R 24 15 3 5 1 Lexicon size 7326 ------------------- Additional interesting documentations ------------------- A (German) version of the instructions to the users for all recording types can be found in the dir /doc/instructions. The subdir /doc/techdocs contain all relevant internal reports of the SmartWeb project (including detailed descriptions of the recording and prompting techniques). The subdir /doc/papers contains copies of all relevant publications of the SmartWeb group at the University of Munich. ------------ General known errors/remarks across all recordings ------------- Single turn server recordings may have zero byte length. The reason for this is that the speech server created all potential recording files at start but some of these were never written into, if the session was interrupted for some reason. We include these zero byte files in the distribution in order to keep the structure of the recording session script intact (for instance for a script based replay). Sessions that contain zero byte files are: u026, u045, u064, u071 Note that zero length recordings also appear in the transliteration (*.trl) as well as in the BAS Partitur Format files (*.par). Total server recordings were implemented during the run time of the handheld collection. Therefore this channel is missing in type u sessions up to number u092 (except for u001-u003 which were recorded later than u093). Only 5 SMC motorbike recordings contain the right channel harddisk recording (neck microphone) because microphone was introduced later during the recording phase: m025,m026,m027,m032,m033 13 SMC motorbike recordings do not contain the left channel reference recording of the Bluetooth channel because of malfunctions of the recording device: m006,m007,m015,m017,m018,m029,m040,m041,m044,m045,m046,m047,m048 The left channel harddisk recording of session m023 is cut off in the middle of the session because of a malfunction of the recorder. Other known errors: Session u025 : harddisc channels are missing Session u094 : harddisc channels are missing Session u170 : harddisc channels are missing Session u108 : u108_hdrl.wav is missing Session u180 : u180_hdrl.wav is missing Session u181 : u181_hdrl.wav is missing Session i003 : left harddisc recording is missing Session i016 : left harddisc recording is missing Session i034 : left harddisc recording is missing Session i045 : left harddisc recording is missing Session i046 : left harddisc recording is missing Session i004 : video recording stops after 250 sec Session i010 : video recording stops after 5 sec Session i086 : video recording with bad lighting conditions; video and audio synchronisation (*mpg-file) is not optimal ------------------- History ------------------------------------------------- Main history (only events that concern all SW releases) 12.10.2005 : Release UMTS Handheld 1.1 with 155 sessions of type u. This release contains no TRL files. 23.02.2006 : Release UMTS Handheld 1.2 with 154 sessions of type u. This release contains 109 TRL files. 6 sessions were added to the release. 7 sessions were removed from the release because of heavy dialect on the speaker side. BAS Partitur Format files were added for all sessions. 28.06.2006 : Release 2.0 with 154 SHC and 36 SMC sessions. 04.07.2006 : 2.1 file naming for single server recording changed: the string '_srs' was deleted from the file name to achieve full compatibility with the trl and par files. This was also changed within the mar files session 'u160' was added to dvd-015 2.2 u091,u092 : total server recording was missing -> fixed 2.3 several hd recording in motorbike session missing -> fixed 02.08.2006 : Release 2.4 new rpr format, several minor bugs fixed, session-statistics added, stable release. 09.01.2007 : Release 2.9 with 156 SHC, 36 SMC and 99 SVC sessions several minor bugs fixed Final release within the SmartWeb project 29.10.2007 : Release 3.0 with new edition of SVC (DVDs 19 - 23) extended by an additional video file per session (MPG) containing a synchronised version of the captured video (audio stems from the server recording) 21.05.2008 : Release 3.1 dvd-fau addendum added to corpus: contribution of FAU Erlangen to SVC with - automatic face detection results (rectangulars) - manual segmented faces (rectangulars) - videos as JPG sequences + classification into OnView/OffView 03.07.2008 : Release 3.2 removed right channel harddisk recording from SMC sessions m023,m030,m039 because the signal is too weak. Fixed corrupt right channel harddisk recording m027 14.08.2008 : Release 3.2 re-ordered the turn sequence in the trl transcription files (annot/trl/*.trl) for handheld (u) and motorbike (m) recordings so that the new order reflect the chronological order of recordings (previous the order was alphabetically according to the recording key (turn marker). Note that video recording transliterations were already in chronological order. 13.10.2012 : Release 3.3 fixed some documentation errors 10.06.2013 : Release 3.4 changed label file names in dvd-fau, so that multiple file names assigned to the same session cannot have identical names: e.g. i002_0572.txt -> i002_0572_auto.txt, i002_0572_hiwi.txt 29.05.2015 : Release 3.5 Added table listing clipping ratios per recording (for sound quality concerns). Several bugfixes. 26.06.2017 : Release 3.6 Added emuDB derived from the turn recording BPF files (see section 'emuDB' above) 17.05.2021 : 3.7 : added number of speakers and total durations to the three sub-corpora SHC, SMC and SVC.