*Diese Seite wird nur in Englisch angeboten. Sorry.*

*This page/data was last updated 2015-11-23*

This page provides a number of empirically based analysis results derived from various German speech corpora. We consider only databases that contain spontaneous, conversational speech. Note that some corpora contain conversations between humans and machines.

The statistics are stored in form of simple
ASCII list, tables ands matrices
that may be downloaded for easy further processing.
The coding is 7-bit ASCII to avoid the typical problems of different
codings such as Iso8859 or Unicode; characters that are not 7-bit ASCII
are coded in LaTeX; phonetic symbols are coded in extended German SAM-PA.

We provide the simple counts, the basic probability of a single entities,
condition probability that the entity occurs in certain contexts,
statistics of durations under varying conditions,
as well as the conditional probabilities for an certain entity occuring right
after another certain entity (bigram).

Schiel F
(2010):
**BAStat: New statistical resources at the Bavarian Archive for Speech Signals
**. In: Proc. of LREC 2010, Valletta, Malta, paper 277.

- Phone Monogram Statistics
- Phone Bigram Statistics
- Statistics of Phone Strings/Contexts
- Syllable Raw Data
- Syllable Monogram Statistics
- Syllable Bigram Statistics
- Pronunciation Statistics
- Word Monogram Statistics
- Word Bigram Statistics

Note that *German SAM-PA* deviates from the official German SAM-PA
in that the glottal stop is labelled as /Q/ instead of /?/ since the question mark
often leads to errors in UNIX based processing software. Further, the German SAM-PA
contains un-lengthened tense vowels /e/, /i/, /o/, /y/, /2/ and /u/, the nazalised
vowels /a~/, /E~/, /O~/ and /9~/, and no affricates (phonemes
of affricates are labelled and segmented separately).

Para-linguistic segment markers often used in technical speech processing such as articulatory noise (cough, throat clear etc.), laughing or silence intervals are not considered. That is, we provide no priors for these events but bigram statistics to and from silence intervals and word boundaries (see section Phoneme Bigram Statistiscs below). The reason is that these para-linguistic events are marked quite differently and inconsistently accross different corpora.

All phone segments belonging to a filled pause (hesitation) are excluded from the analysis, because these phones behave in a different way than phones embedded in spoken words. (For instance they may be exceedingly long.)

Since there also exist triphthongs and additional diphthongs in German that all end in /6/ and it is not clear whether they should be treated as entities or not, we calculate two sets for the first order statistics (apriors): one with the basic phone list (47) and one with the additional 24 /6/-combinations (76).

Basic German Phone List (52)

German Phone List extended by /6/ combinations (76)

Since large corpora of manually segmented spontaneous German are not available, most of the following results are based on the automatic MAUS transcript.

- phone label,
- absolute count,
- probability (this column adds up to 1),
- conditional probability that given the phone the phone is word-initial,
- conditional probability that given the phone the phone is word-final,
- conditional probability that given the phone the phone is word-internal,

- mean of duration all word positions (Mean)
- standard deviation of duration all word positions (SD)
- 25% quantile of duration all word positions (QQ25)
- 50% quantile of duration all word positions (QQ50 = median)
- 75% quantile of duration all word positions (QQ75)
- mean of duration initial word positions (WI)
- standard deviation of duration initial word positions
- 25% quantile of duration initial word positions
- 50% quantile of duration initial word positions (= median)
- 75% quantile of duration initial word positions
- mean of duration final word positions (WF)
- standard deviation of duration final word positions
- 25% quantile of duration final word positions
- 50% quantile of duration final word positions (= median)
- 75% quantile of duration final word positions
- mean of duration middle word positions (WM)
- standard deviation of duration middle word positions
- 25% quantile of duration middle word positions
- 50% quantile of duration middle word positions (= median)
- 75% quantile of duration middle word positions
- mean of duration of single-phone words (WS)
- standard deviation of duration of single-phone words
- 25% quantile of duration of single-phone words
- 50% quantile of duration of single-phone words
- 75% quantile of duration of single-phone words

Database TOTAL Words 557561 total 2124420 Phon Count P(Phon) P(WI|Phon) P(WF|Phon) P(WM|Phon) Mean(Phon) SD(Phon) aI 47890 2.254262e-02 1.493005e-01 1.631238e-01 6.859261e-01 1.289775e-01 8.806708e-02 OY 4884 2.298980e-03 0.000000e+00 3.521704e-02 9.647830e-01 1.235842e-01 4.567128e-02 aU 19964 9.397388e-03 4.264676e-01 1.361451e-01 4.156482e-01 1.365879e-01 6.872048e-02 ...The probability in the third column is the estimate for P(phone); consequently the third column sums up to 1.

Statistical duration values are given in secs. Since some phone segments are exceedingly long because of very long hesitations or - possibly - caused by errors in the automatic segmentation algorithm, segment durations above 2000msec are reduced to the average phone duration of 90msec, before calculating the statistical values.

The conditional probability for given the phone it being a single-phone word is 1 minus the 4th, 5th and 6th column.

- Database: Verbmobil 1 -
Training Set
(12600 turns, 285280 words, 1115582 phones)

*(Note that this statistic is based on the training set only; the test and the development sets are not included and may therefore be used for independent testing.)* - Type of speech : dialogue, two speakers separated by sound-proof glas, non-overlapped speech
- Speaker : 885 speakers (each participating in multiple recordings) recorded in Munich, Bonn, Karlsruhe, Kiel, Germany.
- Domain, content : appointment scheduling
- Segmentation method : MAUS

German Phone Statistics extended by /6/ combinations

- Database: Verbmobil 2 -
Training Set
(11835 turns, 153438 words, 575924 phones)

*(Note that this statistic is based on the training set only; the test and the development sets are not included and may therefore be used for independent testing.)* - Type of speech : dialogue, two speakers facing each other in same room, overlapped speech
- Speaker : 259 different speakers (each participating in multiple recordings) recorded in Munich and Bonn, Germany.
- Domain, content : appointment scheduling, travel and leisure planing
- Segmentation method : MAUS

German Phone Statistics extended by /6/ combinations

- Database: Verbmobil 1 + 2 - Training Sets (24435 turns, 438718 words, 1691506 phones)
- Type of speech : see above
- Speaker : 1144 different speakers (each participating in multiple recordings) recorded in Munich, Kiel, Karlsruhe and Bonn, Germany.
- Domain, content : appointment scheduling, travel and leisure planing
- Segmentation method : MAUS

German Phone Statistics extended by /6/ combinations

- Database: SmartKom Audio - complete database (55681 words, 221498 phonemes)
- Type of speech : multimodal man - machine interaction (only speech of human actor considered)
- Speaker : 224 different speakers (each participating in 2 recordings) recorded in a Wizard-of-Oz setting in Munich, Germany.
- Domain, content : cinema, touristic planing, TV guide, restaurant, navigation, VCR programing, music jukebox, phone, fax
- Segmentation method : MAUS

German Phone Statistics extended by /6/ combinations

- Database: RVG1 HQ - only spontaneous speech (recordings sp1) (63162 words, 242832 phones)
- Type of speech : spontaneous monologue (1 minute of monologue per speaker)
- Speaker : 419 different speakers recorded from all dialect regions of Austria, Switzerland and Germany.
- Domain, content : The answer to 'Please tell me what you did this week.'
- Segmentation method : MAUS (v2.7)

German Phone Statistics extended by /6/ combinations

(Test sets of VM1 and VM2 are still excluded, though!)

Basic German Phone Statistics

German Phone Statistics extended by /6/ combinations

If n = number of entities, then the matrix is (n+3) cols x (n+2) rows since the first column contains an index to the entities and the !ENTER and !EXIT pseudo entities are added to the data to model entry and exit bigram probabilities of utterances. (Note that the latter only make sense if the corpus is segmented into utterances! This feature is marked for each corpus individually below!)

The rows define the predecessor phone phon1 (indexed in the first
column), while
the columns define the successor phone phon2 (not indexed but in the same
order as the rows), one single element contains the linear
conditional probability P(col=phon2|row=phon1). Consequently, the elements
of each row sum up to 1 (because *per definitionem* after each given phone
must follow another, and hence the probabilities over all possible successors sum
up to 1).

For technical reasons the last line indexed by phon1 = !EXIT also
contains equally distributed
probabilities summing up to 1 although !EXIT has no successor.

Since the 2nd colum contains the probabilities for phon2 = !ENTER but
!ENTER has no predecessor, all values in this column are set to zero, to
avoid a distortion of the remaining values in the rows.

If other entries than the first column are zero, this means that the bigram combination was not seen in the input corpus/corpora. You may apply standard discounting techniques to obtain non-zero probabilities for these cases.

Identical values following each other are indicated by an optional counter added to the probability, e.g. 8.968610e-03*2 = 8.968610e-03 8.968610e-03

Each bigram matrix file contains a three-line, two-column header listing the name of the database, the total number of words and the total number of phonemes followed by the matrix as described above.

- Database: Verbmobil 1 -
Training Set
(12600 turns, 285280 words, 1115582 phones)

*(Note that this statistic is based on the training set only; the test and the development sets are not included and may therefore be used for independent testing.)* - Type of speech : dialogue, two speakers separated by sound-proof glas, non-overlapped speech
- Speaker : 885 speakers ((each participating in multiple recordings) recorded in Munich, Bonn, Karlsruhe, Kiel, Germany.
- Domain, content : appointment scheduling
- Segmentation method : MAUS
- Entry and exit transition probabilities are valid in the sense of a
*turn initial and turn final transition*since the utterances in Verbmobil are not segmented into sentences but rather into dialogue turns. Therefore one turn may contain several utterances of which the entry and exit transitions are not counted here.

Bigram Matrix extended by /6/-combinations

- Database: Verbmobil 2 -
Training Set
(11835 turns, 153438 words, 575924 phones)

- Type of speech : dialogue, two speakers facing each other in same room, overlapped speech
- Speaker : 259 different speakers (each participating in multiple recordings) recorded in Munich and Bonn, Germany.
- Domain, content : appointment scheduling, travel and leisure planing
- Segmentation method : MAUS
- Entry and exit transition probabilities are valid in the sense of a
*turn initial and turn final transition*since the utterances in Verbmobil are not segmented into sentences but rather into dialogue turns. Therefore one turn may contain several utterances of which the entry and exit transitions are not counted here.

Bigram Matrix extended by /6/-combinations

- Database: Verbmobil 1 + 2 - Training Sets (24435 turns, 438718 words, 1691506 phones)
- Type of speech : see above
- Speaker : 1144 different speakers (each participating in multiple recordings) recorded in Munich, Kiel, Karlsruhe and Bonn, Germany.
- Domain, content : appointment scheduling, travel and leisure planing
- Segmentation method : MAUS
- Entry and exit transition probabilities are valid in the sense of a
*turn initial and turn final transition*since the utterances in Verbmobil are not segmented into sentences but rather into dialogue turns. Therefore one turn may contain several utterances of which the entry and exit transitions are not counted here.

Bigram Matrix extended by /6/-combinations

- Database: SmartKom Audio - complete database (55681 words, 221498 phones)
- Type of speech : multimodal man - machine interaction (only speech of human actor considered)
- Speaker : 224 different speakers (each participating in 2 recordings) recorded in a Wizard-of-Oz setting in Munich, Germany.
- Domain, content : cinema, touristic planing, TV guide, restaurant, navigation, VCR programing, music jukebox, phone, fax
- Segmentation method : MAUS

Bigram Matrix extended by /6/-combinations

- Database: RVG1 HQ - only spontaneous speech (recordings sp1) (63162 words, 242832 phones)
- Type of speech : spontaneous monologue (1 minute of monologue per speaker)
- Speaker : 419 different speakers recorded from all dialect regions of Austria, Switzerland and Germany.
- Domain, content : The answer to 'Please tell me what you did this week.'
- Segmentation method : MAUS (v2.7)

Bigram Matrix extended by /6/-combinations

Bigram Matrix based on basic phone set
(Excel version,
CSV version)

Bigram Matrix extended by /6/-combinations

Classical questions often asked are:

*'What is the probability for phone /x/ to occur
with the left-context phone /y/?'*

or:

*'What is the probability for phone /x/ to occur
with right-context phone /z/?'*

or:

*'What is the probability for phone /x/ to occur
with left-context phone /y/ and right-context phone /z/?'*

These can be answered by calculating the probability of the cooccurrence
of an ordered pair of phones (y,x) or (x,z) or an ordered triplet (y,x,z)
by combining the first and second order statistics and ignoring higher order
statistics.

First re-formulate
the question into the form 'y is followed by x' or 'x is followed by z'
or 'y is followed by x followed by z'
respectively and then apply the following formulae:

*P(yx) = P(col=x|row=y) * P(y)*

or:

*P(xz) = P(col=z|row=x) * P(x)*

or:

*P(yxz) = P(col=z|row=x) * P(col=x|row=y) * P(y)*

The latter is merely an estimate since the statistical dependencies
between the left and the right context (between y and z) are being
ignored here.

Following the same scheme the probability of larger phone strings
(for instance syllables) may be estimated.

(Please note that here *P(yx) is not equal to P(xy)* since the cooccurrence
is ordered! The well-known Bayes formula P(A|B) * P(B) = P(B|A) * P(A)
is still valid for this case but you have to use a different matrix for the
lookup of P(B|A) since the meaning of P(A|B) is not 'the probability of
A while B is given' but rather 'the probability of A *occurring
after* B given that B occurs' and P(B|A) would then be 'the probability
of B occurring before A given that A occurs', which is not given in our
bigram matrix.)

Examples:

What is the probability of /n/ with left-context /E/ in any word position
(e.g. 'M**en**sch', '**en**tzückt', 'mond**ä**n}'):

P(col=n|row=E) * P(E) = 9.874496e-02 * 2.415072e-02 = 0.00238476

Waht is the probability of a word-initial /p/ with right-context /r/ (e.g. '**pr**üfen'):

P(col=r|row=p) * P(p) * P(word-initial|p) = 2.249895e-01 * 1.968239e-02 * 2.466251e-01 = 0.00109213

The estimate for the probability of high-frequent word-**final** syllable /g@n/ (e.g. 'le**gen**'):

P(col=n|row=@) * P(col=@|row=g) * P(g) * P(word-final|n) =
2.708625e-02 * 3.030752e-01 * 1.853434e-02 * 5.013115e-01 = 7.6275e-5

The estimate for the probability of high-frequent word-**internal** syllable /g@n/ (e.g. 'le**gen**där'):

P(col=n|row=@) * P(col=@|row=g) * P(g) * (1 - P(word-initial|n) - P(word-final|n)) * (1 - P(word-initial|g) - P(word-final|g)) =
2.542894e-02 * 3.029079e-01 * 1.880843e-02 * (1 - 1.224357e-01 - 4.933440e-01 ) * (1 - 6.239958e-01 - 1.148735e-02) = 2.026100e-05

The estimate for the probability of low-frequent word-final syllable /vOYs/ (e.g. 'Kon**vois**'):

P(col=OY|row=v) * P(col=s|row=OY) * P(v) * P(word-final|s) = 9.688200e-05 * 1.885138e-02 * 2.265478e-03 * 4.365535e-01 = 1.806273e-09

**Caution 1:**

Since these are merely estimates caution should be taken to take these for absolute values. For instance it is
probably not correct to state: *"The probability for the word final syllable /g@n/ is 7.6275e-5!"* but we can
say with some confidence that
*"The probability for the syllable /g@n/ is higher in word-final than in word-internal position."*

(This is actually confirmed in this case by looking up the syllable monogram statistics below, which predicts a zero probability for a word-internal syllable /g@n/ and a non-zero probability for the word-final position.)

**Caution 2:**

Since our statistic is based on finite data sets and number of observed items differs considerably
between phone, syllable and word statistics, it is not possible to combine estimates from different
statistic types (phones, syllables, words) into one estimate!

Since the number of empiric syllable types in our database is quite high, we do not provide the statistics of the individual corpora but rather a total statistics covering all corpora described above. This includes also the test and development test sets of the Verbmobil corpora; therefore the number of words in the following statistics is slightly larger than the ones listed in the phone statistics above.

In case you need the syllable statistics for the Verbmobil training sets only, please contact me and I can provide you with the data.

SYLLABLE DURATION WORD CANONICAL SYLPOS REFERENCE WORDNR WORDDURATION

where REFERENCE is a file identifier of our internal database
that allows us to find the corresponding recording and

WORDDURATION is the duration of the word from which this syllable was taken
given in secs (this comes handy, if you want to normalize the syllable duration
against different speaking rates).

WORDNR points to the word within this recording from which
this syllable was taken (starting with '0').

SYLPOS is of the form (Pos,Max), e.g. (2,5) is the second
syllable in a 5 syllable word.

CANONICAL is the citation form pronunciation of the word
coded in German SAM-PA (with
/Q/ instead of /?/ for the glottal stop).

WORD is the orthographic form of the word token as described in
the word section of this page.

DURATION is the duration of the syllable given in secs.

SYLLABLE contains the German SAM-PA coding of the syllable but also
a leading ' if the syllable was marked as lexically accented (derived from
CANONICAL) and a trailing '+' if the syllable is part of a function word (a
non-content word, derived from the manual tagging).

Both latter markers (',+) are not 100% correct, since

- the mapping from CANONICAL to phonetic transcript only works in words with the same syllable count in canonical and actual pronunciation (except reductions 2 -> 1 where we also assigned a marker if the 2 syllable canonical word carries a marker)
- the definition of 'function word' used by the human labellers was far from crystal clear, namely "the opposite of content word".

Since syllable analysis can be tricky, we provide these raw data
as well as the following statistics derived from it. __Please feel
free to use these data for your own studies.__

RANK SYLLABLE COUNT PROBABILITY ACCUMULATED-PROBABILITY

Please note that here lexical accentuation and word class (content vs. function) is not considered. That is, the probability given here includes all taggings for a given syllable. For a more detailed analysis see the monogram and duration statistics in the next section.

As can be seen from this table the first 1000 top ranked syllables cover over 94,37% of the analysed corpora speech. Also, it is interesting to note that the German syllable /ja:/ (the German affirmative) is the most frequent syllable in conversational speech. (Followed by /IC/ (1st person singular pronoun); so it seems we mostly talk affirmative about ourselves ;-)

The following figures plot the accumulated probability (the mass) across the ranking of syllables.

The number of syllable types in this table is higher then the
number of entries in the syllable ranking table, because here an accented syllable and
an unaccented syllable are two different types.

Some of the counts do not add up; this is caused by inconsistent
tagging of function words: some word types are tagged as content
and as function word.

For instance the following figure shows the number of distinctive word and syllable types found in the data sorted by different filters (left side or top) and the ratio of content (grey) vs. function words (black) accross the syllable ranks (right side or bottom):

where:

tot = total number of types

con = from content words only

fun = from function words only

acc = marked as carrying a lexical accent

con acc = marked as carrying a lexical accent in content words

c/a >1 = marked as carrying a lexical accent in content words with more than one syllable

1syl = one-syllable words

2syl = two-syllable words

3syl = three-syllable words

4syl = four-syllable words

5syl = five-syllable words

>5syl = more than five-syllable words

Although the number of syllable types found in function words is very small compared to content words
(left figure, bars 'con' and 'fun'), the ratio between syllable tokens
taken from content and function words
(right or bottom figure) shows that function word syllables are not only found in the top ranking (in the
syllables with the highest probabilities) and must be very repetitive (since
the token ratios are much higher than the type ratios).

The top 100 rank bin (left-most bar in right or bottom figure) shows that
more than 50% of the top 100 syllables belong to function words.

- Rank
- Syllable Syl
- Total Count
- Probability P(Syl)
- Conditional probability being from a content word P(Con|Syl)
- Conditional probability being from a function word P(Fun|Syl)
- Conditional probability being lexically accented P(LA|Syl)
- Conditional probability being word-initial P(WI|Syl)
- Conditional probability being word-final P(WF|Syl)
- Conditional probability being word-internal P(WM|Syl)
- Mean duration Mean(Syl)
- Standard deviation SD(Syl)
- Quarter quantile 25 Q25(Syl)
- Quarter quantile 50 Q50(Syl) (= median)
- Quarter quantile 75 Q75(Syl)
- Mean duration in content words Mean(Con Syl)
- Standard deviation SD(Con Syl)
- Quarter quantile 25 Q25(Con Syl)
- Quarter quantile 50 Q50(Con Syl) (= median)
- Quarter quantile 75 Q75(Con Syl)
- Mean duration in function words Mean(Fun Syl)
- Standard deviation SD(Fun Syl)
- Quarter quantile 25 Q25(Fun Syl)
- Quarter quantile 50 Q50(Fun Syl) (= median)
- Quarter quantile 75 Q75(Fun Syl)
- Mean duration lexically accented Mean(LA Syl)
- Standard deviation SD(LA Syl)
- Quarter quantile 25 Q25(LA Syl)
- Quarter quantile 50 Q50(LA Syl) (= median)
- Quarter quantile 75 Q75(LA Syl)
- Mean duration word-initial Mean(WI Syl)
- Standard deviation SD(WI Syl)
- Quarter quantile 25 Q25(WI Syl)
- Quarter quantile 50 Q50(WI Syl) (= median)
- Quarter quantile 75 Q75(WI Syl)
- Mean duration word-final Mean(WF Syl)
- Standard deviation SD(WF Syl)
- Quarter quantile 25 Q25(WF Syl)
- Quarter quantile 50 Q50(WF Syl) (= median)
- Quarter quantile 75 Q75(WF Syl)
- Mean duration word-internal Mean(WM Syl)
- Standard deviation SD(WM Syl)
- Quarter quantile 25 Q25(WM Syl)
- Quarter quantile 50 Q50(WM Syl) (= median)
- Quarter quantile 75 Q75(WM Syl)
- Mean duration one-syllable words Mean(WS Syl)
- Standard deviation SD(WS Syl)
- Quarter quantile 25 Q25(WS Syl)
- Quarter quantile 50 Q50(WS Syl) (= median)
- Quarter quantile 75 Q75(WS Syl)

Syllable Monogram and Duration Statistics TOTAL

The following figures show some example data derived from this table.

These boxplots (left or top) show the distribution of the medians of the duration of
each syllable type. That is, each syllable type is represented by one
data point in this distribution and the probability of the syllable type
is not considered here. Contrary to expectation the distribution of lexically
accented syllables in words with more than one syllable ('Lex.accented')
**does not deviate** from the distribution over all syllable types
('Duration'). However, as expected the distribution of syllables
derived from function words shows smaller durations than the syllables
derived from content words (which is only slightly elavated from the overall
distribution).

The Scatter plot (bottom or right) shows the medians per syllable type across ranking order. As expected there is an inverse correlation between rank and duration but the Pearson correlation is with r = 0.31 very weak.

The (left or top) histogram shows the distribution of the syllable durations
after filtering syllables belonging to hesitations (often exeedingly long) and
syllables only containing the garbage sound (<usb>). The dashed line marks the
arithmetic mean (0.21sec) while the dotted lines mark the quarter-quantiles
25% (0.12sec), 50% (median, 0.17sec) and 75% (0.25sec).

The histogram pretty much ends around a syllable length of
1.0 secs (after that only outliers caused either by excessive sound lengthening
or - more likely - caused by segmentation errors occur).
Hence, for the duration statistic
presented above we filter all syllables that have a duration of greater than
1.0secs (= 0.657% of all syllables).

See section Phone Bigram Statistics for details about the file format.

Bigram Matrix based on raw syllable type list

We provide a statistics of pronunciation variants, the basic monogram statistics, average word duration, canonical pronunciation and syllable count, as well as the bigram probabilities.

The results are stored in a four-column table with a three line header:

Database

Pronunciation types

Pronunciation tokens

The columns of the following table are:

ORTHOGRAPHY PRONUNCIATION(SAM-PA) COUNT P(P|W)

Word Pronunciation Statistics TOTAL

The word label '<%>' denotes an unintelligible word; likewise the phonetic label '<usb>' is the phonetic garbage symbol. '$' denotes a spelling.

The following first and second order statistics are based on a list of raw word types coded in 7-bit ASCII LaTeX (if a token starts with an Umlaut, the leading " is quoted with a backslash, e.g. '\"Arger'):

This list contains all transcribed items (including some meta tags; see below) that
can be found in the analysed corpora described in the Phoneme Statistics
sections.

Please note that these raw word types are not necessary valid word forms. Since
the database is conversational speech the list may contain 'non-words'
such as word breaks, neologisms and dialectal variants.

The following 'meta tags' are also included in the basic word list,
since their statistics might be useful:

- hesitations : <"ah>, <"ahm>, <hm> (German), <uhm>, <uh> (English), <hes>, <h"as> (others)
- non-understandable word : <%> (garbage model for speech)
- silence interval : !SIL
- spelling : spelled characters are capital und preceeded by a '$', e.g. '$A'. Note that Umlauts can be spelled as well, e.g. '$"A'.
- background noise : !NOISE (garbage model for non-speech)
- articulatory noise (cough, throat clear etc.) : !ANOISE
- laughing : !LAUGH
- audible breathing : !BREATH

Proper names of persons, institutions or locations which consist of several words are in some cases concatenated into one string without blanks, e.g. the movie title 'American Hero X' is listed as 'AmericanHeroX'.

- word type or meta tags (see above) encoded in 7-bit ASCII LaTeX
- absolute word count
- word probability
- average word duration in secs.

Note: the duration values of meta tags (!....) are not valid! - canonical pronunciation according to the Transcription Conventions for Canonical German.
- syllable count

See section Phoneme Bigram Statistics for details about the file format.

Bigram Matrix based on raw word type list

The German part of CELEX contains no empirically based phonetic information about phones and syllables. However, it contains phonological data for phonemes and syllables based on large collections of word types (derived from the archives of the 'Institut der Deutschen Sprache', Mannheim, Germany). We therefore expect that the ratios between phone and syllable tokens and types and hence the statistics will differ significantly.

CELEX | BAStat | |

Word tokens | 5002442 | 689966 |

Word types | 84173 | 16426 |

Syllable tokens | 9062607 | 1030588 |

Syllable types | 7030 | 9210 |

Although the number of word and syllable tokens is about one
magnitude higher in CELEX than in BAStat, we find about the same
number of syllable types in both resources.
The ratio of words types against word tokens
is in CELEX (1.7%) lower than in BAStat (2.4%); this
is probably caused by the insufficiant number of word tokens in
BAStat: while the number of word types in CELEX is probably nearly
converged, in BAStat the number of word types will probably still
grow with increasing corpus size.

From the smaller amount of word types in BAStat we would
also expect a proportional smaller number of syllable types, but this
is not the case: the number of syllable types in BAStat exceeds the
number in CELEX. The reason is probably that the phonetic variation
of syllables produces more syllable forms than in the phonological
paradigma of CELEX, where each word token has always the same syllables.

The statistic of syllable types also differs considerably: in the following
we plot the first 20 highest ranking syllables from CELEX and BAStat in
descending ranking order.
The few overlaps in both ranking sets are printed in bold face. Likely
overlaps between phonological and phonetic syllable forms are underlined.

(The CELEX phonologic coding was mapped to German SAM-PA here and
word initial glottal stops were inserted (e.g. 'und' /Unt/ -> /QUnt/))

__de:r__**g@****t@**QUntQInb@__t@n__tsu:**das**QaIfErg@n**n@**d@nde:nQan__n@n____b@r__t@r

CELEX | di: | BAStat | ja: | IC | das |
n |
dan | g@ |
tn |
@ | da: | di: | t@ |
s | d6 |
vi:6 | vi: | zi: | @n | b6 |
ta:k | n@ |

If we look at the 1000 top ranked syllables (which cover 94.37% of the spoken language in BAStat) in both resources, we find an overlap of merely 47.5%.

Of course this comparison is not entirely valid
since in the case of CELEX the syllabification was done phonologically
with regard to citation word forms while in BAStat the syllabification
is based on the phonetic transcript (which may contain errors). For instance a
syllabic nasal /n/ is is very highly ranked in the syllable ranking of
BAStat but does not even appear in the CELEX syllable type list.

Nevertheless, these examples show that it is not plain sailing to use
phone or syllable statistics from a lexically based resource in experimental
setups dealing with spoken language.

- 2008/11/12 : first edition: Verbmobil VM1TRAIN and VM2TRAIN
- 2008/11/25 : added corpus SmartKom (SK), added bigram statistics
- 2009/03/07 : added TOTAL statistics, VM12TRAIN, changed float format percentage in monogram statistics to scientific format (1.00000e+00) probability (to be conform with bigram statistics)
- 2009/08/20 : added Statistics of Phoneme Strings/Contexts
- 2009/08/24 : extended basic phoneme list to 'extended German SAM-PA', extended first order statistics to average duration and posteriors of word-initial or word-final occurance, added second order statistics (bigrams) for /6/-diphthong combinations and silence intervals
- 2009/08/25 : added word bigram statistics
- 2009/08/27 : added word monogram statistics
- 2009/09/23 : extended phone monogram duration statistics
- 2009/10/01 : fixed several errors in RVG1 database
- 2009/10/20 : added syllable collection, syllable ranking
- 2009/10/21 : added syllable monogram/duration statistics
- 2009/10/22 : extended syllable collection by file and word reference
- 2009/10/25 : extended phoneme monogram statistic by duration values for single-phoneme words
- 2009/10/31 : changes word statistics: single word occurrences are now considered in the monogram and bigram to be consistent with syllable statistics
- 2009/11/16 : extended raw syllable table by word duration
- 2009/11/17 : added comparison to lexical resource CELEX
- 2015/11/23 : go over the page to make the usage of terms 'phones' and 'phonemes' more clear.

- test the number of syllable types for convergence by randomly grow the corpus and calculated the number of syllable types at each step. If the number of syllable types is already converged, this should result in a converging figure.
- comparison to manually segmented resource: Kiel Corpus
- the same statistics for read speech
- add TEST and DEV set of the Verbmobil corpora into the part TOTAL of the phoneme statistics
- distinguish phonemes that are part of a lexically accented syllable or not
- word statistics based on large (> 10 Mio) conversational transcripts (unfortunately it appears that large enough transcripts do not exist for German.)
- add conversational recordings of the ALC corpus (150 speakers x 15min = 2250min speech!)

All rights stay with BAS, Ludwig-Maximilians-Universität München.

This page and all other pages with the initial 'BAS' or 'Bas' in the filename may be copied, printed and distributed to other parties, under the condition that the pages are distributed as shown here. Parts of pages or extended pages may not be distributed further without permission of the BAS.

Florian Schiel