Analysis of formants and formant transitions

The aim of this Chapter¹ is to extend some of the techniques presented in Chapter 3 for the analysis of formant frequencies and of the way that formants change in time. The discussion is centred predominantly around vowels and the type of acoustic information that is available for distinguishing between them. Sections 6.1 – 6.3 are for the most part concerned with representing vowels in terms of their first two formant frequencies extracted at the vowel targets. A technique known as kmeans clustering for assessing the influence of context is briefly reviewed as well as some methods for locating vowel targets automatically from vowel formant data. Outliers that can arise as a result of formant tracking errors are discussed as well as methods for removing them. As is well known, the formants of the same phonetic vowel vary not only because of context, but also due to speaker differences and in 6.4 some techniques of vowel normalization are applied to some vowel data in order to determine how far they reduce the different formant characteristics of male and female vowels. The final sections of this Chapter deal with vowel reduction, undershoot and coarticulatory influences. In 6.5, some metrics for measuring the Euclidean distance are introduced and applied to determining the expansion of the vowel space relative to its centre: this method is especially relevant for modelling the relationship between vowel positions and vowel hyperarticulation (see e.g., Moon & Lindblom, 1994; Wright, 2003). But Euclidean distance measurements can also be used to assess how close one vowel space is to another and this is found to have an application in quantifying sound change that is relevant for sociolinguistic investigations. Whereas all the techniques in 6.1-6.5 are static, in the sense that they rely on applying analyses to formants extracted at a single point in time, in section 6.6 the focus is on the shape of the entire formant movement as a function of time. In this section, the coefficients of the parabola fitted to a formant are used both for quantifying vowel undershoot and also for smoothing formant frequencies. Finally, the concern in section 6.7 is with the second formant frequency transition as a cue to place of the articulation of consonants and the way that so-called locus equations can be used to measure the coarticulatory influence of a vowel on a preceding or following consonant.

6.1 Vowel ellipses in the F2 x F1 plane

There is extensive evidence going back to the 19th and early part of the 20th Century that vowel quality distinctions depend on the first two, or first three, resonances of the vocal tract (see e.g., Ladefoged, 1967 and Traunmüller & Lacerda, 1987). Since the first formant frequency is negatively correlated with phonetic vowel height, and since F2 is correlated with vowel backness, then a shape resembling the vowel quadrilateral emerges by plotting vowels in the (decreasing) F1 x F2 plane. Essner (1947) and Joos (1948) were amongst the first to demonstrate this relationship, and since then, many different kinds of experimental studies have shown that this space is important for making judgements of vowel quality (see e.g., Harrington & Cassidy, 1999, p. 60-78). The first task will be to examine some formant data from a male speaker of Standard German.

#we still have to install the newest version of emuR, which is not yet published on CRAN
install.packages("devtools")
devtools::install_github("IPS-LMU/emuR",force=TRUE,build_vignettes=TRUE)

library(emuR)

#change this to your path
path2kielread = "/Users/reubold/kielread/kielread_emuDB/"
# load emuDB into current R session
kielread = load_emuDB(path2kielread, verbose = FALSE)
summary(kielread)

## Name:     kielread 
## UUID:     08b534b5-916a-4877-93aa-4f6b641995a1 
## Directory:    /Users/reubold/kielread/kielread_emuDB 
## Session count: 1 
## Bundle count: 200 
## Annotation item count:  12294 
## Label count:  32978 
## Link count:  11271 
## 
## Database configuration:
## 
## SSFF track definitions:
##        name columnName fileExtension
## 1  FORMANTS         fm           fms
## 2 wrasspFms         fm       fms_new
## 3  praatFms         fm      praatFms
## 
## Level definitions:
##       name    type nrOfAttrDefs                             attrDefNames
## 1     Word    ITEM            2                              Word; Func;
## 2 Syllable    ITEM            1                                Syllable;
## 3  Kanonic    ITEM            2                       Kanonic; SinfoKan;
## 4 Phonetic SEGMENT            4 Phonetic; Autoseg; SinfoPhon; LexAccent;
## 
## Link definitions:
##           type superlevelName sublevelName
## 1  ONE_TO_MANY           Word     Syllable
## 2  ONE_TO_MANY       Syllable      Kanonic
## 3 MANY_TO_MANY        Kanonic     Phonetic

vowlaxn = query(emuDBhandle = kielread,
                query = "Kanonic== a | E | I | O")
vowlaxn.fdat = get_trackdata(kielread,
                        seglist = vowlaxn,
                        ssffTrackName =  "FORMANTS",
                        resultType="emuRtrackdata",
                        verbose = FALSE)
vowlaxn.l = label(vowlaxn)
#bring all tracks to the same length (here: 21 samples, or any other uneven number of samples)
vowlaxn.fdat_norm = normalize_length(vowlaxn.fdat, N = 21)
#extract the relative time point 0.5
vowlaxn.fdat_norm.5 = vowlaxn.fdat_norm[vowlaxn.fdat_norm$times_norm==0.5,]

#label of left context
vowlaxn.fdat_norm.5$leftlabels = label(requery_seq(kielread,vowlaxn,offset=-1,calcTimes = FALSE))
#label of right context
vowlaxn.fdat_norm.5$rightlabels = label(requery_seq(kielread,vowlaxn,offset=1,calcTimes = FALSE))
#extract speaker labels (=extract the second and third label of $bundle)
vowlaxn.fdat_norm.5$spkr = substr(vowlaxn$bundle,2,3)
# get labels at the level "Word"
vowlaxn.fdat_norm.5$word = label(requery_hier(kielread,seglist = vowlaxn,level = "Word",calcTimes = FALSE))

#add these columns also to vowlaxn.fdat:
vowlaxn.fdat$leftlabels = vowlaxn.fdat$rightlabels = vowlaxn.fdat$spkr = vowlaxn.fdat$word = "something"
for (i in vowlaxn.fdat_norm.5$sl_rowIdx){
  vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==i,]$leftlabels=vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$sl_rowIdx==i,]$leftlabels
  vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==i,]$rightlabels=vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$sl_rowIdx==i,]$rightlabels
  vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==i,]$spkr=vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$sl_rowIdx==i,]$spkr
  vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==i,]$word=vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$sl_rowIdx==i,]$word
}

The dataset includes two speakers, one male (speaker 67) and one female (speaker 68) and the vowels I, E, O, a (corresponding to [ɪ, ɛ, ɔ, a]) that were extracted from the same 100 sentences read by each of these speakers from the kielread database. In the following, a logical vector is used to extract the data from the above for the male speaker:

# use logical vectors - TRUE for speaker 67

m.fdat = vowlaxn.fdat[substr(vowlaxn.fdat$bundle,2,3)=="67",]   # formant data
m.fdat_norm.5 = vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr=="67",]

In plotting vowels in the F1 x F2 plane, a decision has to be made about the time point from which the data are to be extracted. Usually, the extraction point should be at or near the vowel target, which can be considered to be the point in the vowel at which the formants are least influenced by context and/or where the formants change minimally in time (Chapter 3, Fig. 3.2). Some issues to do with the vowel target are discussed in 6.3. For the present, the target is taken to be at the temporal midpoint of the vowel, on the assumption that this is usually the time point nearest to which the target occurs (Fig. 6.1):

#pre-calculate centroids (we will need these to plot the the means of the distributions for F1 and F2):
centr = aggregate(cbind(T1,T2)~labels,data=m.fdat_norm.5,FUN=mean)

library(ggplot2)
p=ggplot(m.fdat_norm.5) + 
  aes(y = T1, x  = T2, linetype = labels, label=labels) + 
  stat_ellipse(type="norm") + 
  geom_text(data = centr) + 
  scale_y_reverse() + scale_x_reverse() + 
  labs(x = "F2(Hz)", y = "F1(Hz)") +
  theme(legend.position="none")
p

Fig. 6.1: 95% ellipse contours for F1 x F2 data extracted from the temporal midpoint of four German lax vowels produced by one speaker.

The means of the distributions using the corresponding character label are calculated by using the function aggregate() (which applies the function mean()); and the scale_[y|x]_reverse() rotates the space so that the x-axis has decreasing F2 and the y-axis decreasing F1 as a result of which the vowels are positioned analogously to the phonetic backness and height axes of the vowel quadrilateral. As discussed in more detail in connection with probabilistic classification in Chapter 9, an ellipse is a contour of equal probability. In the default implementation of the stat_ellipse(type="norm") function, each ellipse includes about 95% of the data points corresponding to just under 2.45 ellipse standard deviations. Those researchers who tend not to look at very much at data from continuous speech often find the extent of overlap quite alarming because in laboratory speech of isolated word data, the ellipses of one speaker are usually quite well separated. The overlap arises, in part, because vowel targets in continuous speech are affected by different contexts and prosodic factors. It might be helpful then to look at [ɪ] in further detail according to the left context (Fig. 6.2):

temp = m.fdat_norm.5$labels=="I"
library(gridExtra)
p2 = ggplot(m.fdat_norm.5[temp,]) + 
      aes(y = T1, x  = T2, linetype = labels, label=leftlabels) + 
      stat_ellipse(type="norm") + 
      geom_text() + 
      scale_y_reverse() + scale_x_reverse() + 
      labs(x = "F2(Hz)", y = "F1(Hz)") +
      theme(legend.position="none")
grid.arrange(p2,ncol=2)

There is not an immediately obvious pattern to the data in Fig. 6.2 and nor can one be reasonably expected, given that it does not take account of some other variables, especially of the right context. Nevertheless, when the preceding context is alveolar it does seem that [ɪ] is mostly positioned in the top left of the display with low F1 and high F2. There are also a number of Q labels with a high F2: these denote vowels that are preceded by a glottal stop i.e., syllable or word-initial [ʔɪ] (vowels in a domain-initial position in German are usually glottalised).

The technique of kmeans clustering can be applied to the data to give an indication of whether the variability is affected by different types of context. This technique partitions the data into k different clusters in such a way that the distance from the data points to the centroids (means) of the derived clusters of which they are members is minimised. An example of how this algorithm works is shown for 10 data points (those in bridge[1:10,1:2]) which are divided into two clusters. Initially, a guess is made of two means shown by X1 and Y1 in the left panel of Fig. 6.3. Then the straight-line (Euclidean) distance is calculated from each point to each of these two means (i.e., two distance calculations per point) and each point is classified depending on which of these two distances is shortest. The results of this initial classification are shown in the central panel of the same figure: thus the four values at the bottom of the central panel are labelled x because their distance is less to X1 than to Y1. Then the centroid (mean value on both dimensions) is calculated separately for the points labelled x and those labelled y: these are shown as X2 and Y2 in the middle panel. The same step as before is repeated in which two distances are calculated from each point to X2 and Y2 and then the points are reclassified depending on which of the two distances is the shortest. The results of this reclassification (right panel, Fig. 6.3) show that two additional points are labelled x because these are nearer to X2 than to Y2. The means of these new classes are X3 and Y3 and since there is no further shift in the derived means by making the same calculations again, these are the final means and final classifications of the points. They are also the ones that are given by `kmeans(bridge[1:10,1:2], 2). When kmeans clustering is applied to the [ɪ] data shown in the left panel of Fig. 6.2, the result is a split of the data into two classes, as the right panel of the same figure shows. This figure was produced with the following commands:

temp = m.fdat_norm.5$labels=="I"
k = kmeans(m.fdat_norm.5[temp,c("T1","T2")], 2)
m.fdat_norm.5$cluster=0
m.fdat_norm.5$cluster[temp] = k$cluster
pkm = ggplot(m.fdat_norm.5[temp,]) + 
      aes(y = T1, x  = T2, linetype = labels, label=cluster) + 
      stat_ellipse(type="norm") + 
      geom_text() + 
      scale_y_reverse() + scale_x_reverse() + 
      labs(x = "F2(Hz)", y = "F1(Hz)") +
      theme(legend.position="none")

grid.arrange(p2,pkm,ncol=2)

Fig. 6.2. Left. The [ɪ] ellipse from Fig. 6.1 with the left-context labels superimposed on the data points. Right: the same data partitioned into two clusters using kmeans-clustering.

As is apparent from the right panel of Fig. 6.2, the algorithm has split up the data according to whether F2 is less, or greater, than roughly 1800 Hz. We can see whether this also partitions the data along the lines of the left context as follows:

temp = m.fdat_norm.5$labels=="I"
# Left context preceding [ɪ]
m.left.I = m.fdat_norm.5$leftlabels[temp]
# Left context preceding [ɪ] in cluster 1 (the circles in Fig. 6.2, right panel).
table(m.left.I[k$cluster==1])

## 
##  b  d  k  l  m  n  Q  r  s  t  z 
##  3  6  3 10  1 10 10  6  2  4  1

# Left context preceding [ɪ] in cluster 2
table(m.left.I[k$cluster!=1])

## 
## b f g l m n Q r s v 
## 4 3 1 3 2 1 2 9 1 3

Fig. 6.3. An illustration of the steps in kmeans clustering for 10 points in two dimensions. X1 and Y1 in the left panel are the initial guesses of the means of the two classes. x and y in the middle panel are the same points classified based on whichever Euclidean distance to X1 and Y1 is least. In the middle panel, X2 and Y2 are the means (centroids) of the points classified as x and y. In the right panel, x and y are derived by re-classifying the points based on the shortest Euclidean distance to X2 and Y2. The means of these two classes in the right panel are X3 and Y3.

So, as these results and the right panel of Fig. 6.2 show, cluster 1 (the circles in Fig. 6.2) includes all of [d, t], 10/11 of [n] and 10/12 [ʔ] (“Q”). Cluster 2 tends to include more contexts like [ʁ] (“r”) and labials (there are more [b, f, m, v] in cluster 2 that for reasons to do with their low F2-locus, are likely to have a lowering effect on F2). Thus left context obviously has an effect on F2 at the formant midpoint of [ɪ]. Just how much of an effect can be seen by plotting the entire F2 trajectory between the vowel onset and vowel midpoint for two left contexts that fall predominantly in cluster 1 and cluster 2 respectively. Here is such a plot comparing the left contexts [ʔ] on the one hand with the labiodentals [f, v] together (Fig. 6.4):

# Logical vector that is true when the left context of [ɪ]  is one of [ʔ, f,  v]
temp = m.fdat$labels == "I" & m.fdat$leftlabels %in% c("Q", "f", "v")
# The next three lines relabel "f" and "v" to a single category "LAB"
m.fdat$labial = "something"
m.fdat$labial[temp] = m.fdat$leftlabels[temp]
m.fdat$labial[m.fdat$labial %in% c("f", "v")] = "LAB"

ggplot(m.fdat[temp,]) +
  aes(x=times_rel,y=T2,col=labial,group=sl_rowIdx) +
  geom_line() +
  labs(x = "vowel duration (ms)", y = "F2 (Hz)")

Fig. 6.4: F2 trajectories for [ʔɪ] (blue) and for [fɪ] and [vɪ] together (red) synchronised at the temporal onset.

Apart from two [ʔɪ] trajectories, there is a separation in F2 throughout the vowel depending on whether the left context is [ʔ] or a labiodental fricative. But before these very clear F2 differences are attributed just to left context, the word label (and hence the right context) should also be checked. For example, for [ʔ]:

table(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$leftlabels=="Q" & vowlaxn.fdat_norm.5$labels=="I",]$word)

## 
##  ich  Ich   In Inge  Iss isst  ist 
##    4    4    2    4    2    2    6

So the words that begin with [ʔɪ] almost all have a right context which is likely to contribute to the high F2 i.e., [ç] in [ɪç] (I) or [s] in [ɪs], [ɪst] (eat, is): that is, the high F2 in [ʔɪ] is unlikely to be due just to left context alone.

6.2 Outliers

When a formant tracker is run over speech data in the manner described in Chapter 3, there will inevitably be errors due to formant tracking especially for large speech corpora. Errors are especially common at the boundaries between voiceless and voiced segments and whenever two formants are close together in frequency, such as F1 and F2 for back vowels. If such errors occur, then it is likely that they will show up as outliers in ellipse plots of the kind examined so far. If the outliers are far from the ellipse’s centre, then they can have quite a dramatic effect on the ellipse orientation.

Fig. 6.5 shows two outliers from the [ɪ] vowels of the male speaker for the data extracted at the onset of the vowel:

# Speaker 67's [ɪ]vowels
temp = vowlaxn.fdat$spkr=="67" & vowlaxn.fdat$labels=="I" & vowlaxn.fdat$times_rel==0
m.Ion = vowlaxn.fdat[temp,]
# Ellipse plot with outliers
p65left <- ggplot(m.Ion) + 
            aes(y = T1, x  = T2, label=labels) + 
            stat_ellipse(type="norm") + 
            geom_text() + 
            scale_y_reverse() + scale_x_reverse() + 
            labs(x = "F2(Hz)", y = "F1(Hz)") +
            theme(legend.position="none") +
            ylim(c(500,150)) +
            xlim(c(2500,0))

As the left panel of Fig. 6.5 shows, there are two (distinct) outliers: one of these is where F2 = 0 Hz at the right of the plot and is almost certainly due to a formant tracking error. The other outlier has a very low F1 but it is not possible to conclude without looking at the spectrogram whether this is a formant error or the result of a context effect. The first of these outliers can, and should, be removed by identifying all values that have F2 less than a certain value, say 50 Hz. This can be done with a logical vector to produce the plot without the outlier on the right. The logical vector in the first line is used to identify F2 values less than 50 Hz:

p65right <- ggplot(m.Ion[m.Ion$T2>50,]) + 
            aes(y = T1, x  = T2, label=labels) + 
            stat_ellipse(type="norm") + 
            geom_text() + 
            scale_y_reverse() + scale_x_reverse() + 
            labs(x = "F2(Hz)", y = "F1(Hz)") +
            theme(legend.position="none") +
            ylim(c(500,150)) +
            xlim(c(2500,0))
grid.arrange(p65left,p65right,ncol=2)

Fig. 6.5: 95% confidence intervals ellipses for [ɪ] vowels of speaker 67 at segment onset in the F1 x F2 plane before (left) and after (right) removing the outlier at F2 = 0 Hz.

Because the outlier was a long way from the centre of the distribution, its removal has shrunk the ellipse size and also changed its orientation slightly.

It is a nuisance to have to constantly remove outliers with logical vectors, so a better solution that the one in Fig. 6.5 is to locate its utterance identifier and redraw the formant track by hand in the database from which these data were extracted (following the procedure in 3.1). The utterance in which the outlier occurs as well as its time stamp can be found by combining a logical vector with the emuRtrackdata object:

temp = vowlaxn.fdat$spkr=="67" & vowlaxn.fdat$labels=="I" & vowlaxn.fdat$times_rel==0 & vowlaxn.fdat$T2 < 50
vowlaxn.fdat[temp,]

##      sl_rowIdx labels    start      end session   level type times_orig
## 2662       194      I 911.5312 964.5938    0000 Kanonic ITEM      912.5
##      times_rel times_norm  T1 T2   T3   T4
## 2662         0          0 287  0 2189 3232
## 
## NOTE: to reduce the verboseness of the output not all colums of an emuRtrackdata object are printed. Use print.data.frame() to print all columns.

serve(kielread,seglist = vowlaxn.fdat[temp,])

That is, the outlier occurs somewhere between 911 ms and 964 ms in the utterance K67MR096 of the kielread database. The corresponding spectrogram shows that the outlier is very clearly a formant tracking error that can be manually corrected as in Fig. 6.6.

Fig. 6.6. Spectrogram (0 - 3000 Hz) in of part of the utterance K67MR096 showing an [ɪ] vowel with superimposed F1-F3 tracks; the miscalculated F2 values are redrawn on the right. The redrawn values can be saved – which has the effect of overwriting the signal file data containing the formant track for that utterance permanently. See Chapter 3 for further details.

Manually correcting outliers when they are obviously due to formant tracking errors as in Fig. 6.6 is necessary. But this method should definitely be used sparingly: the more manual intervention that is used, the greater the risk that the researcher might unwittingly bias the experimental data.

6.3 Vowel targets

(THIS SUB-CHAPTER IS STILL UNDER CONSTRUCTION)

As discussed earlier, the vowel target can be considered to be the section of the vowel that is least influenced by consonantal context and most similar to a citation-form production of the same vowel. It is also sometimes defined as the most steady-state part of the vowel: that is, the section of the vowel during which the formants (and hence the phonetic quality) change minimally (see e.g., Broad & Wakita, 1977; Schouten & Pols, 1979). It is however not always the case that the vowel target is at the temporal midpoint. For example, in most accents of Australian English, in particular the broad variety, the targets of the long high vowels in heed and who’d occur late in the vowel and are preceded by a long onglide (Cox & Palethorpe, 2007). Apart from factors such as these, the vowel target could shift proportionally because of coarticulation. For example, Harrington, Fletcher and Roberts (1995) present articulatory data showing that in prosodically unaccented vowels (that is those produced without sentence stress), the final consonant is timed to occur earlier in the vowel than in prosodically accented vowels (with sentence stress). If the difference between the production of accented and unaccented vowels has an influence on the final transition in the vowel, then the effect would be that the vowel target occurs proportionally somewhat later in unaccented than in accented vowels. For reasons such as these, the vowel target cannot always be assumed to be at the temporal midpoint. At the same time, some studies have found that different strategies for locating the vowel target have not made a great deal of difference to the analysis of vowel formant data (see e.g., van Son & Pols, 1990 who compared three different methods for vowel target identification in Dutch).

One method that is sometimes used for vowel target identification is to find the time point at which the F1 is at a maximum. This is based on the idea that vowels reach their targets when the oral tract is maximally open which often coincides with an F1-maximum, at least in non-high vowels (Lindblom & Sundberg, 1971).

# only /a/, only between nomalized times 0.25-0.75
a.fdat = vowlaxn.fdat[vowlaxn.fdat$labels=="a" & vowlaxn.fdat$times_norm >= 0.25 & vowlaxn.fdat$times_norm <= 0.75,]
#find the maximal F1 value
a.max = aggregate(T1~sl_rowIdx,FUN=max,data=a.fdat)
#add $times_norm to a.max
a.max$times_norm=0
for (i in 1:nrow(a.max)){
  sl = a.max[i,]$sl_rowIdx
  T1max = a.max[i,]$T1
  a.max[i,]$times_norm=a.fdat[a.fdat$sl_rowIdx==sl &  a.fdat$T1==T1max,][1,]$times_norm
}
#add $times_moved to a.fdat
a.fdat$times_moved=0
for (i in 1:nrow(a.max)){
  sl = a.max[i,]$sl_rowIdx
  time_max = a.max[i,]$times_norm
  a.fdat[a.fdat$sl_rowIdx==sl,]$times_moved = a.fdat[a.fdat$sl_rowIdx==sl,]$times_norm - time_max
}
# plot
p1=ggplot(a.fdat) +
  aes(x=times_moved,y=T1,col=labels,group=sl_rowIdx) +
  geom_line() +
  geom_vline(xintercept = 0) +
  labs(x = "vowel duration (ms), centered at the F1 maximum", y = "F1 (Hz)")
p1

Fig. 6.7: F1 of [a] for a male speaker of standard German synchronised at time t = 0, the time of the F1 maximum.

6.4 Vowel normalization

Much acoustic variation comes about because speakers have different sized and shaped vocal organs. This type of variation was demonstrated spectrographically in Peterson & Barney’s (1952) classic study of vowels produced by men, women, and children and the extensive overlap between the formants of children and men was further investigated in Peterson (1961) and also in Ladefoged (1967) and Pols et al. (1973) (see also Adank et al., 2004 for a recent investigation). The differences in the distribution of the vowels for the male speaker and female speaker can be examined once again with ellipse plots. These are shown together with the data points in Fig. 6.9 and they were created with the following commands:

ggplot(vowlaxn.fdat_norm.5) + 
  facet_grid(~spkr) +
  aes(y = T1, x  = T2, color = labels, label=labels) + 
  stat_ellipse(type="norm") + 
  geom_text() + 
  scale_y_reverse() + scale_x_reverse() + 
  labs(x = "F2(Hz)", y = "F1(Hz)") +
  theme(legend.position="none")

Fig. 6.9: German lax monophthongs produced by a male (left) and a female (right) speaker in the F2 x F1 plane. Data extracted at the temporal midpoint of the vowel.

The differences are quite substantial, especially considering that these are speakers of the same standard North German variety producing the same read sentences! The figure shows, in general, how the formants of the female speaker are higher in frequency than those of the male which is to be expected because female vocal tracts are on average shorter. But as argued in Fant (1966), because the ratio of the mouth cavity to the pharyngeal cavity lengths is different in male and female speakers, the changes to the vowels due to gender are non-uniform: this means that the male-female differences are greater in some vowels than others. Fig. 6.9 shows that, whereas the differences between the speakers in [ɪ] are not that substantial, those between [a] on F1 and F2 are quite considerable. Also the F2 differences are much more marked than those in F1 as a comparison between the two speakers in the relative positions of [ɪ, ɛ] shows. Finally, the differences need not be the result entirely of anatomical and physiological differences between the speakers. Some differences may be as a result of speaking-style: indeed, for the female speaker there is a greater separation between [ɪ, ɛ] on the one hand and [ɔ, a] on the other than for the male speaker and this may be because there is greater vowel hyperarticulation for this speaker – this issue will be taken up in more detail in the analysis of Euclidean distances in 6.5. An overview of the main male-female vowel differences can be obtained by plotting a polygon that connects the means (centroids) of each vowel category for the separate speakers on the same axes. The first step is to get the speaker means by using the aggregate() function. So the F1 and F2 category means for speaker 67 are:

centr = aggregate(cbind(T1,T2)~labels+spkr,FUN=mean,data=vowlaxn.fdat_norm.5)
centr[centr$spkr==67,]

##   labels spkr       T1       T2
## 1      a   67 636.7698 1346.444
## 2      E   67 523.2927 1640.451
## 3      I   67 368.2294 1779.112
## 4      O   67 548.4688 1126.000

We can now use the geom_polygon(fill="transparent") alongside with geom_text() to plot the following:

ggplot(centr) + 
  aes(y = T1, x  = T2, color = spkr, label=labels) + 
  geom_polygon(fill="transparent") + 
  geom_text() + 
  scale_y_reverse() + scale_x_reverse() + 
  labs(x = "F2(Hz)", y = "F1(Hz)")

Fig. 6.10: Mean F2 x F1 values for the male (‘67’) and female (‘68’) data from Fig. 6.9.

Strategies for vowel normalization are designed to reduce the extent of these divergences due to the speaker and they fall into two categories: speaker-dependent and speaker-independent. In the first of these, normalization can only be carried out using statistical data from the speaker beyond the vowel that is to be normalized (for this reason, speaker-dependent strategies are also called extrinsic, because information for normalization is extrinsic to the vowel that is to be normalized). In a speaker-independent strategy by contrast, all the information needed for normalising a vowel is within the vowel itself, i.e., intrinsic to the vowel.

The idea that normalization might be extrinsic can be traced back to Joos (1948) who suggested that listeners judge the phonetic quality of a vowel in relation to a speaker’s point vowels [i, a, u]. Some evidence in favour of extrinsic normalization is provided in Ladefoged & Broadbent (1957) who found that the listeners’ perceptions of the vowel in the same test word shifted when the formant frequencies in a preceding carrier phrase were manipulated. On the other hand, there were also various perception experiments in the 1970s and 1980s showing that listeners’ identifications of a speaker’s vowels were not substantially improved if they were initially exposed to the same speaker’s point vowels (Assmann et al, 1982; Verbrugge et al., 1976).

Whatever the arguments from studies of speech perception for or against extrinsic normalization (Johnson, 2005), there is evidence to show that when extrinsic normalization is applied to acoustic vowel data, the differences due to speakers can often be quite substantially reduced (see e.g. Disner, 1980 for an evaluation of some extrinsic vowel normalization procedures). A very basic and effective extrinsic normalization technique is to transform the data to z-scores by subtracting out the speaker’s mean and dividing by the speaker’s standard-deviation for each parameter separately. The technique was first used by Lobanov (1971) for vowels and so is sometimes called Lobanov-normalization. The transformation has the effect of centering each speaker’s vowel space at coordinates of zero (the mean); the axes are then the number of standard deviations away from the speaker’s mean. So for a vector of values:

vec = c(-4, -9, -4,  7,  5, -7,  0,  3,  2, -3)

their Lobanov-normalized equivalents are:

(vec - mean(vec)) / sd(vec)

##  [1] -0.5715006 -1.5240015 -0.5715006  1.5240015  1.1430011 -1.1430011
##  [7]  0.1905002  0.7620008  0.5715006 -0.3810004

A function that will carry out Lobanov-normalization when applied to a vector is as follows:

lob <- function(x)
{
# transform x to z-scores (Lobanov normalization); x is a vector
(x - mean(x))/sd(x)
}

The function has to be applied by speaker. The Lobanov-normalized F1 and F2 for the male speaker are e.g.:

vowlaxn.fdat_norm.5$T1_lobnorm=vowlaxn.fdat_norm.5$T2_lobnorm=0

vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr=="67",]$T1_lobnorm = lob(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr=="67",]$T1)

vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr=="67",]$T2_lobnorm = lob(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr=="67",]$T2)

In order to check the result of this normalization, we plot:

p1n=ggplot(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr=="67",]) + 
    aes(y = T1_lobnorm, x  = T2_lobnorm, color = labels, label=labels) + 
    stat_ellipse(type="norm") + 
    geom_text() + 
    scale_y_reverse() + scale_x_reverse() + 
    labs(x = "F2(Hz)", y = "F1(Hz)") +
    theme(legend.position="none") +
    ylim(c(3,-3)) +
    xlim(c(3,-3))

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.

So, the center of this vowel space now is at x=0, y=0.

However, we do not only want to normalize data of one speaker, but of all speakers. Again, we can achieve this by using a for-loop:

vowlaxn.fdat_norm.5$T1_lobnorm=0
vowlaxn.fdat_norm.5$T2_lobnorm=0

for (i in unique(vowlaxn.fdat_norm.5$spkr)){
  
vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,]$T1_lobnorm = lob(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,]$T1)

vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,]$T2_lobnorm = lob(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,]$T2)

}

We can now redo Figure 6.9, but now with normalized formant values.

ggplot(vowlaxn.fdat_norm.5) + 
    aes(y = T1_lobnorm, x  = T2_lobnorm, color = labels, label=labels) + 
    facet_grid(~spkr) +
    stat_ellipse(type="norm") + 
    geom_text() + 
    scale_y_reverse() + scale_x_reverse() + 
    labs(x = "F2(Hz)", y = "F1(Hz)") +
    theme(legend.position="none")

Fig. 6.11: Lobanov-normalised F1 and F2 of the data in Fig. 6.9. The axes are numbers of standard deviations from the mean.

The point [0,0] is the mean or centroid across all the data points per speaker and the axes are the numbers of standard deviations away from [0,0]. Compared with the raw data in Fig. 6.9, it is clear enough that there is a much closer alignment between the vowel categories of the male and female speakers in these normalized data. For larger studies, the mean and standard deviation should be based not just on those of a handful of lax vowels, but a much wider selection of vowel categories.

Another extrinsic normalization technique is due to Nearey (see e.g., Assmann et al., 1982; Nearey, 1989). The version demonstrated here is the one in which normalization is accomplished by subtracting a speaker-dependent constant from the logarithm of the formants. This speaker-dependent constant is obtained by working out (a) the mean of the logarithm of F1 (across all tokens for a given speaker) and (b) the mean of the logarithm of F2 (across the same tokens) and then averaging (a) and (b).

This speaker-dependent constant must now be subtracted from the logarithm of the raw formant values separately for each speaker. This can be done with a single-line function:

nearey <- function(x)
{
# Function for extrinsic normalization according to Nearey
# x is  a two-columned matrix
log(x) -  mean(apply(log(x), 2, mean))
}

Thus nearey(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr=="67",c("T1","T2")]) gives the Nearey-normalized formant (F1 and F2) data for the male speaker. The same methodology as for Lobanov-normalization above can be used to obtain Nearey-normalized data that is parallel to all the other objects – thereby allowing ellipse and polygon plots to be drawn in the F2 x F1 plane in the manner described earlier.

vowlaxn.fdat_norm.5$T1_neareynorm=0
vowlaxn.fdat_norm.5$T2_neareynorm=0

for (i in unique(vowlaxn.fdat_norm.5$spkr)){
vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,]$T1_neareynorm = nearey(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,c("T1","T2")])$T1
vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,]$T2_neareynorm = nearey(vowlaxn.fdat_norm.5[vowlaxn.fdat_norm.5$spkr==i,c("T1","T2")])$T2
}

ggplot(vowlaxn.fdat_norm.5) + 
    aes(y = T1_neareynorm, x  = T2_neareynorm, color = labels, label=labels) + 
    facet_grid(~spkr) +
    stat_ellipse(type="norm") + 
    geom_text() + 
    scale_y_reverse() + scale_x_reverse() + 
    labs(x = "F2(Hz)", y = "F1(Hz)") +
    theme(legend.position="none")

Fig. 6.12: Nearey-normalised F1 and F2 of the data in Fig. 6.9.

Lobanov- and Nearey-normalization are, then, two examples of speaker-dependent strategies that require data beyond the vowel that is to be normalized. In speaker-independent strategies all the information for normalization is supposed to be in the vowel itself. Earlier speaker-independent strategies made use of formant ratios (Peterson, 1961; Potter & Steinberg, 1950; see also Miller, 1989) and they often copy some aspects of the auditory transformations to acoustic data that are known to take place in the ear (Bladon et al, 1984). These types of speaker-independent auditory transformations are based on the idea that two equivalent vowels, even if they are produced by different speakers, result in a similar pattern of motion along the basilar membrane, even if the actual position of the pattern varies (Potter & Steinberg, 1950; see also Chiba & Kajiyama, 1941). Since there is a direct correspondence between basilar membrane motion and a sound’s frequency on a scale known as the Bark scale, a transformation to an auditory scale like the Bark scale (or ERB scale: see e.g., Glasberg & Moore, 1990) is usually the starting point for speaker-independent normalization. Independently of these normalization issues, many researchers transform formant values from Hertz to Bark before applying any further analysis, on the grounds that an analogous translation is presumed to be carried out in the ear.

There is a function bark(x) in the emuR library to carry out such a transformation, where x is a vector, matrix, trackdata object or emuRtrackdata column of Hertz values. (The same function with the inv=T argument converts Bark back into Hertz). The formulae for these transformations are given in Traunmüller (1990) and they are based on analyses by Zwicker (1961).

A graph of the relationship between the two scales (up to 10 kHz) with horizontal lines at intervals of 1 Bark superimposed on the Hz axis can be drawn as follows (Fig. 6.13):

plot(0:10000, bark(0:10000), type="l", xlab="Frequency (Hz)", ylab="Frequency (Bark)")
abline(v=bark(1:22, inv=T), lty=2)

Fig. 6.13: Relationship between the Hertz and Bark scales. The vertical lines mark interval widths of 1 Bark.

Fig. 6.13 shows that the interval corresponding to 1 Bark becomes progressively wider for values higher up the Hertz scale (the Bark scale, like the Mel scale (Fant, 1968), is roughly linear up to 1 kHz and then quasi-logarithmic thereafter). Ellipse plots analogous to those in Fig. 6.9 can be created by converting the matrix into Bark values.

ggplot(vowlaxn.fdat_norm.5) + 
  facet_grid(~spkr) +
  aes(y = bark(T1), x  = bark(T2), color = labels, label=labels) + 
  stat_ellipse(type="norm") + 
  geom_text() + 
  scale_y_reverse() + scale_x_reverse() + 
  labs(x = "F2(Bark)", y = "F1(Bark)") +
  theme(legend.position="none")

Fig. 6.11: German lax monophthongs produced by a male (left) and a female (right) speaker in the F2 x F1 plane. Data extracted at the temporal midpoint of the vowel and converted to the Bark scale.

A detailed exploration of Bark-scaled, vowel formant data is given in Syrdal & Gopal (1986).

Be informed, that there is a much nicer way to calculate Lobanov- or Nearey-normalized values by the function norm()

vowlaxn.fdat_norm.5$T1_lobnorm2 = norm(data=cbind(vowlaxn.fdat_norm.5$T1,vowlaxn.fdat_norm.5$T2),speakerlabs=vowlaxn.fdat_norm.5$spkr,type="lob")[,1]
vowlaxn.fdat_norm.5$T2_lobnorm2 = norm(data=cbind(vowlaxn.fdat_norm.5$T1,vowlaxn.fdat_norm.5$T2),speakerlabs=vowlaxn.fdat_norm.5$spkr,type="lob")[,2]

#Test,whether these are different from the values calculated above
any(vowlaxn.fdat_norm.5$T2_lobnorm2 != vowlaxn.fdat_norm.5$T2_lobnorm)

## [1] FALSE

any(vowlaxn.fdat_norm.5$T2_lobnorm2 != vowlaxn.fdat_norm.5$T2_lobnorm)

## [1] FALSE

In order to calculate a Nearey-type normalization, use type='nearey' instead of type='lob' (or simply do not add the type parameter, as type defaults to the Nearey method).

6.5 Euclidean distances

See lecture 2018 January 22.

6.6 Vowel undershoot and formant smoothing

The calculation of the Euclidean distance to the centre of the vowel space discussed in 6.5.1 is one of the possible methods for measuring vowel undershoot, a term first used by Lindblom (1963) to refer to the way in which vowels failed to reach their targets due to contextual influences such as the flanking consonants and stress. But in such calculations, the extent of vowel undershoot (or expansion) is being measured only at a single time point. The technique to be discussed in this section is based on a parameterisation of the entire formant trajectory. These parameterisations involve reducing an entire formant trajectory to a set of coefficients – this can be thought of as the analysis mode. A by-product of this reduction is that if the formant trajectories are reconstructed from the coefficients that were obtained in the analysis mode, then a smoothed formant contour can be derived – this is the synthesis mode and it is discussed more fully at the end of this section. The type of coefficients to be considered are due to van Bergem (1993) and involve fitting a parabola, that is an equation of the form \[F = c0 + c1*t + c2*t2\], where $F$ is a formant from the start to the end of a vowel that changes as a function of time $t$. As the equation shows, there are three coefficients $c0$, $c1$, and $c2$ that have to be calculated for each vowel separately from the formant’s trajectory. The shape of a parabola is necessarily curved in an arc, either U-shaped if $c2$ is positive, or ∩-shaped if $c2$ is negative. The principle that lies behind fitting such an equation is as follows. The shape of a formant trajectory, and in particular that of F2, over the extent of a vowel, is influenced mostly both by the vowel and by the immediately preceding and following sounds: that is by the left and right contexts. At the vowel target, which for most monophthongs is nearest the vowel’s temporal midpoint, the shape is predominantly determined by the phonetic quality of the vowel, but it is reasonable to assume that the influence from the context increases progressively nearer the vowel onset and offset (e.g. Broad & Fertig, 1970). Consider for example the case in which the vowel has no target at all. Just this hypothesis has been suggested for schwa vowels by Browman & Goldstein (1992b) in an articulatory analysis and by van Bergem (1994) using acoustic data. In such a situation, a formant trajectory might approximately follow a straight line between its values at the vowel onset and vowel offset: this would happen if the vowel target has no influence so that the trajectory’s shape is entirely determined by the left and right contexts. On the other hand, if a vowel has a prominent target, as if often the case if it is emphasised or prosodically accented (Pierrehumbert & Talkin, 1990) – then, it is likely to deviate considerably from a straight line joining its onset and offset. Since a formant trajectory often follows reasonably well a parabolic trajectory (Lindblom, 1963), one way to measure the extent of deviation from the straight line and hence to estimate how much it is undershot is to fit a parabola to the formant and then measure the parabola’s curvature. If the formant is heavily undershot and follows more or less a straight line path between its endpoints, then the curvature will be almost zero; while the more prominent the target, the greater deviation from the straight line, and the greater the magnitude of the curvature, in either a positive or a negative direction. The way that the parabola is fitted in van Bergem (1993) is essentially to rescale the time axis of a trajectory linearly between t = -1 and t = 1. This rescaled time-axis can be obtained using the seq() function, if the length of the trajectory in data points is known. As an example, the length of the F2-trajectory for the first segment in the lax vowel data is given by:

N = length(vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==1,]$T2)
N

## [1] 18

So the linearly rescaled time axis between t = ±1 is given by:

times = seq(-1, 1, length=N)

Since a precise estimate of the formant will need to be made at time $t = 0$, the number of data points that supports the trajectory could be increased using linear interpolation with the approx() function. The shape of the trajectory stays exactly the same, but the interval along the time axis becomes more fine grained (this procedure is sometimes known as linear time normalization). For example, the F2-trajectory for the first segment in the lax vowel dataset could be given 101 rather than 18 points as in Fig. 6.17 which was created as follows (an odd number of points is chosen here, because this makes sure that there will be a value at $t = 0$):

N = 101
F2int = approx(vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==1,]$T2, n=N)
times = seq(-1, 1, length=N)
par(mfrow=c(1,2));  
plot(times, F2int$y, type="b", xlab="Normalized time", ylab="F2 (Hz)")
# The original F2 for this segment
plot(T2~times_orig, data = vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==1,], type="b", xlab="Time (ms)", ylab="")

Fig. 6.17: An F2 trajectory (right) and its linearly time normalised equivalent using 101 data points between t = ± 1.

There are three unknown coefficients to be found in the parabola $F = c0 + c1*t + c2*t2$ that is to be fitted to the data of the left panel in Fig. 6.17 and this requires inserting three sets of data points into this equation. It can be shown (van Bergem, 1993) that when the data is extended on the time axis between t = ± 1, the coefficients have the following values:

# c0 is the value at t = 0.
c0 = F2int$y[times==0]
# c1 is half of the difference between the first and last data points.
c1 <- 0.5 * (F2int$y[N] - F2int$y[1])
# c2 is half of the sum of the first and last data points minus c0
c2 <- 0.5 * (F2int$y[N] + F2int$y[1]) - c0
c0

## [1] 1774

c1

## [1] -84

c2

## [1] 30

If you follow through the example in R, you will get values of 1774, -84, and 30 for $c0$, $c1$, and $c2$ respectively. Since these are the coefficients, the parabola over the entire trajectory can be calculated by inserting these coefficients values into the equation $c0 + c1*t + c2*t$. So for this segment, the values of the parabola are:

c0 + c1 * times + c2 * (times^2)

##   [1] 1888.000 1885.132 1882.288 1879.468 1876.672 1873.900 1871.152
##   [8] 1868.428 1865.728 1863.052 1860.400 1857.772 1855.168 1852.588
##  [15] 1850.032 1847.500 1844.992 1842.508 1840.048 1837.612 1835.200
##  [22] 1832.812 1830.448 1828.108 1825.792 1823.500 1821.232 1818.988
##  [29] 1816.768 1814.572 1812.400 1810.252 1808.128 1806.028 1803.952
##  [36] 1801.900 1799.872 1797.868 1795.888 1793.932 1792.000 1790.092
##  [43] 1788.208 1786.348 1784.512 1782.700 1780.912 1779.148 1777.408
##  [50] 1775.692 1774.000 1772.332 1770.688 1769.068 1767.472 1765.900
##  [57] 1764.352 1762.828 1761.328 1759.852 1758.400 1756.972 1755.568
##  [64] 1754.188 1752.832 1751.500 1750.192 1748.908 1747.648 1746.412
##  [71] 1745.200 1744.012 1742.848 1741.708 1740.592 1739.500 1738.432
##  [78] 1737.388 1736.368 1735.372 1734.400 1733.452 1732.528 1731.628
##  [85] 1730.752 1729.900 1729.072 1728.268 1727.488 1726.732 1726.000
##  [92] 1725.292 1724.608 1723.948 1723.312 1722.700 1722.112 1721.548
##  [99] 1721.008 1720.492 1720.000

These values could be plotted as a function of time to obtain the fitted curve. However, there is a function in the Emu-R library plafit() that does all these steps. So for the present data, the coefficients are:

plafit(vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==1,]$T2)

##   c0   c1   c2 
## 1774  -84   30

Moreover, the additional argument fit=T returns the formant values of the fitted parabola linearly time-normalized back to the same length as the original data to which the plafit() function was applied. So a superimposed plot of the raw and parabolically-smoothed F2-track for this first segment is:

# Calculate the values of the parabola
F2par = plafit(vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==1,]$T2, fit=T)
ylim = range(c(F2par, vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==1,]$T2))
xlab="Time (ms)"; ylab="F2 (Hz)"
par(mfrow=c(1,1))
# Plot the raw values
plot(vowlax.fdat[1,2], type="b", ylim=ylim, xlab=xlab, ylab=ylab)
# Superimpose the smoothed values
par(new=T)
plot(vowlaxn.fdat[vowlaxn.fdat$sl_rowIdx==1,]$times_orig, F2par, type="l", ylim=ylim, xlab=xlab, ylab=ylab, lwd=2)

Fig. 6.18: A raw F2 trajectory (gray) and a fitted parabola.

The fitted parabola (Fig. 6.18) always passes through the first and last points of the trajectory and through whichever point is closest to the temporal midpoint. The coefficient c0 is the y-axis value at the temporal midpoint. The coefficient $c1$, being the average of the first and last values, is negative for falling trajectories and positive for rising trajectories. As already mentioned, $c2$ measures the trajectory’s curvature: positive values on $c2$ mean that the parabola has a U-shape, as in Fig. 6.18, negative values that it is ∩-shaped. Notice that these coefficients encode the trajectory’s shape independently of time. So the above trajectory extends over a duration of about 80 ms; however, even if the duration were 1/10th or 100 times as great, the coefficients would all be the same, if the trajectory’s shape were unchanged. So it would be wrong to say that very much can be inferred about the rate of change of the formant (in Hz/s) from c1 (or to do so, c1 would have to be divided by the formant’s duration). The task now is to explore a worked example of measuring formant curvatures in a larger sample of speech. The analysis of the vowel spaces in the F2 x F1 plane, as well as the Euclidean distance measurements in 6.5 have suggested that the female speaker produces more distinctive vowel targets, or in terms of Lindblom’s (1990) H&H theory, her vowels show greater evidence of hyperarticulation and less of a tendency to be undershot. Is this also reflected in a difference in the extent of formant curvature? In order to address this question, the two speakers’ [ɛ] vowels will be compared. Before applying an algorithm for quantifying the data, it is always helpful to look at a few plots first, if this is possible (if only because a gross inconsistency between what is seen and what is obtained numerically often indicates that there is a mistake in the calculation!). A plot of all of the F2 trajectories lined up at the temporal midpoint and shown separately for the two speakers does not seem to be especially revealing, perhaps in part because the trajectories have different durations and, as was mentioned earlier, in order to compare whether one trajectory is more curved than another, they need to be time normalized to the same length. The function normalize_length() function does just this by linearly stretching and compressing the trajectories so that they extend in time between 0 and 1 (see above, when we created vowlaxn.fdat_norm). There is some suggestion from the time-normalized data (Fig. 6.19 left) that the female’s F2 trajectories are more curved than for the male speaker. This emerges especially clearly when the linearly time-normalzed, male and female F2-trajectories are separately averaged (Fig. 6.19, right). However, the average is just that: at best a trend, and one that we will now seek to quantify by calculating the $c2$ coefficients of the fitted parabola. Before doing this, here are the instructions for producing Fig. 6.19 (using the above mentioned, time_normalized dataframe vowlaxn.fdat_norm):

vowlaxn.fdat_norm = normalize_length(vowlaxn.fdat, N = 21)

p1 = ggplot(vowlaxn.fdat_norm[vowlaxn.fdat_norm$labels=="E",]) +
      aes(x=times_norm,y=T2,col=spkr,group=sl_rowIdx) +
      geom_line() +
      labs(x = "Normalized time", y = "F2 (Hz)") +
      ylim(c(1000,2800))


p2 = ggplot(aggregate(T2~times_norm+labels+spkr, data = vowlaxn.fdat_norm[vowlaxn.fdat_norm$labels=="E",],FUN=mean)) +
      aes(x=times_norm,y=T2,col=spkr) +
      geom_line() +
      labs(x = "Normalized time", y = "F2 (Hz)") +
      ylim(c(1000,2800))

grid.arrange(p1,p2,ncol=2)

## Warning: Removed 3 rows containing missing values (geom_path).

Fig. 6.19: F2 of [ɛ] for the male (black) and female (gray) speakers linearly time normalised (left), and linearly time normalised and averaged (right).

The plafit() can be applied to any vector of values, just like the euclid() function created earlier. For example, this instruction finds the three coefficients of a parabola that have been fitted to 10 random numbers.

r = runif(10)
plafit(r)

##         c0         c1         c2 
##  0.6814636  0.2255234 -0.1282457

Sind plafit() evidently works on frames of speech data (see the instructions for creating Fig. 6.18), then for all the reasons given in 5.5.1 of the preceding Chapter, it can also be used inside aggregate(). Moreover, since the function will return the same number of elements per segment (3 in this case), then one ‘column’ of the resulting dataframe will be a matrix, and not a vector:

E.fdat_norm = vowlaxn.fdat_norm[vowlaxn.fdat_norm$labels=="E",]

T2Eplafit = aggregate(T2~sl_rowIdx+spkr, data=E.fdat_norm,FUN=plafit)
names(T2Eplafit)

## [1] "sl_rowIdx" "spkr"      "T2"

#be careful: $T2 is a matrix with three columns
is(T2Eplafit$T2)

## [1] "matrix"    "array"     "structure" "vector"

dim(T2Eplafit$T2)

## [1] 82  3

T2Eplafit$T2 has 3 columns (one per coefficient) and the same number of rows as there are [ɛ] segments. Fig. 6.20 compares the F2-curvatures of the male and female speakers using a boxplot. There are two outliers (both for the female speaker) with values less than -500 (as T2Eplafit$T2[,3] < -500 shows) and these have been excluded from the plot by setting the y-axis limits:

#to plot the third column (=equivalent to the amplitude)
ggplot(T2Eplafit) +
  aes(y=T2[,3],x=spkr) +
  geom_boxplot() +
  ylim(c(-550,150)) +
  ylab("Amplitude") +
  xlab("Speaker")

## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

Fig. 6.20: Boxplots of the c2 coefficient of a parabola fitted to F2 of [ɛ] for the male (left) and female (right) speaker.

Fig. 6.20 shows greater negative values on $c2$ for the female speaker which is consistent with the view that there is indeed greater curvature in the female speaker’s F2 of [ɛ] than for the male speaker. As foreshadowed at various stages in this section, the fit=T argument applies the function plafit() in synthesis mode: it works out the corresponding fitted formant parabola as a function of time. In order to smooth the entire trackdata object, do:

T2Eplafitted = aggregate(T2~sl_rowIdx+spkr, data=E.fdat_norm,FUN=plafit, fit=T)
#unfortunately, we need a for-loop to add these values to E.fdat_norm
#first of all, create E.fdat_norm$T2plafitted with 0s only
E.fdat_norm$T2plafitted = 0
#then
for (i in unique(E.fdat_norm$sl_rowIdx) ){
  E.fdat_norm[E.fdat_norm$sl_rowIdx==i,]$T2plafitted = 
      as.vector(T2Eplafitted[T2Eplafitted$sl_rowIdx==i,]$T2)
}

The smoothed and raw F2 data can be superimposed on each other for any segment as in Fig. 6.21 and in the manner described below. Although fitting a parabola is an effective method of data reduction that is especially useful for measuring formant curvature and hence undershoot, there are two disadvantages as far as obtaining a smoothed contour are concerned:

not every formant trajectory has a parabolic shape
parabolic fitting forces a fit at the segment onset, offset, and midpoint as is evident from Figs. 6.21.

One way around both of these problems is to use the discrete cosine transformation (see e.g., Watson & Harrington; 1999; Harrington, 2006; Harrington et al., 2008) which will be discussed more fully in Chapter 8. This transformation decomposes a trajectory into a set of coefficients (this is the analysis mode) that are the amplitudes of half-cycle cosine waves of increasing frequency. The number of coefficients derived from the discrete cosine transformation (DCT) is the same as the length of the trajectory. If, in synthesis mode, all of these cosine waves are summed, then the original, raw trajectory is exactly reconstructed. However, if only the first few lowest frequency cosine waves are summed, then a smoothed trajectory is derived. Moreover, the fewer the cosine waves that are summed, the greater the degree of smoothing. Therefore, an advantage of this type of smoothing over that of fitting parabolas is that it is possible to control the degree of smoothing. Another advantage is that the DCT does not necessarily force the smoothed trajectory to pass through the values at the onset, offset, and midpoint and so is not as prone to produce a wildly inaccurate contour, if formant onsets and offsets were inaccurately tracked (which is often the case especially if there is a preceding or following voiceless segment). There is a function in the Emu-R library for computing the DCT coefficients, dct() which, just like plafit() and euclid() takes a vector of values as its main argument. The function can be used in an exactly analogous way to the plafit() function in synthesis mode for obtaining smoothed trajectories from the coefficients that are calculated in analysis mode. In the example in Fig. 6.21, a smoothed F2 trajectory is calculated from the first five DCT coefficients (to obtain these, we have to use m = 4 as a parameter to be used in dct()). The DCT coefficients were calculated as follows:

T2Edctfitted = aggregate(T2~sl_rowIdx+spkr, data=E.fdat_norm,FUN=dct, m=4, fit=T)
#unfortunately, we need a for-loop to add these values to E.fdat_norm
#first of all, create E.fdat_norm$T2plafitted with 0s only
E.fdat_norm$T2dctfitted = 0
#then
for (i in unique(E.fdat_norm$sl_rowIdx) ){
  E.fdat_norm[E.fdat_norm$sl_rowIdx==i,]$T2dctfitted = 
    as.vector(T2Edctfitted[T2Edctfitted$sl_rowIdx==i,]$T2)
}

Fig. 6.21 containing the raw and two types of smoothed trajectories for the 2st [ɛ] was produced with the following commands:

ggplot(E.fdat_norm[E.fdat_norm$sl_rowIdx==1,]) +
  aes(x=times_norm) +
  geom_line(aes(y = T2, linetype = "1. raw")) +
  geom_line(aes(y = T2plafitted, linetype = "2. fitted with plafit()")) +
  geom_line(aes(y = T2dctfitted, linetype = "3. dct-fitted (5 coefficients)")) +
  scale_linetype_manual(values=c("solid", "dotted","dashed")) +
  labs(linetype="") +
  ylab("Frequency (Hz)") +
  xlab("Normalized time")

Fig. 6.21a: A raw F2 formant trajectory of one of the vowels [ɛ] (solid) produced by a male speaker of Standard German and two smoothed contours of the raw signal based on fitting a parabola following van Bergem (1993) (dotted) and the first 5 coefficients of the discrete cosine transformation (dashed).

The less dct()-coefficients we had used, the more has the resulting trajectory been smoothed; lets assume, we had used only three coefficients (i.e. used m=2):

T2Edctfitted2 = aggregate(T2~sl_rowIdx+spkr, data=E.fdat_norm,FUN=dct, m=2, fit=T)
#unfortunately, we need a for-loop to add these values to E.fdat_norm
#first of all, create E.fdat_norm$T2plafitted with 0s only
E.fdat_norm$T2dctfitted2 = 0
#then
for (i in unique(E.fdat_norm$sl_rowIdx) ){
  E.fdat_norm[E.fdat_norm$sl_rowIdx==i,]$T2dctfitted2 = 
    as.vector(T2Edctfitted2[T2Edctfitted2$sl_rowIdx==i,]$T2)
}
ggplot(E.fdat_norm[E.fdat_norm$sl_rowIdx==1,]) +
  aes(x=times_norm) +
  geom_line(aes(y = T2, linetype = "1. raw")) +
  geom_line(aes(y = T2dctfitted, linetype = "2. dct-fitted (5 coefficients)")) +
  geom_line(aes(y = T2dctfitted2, linetype = "3. dct-fitted (3 coefficients)")) +
  scale_linetype_manual(values=c("solid", "dashed","dotted")) +
  labs(linetype="") +
  ylab("Frequency (Hz)") +
  xlab("Normalized time")

Fig. 6.21b: A raw F2 formant trajectory of one of the vowels [ɛ] (solid) produced by a male speaker of Standard German and two smoothed contours of the raw signal based on the first 5 (dashed) or the first 3 (dotted) coefficients of the discrete cosine transformation.

6.7 F2 locus, place of articulation and variability

(This sub-chapter has still to be ‘translated’ into the ‘new’ emuR …)

In the production of an oral stop, the vocal tract is initially sealed during which time air-pressure builds up and is then released. At the point of release, the shape of the vocal tract has a marked influence on the acoustic signal and this influence extends at least to the beginning of the formant transitions, if the following segment is a vowel. Since the shape of the vocal tract is quite different for labial, alveolar, and velar places of articulation, the way in which the acoustic signal is influenced following the release differs correspondingly. If the influence extends to the onset of periodicity for the following vowel, then the formant onset frequencies of the same phonetic vowel should be different when it occurs after consonants at different places of articulation. As a study by Potter, Kopp and Green (1947) had shown, different places of articulation have their greatest influence on the onset of the second formant frequency. But it was the famous perception experiments using hand-painted spectrograms at the Haskins Laboratories that gave rise to the concept of an F2-locus. In these experiments, Liberman and colleagues (Delattre, Liberman & Cooper, 1955; Liberman, Delattre, Cooper & Gerstman, 1954; Liberman, et al, 1958) showed that the origin of the second formant frequency influenced the perception of the place of articulation of the following consonant. More specifically, if the F2-onset was low and at around 700 Hz, listeners would predominantly hear /b/; if it was roughly in the 1800 Hz region, they would hear /d/;, while if the F2-onset began near 3000 Hz, listeners would perceive /g/ before front vowels. The locus frequency was not at the acoustic vowel onset itself, but at some point prior to it during the consonantal closure. More specifically, in synthesising a CV transition, formants would be painted from the F2-locus somewhere in the consonant closure to the F2-vowel target and then the first part of the transition up to where the voicing for the vowel began would be erased. With these experiments, Liberman and colleagues also demonstrated that the perception of place of articulation was categorical and this finding was interpreted in favour of the famous motor theory of speech perception (see e.g. Hawkins, 1999 for a thorough review of these issues).

In the years following the Haskins Laboratories experiments, various large-scale acoustic studies were concerned with finding evidence for an F2-locus (including e.g., Fant, 1973; Kewley-Port, 1982; Lehiste & Peterson, 1961; Öhman, 1966). These studies showed that the most stable F2-locus was for alveolars (i.e., alveolars exhibit the least F2-onset variation of the three places of articulation) whereas velars, which exhibit a great deal of variability depending on the backness of the following vowel, showed no real evidence from acoustic data of a single F2-locus. From the early 1990s, Sussman and Colleagues applied the concept of a locus equation to suggest that F2 transitions might in some form provide invariant cues to the place of articulation (see e.g., Sussman et al, 1991; Sussman et al, 1995), a position that has also not been without its critics (e.g., Brancazio & Fowler, 1998; Löfqvist, 1999).

Locus equations were first investigated systematically by Krull (1988, 1989) and they provide a numerical index of the extent to which vowels exert a coarticulatory influence on a consonant’s place of articulation. Thus, whereas in studies of vowel undershoot, the concern is with the way in which context influences vowel targets, here it is the other way round: that is, the aim is to find out the extent to which vowel targets exert a coarticulatory influence on the place of articulation of flanking consonants.

Fig. 6.22. Hypothetical F2-trajectories of [bɛb] (solid) and [bob] (dashed) when there is no V-on-C coarticulation (left) and when V-on-C coarticulation is maximal (right). First row: the trajectories as a function of time. Second row: A plot of the F2 values in the plane of the vowel target x vowel onset for the data in the first row. The dotted line bottom left is the line F2Target = F2Onset that can be used to estimate the locus frequency. From Harrington (2009).

The basic idea behind a locus equation is illustrated with some made-up data in Fig. 6.22 showing two hypothetical F2-trajectories for [bɛb] and [bob]. In the trajectories on the left, the extent of influence of the vowel on the consonant is nil. This is because, in spite of very different F2 vowel targets, the F2 frequencies of [ɛ] and of [o] both converge to exactly the same F2-locus at 700 Hz for the bilabial – that is, the F2-onset frequency is determined entirely by the consonant with no influence from the vowel. The other (equally unlikely) extreme is shown on the right. In this case, the influence of the vowel on the consonant’s place of articulation is maximal so that there is absolutely no convergence to any locus and therefore no transition.

A locus equation is computed by transforming F2 frequencies (top row, Fig. 6.22) as a function of time into the plane of F2-target x F2-onset (bottom row, Fig. 6.22) in order to estimate the extent of coarticulatory influence of the vowel on the consonant. Firstly, when there is no vowel-on-consonant coarticulation (left), the locus equation, which is the line that connects [bɛb] and [bob] is horizontal: this is because they both converge to the same locus frequency and so F2-onset is 700 Hz in both cases. Secondly, when vowel-on-consonant coarticulation is at a maximum (right), then the locus equation is a diagonal in this plane because the locus frequency is equal to the target frequency.

A fundamental difference between these two extreme cases of coarticulation is in the slope of the locus equation. On the left, the slope in this plane is zero (because the locus equation is horizontal) and on the right it is one (because F2Target = F2Onset). Therefore, for real speech data, the slope of the locus equation can be used as a measure of the extent of vowel-on-consonant coarticulation: the closer the slope is to one, the more the consonant’s place of articulation is influenced by the vowel (and the slope must always lie between 0 and 1 since these are the two extreme cases of zero and maximal coarticulation). Finally, it can be shown (with some algebraic manipulation: see Harrington & Cassidy 1999, p. 128-130) that the locus frequency itself, that is the frequency towards which transitions tend to converge, can be estimated by establishing either where the locus equation and the line F2Target = F2Onset bisect. This is shown for the case of zero vowel-on-consonant coarticulation on left in Fig. 6.22: this line bisects the locus equation at the frequency at which the transitions in the top left panel of Fig. 6.22 converge, i.e. at 700 Hz which is the locus frequency. On the right, the line F2Target = F2Onset cannot bisect the locus equation, because it is the same as the locus equation. Because these lines do not bisect, there is no locus frequency, as is of course obvious from the ‘transitions’ in Fig. 6.22 top right that never converge.

This theory can be applied to some simple isolated word data produced by the first author of Clark et al (2007). In 1991, John Clark produced a number of isolated /dVd/ words where the V is one of the 13 possible monophthongs of Australian English. The relevant dataset which is part of the Emu-R library included a segment list of the vowel in these words (isol), a vector of vowel labels (isol.l) and a trackdata object of the first four formant frequencies between the vowel onset and offset (isol.fdat).

The task is to investigate whether the coarticulatory influence of the vowel is greater on the final, than on the initial, consonant. There are various reasons for expecting this to be so. Foremost are the arguments presented in Ohala (1990) and Ohala & Kawasaki (1984) that initial CV (consonant-vowel) transitions tend to be a good deal more salient than VC transitions: compatibly, there are many more sound changes in which the vowel and syllable-final consonant merge resulting in consonant loss (such as the nasalization of vowels and associated final nasal consonant deletion in French) than is the case for initial CV syllables. This would suggest that synchronically a consonant and vowel are more sharply delineated from each other in CV than in VC syllables and again there are numerous aerodynamic and acoustic experiments to support this view. Secondly, there are various investigations (Butcher, 1989, Hoole et al, 1990) which show that in V1CV2 sequences where C is an alveolar, the perseverative influence of V1 on V2 is greater than the anticipatory influence of V2 on V1. This suggests that the alveolar consonant resists coarticulatory influences of the following V2 (so the alveolar has a blocking effect on the coarticulatory influences of a following vowel) but is more transparent to the coarticulatory influences of the preceding V2 (so the alveolar does not block the coarticulatory influences of a preceding vowel to the same extent). Therefore, the vowel-on-consonant coarticulatory influences can be expected to be weaker when the vowel follows the alveolar in /dV/ than when it precedes it in /Vd/.

Before proceeding to the details of quantifying the data, it will be helpful as always to plot the data. Fig. 6.23 shows the second formant frequency trajectories synchronised at the vowel onset on the left and at the vowel offset on the right. The trajectories are of F2 of the separate vowel categories and there is only one token per vowel category, as table(isol.l) will show . The plots can be obtained by setting the offset argument in this function argument to 0 (for alignment at the segment onset) and to 1 (for alignment at the segment offset).

par(mfrow=c(1,2))
dplot(isol.fdat[,2], offset=0, ylab = "F2 (Hz)", xlab="Time (ms)")
dplot(isol.fdat[,2], offset=1, xlab="Time (ms)")

Fig. 6.23: F2 trajectories in isolated /dVd/ syllables produced by a male speaker of Australian English for a number of different vowel categories synchronised at the vowel onset (left) and at the vowel offset (right).

It seems clear enough from Fig. 6.23 that there is greater convergence of the F2-transitions at the alignment point on the left (segment onset) than on the right (segment offset). So far, the hypothesis seems to be supported.

The task now is to switch to the plane of F2-target x F2-onset (F2-offset) and calculate locus equations in these two planes. The vowel target is taken to be at the temporal midpoint of the vowel. The following three vectors are F2 at the vowel onset, target, and offset respectively:

f2onset = dcut(isol.fdat[,2], 0, prop=T)
f2targ = dcut(isol.fdat[,2], .5, prop=T)
f2offset= dcut(isol.fdat[,2], 1, prop=T)

The next step is to plot the data and then to calculate a straight line of best fit known as a regression line through the scatter of points: that is, there will not be two points that lie on the same line as in the theoretical example in Fig. 6.22, but several points that lie close to a line of best fit. The technique of linear regression, which for speech analysis in R is described in detail in Johnson (2008), is to calculate such a line, which is defined as the line to which the distances of the points are minimised. The function lm() is used to do this. Thus for the vowel onset data, the second and third commands below are used to calculate and draw the straight line of best fit:

The information about the slope is stored in $coeff which also gives the intercept (the value at which the regression line cuts the y-axis, i.e. the F2-onset axis). The slope of the line is just over 0.27:

regr$coeff

##  (Intercept)       f2targ 
## 1217.8046955    0.2720668

Recall that the best estimate of the locus is where the line F2Target = F2Onset cuts this regression line. Such a line can be superimposed on the plot as follows:

abline(0, 1, lty=2)

There is a function in the Emu-R library, locus(), that carries out all of these operations. The first two arguments are for the data to be plotted on the x- and y-axes (in this case the F2-target and the F2-onset respectively) and there is a third optional argument for superimposing a parallel set of labels. In this example, the vowel labels are superimposed on the scatter and the x- and y-ranges are set to be identical (Fig. 6.24):

xlim = c(500, 2500); ylim = xlim; par(mfrow=c(1,2))
xlab = "F2-target (Hz)"; ylab = "F2-onset (Hz)"
stats.on = locus(f2targ, f2onset, isol.l, xlim=xlim, ylim=ylim, xlab=xlab, ylab=ylab)
stats.off = locus(f2targ, f2offset, isol.l, xlim=xlim, ylim=ylim, xlab=xlab)

Fig. 6.24. Locus equations (solid lines) for /dVd/ words produced by a male speaker of Australian English for F2-onset (left) and F2-offset (right) as a function of F2-target. The dotted line is the line y = x and is used to estimate the locus frequency at the point of intersection with the locus equation.

Fig. 6.24 shows that the regression line (locus equation) is not as steep for the F2-onset compared with the F2-offset data. The following two commands show the actual values of the slopes (and intercepts) firstly in the F2-target x F2-onset plane and secondly in the F2-target x F2-offset plane:

stats.on$coeff

##  (Intercept)       target 
## 1217.8046955    0.2720668

stats.off$coeff

## (Intercept)      target 
## 850.4935279   0.4447689

The locus() function also calculates the point at which the regression line and the lines $F2Target = F2Onset$ (Fig. 6.24, left) and $F2Target = F2Offset$ (Fig. 6.24, right) bisect, thus giving a best estimate of the locus frequency. These estimates are in `$locus and the calculated values for the data in the left and right panels of Fig. 6.24 respectively are 1673 Hz and 1532 Hz. In fact, the locus frequency can also be derived from:

\[L = c/(1 – α)\]

where L is the locus frequency and c and α are the intercept and slope of the locus equation respectively (Harrington, 2009). Thus the estimated locus frequency for the CV syllables is equivalently given by:

1217.8046955/(1- 0.2720668)

## [1] 1672.962

The reason why there is this 141 Hz difference in the calculation of the locus frequencies for initial as opposed to final /d/ is not immediately clear; but nevertheless as Fig. 6.23 shows, these are reasonable estimates of the point at which the F2 transitions tend, on average, to converge. Finally, statistical diagnostics on the extent to which the regression line could be fitted to the points is given by the summary function:

summary(stats.on)

## 
## Call:
## lm(formula = onset ~ target)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.558 -34.733   5.164  41.402  66.701 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.218e+03  5.134e+01  23.721 8.51e-11 ***
## target      2.721e-01  3.308e-02   8.224 5.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.12 on 11 degrees of freedom
## Multiple R-squared:  0.8601, Adjusted R-squared:  0.8474 
## F-statistic: 67.64 on 1 and 11 DF,  p-value: 5.018e-06

The probability that the regression line slope accounts for the data is given by the results of t-test in the line beginning target. The null hypothesis is that there is no relationship at all between the F2-onset and F2-target (which would mean that the slope of the regression line would be zero): the t-test computes the likelihood that this could be the case. This probability is shown to be 5.02e-06 or 0.00000502 i.e., very small. So the null hypothesis is rejected. The other important diagnostic is given under adjusted R-squared which is a measure of how much of the variance is explained by the regression line. It shows that for these data, just over 86% of the variance is explained by the regression line and the probability that this value is different from zero is given by the results of an F-test in the next line beginning ‘F-statistic’. The analysis of these (very limited) data lends some support to the view that the coarticulatory influence of the vowel on an alveolar stop is greater when the consonant is in final than when it is in initial position. However, data from continuous speech and above all from other consonant categories will not always give such clear results, so the locus equation calculations must be used with care. In particular, an initial plot of the data to check the formant transitions for any outliers due to formant tracking errors is essential. If the slope of the line is negative or greater 1, then it means either that the data could not be reliably modelled by any kind of regression line and/or that no sensible locus frequency could be computed (i.e., the formant transitions do not converge, and the resulting locus frequency calculation is likely to be absurd).

This chapter has been published in Harrington, 2010 as Chapter 6. ‘Analysis of formants and formant transitions’↩