1 Introduction
Some landmark research in North American dialectology and sociolinguistics of vowels has relied on single point measurements of F1 and F2 to characterize the entire vowel (Labov 2001; Labov et al. 2006). However, there is a wealth of work that makes it clear that full formant dynamics, or Vowel Intrinsic Spectral Change (Nearey and Assmann 1986), have important dialectal and sociolinguistic properties (e.g. Fox and Jacewicz 2009; Risdal and Kohn 2014; Docherty et al. 2015; Tanner et al. 2022). Additionally, vowel formant extraction tools such as the FAVE suite (Rosenfelder et al. 2024), fasttrack (Barreda 2021a; Fruehwald and Barreda 2024), and new-fave (Fruehwald 2025) make full formant tracks available to researchers, and curve fitting techniques are becoming more widely adopted (Sóskuthy 2021).
One open question about the use of full vowel formant tracks is how normalization for differences in vocal tract length between speakers ought to be carried out. There is an extensive literature on the topic in the use of point measurements (see Adank et al. 2004), but how these methods should (or could) be applied to full formant tracks has not been addressed as fully.
In this paper, I will be exploring the use the Discrete Cosine Transform (DCT) to smooth and normalize formant tracks. The DCT was chosen, in part, because it is used in the fasttrack method to optimize the Linear Predictive Coding parameters in formant extraction, and the fasttrack python implementation allows users to save the DCT coefficients. I will demonstrate that it is possible to carry out speaker normalization directly on the DCT coefficients, which can then themselves be used for quantitative analysis.
The goal of this paper is not to compare normalization techniques and evaluate their benefits and drawbacks. Rather, it is to demonstrate that if we were to construe any given formant normalization technique as a function \(f(x)\) that is applied to measured formant tracks, there is an approximately equivalent function \(g(y)\) that can applied to the DCT coefficients of formant tracks. The specific procedure for defining \(g(y)\) given \(f(x)\) is as follows:
- Any non-linear transformations (log, exponentiation, Mel, Bark, etc) must be applied to formant track data before the DCT is applied.
- For location parameters (values added or subtracted in the normalization):
- They should be estimated using the 0th DCT parameter.
- They should be applied to just the 0th DCT parameter.
- For scale parameters (values divided or multiplied by in the normalization):
- They should be estimated using the 0th DCT parameter and multiplied by \(\sqrt{2}\).1
- They should be applied to all DCT parameters.
1.1 Notation conventions
There are a number of mathematical formulae in this paper and mathematical notation used throughout. I have chosen to follow the following notation conventions.
- \(F\), \(x\)
-
A formant track. \(F\) is used when discussing formants specifically, and \(x\) when discussing a signal in general.
- \(\bar{F}_i\)
-
For a single vowel token, the average across its entire formant track.
- \(y\)
-
A vector of Discrete Cosine Transform coefficients (described below).
- \(i\)
-
A variable for formant numbers. For example \(F_1\) is the first formant, and \(F_i\) is the \(i^{th}\) formant.
- \(j\)
-
A variable for an observation number. For example, \(x_3\) would be the third value from a signal \(x\). This will sometimes be used interchangeably to refer to the \(j^{th}\) value in a signal, and sometimes to the \(j^{th}\) vowel token in a data set.
- \(k\)
-
A variable for a Discrete Cosine Transform coefficient. For example \(y_0\) will be the 0th coefficient, and \(y_k\) will be the \(k^{th}\) coefficient.
2 The Discrete Cosine Transform
The Discrete Cosine Transform (DCT) is a general purpose curve fitting method (it is the basis of jpeg image compression (Wallace 1992) ). It is sometimes used directly on audio spectra (e.g. Zahorian and Jagharghi 1991; Zahorian and Jagharghi 1993; Guzik and Harrington 2007; Jannedy and Weirich 2017) and has also been used to model and classify vowel formant trajectories (Watson and Harrington 1999; Hillenbrand et al. 2001; Morrison 2009; Williams and Escudero 2014; Williams et al. 2015; Cox and Palethorpe 2019). Versions of the DCT are available in the fftw R package (Mersmann 2024) (which itself calls the fftw C library) and in the scipy python package (Virtanen et al. 2020). It has also been implemented in the emuR R package (Jochim et al. 2024) and the fasttrackpy python package (Fruehwald and Barreda 2024).
The DCT re–describes a continuous function in terms of weights on a bank of cosine functions oscillating at increasing frequencies. There are several definitions of the DCT; for the purpose of this paper, we’ll be using the definition of the DCT as used in the python scipy library (Virtanen et al. 2020) both because this is a standard and replicable method, and because this is how the fasttrackpy library implements and saves these parameters (see Section 8 for more information).
The mathematical definition for the \(k\)th DCT is given in Equation 1.
\[ y_k = \frac{1}{oN} \sum_{j=0}^{N-1} x_j\cos\left(\frac{\pi k(2j + 1)}{2N} \right) \tag{1}\]
\[ o = \begin{cases} \sqrt{2}& \text{for }k=0\\ 1 & \text{for }k>0 \end{cases} \]
For the purpose of this paper, it’s not necessary to understand every component of Equation 1 in detail. Figure 1 annotates the formula with the important pieces that will come into play. A signal \(x\) (in our case, a formant track) is multiplied by a cosine function of the same length, then summed. Then, this sum is normalized (divided by the length of the signal), resulting in the DCT coefficient. The cosine function changes depending on the number of the DCT coefficient.
Figure 2 plots the 0th through 2nd cosine functions that the coefficients calculated in Equation 1 weight.
The inverse DCT takes the coefficients generated by Equation 1 and returns the original signal. The Inverse DCT equation is given in Equation 2.
\[ x_j = \sqrt{2}y_0 + 2\sum_{k=1}^{N-1}y_k\cos\left(\frac{\pi k(2j+1)}{2N}\right) \tag{2}\]
To get the rate of change of a signal from its DCT coefficients, we need the first derivative of Equation 2, given in Equation 3
\[ \frac{\delta x_j}{\delta j} = -2\sum_{k=1}^{N-1}y_k\frac{\pi k}{N}\sin\left(\frac{\pi k(2j+1)}{2N}\right) \tag{3}\]
To get the acceleration in a signal from its DCT coefficients, we need the second derivative of Equation 2, given in
\[ \frac{\delta^2 x_j}{\delta j^2} = -2\sum_{k=1}^{N-1}y_k\left(\frac{\pi k}{N}\right)^2\cos\left(\frac{\pi k(2j+1)}{2N}\right) \tag{4}\]
3 Benefits of the DCT
There are a number of general benefits to using DCT coefficients for analyzing vowel formant tracks which I will outline here. It’s worth noting at this point that several of the benefits only really hold for within-speaker analyses. In order to apply some of these methods to multi-speaker analyses, the DCT coefficients will need to be speaker normalized (the subject of the remainder of this paper).
3.1 Smoothing by truncation
The DCT returns the same number of coefficients as there are data points in the original signal. With all of these DCT coefficients, the original signal can be fully reconstructed by applying the inverse DCT. For example, the F1 formant track in Figure 3 consists of 63 measurement points (the plotted points), so the application of the DCT returns 63 DCT coefficients. When applying the inverse DCT to all 63 coefficients, the resulting curve (plotted in the red line) recreates the original signal.
The first 3 DCT coefficients have relatively interpretable meanings which can be seen in their shapes in Figure 2.
0th: The overall level of the curve
1st: The amount of fall to the curve
2nd: The amount of fall-rise to the curve
In fact, several papers have been successful classifying vowel phonemes using just these first three coefficients (Watson and Harrington 1999; Williams et al. 2015). Higher DCT coefficients capture higher frequency oscillations of a signal. For example, the cosine function that the 9th DCT coefficient weights is plotted in Figure 4.
While these higher frequency oscillations are important for perfectly recreating the original signal, they are probably not linguistically meaningful with respect to a single vowel’s formant track. By truncating the number of DCT coefficients used to describe a formant track to the first 3 (as done in prior work) or 5 (as done in fasttrackpy), we get back a smoothed version of the formant track when applying the inverse DCT.
The benefit here is that the degree of desired smoothing can be arbitrarily specified after the DCT has been applied by simply using just the first \(n\) coefficients. This is in contrast with some other commonly used curve fitting methods for vowel formants, like cubic regression splines (Sóskuthy 2017, 2021), where the number of basis functions (and therefore coefficients) must be specified before the curve is fit, and cannot be altered after the fact.
3.2 Data Compression
While this may be a diminishing concern given increasing computational power, by using just the first 5 DCT coefficients for analysis, instead of full formant track data, the volume of data that needs to be stored or loaded into memory for quantitative analysis is dramatically decreased. As shown in Table 1, the size of the data in the Philadelphia Neighborhood Corpus (PNC) (Labov and Rosenfelder 2011) is about an order of magnitude smaller when using DCT coefficients than when using the raw formant track data.
lines | size | |
---|---|---|
tracks | 46,973,521 | 17 GB |
dct coefs | 4,943,165 | 999 MB |
ratio | 9.5 | 16.5 |
3.3 Averaging Formant Tracks
Averaging over formant tracks of different durations is not straightforward, as they will have a different number of sampled values. Figure 6 illustrates this problem with three example formant tracks for the vowel /ay/ from one speaker in the Philadelphia Neighborhood Corpus. It would perhaps be inappropriate to average the final sample from a short token with the aligned midpoint sample from a longer token. Aligning the beginnings and ends tokens in, for example, proportional time, creates a new problem that the samples in between are no longer aligned, making point-wise averaging impossible.
Instead of point-wise averaging, or fitting a smoothing model, we can instead average over the DCT coefficients for these three tokens (Table 2), then use these averages as DCT coefficients themselves to get the inverse DCT. The result of this averaging process is plotted in Figure 7.
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
token | |||||
1 | 471.3 | 26.0 | −17.8 | 5.6 | −0.3 |
2 | 428.3 | 72.6 | −14.1 | 2.3 | −4.5 |
3 | 473.6 | 51.0 | −13.6 | −3.2 | 7.3 |
average | 457.7 | 49.9 | −15.2 | 1.5 | 0.8 |
3.4 Modelling with DCT coefficients
Similarly to how we can calculate averages more easily with DCT coefficients, we can also fit regression models more easily. Let’s say, for example, we wanted to model the effect of following voicing on the entire F1 trajectory of /ay/ (plotted in Figure 8). To do this with the formant tracks themselves, we would need to fit a generalized additive model with an interaction between the smoothing term and voicing context, taking into account autocorrelation and, perhaps, random smooths by token (Sóskuthy 2021).
With DCT coefficients, on the other hand, we could fit a linear model, using each individual DCT coefficient as the outcome, and the voicing effect as the predictor. This is a form of functional regression (Ramsay and Silverman 2006; Gubian et al. 2015). Listing 1 shows the R code involved for fitting such a model, with the model results summarized in Table 3.
<- lm(
ay_model cbind(y0, y1, y2, y3, y4) ~ context,
data = ay_data
)
y0 | y1 | y2 | y3 | y4 | |
---|---|---|---|---|---|
(Intercept) | 523.6 | 39.0 | −21.6 | 1.5 | −3.9 |
pre-voiceless | −72.5 | 5.1 | 13.1 | −2.2 | 1.9 |
In Table 3, the intercept terms for each DCT coefficient correspond to the average of that coefficient across the reference level (the elsewhere context), and the pre-voiceless terms correspond to the difference of the pre-voiceless context from the reference level for each coefficient.
Interestingly, if we apply the inverse DCT to these difference terms, the result is the difference curve between the the two voicing contexts, plotted in Figure 9.
More complex models exploring the difference in formant dynamics between these two allophones could be specified, but by using DCT coefficients as the outcome variables, we don’t need to include time across the formant track in our predictors.
3.5 Rate of change analysis
Given the definition of the first derivative of the signal in Equation 3, it’s also possible to analyze the rate of change in formant dynamics. Figure 10 plots the predicted rate of change for these two allophones of /ay/ based on the model fit above. These rate of change curves were calculated directly from the DCT coefficients by plugging them into Equation 3. This kind of analysis could be useful in research on hypo/hyper-articulation, how coarticulation influences sound change, etc.
3.6 Multi-speaker analysis
The ability to regress over DCT coefficients could be invaluable for research into, for example, diachronic changes in formant dynamics. However this modeling approach, along with the other benefits of DCT coefficients discussed above, is really only well defined for within-speaker analyses. For example, Figure 11 plots formant trajectories for the vowels /iy/, /ey/, /ay/, /uw/, /ow/ and /aw/ from two speakers in the PNC. Specifically, DCT coefficients from each vowel class for each speaker were averaged, and then the inverse DCT applied to result in the plotted trajectories.
The consistent differences in vowel space size and location between these two speakers are also reflected in their DCT coefficients, such that any analysis combining their data will face the same problems as using unnormalized formant point measurements. For example, Speaker B’s /iy/ and /ey/ appear to be backer than Speaker A’s, but it’s unclear how much of this is due to their overall smaller vowel space size versus having a different vowel space target. In order to conduct multi-speaker analyses using DCT coefficients, the DCT coefficients themselves will need to be normalized.
4 Transforming DCT Coefficients
Nearly all speaker normalization methods involve:
Shifting the location of formant values by a constant \(l\).
Scaling formant values by a constant \(s\).
\[ F' = s(F+l) \tag{5}\]
If there are coefficients \(y_k\) that when the inverse DCT is applied to them return \(F\), then there are coefficients \(y'_k\) that when the inverse DCT is applied to them return \(F'\). The goal of this section is to identify whether and how it is possible to directly transform \(y_k\) to \(y'_k\). To achieve this, we’ll begin by identifying how shifting and scaling the input signal by our desired \(l\) and \(s\) affect the resulting DCT coefficients.
4.1 Useful Equivalencies
In order to simplify the notation in the following section, we’ll define some equivalencies. 2 The cosine term in Equation 1 will be replaced with \(\psi_j\), a function over the signal index \(j\) and the DCT coefficient index \(k\).
\[ \psi_j(k) = \cos\left(\frac{\pi k(2j + 1)}{2N}\right) \tag{6}\]
The normalizing constant will be replaced with \(c\).
\[ c = \frac{1}{oN} \tag{7}\]
With these definitions, the DCT can be expressed as
\[ y_k = c\sum_{j=0}^{N-1}x_j\psi_j(k) \tag{8}\]
There are some additional important equivalencies we’ll leverage below. The first is for when \(k=0\).
\[ 1 = \psi_j(0) \tag{9}\]
Therefore, summing \(\psi_j(0)\) over \(j\) will equal \(N\).
\[ N = \sum_{j=0}^{N-1}\psi_j(0) \tag{10}\]
Second, when \(k>0\), summing \(\psi_j(k)\) over \(j\) will equal 0. \[ 0 = \sum_{j=0}^{N-1}\psi_j(k);~ k > 0 \tag{11}\]
4.2 Scaling the input
Let’s say we have a signal \(x\) that we applied the DCT to to get \(y_k\).
\[ y_k = c\sum_{j=0}^{N-1}x_j\psi_j(k) \tag{12}\]
If our normalization method called for scaling \(x\) by \(s\), our input to the DCT would now be \(sx_j\). The new DCT coefficients \(y'_k\) would be
\[ y'_k = c\sum_{j=0}^{N-1}(sx_j)\psi_j(k) \tag{13}\]
As a constant, \(s\) can be brought outside the summation.
\[ y'_k = sc\sum_{j=0}^{N-1}x_j\psi_j(k) \]
Now, everything to the right of \(s\) is the same as Equation 12, so we can substitute in \(y_k\),
\[ y'_k = sy_k \tag{14}\]
4.2.1 Scaling conclusion
The result here is that any scaling term applied to an input signal results in the same scaling term being applied to all of its DCT coefficients.
4.3 Shifting the input
Again, let’s say we have a signal \(x\) that we applied the DCT to to get \(y_k\). If our normalization method called for adding the constant \(l\) to \(x\), our input to the DCT would now be \((x_j + l)\). The new DCT coefficients \(y'_k\) would now be
\[ y'_k = c\sum_{j=0}^{N-1}(x_j+l)\psi_j(k) \tag{15}\]
The summation term can be rewritten as the addition of two summation terms.
\[ y'_k = c(\sum_{j=0}^{N-1}x_j\psi_j(k) +\sum_{j=0}^{N-1}l\psi_j(k)) \tag{16}\]
Then, we can distribute the normalizing term \(c\) and move the constant \(l\) outside of its summation.
\[ y'_k = c\sum_{j=0}^{N-1}x_j\psi_j(k) + lc\sum_{j=0}^{N-1}\psi_j(k)) \tag{17}\]
Everything to the left of the addition is the same as Equation 12, so we can substitute in \(y_k\).
\[ y'_k = y_k + lc\sum_{j=0}^{N-1}\psi_j(k) \tag{18}\]
We can now use the equivalencies in Equation 10 and Equation 11 to evaluate Equation 18 when \(k=0\) and when \(k>0\).
For \(k=0\), we’ll substitute both the sum of \(\psi_j(0)\) (Equation 10) and the normalizing constant \(c\) (Equation 7) to arrive at
\[ y'_0 = y_0 + l\frac{1}{\sqrt{2}N}N = y_0 + \frac{l}{\sqrt{2}} \tag{19}\]
When \(k>0\), the sum of \(\psi_j(k)\) is 0 (Equation 11), which eliminates the second term entirely.
\[ y'_{k>0} = y_{k>0} + 0lc = y_{k>0} \tag{20}\]
4.3.1 Shifting conclusion
If an input signal is shifted by \(l\), its 0th DCT coefficient will be shifted by \(\frac{l}{\sqrt{2}}\), and its remaining DCT coefficients will remain the same.
4.4 Non-linear transforms of the input
Some speaker normalization procedures involve transformations of the signal that are non-linear. The most common one is taking the logarithm of formant values. If we logged an input signal \(x\), its DCT coefficients would be given as
\[ y'_k = c\sum_{j=0}^{N-1}\log(x_j)\psi_j(k) \tag{21}\]
There are no additional equivalences we can leverage here to re-express \(y'_k\) as a transformation of \(y_k\). In order to calculate \(y'_k\), we must directly transform the original signal, then apply the DCT.
4.5 DCT Coefficient Transformation Summary
In summary, if we want to shift our formant values by some value \(l\) and scale them by some value \(s\):
We need to add \(\frac{l}{\sqrt{2}}\) to the 0th DCT coefficient (Equation 19).
We don’t add anything to the remaining DCT coefficients (Equation 20).
We need to scale all DCT coefficients by \(s\) (Equation 14).
Additionally, for non-linear transformations of formant frequencies, there is no way to transform the DCT coefficients of the original values directly into DCT coefficients of the transformed values. Instead, the entire DCT needs to be applied to the transformed formants directly.
5 Estimating normalization parameters
Now that we understand how to shift DCT coefficients by \(l\) and scale them by \(s\), we need to establish how to get these \(l\) and \(s\) normalization parameters from DCT coefficients. The two most common normalization parameters are the mean and standard deviation over vowels, so we will focus on those.
5.1 Calculating means
Let’s begin with \(F\) as a formant track for a single vowel token, and \(y_k\) as the coefficients we get from applying the DCT to \(F\). As we already established in Equation 9, \(\psi_j(0) =1\), which means the 0th DCT coefficient can be given as
\[ y_0 = \frac{1}{\sqrt{2}}\frac{1}{N}\sum_{j=0}^{N-1}F_j \tag{22}\]
The sum of all values in a vector divided by the number of values in the vector is the mean of that vector, which we can represent with \(\bar{F}\). So, the 0th coefficient is equal to the mean of \(F\), divided by \(\sqrt{2}\).
\[ y_0 = \frac{1}{\sqrt{2}}\bar{F} \]
And therefore
\[ \bar{F} = \sqrt{2}y_0 \tag{23}\]
The \(\sqrt{2}\) constant will carry through any functions that are weighted sums. So if the function \(f(x)\) is a weighted sum,
\[ f(\bar{F}) = \sqrt{2}f(y_0) \tag{24}\]
The mean is a weighted sum, so
\[ \text{mean}(\bar{F}) = \sqrt{2}~\text{mean}(y_0) \tag{25}\]
5.1.1 Shifting by the mean
Let’s use \(\text{mean}(\bar{F})\) as our desired location parameter \(l\) by which to shift all vowel tokens. We’ve already established in Equation 19 and Equation 20 that this will involve only changes to the 0th DCT coefficient for the \(j^{th}\) token. If we substitute Equation 25 into Equation 19, we get
\[ y'_{0j} = y_{0j} - \frac{\sqrt{2}~\text{mean}(y_0)}{\sqrt{2}} = y_{0j}-\text{mean}(y_0) \tag{26}\]
Because \(\sqrt{2}\) cancels out in Equation 26, to center all DCT coefficients on the mean across tokens, we just need to center \(y_0\) on the mean across \(y_0\).
5.2 Calculating standard deviations
The standard deviation over \(\bar{F}\) is not a weighted sum, but as it happens for the standard deviation specifically, there is still a straightforward relationship between formant values and \(y_0\).
Firstly, the standard deviation across \(\bar{F}\) can be given as
\[ \text{sd}(\bar{F}) = \sqrt{ \frac{ \sum(\bar{F}_j - \text{mean}(\bar{F}))^2 }{ N } } \tag{27}\]
Substituting in Equation 23 and Equation 25, we have
\[ \text{sd}(\bar{F}) = \sqrt{ \frac{ \sum(\sqrt{2}y_{0j} - \sqrt{2}~\text{mean}(y_0))^2 }{ N } } \tag{28}\]
This can be simplified in a few steps.
\[ \text{sd}(\bar{F}) = \sqrt{ \frac{ \sum\sqrt{2}^2(y_{0j} - \text{mean}(y_0))^2 }{ N } } \tag{29}\]
\[ \text{sd}(\bar{F}) = \sqrt{ \frac{ \sum2(y_{0j} - \text{mean}(y_0))^2 }{ N } } \]
\[ \text{sd}(\bar{F}) = \sqrt{ 2\frac{ \sum(y_{0j} - \text{mean}(y_0))^2 }{ N } } \]
\[ \text{sd}(\bar{F}) = \sqrt{2}\sqrt{ \frac{ \sum(y_{0j} - \text{mean}(y_0)) }{ N } } \]
Resulting in
\[ \text{sd}(\bar{F}) = \sqrt{2}~\text{sd}(y_0) \tag{30}\]
5.2.1 Scaling by the standard deviation
Let’s use \(\frac{1}{\text{sd}(\bar{F})}\) as our scaling parameter \(s\) by which to scale all vowel tokens. By substituting Equation 30 into Equation 14, we get
\[ y'_{kj} = \frac{1}{\sqrt{2}~\text{sd}(y_0)}y_{kj} \tag{31}\]
So, when scaling by the standard deviation, or by a linear function over \(y_0\), we need to multiply the result by \(\sqrt{2}\), and then use it as a scaling parameter over all DCT coefficients.
6 Example Application
In this section, we’ll take three normalization methods that were defined over formant point measurements, and redefine DCT normalization methods as outlined above. These three methods were chosen because they cover all three possibilities of shifting and scaling.
Method | \(l\) (shift) | \(s\) (scale) |
---|---|---|
\(\Delta\)F (Johnson 2020) | \(0\) | \(\frac{1}{\delta(F)}\) |
Nearey (Nearey 1978) | \(-\text{mean}(\log F)\) | \(1\) |
Lobanov (Lobanov 1971) | \(-\text{mean}(F_i)\) | \(\frac{1}{\text{sd}(F_i)}\) |
We’ll be using Equation 5 (repeated here) as the generalized normalization equation that all three methods have in common.
\[ F' = s(F+l) \tag{32}\]
6.1 \(\Delta\)F
Johnson (2020) proposes a vocal tract length normalization method that scales all formants, but does not shift them. In the original paper, a quantity \(\Delta\)F is defined, but for the purpose of this paper, we’ll define it as a function \(\delta(F)\).
\[ \delta(F) =\sum_{i=1}^{M}\sum_{j=1}^{N} \frac{F_{ij}}{i-0.5} \tag{33}\]
Where \(i\) is the formant number. The scaling and shifting parameters \(s\) and \(l\) for this method are \[ s = \frac{1}{\delta(F)}; ~~l=0 \tag{34}\]
Plugging these parameters in to Equation 32, we get
\[
F'_{ij} = \frac{1}{\delta(F)}F_{ij}
\tag{35}\]
The \(\delta(F)\) function is a weighted sum, so by Equation 24 we get
\[ \delta(F) = \sqrt{2}\delta(y_0) \tag{36}\]
Since this method involves only scaling, we can plug Equation 36 into Equation 14 to get the DCT coefficients of the normalized formants as
\[ y'_{kij} = \frac{1}{\sqrt{2}\delta(y_{0})}y_{kij} \tag{37}\]
I used Equation 37 to normalize the DCT coefficients for F1 through F3 for two speakers from the PNC. Figure 12 compares the resulting inverse DCT on the unnormalized vs \(\Delta\)F normalized formant tracks, focusing on just the long vowels.
6.2 Nearey
Nearey normalization (Nearey 1978) involves a single shift in location, and has been recently argued to more closely reflect listeners perceptual categorization (Barreda 2021b).
Importantly, the first step of Nearey normalization involves taking the logarithm of formant values. As discussed in Section 4.4, there is no way to directly transform DCT coefficients of formants in Hz to DCT coefficients in log(Hz). Instead, the formants must first be logged, and then have the DCT applied. For the sake of notational brevity, for this section assume that any \(F\) refers to a log transformed formant, and any \(y_k\) refers to the DCT coefficients of that log transformed formant.
The scaling (\(s\)) and shifting (\(l\)) parameters for Nearey normalization are:
\[ s = 1; ~~l = -\text{mean}(F) \tag{38}\]
Plugging this into Equation 32, we get
\[ F'_{ij} = F_{ij} - \text{mean}(F) \tag{39}\]
Since this involves shifting by the mean, we can use Equation 26 to define the DCT normalization like so:
\[ y'_{0ij} = y_{0ij} - \text{mean}(y_{0}) \tag{40}\]
\[ y'_{k>0ij} = y_{k>0ij} \]
The 0th DCT parameter of every token will be shifted by the mean across \(y_0\), and all other DCT parameters will stay the same.
Using Equation 40, I Nearey normalized the DCT coefficients for F1 through F3 for the same two speakers from the PNC as before. Figure 13 compares the resulting inverse DCT on the unnormalized vs normalized formant tracks.
6.3 Lobanov
When applied to other domains of data analysis, Lobanov normalization (Lobanov 1971) is called z-scoring. The scaling (\(s\)) and shifting (\(l\)) parameters for Lobanov normalization are:
\[ s = \frac{1}{\text{sd}(F_i)}; ~~ l=-\text{mean}(F_i) \tag{41}\]
Plugging these parameters into Equation 32, we get the following definition of Lobanov normalization.
\[ F'_{ij} = \frac{1}{\text{sd}(F_i)}(F_{ij}-\text{mean}(F_i)) \tag{42}\]
The shifting parameter will only affect the 0th coefficient, and the scaling parameter will affect all coefficients. Plugging the equivalencies from Equation 26 and Equation 30 into Equation 42, we get the following definition for Lobanov normalizing the DCT coefficients.
\[ y_{0ij} = \frac{1}{\sqrt{2}~\text{sd}(y_{0i})}(y_{0ij} - \text{mean}(y_{0i})) \tag{43}\]
\[ y_{k>0;ij} = \frac{1}{\sqrt{2}~\text{sd}(y_{0i})}y_{k>0;ij} \]
Again, using Equation 43, I Lobanov normalized the DCT coefficients for F1 and F2 for the same two speakers from the Philadelphia neighborhood corpus. Figure 14 compares the resulting average formant tracks when applying the inverse DCT to the normalized coefficients.
6.4 Comparison to point-based normalization
It’s visually clear that none of the normalization methods resulted in complete overlap of the formant trajectories for these two speakers. However, the goal of speaker normalization is to eliminate (socio)linguistically irrelevant differences, while preserving meaningful differences. While different approaches have been taken to evaluating how well any given normalization method achieves this goal (Adank et al. 2004; Barreda 2021b) the goal of this paper is just to define formant track normalization based on formant point normalization methods. As such, we’ll briefly compare the formant track results above to the same point-based methods.
For this comparison, the same two speakers’ formant track data was used, and formant values at 50% of the duration taken as the point measurement. These point measures were then normalized using the \(\Delta\)F, Nearey, and Lobanov methods, as defined above. Then, for the same set of vowels, the average F1 and F2 was estimated. Figure 15 plots the result of all three methods as well as the unnormalized data.
Broadly similar patterns emerge in the normalized point values as did in the normalized formant tracks (e.g. Speaker A’s fronter /uw/ and/ ow/ in all normalizations, and speaker B’s slightly backer /iy/ and /ey/ in the delta F and Nearey normalizations).
6.5 Discussion
In this section, we’ve seen that normalization methods originally defined for formant point measurements can be directly translated into DCT coefficient normalization methods, and that the results on formant tracks are broadly similar to the results on formant points.
To briefly reiterate and refine the process of defining a DCT normalization procedure:
- Any non-linear transformations must be applied to the formant tracks before the DCT is applied.
- Any normalization parameters based on weighted sums (like the mean) or the standard deviation can be estimated using the 0th DCT coefficient.
- Location (\(l\)) parameters can be directly added or subtracted from the 0th DCT coefficient only.
- Scale parameters (\(s\)) need to be multiplied by \(\sqrt{2}\), and applied to all DCT coefficients.
7 Conclusion
The Discrete Cosine Transform is a formant track smoothing method that has already found usage in phonetic research on vowel intrinsic spectral change. The use of DCT coefficients comes with a number of concrete benefits to research on vowel formant dynamics. The large body of research on speaker normalization can be drawn upon to directly normalize DCT coefficients, as approximately equivalent normalization functions can be defined for DCT coefficients based on the definition of the formant normalization functions.
Appendices
8 The scipy implementation of the DCT
The scipy documentation for the DCT describes three ways the DCT can be “normalized”, and two ways the DCT can be “orthogonalized” or “non-orthogonalized”. All of these options on the DCT alter the terms to the left of the sum in the DCT formula. Let’s define and simplify these components.
I’ll use \(S\) to indicate the sum function, which is defined as
\[ S_k(x) = \sum_{j=0}^{N-1} x_j \cos\left(\frac{\pi k(2j + 1)}{2N} \right) \]
This term is unaltered by any of the different options scipy offers. Any given DCT implementation can be given as
\[ y_k = ocS_k(x) \] Where \(o\) is the orthogonalization term and \(c\) is the normalization constant.
The orthogonalization term is the easiest to define.
\[ o = \begin{cases} 1 & \text{if orth = False}\\ \frac{1}{\sqrt{2}} & \text{if orth = True} \end{cases} \]
The scipy documentation provides the mathematical definition for “backward” normalization constant only, but the “forward” normalization can be inferred from its output.
\[ c = \begin{cases} 2 & \text{if norm = backward} \\ \frac{1}{N} & \text{if norm = forward} \end{cases} \]
As a demonstration by example, we can define a python function for just the sum function, then apply it to the formant track in Figure 16.
import numpy as np
from scipy.fft import dct
def S(x, k):
= np.arange(len(x))
j = len(x)
N
= sum(
result * np.cos(
x * k * ((2*j)+1))/
(np.pi 2*N)
(
)
)
return result
We can get the result of the sum function for the 0th and 1st DCT coefficients to then examine the outcome of the different normalizations.
= S(f1, 0)
s_0 = S(f1, 1) s_1
At this point, we can also get the 0th and 1st DCT coefficients from the scipy implementation.
= dct(f1, norm = "backward")
dct_backward = dct(f1, norm = "forward") dct_forward
The normalizing constant for norm = "backward"
is documented to be 2, so multiplying s_0
and s_1
by 2 should be equal to the 0th and 1st coefficients in dct_backward
.
np.array([2 * np.array([s_0, s_1]),
0:2]
dct_backward[ ])
array([[87057.43, 11509.54],
[87057.43, 11509.54]])
If the normalizing constant for norm="forward"
is \(\frac{1}{N}\), dividing s_0
and s_1
by the length of the input vector should be equal to the 0th and 1st coefficients in dct_forward
.
= len(f1)
N
np.array([/ N,
np.array([s_0, s_1]) 0:2]
dct_forward[ ])
array([[690.93, 91.35],
[690.93, 91.35]])
Admittedly, it would be more ideal to be able to reference the actual forward normalization constant from the scipy documentation, but it is not provided.
9 The DCT Basis
While the formula in Equation 1 can be used to calculate the DCT coefficients, the formula to calculate the DCT basis functions in Figure 2 is different. If \(B\) is a matrix of the basis functions, the \(k^{th}\) basis function will be in its columns. To get \(B\), we apply the DCT with backward normalization to an identity matrix \(I\) (that is, a matrix with 1s along the diagonal, and 0s elsewhere). The orthogonalization term \(o\) is included in Equation 44.
\[ B_{jk} = 2o \sum_{j=0}^{N-1} I_{jk} \cos\left(\frac{\pi k(2j + 1)}{2N} \right) \tag{44}\]
This can be quickly implemented using the scipy DCT implementation like so:
= len(f1)
N = dct(np.eye(N), norm = "backward", orthogonalize = True) dct_basis
10 The choice of orthogonalization
The choice of “orthogonalizing” the DCT coefficients, that is, dividing the 0th coefficient by \(\sqrt{2}\), does introduce some awkwardness into the normalization procedures described here. Using the orthogonalized DCT was a design decision within fasttrackpy due to its reliance on regression-based DCT coefficients.
As a practical issue, formant tracking sometimes returns missing, or NA values for some, but not all, time points along a formant track. With missing values, the DCT cannot be directly applied. However, the DCT coefficients can be approximated by linear regression, using the DCT basis as the “predictors”.
= np.linalg.lstsq(dct_basis, f1, rcond=None)[0]
dct_reg = dct(f1, norm = "forward", orthogonalize = True)
dct_direct
np.array([0:3],
dct_reg[0:3]
dct_direct[ ])
array([[488.56, 91.35, -23.14],
[488.56, 91.35, -23.14]])
Orthogonalizing the first coefficient was the only option that resulted in the same coefficients for both regression and direct DCT within the scipy implementation. Without orthogonalizing the first coefficient, the 0th coefficient is not equal between the regression based DCT and direct DCT.
= dct(np.eye(N), norm = "backward", orthogonalize = False)
dct_nonorth_basis
= np.linalg.lstsq(dct_nonorth_basis, f1, rcond=None)[0]
dct_nonorth_reg = dct(f1, norm="forward", orthogonalize=False)
dct_nonorth_direct
np.array([0:3],
dct_nonorth_reg[0:3]
dct_nonorth_direct[ ])
array([[345.47, 91.35, -23.14],
[690.93, 91.35, -23.14]])
Since the design decision to orthogonalize the DCT coefficients was made within fasttrackpy, which was the tool used to arrive at these DCT coefficients in this paper, this was also the version of the DCT used here.
Acknowledgements
I would like to thank Santiago Barreda, Kevin McGowan, Dan Villarreal, Jack Rechsteiner, the journal’s anonymous reviewers, and the audience at NWAV 52 for feedback on this work.