The DiTME Project
Interdisciplinary research in music technology
2.2 Frequency-domain techniques (sinusoidal modelling and the phase vocoder)
The second technique adapted is that of sinusoidal modelling, which operates on the principle that an audio signal can be modelled by the sum of a number of quasi-sinusoidal waveforms that are slowly changing in both amplitude and frequency over time. The number of sinusoidal waveforms (or sinusoidal tracks) required to accurately represent a particular sound depends on the type of sound being analysed. For example, the steady state portion of a flute could be well represented by only three or four tracks, whilst a timbrally rich piano would require many more. Figure 2 illustrates how an 11 ms segment of a flute waveform can be modelled by four sinusoidal tracks. Even though a single pitched example is given in the illustration, it should be appreciated that a sinusoidal model could also represent more complex sound signals, Figure 2.
The benefit of representing a complex sound through sinusoids is that these sinusoidal tracks can easily be represented as mathematical functions and can therefore be accordingly manipulated. Time-scaling via sinusoidal modelling then becomes the process of extending or compressing each individual sinusoidal track prior to summation, which could be achieved though the use of time-domain techniques described above, but is generally achieved through mathematical synthesis of sinusoidal magnitude and phase values. As the sinusoidal model is capable of representing complex multi-pitch sounds it can also be used to time-scale these types of sounds and therefore overcomes the limitations of time-domain algorithms.
The principal difficulty with sinusoidal modelling techniques is to obtain an accurate sinusoidal representation of the signal in the first place, which is a continuing area of interest within the research community. In general a reasonable representation can be obtained using a Short-Time Fourier Analysis, which can yield a perceptually accurate representation if no modifications are applied, but can however introduce objectionable artefacts when time-scaling is applied. The primary cause of these artefacts is a loss of phase coherence between sinusoidal tracks, which is perceived as a reverberant type effect in the time-scaled signal. Phase coherence is lost because of slight inaccuracies in determining the exact frequency at each instant in time of the sinusoidal tracks – these inaccuracies will always be present due to the time frequency uncertainty principle (similar to Heisenberg’s uncertainty principle for mechanical systems).
Another method used which is similar to sinusoidal modelling is known as the phase vocoder. While the sinusoidal model attempts to extract a relatively small number of perceptually dominant sinusoidal tracks from a sound, the phase vocoder essentially extracts a relatively large fixed number of sinusoids from a sound via a filterbank. The principal of extending or compressing each sinusoidal term in order to time-scale remains the same for both techniques. The advantage of the phase vocoder is that it is more robust than the sinusoidal model, since it does not require any rules to track or extract sinusoidal components. However, the filtering process employed by the phase vocoder introduces interference terms that can be problematic. The last ten years have seen a merging of the two techniques to resolve these issues (see Laroche and Dolson 1999a).
2.3 Hybrid technique
From what has been described in the previous two sections, it can be appreciated that time-domain techniques are efficient but rely on the presence of a strong periodic element with the waveform being time-scaled in order to produce high quality results; frequency-domain techniques are more robust, in that they can be applied to more general signals, but they are less computationally efficient and introduce an objectionable artefact into the time-scaled output. A hybrid approach, developed by David Dorran (2005), attempts to achieve the benefit of both time and frequency approaches to improve upon the quality of output and reduce computational demands.
The hybrid technique takes advantage of a degree of flexibility that exists in the choice of phase used during synthesis of each sinusoidal track within frequency-domain approaches. A thorough mathematical analysis shows that deviating from the mathematically ‘ideal’ phase values results in amplitude and frequency modulations entering each sinusoidal component. However, an empirical psycho-acoustic analysis (Zwicker and Fastl 1999) has shown that the human auditory system is insensitive to slight modulations in both amplitude and phase. Using these results, the maximum phase deviation (or tolerance) which can be introduced without introducing audible artefacts has been established. This phase tolerance can then be used to ‘push or pull’ the sinusoidal tracks back into a phase coherent state, thereby removing the reverberant artefact associated with frequency-domain techniques. The set of target or ‘coherent’ phases are actually taken from the original signal, since these phases are guaranteed to preserve the phase relationship between sinusoids without the introduction of reverberation. The choice of these sets of target phases is extremely important, since a ‘good’ set of target phases will reduce the transition time for sinusoidal tracks being out of phase to being back in perfect phase coherence; a shorter transition time reduces the amount of reverberation introduced. The technique used to identify the best set of target phases is based upon ‘correlation’, which is also used within time-domain techniques to identify the local pitch period.
The current implementation of the hybrid system is particularly efficient for relatively small time-scale factors. Figure 3 illustrates its computational advantage when compared to an improved phase vocoder (Laroche and Dolson 1999b) – an implementation of the phase vocoder which draws on sinusoidal modelling techniques, Figure 3.
Subjective listening tests have also shown that the hybrid approach produces a higher quality of output to frequency-domain techniques for speech signals. No significant improvement was observed for music signals. This was attributed to the fact that music generally contains more reverberation than speech, therefore the introduction or reduction of a relatively small amount of reverberation is not objectionable. Table 1, and Table 2, present the results obtained from 14 subjective listening tests. It can be seen that the algorithm is both robust and efficient and produces high quality results for both speech and a wide range of polyphonic audio. These attributes make it particularly suitable for the time-scale modification of general audio where no prior knowledge of the input signal exists, for example, during the time-scale modification of movies or television/radio adverts, in which both speech and/or music are typically present, Table 1, Table 2.