The DiTME Project
Interdisciplinary research in music technology
2. Audio time-scale modification
Audio time-scale modification (TSM) is an audio effect that enables either speeding up or slowing down, i.e. altering the duration, of an audio signal without affecting its perceived local pitch and timbral characteristics. In other words, the duration of the original signal is increased or decreased but the perceptually important features of the original signal remain unchanged. In the case of speech, the time-scale signal sounds as if the original speaker has spoken at a quicker or slower rate. In the case of music, the time-scaled signal sounds as if the musicians have played at a different tempo. Transforming audio into an alternative time-scale is a popular and useful digital audio effect that has become a standard tool within many audio multi-processing applications.
In addition to music teaching and learning TSM has numerous applications, including:
- accelerated aural reading for the blind
- music composition
- audio data compression
- text-to-speech synthesis
- audio watermarking
- fast browsing of speech material for digital library and distance learning.
In order to achieve implementation of audio time-scale modification there are two broad categories of time-scale modification algorithms which may be applied: time-domain and frequency-domain. Time-domain techniques are computationally efficient and produce high quality results for single pitched signals such as speech and monophonic music, but do not cope well with more complex signals such as polyphonic music. Frequency-domain techniques are less computationally efficient, however they have proven to be more robust and produce high quality results for a variety of signals. A perceived drawback of frequency-domain techniques is the knowledge that they can introduce a reverberant or phasy artefact into the output signal.
In completing the research for his Ph.D. in audio time-scale modification, David Dorran focused on incorporating aspects of time-domain techniques into frequency-domain techniques in an attempt to reduce the reverberant artefact and improve upon computational demands.
2.1 Time-domain techniques
In basic terms, time-domain techniques operate by discarding or repeating suitable segments of the input waveform. This process is illustrated in Figure 1 in which a quasi-periodic waveform is time-scale compressed (reduced in duration) by discarding four periods of the original waveform. It should be appreciated that time-scale expansion could be achieved in a similar manner through repetition of short segments of the original waveform, Figure 1. This example may appear somewhat trivial as it applies only to a very short sound (the original is an oboe sound of approximately 100 ms duration) that has strong periodic characteristics; however, a significant number of everyday sounds change relatively slowly over time and are therefore considered to be quasi-periodic over any 50 ms duration of the waveform. One query that often arises with regard to the periodicity of sounds is in relation to noise-like elements of a waveform, such as the ‘s’ and ‘ch’ part of the word ‘speech’ and the onset of a note of a particular instrument. It is often argued that such sounds do not contain a distinct period and therefore the discard/repeat process is not appropriate for these types of sounds; however, they can be considered periodic in the sense that the noise-like sound exists for a significant duration of time and can be viewed as the repetition of a very short noise segment over that duration. Therefore discarding/repeating short segments of these sounds will also result in time-scale expansion or compression of the sound even though they are not periodic in the strictest sense of the word.
Given the assumption of quasi-periodicity, the problem of time-scaling in the time-domain then falls into two areas: firstly, the identification of the local pitch period and secondly, identification of which segments of the original waveform to discard/repeat.
Identification of the local pitch period has received a significant amount of interest within the research community since it also forms an important part of a number of other applications such as speaker recognition and music transcription (Kim et al. 2004; Plumbley et al. 2002). They are also used in other disciplines including biomedical signal analysis for detection of heart rate. Existing pitch period detection algorithms tend to suffer from what is referred to as ‘octave errors’. For example if the pitch period was, for instance, 3 ms the algorithm may inadvertently detect a period of 6 ms, 9 ms or 12 ms, i.e. integer multiples of the actual period. However, this particular problem does not affect the quality produced by time-scaling algorithms, since the quality of the output is unaffected regardless of whether we discard one, two or three periods of the waveform. The number of periods discarded/repeated does however affect the next location for discarding/repeating the ensuing waveform segment.
The location of the discard/repeat segments is dependent principally upon the desired time-scale factor and also the duration of the segment that can be discarded/repeated. For speech, Portnoff (1981) notes that the length of each discarded/repeated segment should be longer than one pitch period (typically 4 to 20 ms) but shorter than the length of a phoneme (approximately 40 ms); these values have also been found to produce good results for music. If the duration of every segment discarded/repeated was the same, for example 10 ms, the time-scaling procedure would be very straightforward; to time-scale expand by 25 per cent, one 10 ms segment would be repeated every 40 ms; to time-scale compress by 10 per cent, one 10 ms segment would be discarded every 100 ms. In practice, since the duration of the segment being discarded or repeated must vary with the local pitch period, a slightly more complicated procedure is employed. The exact method used varies from algorithm to algorithm but all effectively keep track of the duration of the previous segment which has been discarded/repeated. If, for example, a large segment (say 16 ms) has been discarded in a particular iteration of the algorithm, then the largest segment that could be discarded in the next iteration could be forced to a time window of 4 ms, thereby ensuring that the overall time-scaling is preserved at a global level, with small variations in time-scale duration at a local level not being generally perceived to be objectionable.
The procedure outlined in the previous paragraph works well for signals that do not contain strong transient components, and is also extremely efficient in terms of computational demands. Additional care is required when transients, such as drum sounds, occur. The reason for this special treatment of transients is that, by definition, they exist for very short periods of time, i.e. less than 5 ms. If a transient segment has been discarded or repeated the result is extremely objectionable: consider the effect of removing the start of a snare drum – it would no longer sound like a snare. For this reason, time-scaling algorithms typically include a transient detection component that ensures that this problem does not arise.