The DiTME Project
Interdisciplinary research in music technology
3.2 Azimuth Discrimination and Resynthesis
In the following section, we present a novel sound source separation algorithm called ADRess (Azimuth Discrimination and Resynthesis) which was developed at DIT in 2003 (Barry et al. 2004a and 2004b). The algorithm which requires no prior knowledge or learning, performs the task of separation based purely on the lateral displacement of a source within the stereo field; in other words, the position of the sound source between the left and right speakers. The algorithm exploits the use of the ‘pan pot’ as a means to achieve image localisation within stereophonic recordings. As such, only an interaural intensity difference exists between left and right channels for a single source. Gain scaling and phase cancellation techniques are used to expose frequency dependent nulls across the azimuth domain, from which source separation and resynthesis is carried out.
Since the advent of multi-channel recording systems in the early 1960s, most musical recordings are made in such a fashion, whereby N sources are recorded individually, then summed and distributed across two channels using a mixing console. Image localisation, referring to the apparent position of a particular instrument in the stereo field, is achieved by using a panoramic potentiometer. This device allows a single sound source to be divided into two channels with continuously variable intensity ratios (Eargle 1969). By virtue of this, a single source may be virtually positioned at any point between the speakers. So localisation in this case is achieved by creating an interaural intensity difference (IID) – a well-known phenomenon (Rayleigh1875). The pan pot was devised to simulate IIDs by attenuating the source signal fed to one reproduction channel, causing it to be localised more in the opposite channel. This means that for any single source in such a recording, the phase of a source is coherent between left and right, and only its intensity differs. It is precisely this feature that enables us to perform separation, Figure 4 shows a typical scenario for panning multiple sources in popular music, Figure 4.
3.2.2 Method used in ADRess
A stereo recording contains two channels only (typically left and right), but any number of sources can be virtually positioned between the left and right speakers by varying the relative amplitude in each channel for a particular source. The problem is then to recover an arbitrary number of sources from only two mixtures. In order to achieve source separation in ADRess a raised cosine window is applied to a frame of 4,096 samples of audio in each channel. A Fast Fourier Transform (FFT) is then performed, taking us into the complex frequency domain. This yields 2,048 linearly spaced discrete frequency bands of width 10.71 Hz. For each band, iterative gain-scaling is applied to one channel so that a source’s intensity becomes equal in both left and right channels. A subtraction of each complex band in each channel at this point will cause that source to approach a local minimum due to phase cancellation. The cancelled source is then recovered by creating a ‘frequency-azimuth’ plane, which is analysed for local minima along the azimuth axis. These local minima represent points at which some gain scalar caused phase cancellation. It is observed that at some point where an instrument cancels, only the frequency components which it contained will show a local minima. The magnitude and phase of these minima are then estimated and an IFFT in conjunction with an overlap add scheme is used to resynthesise the cancelled instrument. This process is carried out on every frame of audio independently for the left and right channel for all time. Figure 5 shows this process in action for a single frequency band centred on K = 110Hz. In this example, the left channel is scaled from 1 down to 0 in discrete steps of 0.01. At each iteration, the complex value of the Kth scaled left channel is subtracted from the complex value in the same band in the right channel. The modulus of this operation is then taken, as shown in the plot below. At some point, this value approaches to a minimum; in this case when the gain scalar = 0.42. This signifies that a source is present at this location in stereo space. The magnitude of the component for that source is calculated as A = Kmax – Kmin. This is repeated for all bands as shown in, Figure 5.
In order to show how frequency components belonging to a single source are clustered on the azimuth axis, two sources were synthesised, each containing five non-overlapping partials. Each source was panned to a unique location left of centre in the stereo field. Figure 6 shows the frequency azimuth plane created by ADRess to recover these sources. Frequency is depicted along the Y axis and azimuth along the X axis with amplitude represented by colour intensity. (Figure 6)
It can be seen that the five frequency components from each source have their minima clustered along the azimuth axis. The frequency azimuth spectrogram shows the location of sources according to the cancellation points along the azimuth axis but, in order to resynthesise, we need the invert these nulls, since the amount of energy lost through cancellation is proportional to the actual energy contributed by the source. When the nulls are inverted we get a more intuitive representation of each individual source as demonstrated in, Figure 7.
Figure 5, illustrates how ADRess decomposes the left channel mixture in order to reveal the frequency composition of the latent sources. It should be borne in mind that the plots in Figure 6,and Figure 7,represent the decomposition of a single frame of audio data; as each consecutive frame is processed, the composition of each source will change in both frequency and amplitude but in the majority of cases the source position (azimuth) in the stereo field will not. It is for this very reason that azimuth is used as the cue to identify each source. By summing energy at all frequencies located at different points along the azimuth axis an energy distribution plot emerges, and by doing this for all time frames a time-azimuth plot, as shown in Figure 8, is achieved. Figure 8, shows source activity in the stereo field with respect to time. A similar two dimensional visualisation updated in real time is presented to the user in order to indicate source positions in the real-time application, Figure 8.
The algorithm has been shown to work for a wide variety of musical recordings, some examples of which can be found at http://eleceng.dit.ie/dbarry/audio.html. The time domain plots in, Figure 9, show the separation results achieved for a jazz recording containing saxophone, bass, drums and piano.