The DiTME Project
Interdisciplinary research in music technology
3. Sound source separation
Sound source separation refers to the task of extracting individual sound sources from some number of mixtures of those sound sources. As an example, consider the task of listening in humans. We have two ears: this means that our auditory cortex receives two sound mixtures, one from each ear. Through complex neural processing, the brain is able to decompose these mixtures into perceptually separate auditory streams. A well-known phenomenon known as the ‘Cocktail Party Effect’ (cherry 1953) illustrates this process in action. In the presence of many speakers, humans exhibit the ability to tend to or focus on a single speaker despite the surrounding environmental noise. In the case of music audition we exhibit the ability to identify the pitch, timbre and temporal characteristics of individual sound sources within an ensemble music recording. This ability varies greatly from person to person and can be improved with practice but is present to some degree in most people. Even young children whilst singing along to a song on the radio are carrying out some form of sound source separation in order to discern which elements of the music correspond to a singing voice and which do not.
In engineering the same problem exists. A signal is observed which is known to be a mixture of several other signals. The goal is to separate this observed signal into the individual signals of which it is comprised. This is the goal of our research. In particular, our research is concerned with separating individual musical sound sources from ensemble music recordings for the purposes of audition, analysis, and transcription. Observing only the mixture (or mixtures) of these instruments, i.e. ‘the song’, we aim to recover each individual sound source present in the song. The applications of source separation include the following.
- Music education: A common problem for amateur musicians is that of identifying exactly which instrument is playing which note or notes in polyphonic music. A sound source separation facility would allow the user to take a standard musical recording such as a song on a compact disc, and extract an individual instrument part.
- Music transcription: Transcription is the process of transforming some set of audio events into some form of notation. In the case of music, it involves creating a musical score from audio. This task is usually carried out by humans and is both expensive and laborious. Computerised music transcription tools do exist but are limited to monophonic transcription, and are not yet highly perfected. Sound source separation allows a polyphonic mixture to be decomposed into several monophonic mixtures thus allowing current transcription techniques to be applied.
- Audio analysis: In many real-world scenarios, audio recordings can often be corrupted by unwanted noise from sound sources which are proximal to the source of interest. Forensic audio analysis is one such example. Source separation can facilitate the isolation of particular sounds of interest within badly corrupted recordings.
- Remixing and up mixing: Multi-channel audio formats are becoming increasingly popular, such as the Dolby 5.1 and DTS surround sound formats which have become standards in the film industry and are gaining ground in the music industry too. Up mixing is the process of generating several reproduction channels out of only one or two mixtures. Old films and music, for which the multi-track recordings are unavailable, could be remastered for today’s modern formats.
3.1 Existing approaches
Currently, the most prevalent approaches to this problem fall into one of two categories, Independent Component Analysis (ICA) (seeHyvarinen 1999and Casey 2000), and Computational Auditory Scene Analysis (CASA) (see Rosenthal and Okuno 1998). ICA is a statistical source separation method which operates under the assumption that the latent sources have the property of mutual statistical independence and are non-gaussian. In addition to this, ICA assumes that there are at least as many observation mixtures as there are independent sources. Since we are concerned with musical recordings, we will have at most only two observation mixtures, the left and right channels. This makes pure ICA unsuitable for the problem where more than two sources exist. One solution to the degenerate case (where sources outnumber mixtures) is the DUET algorithm (Jourjine et al. 2000; Rickard et al. 2001). This approach assumes that latent sources are disjoint orthogonal in the time-frequency domain. This assumption holds true for speech signals but not for musical signals, since western classical music is based on harmony which implies a significant amount of time-frequency overlap. CASA methods on the other hand, attempt to decompose a sound mixture into auditory events which are then grouped according to perceptually motivated heuristics (Bregman 1990), such as common onset and offset of harmonically related components, or frequency and amplitude co-modulation of components.