Annotated bibliography for ISMIR 2002 tutorial "Music Information Retrieval for Audio Signals"

This annotated bibliography is written to support the presentation of a tutorial presented in the International Conference on Music Information Retrieval in Paris 2002. The purpose of the tutorial is to provide a current overview and introduction to the main problems, challenges, and solutions for retrieval and analysis of musical signals in audio format.

The selection, classification and annotation of the papers was done by the author for the purposes of the tutorial presentation and reflect his personal opinions and preferences. The annotations attempt to describe the main ideas and results of each paper, its significance for audio MIR, and in some cases the potential for future research. When appropriate sentences from the original abstract where copied.

Although, the goal of collecting this bibliography was to be as complete as possible it is inevitable that some omissions and errors will exist. The author maintains an online version of the bibliography in http://www.cs.princeton.edu/~gtzan/caudition.html that will be regularly updated.

I apologise to any authors that I failed to include in this bibliography or misrepresented in my short annotations. Please do not hesitate to send corrections and additions by email to gtzan@cs.cmu.edu and I will do my best to include them in the online version of this document.

I hope that this work will inform and inspire researchers interested in the exciting and interesting area of MIR for audio signals.

George Tzanetakis
Computer Science Department
Carnegie Mellon University,
July 2002

ismir2002tutorial

[1] Proc. Int. Symposium on Music Information Retrieval (ISMIR), Plymouth, MA, 2000.

The first Symposium on Music Information Retrieval the only conference that focuses on Music Information Retrieval. In addition to papers dealing with audio MIR aspects, many papers are about symbolic MIR where the music data is stored in symbolic form such as MIDI files.

[2] Proc. Int. Symposium on Music Information Retrieval (ISMIR), Bloomington, IN, 2001.

The second Symposium on Music Information Retrieval the only conference that focuses on Music Information Retrieval. In addition to papers dealing with audio MIR aspects, many papers are about symbolic MIR where the music data is stored in symbolic form such as MIDI files.

[3] Masoud Alghoniemy and Ahmed Tewfik. Rhythm and Periodicity Detection in Polyphonic Music. In Proc. 3rd Workshop on Multimedia Signal Processing, pages 185-190, Denmark, September 1999.

The extraction of beat/rhythm based on periodicity detection in polyphonic music in audio format is the topic of this paper. The authors describe an algorithm that detects the beats using a narrow band-pass filter (50-200 Hz) in the very low frequency band, where bass drums are located. The resulting beat sequence (essentially the filtered band) is then subsequently analysed for repeating patterns in large and small scales. Thresholding (0.7 of maximum pulse amplitude) yields the beat pulse location and the difference pattern is calculated by taking the distance between pulses. In the case when there are no long rest durations a binary parsing tree (similar to Lempel-Ziv coding) is used to detect patterns whereas in the case when there are long rests a trellis is used for detecting patterns. A small scale music genre (essentially culture classification) was conducted. The average kurtosis of small segments (1.5seconds) of the beat sequence was used.

[4] Masoud Alghoniemy and Ahmed Tewfik. Personalized Music Distribution. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, Istanbul, Turkey, June 2000. IEEE.

Playlist generation refers to the process of automatically creating a list of audio signals from a collection that satisfies some criteria. In this work, the authors assume that binary attributes (for example male/female or fast/slow) are attached to each file of a collection (either manually or automatically) and describe an algorithm that creates a playlist that satisfy certain constraints such as 2 slow and 5 fast songs. The authors formulate the problem using vector space concepts as a constrained optimization problem and provide an approximation scheme for solving it. The algorithm is evaluated using synthetically generated binary attributes.

[5] Masoud Alghoniemy and Ahmed Tewfik. A network flow model for playlist generation. In Proc. Int. Conf. on Multimedia and Expo (ICME), Japan, August 2001. IEEE.

Playlist generation refers to the process of automatically creating a list of audio signals from a collection that satisfies some criteria. In this work, the authors assume that binary attributes (for example male/female or fast/slow) are attached to each file of a collection (either manually or automatically) and describe an algorithm that creates a playlist that satisfy certain constraints such as 2 slow and 5 fast songs. The authors formulate the problem using vector space concepts as a constrained optimization problem and provide an approximation scheme for solving it. The algorithm is evaluated using synthetically generated binary attributes.

[6] Eric Allamanche, Herre Jurgen, Oliver Hellmuth, Bernhard Froba, Thorsten Kastner, and Markus Cremer. Content-based Identification of Audio Material using MPEG-7 Low Level Description. In Proc. Int. Symposium on Music Information Retrievla (ISMIR), 2001.

The problem of audio fingerprinting or content-based identification is becoming a hot topic in digital music distribution as it can be used for copyright detection and for linking music metadata based on the content of files rather than their unreliable names. The technique described in this paper is based on clustering and nearest-neighbor search using features based on the MPEG-7 low level description. The audio fingerprinting is evaluated in the context of various signal distortions such as: time shift, cropping, volume change, perceptual audio coding and others. The description of the problem, the challenges, evaluation and the proposed applications are a good introduction to audio fingerprinting in general.

[7] A.Oppenheim and A.Willksy. Signals and Systems. Prentice Hall, 1983.

Standard reference textbook for Signal Processing. Contains mathematical explanations of basic concepts possibly more detailed than necessary for audio MIR purposes. For a more readable coverage of digital signal processing fundamentals look at the Signal Processing Primer by Ken Steiglitz.

[8] Barry Arons. SpeechSkimmer: a system for interactively skimming recorded speech. ACM Transactions Computer Human Interaction, 4:3-38, 1997. http://www.media.mit.edu/people/barons/papers/ToCHI97.ps.

The goal of SpeechSkimmer is to provide an interface that allows users to quickly skim through recordings of speech signals. Using techniques such as time stretching, pitch shifting, pause/noise reduction/removal the user is able to hear the recorded speech at several times its original speed without loss of intelligility. Although this interface was designed mainly for speech signals many of the ideas such as browsing by semantic regions rather than arbitrary blocks are relevant in the area of user interfaces for audio IR.

[9] Jean-Julien Aucouturier and Mark Sandler. Segmentation of Musical Signals Using Hidden Markov Models. In Proc. 110th Audio Engineering Society Convention, Amsterdam, The Netherlands, May 2001. Audio Engineering Society AES.

A method for the segmentation of audio musical signals using Hidden Markov Models is described in this paper. Three feature front ends are compared for this task. They are: Mel Frequency Cepstrum, Linear Prediciction and Discrete Cepstrum.

[10] Dana H. Ballard and Christopher M. Brown. Computer Vision. Prentice Hall, 1982.

A classic reference textbook for Computer Vision. Although not directly related to audio MIR many of the techniques described can provide ideas and inspiration for the development of corresponding ideas in audio analysis. For example algorithms for edge detection and video scene segmentation employ similar principles to audio segmentation.

[11] Mark A. Bartsch and Gregory H. Wakefield. To Catch a Chorus: Using Chroma-Based Representation for Audio Thumbnailing. In Proc. Int. Workshop on applications of Signal Processing to Audio and Acoustics, pages 15-19, Mohonk, NY, 2001. IEEE.

Audio thumbnailing or music summarization is an important component of audio MIR systems. The method described in this paper attempts to identify the chorus or refrain of a song by identifying repeated sections of the audio waveform. A chroma-based representation where the spectrum information is mapped to chroma (pitch class) values is used. A similarity correlation matrix (similar to Foote's similarity visualization) between each pair of feature vectors is used to detect repeating structures. The algorithm is evaluated by comparing the results with hand-selected refrains or representative sections. The algorithm is compared with the same algorithm using MFCC and random selection and is shown to perform better than both. This is probably due to the fact that the chroma representation captures harmonic progression information better than the MFCC representation. This is an indication that features developed for other types of audio signals might be inferior to features that are designed to take advantage of the specific properties of musical signals.

[12] Serge Belongie, Chad Carson, Hayit Greenspan, and Jitendra Malik. Blobworld: A system for region-based image indexing and retrieval. In Proc. 6th Int. Conf. on Computer Vision, January 1998.

Although MIR for audio signals is a very new research field, content-based approaches to image and video retrieval have a relatively longer history. The Blobworld system is an example of such as content-based system for retrieval and some of the ideas described in the paper can inform research in audio MIR.

[13] Adam L. Berenzweig and Daniel P.W. Ellis. Locating singing voice segments within musical signals. In Proc. Int. Workshop on Applications of Signal Processing to Audio and Acoustics WASPAA, pages 119-123, Mohonk, NY, 2001. IEEE.

One of the most important pieces of information used by humans when identifying and classifying music is the singer's voice. Therefore it is important to reliably locate the portions of a musical track during which vocals are present. The results could be used for browsing, as a signature for the piece and as a precursor to automatic recognition of lyrics. In this paper, a acoustic classifier developed for speech recognition is used as a detector for speech-like sounds. An HMM is used to find a best labeling sequence. On a test set of forty 15-second excerpts of randomly selected music, the developed classifier achieves around 80 percent classification accuracy at the frame level.

[14] John Biles. GenJam: A Genetic Algorithms for Generating Jazz Solos. In Proc. Int. Computer Music Conf. (ICMC), pages 131-137, Aarhus, Denmark, September 1994.

Although not directly related to audio MIR this paper is representative of techniques in automatic music generation. It is possible that such techniques will be used in future audio MIR interfaces to specify queries as an alternative to the current query-by-humming and query-by-example paradigms. In this paper, genetic algorithms are used to evolve convincing jazz solos.

[15] J. Boreczky and Lynn Wilcox. A hidden markov model framework for video segmentation using audio and image features. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, volume 6, pages 3741-3744. IEEE, 1998.

Video segmentation refers to the problem of breaking a video signal into regions in time based on shots, shot boundaries, and camera movement. The authors describe a system for video segmentation that combines acoustic and image features to detect the segmentation boundaries. More specifically for audio features, cepstral vectors are computed every 20ms. A two-second interval of 100 cepstral vectors is then used by calculating an audio-distance measure between successive intervals. Essentially the parameters of a gaussian distribution for each interval are computed and then a likelihood ratio between the two distributions is used as the distance measure. By defining states of the HMM things like shot, fadee, dissolve and using the combined image and audio vectors a segmentation of the video signal is achieved. This method works better than manual thresholding and is purely data-driven. Similar methods have been employed for audio segmentation where there a specific classes/states of interest.

[16] Albert Bregman. Auditory Scene Analysis. MIT Press, Cambridge, 1990.

A classic psychology book that has been very influential in the development of ideas in computational auditory scene analysis and audio MIR. The author describes in detail a number of theories and hypotheses about how our auditory system works and provides extensive evidence from user studies. A great book for anyone interested in audio analysis and audition in general. Specifically the author describes the concept of auditory streams as perceptual entities that human form while listening to sounds. Streams are formed by simultaneous and sequential integration of sounds based on various properties described in the book. The general empirical approach described in the book: forming a theory based on experiments, designing experiments to validate it, conducting the experiments and repeating the cycle is applicable not only to cognitive psychology experiments but to the design of computer audition systems as well.

[17] Judith Brown. Computer identification of musical instruments. Journal of the Acoustical Society of America, 105(3):1933-1941, 1999.

The computer identification of isolated musical instrument tones has been a topic of active research for some years. In this paper, the author describes a method for the identification of musical instruments that unlike previous approaches that only work with isolated steady-state instrument tones works with monophonic real recordings of the specified instruments. As sound source separation and music transcription systems improve, instrument identification will become an important part of audio MIR systems.

[18] Chris Chafe, Bernard Mont-Reynaud, and L. Rush. Toward an Intelligent Editor of Digital Audio: Recognition of Musical Constructs. Computer Music Journal, 6(1):30-41, 1982.

One of the goals of audio MIR is develop tools that have an understanding of audio going beyond just a unstructured collection of samples. For example in a Jazz tune we would like to know the structure such as ABA or a specific part like the saxophone solo. In this early paper the authors describe their efforts towards building an intelligent audio editor that recognizes musical constructs. Although, we are still away from this goal except for data in symbolic form, recent advances in computer audition will probably make this goal of an intelligent audio editor realizable in the near future. For example the Marsyas audio editor and browser developed at Princeton University supports automatic segmentation and annotation based on classification.

[19] Perry Cook, editor. Music, Cognition, and Computerised Sound. MIT Press, 2001.

Introductory coverage of multiple topics about how the brain processes sounds entering the ear (Psychoacoustics). The book has a particular emphasis on music and the use of computers to generate stimuli and conduct experiments in psychoacoustics that would not be feasible otherwise. The accompanying CD-ROM includes many sound and source code examples to help explicate the text. Specific topics of interest to the audio MIR community that are covered are: how the ear works, cognitive music psychology, pitch perception, loudness, timbre, stream segregation, consonance and scales, tonal structure and scales, memory for musical attributes, and experimental design in psychoacoustic research

[20] Roger Dannenberg. An on-line algorithm for real-time accompaniment. In Proc. Int. Computer Music Conf., pages 187-191, Paris, France, 1984.

In this classic early paper the author describes probably one of the first if not the first algorithm for real-time accompaniment where the computer is provided with a score for the solo instrument and the accompaniment part and the task is to follow in real-time a performer playing an acoustical instrument compensating for pitch and rhythm errors. Many of the ideas proposed in this paper are relevant for pitch detection and tracking as well as beat detection and tracking.

[21] Roger Dannenberg, Belinda Thom, and David Watson. A machine learning approach to musical style recognition. In Proc. Int. Computer Music Conference (ICMC), pages 344-347, 1997.

This paper describes a machine learning approach to recognizing musical genre in music data in symbolic form (MIDI). Although typically music retrieval algorithms operating on symbolic data utilize string matching and text information retrieval techniques this paper follows a machine learning approach typical of audio MIR. Thirteen low level features such as averages and standard deviations of MIDI key note number, duration, duty factor and others are used to train a naive Baysian Classifier, a linear classifier and a neural network. The styles used for classification consist of a range of performance intentions such as frantic, lyrical, pointillistic, syncopated, high, low, quote, blues. A real-time version of the system using a circular buffer of 5 seconds has been implemented. As multiple pitch extraction and transcription algorithms become better it will be interesting to see similar techniques applied directly to audio data.

[22] Steven Davis and Paul Mermelstein. Experiments in syllable-based recognition of continuous speech. IEEE Transcactions on Acoustics, Speech and Signal Processing, 28:357-366, August 1980.

One of the first papers that described the use Mel Frequency Cepstral Coefficients for speech modeling. Typically cited as the original reference for MFCCs.

[23] Hrishikesh Deshpande, Rohit Singh, and Unjung Nam. Classification of Musical Signals in the Visual Domain. In Proc. COST G-G Conf. on Digital Audio Effects (DAFX), Limerick, Ireland, December 2001.

It is well known that musical genres can be characterized by the statistical characteristics of their spectral texture over time. Visual metaphors have been used to illustrate these ideas but the authors of this paper take this idea even further. They perform genre classification directly on the visual domain using texture modeling methods from image processing. MFCC are used as features and then a spectrogram-like representation based on the them is computed. This image is then modeled using a recursive texture-of-textures modling approach. For classification three methods are compared: K-Nearest Neighbor, Gaussian and Support Vector Machines. For evaluation three genres (classical, jazz, rock) where used. 17 (randomly selected) songs for each category where used for training and the remaining 106 songs for validation and testing. The authors report 75 percent accuracy using the KNN classifier. Based on empirical observation and some statistical analysis the authors also report that jazz and rock are less clustered than classical.

[24] Simon Dixon. A Lightweight Multi-agent Musical Beat Tracking System. In Proc. Pacific Rim Int. Conf. on Artificial Intelligence, pages 778-788, 2000.

Automatic Beat Tracking algorithms are an important component of audio MIR system. In this paper, the author describes a software system for beat tracking that works directly on audio signals without making any assumptions about the style, time signature or approximate tempo. The system works by detecting salient acoustic events and then clustering the time intervals between events to obtain hypotheses about the current tempo. Multiple agents are used to track these hypotheses over time. The output of each agent is a sequence of beat locations, which is evaluated for its closeness of fit to the data.

[25] Simon Dixon. An Interactive Beat Tracking and Visualization System. In Proc. Int. Computer Music Conf. (ICMC), pages 215-218, Habana, Cuba, 2002. ICMA.

An improved version of the beat tracking system developed by the author that contains additional visualization and graphical user interface components. The system works by detecting salient audio events and then clustering the time intervals between the events to obtain hypotheses about the current tempo.

[26] Richard Duda, Peter Hart, and David Stork. Pattern classification. John Wiley & Sons, New York, 2000.

Pattern classification and machine learning techniques have always been used in audio MIR and are becoming increasingly important in symbolic MIR as well. This well-cited standard textbook on Pattern classification describes in detail most of the important techniques and results in this field and provides a good introduction and background for researchers interested to learn more about pattern classification.

[27] Dan Ellis. Prediction-driven computational auditory scene analysis. PhD thesis, MIT Media Lab, 1996.
[ http://sound.media.mit.edu/~dpwe ]

Most computational auditory scene analysis systems follow a bottom-up approach where the signal is analyzed at increasingly higher levels until the desired information is obtained. However, this process can be greatly assisted if there some prior knowledge about the sources that are in the sound. This approach, is followed in this thesis where a perceptually-based CASA system that takes advantage of prior-knowledge for prediction is proposed. The system is able to correctly separate and indentify noisy environmental sounds from complex mixtures.

[28] Antti Eronen and Anssi Klapuri. Musical Instrument Recognition using Cepstral Features and Temporal Features. In Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, Istanbul Turkey, 2000. IEEE.

There is a large literature in the problem of automatically identifying music instruments and instrument families from isolated tones. Although a variety of features have been proposed there are few papers that provide any comparative results. In this paper, the authors described a detailed evaluation of many different cepstral and temporal features and give suggestions about which ones are the best to use depending on the desired computational load and accuracy.

[29] Mikael Fernstorm and Caolan McNamara. After Direct Manipulation - Direct Sonification. In Proc. Int. Conf. on Auditory Display, ICAD, Glasgow, Scotland, 1998.

One of the important problems in audio MIR is the presentation of multiple sound files for browsing. Unlike images typically sounds are presented sequentially resulting in longer times for browsing multiple sounds. One possibility is playing multiple sounds at the same time spatialized to enhance separation. In this paper, the effectiveness of providing multiple-stream audio to support audio browsing is investigated through the iterative development and evaluation of a series of Sonic Browser prototypes. A user study with ten subjects confirmed that interactive multiple-stream audio made the browsing task faster than single-stream audio support.

[30] Mikael Fernstrom and Eoin Brazil. Sonic Browsing: an auditory tool for multimedia asset management. In Proc. Int. Conf. on Auditory Display (ICAD), Espoo, Finland, July 2001.

Most graphical user interfaces for sound editing and visualization are based on the idea of processing a single audio file at a time. Obviously this is not enough for audio MIR purposes. The Sonic Browser is a graphical tool for managing collections of many sound files. The files are represented as visual objects on a two-dimensional visual and aural grid. Multiple stream stereo is used to provide the aural feedback. The user can explore the space by moving a cursor/aura. The sounds that are inside the aura are not only visually highlighted but played simultaneously spatialized in audio space according to their visual spatial position. In this paper, the Sonic Browser is evaluated in the context of browsing of everyday sounds.

[31] Myron Flickner and et al. Query by image and video content: the QBIC system. IEEE Computer, 28(9):23-32, September 1995.

One of the most relevant areas to audio MIR is the area of content-based image retrieval as many of the ideas, motivation, and concepts are similar. The QBIC system is a well known and cited example of a content-based system for querying image and video.

[32] Jonathan Foote. Content-based retrieval of music and audio. In Proc. of SPIE Multimedia Storage and Archiving Systems II, pages 138-147, 1997.

The author describes a system for content-based retrieval of audio signals and evaluates it using the Musclefish database. The similarity measure is based on statistics derived from a supervised vector quantizer. The audio signal is first parametrized using Mel-Frequency Cepstral Coefficients. The resulting feature vectors are then quantized using a tree-structured quantizer. A histogram count of the probabilities of each leaf is then used as the representation for similarity retrieval. A Euclidean and Cosine distance metric is used to compare histogram vectors. A detailed evaluation and comparison with the Musclefish algorithm is also presented in the paper.

[33] Jonathan Foote. An overview of audio information retrieval. ACM Multimedia Systems, 7:2-10, 1999.

An early overview of audio information retrieval. In addition to audio MIR, it covers speech and symbolic MIR as well. Although, somewhat outdated this overview is a good introduction to the areas and identifies several key papers in each area that can serve as starting points to explore the literature. The basic issues and motivations for making audio less opaque and structured are also described.

[34] Jonathan Foote. Visualizing music and audio using self-similarity. In ACM Multimedia, 1999.

Visualization takes advantage of the strong pattern recognition properties of the human visual system to directly express various properties of the data that is analyzed. In music visualization the task is to come up with interesting ways that musical information can be represented visually. In this paper, a method for visualizing the time structure and self similarity of musical audio signals is described. The audio signal is first parametrized into Mel-frequency cepstral coefficients (MFCCs) plus an energy term. For visualization, a window of vectors is chosen, and a similarity measure based on sequence autocorrelation is calculated for all window combinations. The image is constructed so that each pixel at location i,j is given a grayscale value proportional to the similarity measure. Regions of high audio similarity, appear as bright squares in the diagoal. Repeated figures appear as bright off-diagonal rectangles. Several visualization examples of audio files from different musical genres are provided in the paper.

[35] Jonathan Foote. Arthur: Retrieving orchestral music by long-term structure. In Proc. International Symposium on Music Information Retrieval (ISMIR), 2000.

In this paper, the author describes a system for detecting different performances of the same symphonic pieces that operates directly in the audio domain. This retrieval is done by utilizing the long term structure of music. More specifically the envelope of audio energy versus time is calculated for each file of interest. Similarity between energy profiles is calculated using dynamic programming. Experimental results from a modest corpus indicate that the system works for retrieving different performances of the same orchestral work, given an example performance or a short excerpt as a query.

[36] Jonathan Foote. Automatic Audio Segmentation using a Measure of Audio Novelty. In Proc. Int. Conf. on Multimedia and Expo (ICME), volume 1, pages 452-455. IEEE, 2000.

Audio segmentation refers to the process of automatically detecting where there are significant changes in music. In this paper, the author proposes a method for audio segmentation based on analyzing local self-similarity. More specifically, a self-similarity correlation matrix is constructed and. By correlating the similarity matrix S with the appropriate image processing filters (kernels) along the diagonal a one-dimensional novelty score is calculated. This novelty score is then used to extract segment boundaries.

[37] Jonathan Foote and Shingo Uchihashi. The Beat Spectrum:a new approach to rhythmic analysis. In Int. Conf. on Multimedia & Expo (ICME). IEEE, 2001.

The beat spectrum is a global representation that characterizes the rhythm and tempo of music and audio. It is a measure of acoustic self-similarity as a function of time lag. Highly structured and repetitive music will have strong beat spectrum peaks at the repetition times. A small scale evaluation showed that tempo estimated using the beat spectrum is usually correct across a variety of musical genres.

[38] Ichiro Fujinaga. Machine recognition of timbre using steady-state tone of acoustic instruments. In Proc. Int. Computer Music Conf. (ICMC), pages 207-210, Ann Arbor, Michigan, 1998. ICMA.

In this paper, the author describes a system for the recognition of musical instruments for isolated steady-state tones. It is based on extracting spectral moments and then using a nearest neighbor classification system combined with a genetic algorithm for feature weighting.

[39] Ichiro Fujinaga. Realtime recognition of orchestral instruments. In Int. Computer Music Conf. (ICMC), 141-143. ICMA, 2000.

In this paper, the author extends his previous work in musical instrument classification so that it can be performed in real-time and doesn't make the assumption that only the steady state portion of the sound is provided.

[40] Keinosuke Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, New York, 1972.

Pattern Recognition and Machine Learning techniques are necessary to deal with the inescapable uncertainty when dealing with complex data analysis tasks such as audio MIR. Although, a little bit outdated, this standard textbook provides an overview of the basic terminology and techniques in Statistical Pattern Recognition. For a more recent treatment, check the Duda, Hart, Stork textbook.

[41] Asif Ghias, Jonathan Logan, David Chamberlin, and Brian Smith. Query by Humming: Musical Information Retrieval in an Audio Database. ACM Multimedia, pages 213-236, 1995.

One of the first papers that define the term query-by-humming and describe a system for retrieval from symbolic data (MIDI). The approach described in the paper: acoustic query converted to some kind of pitch and rhythmic contour and then used to retrieve from a database of stored MIDI files has been used in a variety of later systems and forms the majority of existing systems for symbolic MIR. The pitch contour is represented with three values: U (up), D (down), S (same). A small database (183 songs) has been used for evaluation. The authors compare autocorrelation, maximum likelihood and cepstrum analysis for pitch detection and decide to use autocorrelation (taking computation time also into account). For matching, an approximate string matching algorithm is used to deal with transposition, dropout and duplication errors.

[42] Masataka Goto and Yoichi Muraoka. Real-time rhythm tracking for drumless audio signals - chord change detection for musical decisions. In Proc. Int. Joint. Conf. in Artificial Intelligence: Workshop on Computational Auditory Scene Analysis, 1997.

This paper presents a beat tracking system for musical audio signals that do not contain drum-sounds. The systems tracks beats at the quarter-note level as well as the half-note and measure levels. A method for chord change detection that doesn't require chord identification is proposed. This is achieved by tracking the dominant frequency trajectories and detecting where they change simultaneously. The system was tested with synthetic chord progression as well as musical signals sampled from the web and compact disks showing good beat tracking performance in both cases.

[43] Masataka Goto and Yoichi Muraoka. Music Understanding at the Beat Level: Real-time Beat Tracking of Audio Signals. In David Rosenthal and Hiroshi Okuno, editors, Computational Auditory Scene Analysis, pages 157-176. Lawrence Erlbaum Associates, 1998.

A classic paper in the area of beat tracking of audio signals. The authors describe a real-time beat tracking system that works on polyphonic multitimbral audio signals. The system is able to handle beat ambiguities by managing multiple agents that maintain multiple hypotheses of beat locations. The system also uses musical knowledge represented in drum patterns. The system correctly tracked beat in 40 out of 42 popular songs in which drums maintain the beat.

[44] Fabien Gouyon, Francois Pachet, and Olivier Delerue. On the use of zero-crossing rate for an application of classification of percussive sounds. In Proc. COST-G6 Conf. on Digital Audio Effects (DAFX), Verona, Italy, December 2000.

In this paper, an approach to automatic beat detection based on extracting time indexes of occurances of different percussive timbres in an audio signal is proposed. In order to follow this approach, it is important to be able to classify different percussive sounds. This problem of classifying percussive sounds is the main focus of this paper. A variety of features are used and analyzed in the paper. The system is trained both with synthetic samples and real-world examples. Unlike typical classification tasks were a large data set of training data is collected, in this approach unsupervised agglomerative clustering is used and it is calibrated by analyzing extracted features of percussive sounds. It shown that the zero-crossing decay can be used effectively for separating two important classes of percussive sounds: bass drum sounds and snare drum sounds.

[45] John M. Gray. An Exploration of Musical Timbre. PhD thesis, Dept. of Psychology, Stanford Univ., 1975.

In this classic thesis, the author explores the perception of the timbre of isolated musical instrument tones using a Multidimensional Scaling (MDS) paradigm and conducting user studies. Based on the studies a three dimensional space of instruments is constructed with the basic dimensions roughly corresponding to spectral centroid, flux and attack time. The idea of constructing a Timbre space with dimensions derived from MDS is important and has also been used for sound effects and other sounds. It would be interesting to see similar user experiments in audio MIR. Similar timbre space can also be constructed automatically from analysis of audio features. An interesting problem is how to map/wrap these spaces from the perceptual/user study domain to the automatic/feature extraction domain. Finally, visualization of timbre spaces can lead to insights and validation of the analysis behind the space.

[46] Alexander Hauptmann and Michael Witbrock. Informedia: News-on-demand multimedia information acquisition and retrieval. In Intelligent Multimedia Information Retrieval, chapter 10, pages 215-240. MIT Press, Cambridge, 1997.

Audio MIR is similar in many ways to work in content-based video indexing and retrieval. The Informedia system developed at Carnegie Mellon University is a an example of such a system mainly focusing on the news-on-demand application area. It combines speech recognition, image and video processing as well a text information retrieval in order to effectively browse and search large amounts of recorded news data.

[47] Walter B. Hewlett and Eleanor Selfridge-Field, editors. Melodic Similarity: Concepts, Procedures and Applications, volume 11. Computing in Musicology, 1998.

Although not directly related to audio MIR this collection of articles is the most well-known reference for computing melodic similarity using symbolic data. The techniques described in this collection will become increasingly relevant to audio MIR as gradually more detailed pitch information will be extracted from audio signals.

[48] L.T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.

Principal Component Analysis (PCA) is a standard technique used for visualization and dimensionality reduction. It has been applied in a variety of different contexts including audio MIR (see for example the Timbregram visualization proposed by Tzanetakis and Cook). This textbook defines PCA and explains the main concepts behind it.

[49] Joemon M. Jose, Jonathan Furner, and David J. Harper. Spatial querying for image retrieval: a user-oriented evaluation. In Proc. SIGIR Conf. on research and development in Information Retrieval, Melbourne, Australia, 1998. ACM.

The use of spatial layout based on content is shown to improve retrieval performance for images. This idea has been used as insiration for spatial audio displays such as the Sonic Browser by the University of Limerick or the Marsyas3D audio visualization system at Princeton University. However, , the most important part of the paper for audio MIR research is the user-centered, task-oriented, comparative evaluation of the system that can serve as a guide for designing similar experiments in the evaluation of graphical user interfaces for audio MIR as until now there has been very little comparative evaluation of such interfaces.

[50] Thomas Kemp, Michael Schmidt, Martin Westphal, and Alex Waibel. Strategies for automatic segmentation of audio data. In Proc. Int. Conf. on Acoustics Speech and Signal Processing (ICASSP), volume 3, pages 1423-1426. IEEE, 2000.

A large number of audio segmenation algorithms have been proposed in the literature. In this paper, a detailed comparison of different strategies for automatic audio segmenation is evaluated in the context of TV news shows. This paper, is a good introduction to the topic of audio segmenation as well as how segmenation can be evaluated. The three approaches compared are: model-based, metric-based and energy-based. It is shown that model-based and metric-based outperform the simpler energy-based algorithms. Based on this observation a hybrid approach is proposed where a metric-based segmentation is used as a base for constructing the model for the final model-based segmenter run.

[51] Don Kimber and Lynn Wilcox. Acoustic segmentation for audio browsers. In Proc. Interface Conference, Sydney, Australia, July 1996.

The authors describe a user interface for browsing spoken documents that is based on their automatic speaker segmentation algorithm. This algorithm described in more detail in other papers basically uses agglomerative clustering and hidden markov models to perform acoustic segmentation. The developed interface is one of the first proposed interfaces that use intelligent navigation through audio signals. Similar interfaces are being developed for music browsing and retrieval.

[52] Richard Kronland-Martinet, J. Morlet, and A. Grossman. Analysis of sound patterns through wavelet transforms. Int. Journal of Pattern Recognition and Artificial Intelligence, 1(2):237-301, 1987.

There is a large number of textbooks and articles describing Wavelets from various viewpoints. Typically these fall into three categories: mathematics, signal processing and image processing/compression. Therefore it is not directly evident how Wavelets can be applied to audio signals. This paper, introduces Wavelets and describes how they can be used to model acoustical signals.

[53] Jean Laroche. Estimating Tempo, Swing and Beat Locations in Audio Recordings. In Proc. Int. Workshop on applications of Signal Processing to Audio and Acoustics WASPAA, pages 135-139, Mohonk, NY, 2001. IEEE.

This article presents techniques for estimating the tempo, swing and beat locations in audio recordings. It makes the assumption that the tempo is constant something which is true in a large percentage of modern popular music. The algorithm works by detecting transients such as note onsets/offsets, percussion hits and other time-localized events. This step is followed by a maximum-likelihood probabilistic estimation of the tempo, swing and beat locations. Some suggestions to minimize the computational load of the method are provided.

[54] Fred Lerdahl and Ray Jackendoff. A Generative Theory of Tonal Music. MIT Press, 1983.

Formal grammars have had remarkable success in modeling human language. In this classic book a generative grammar for tonal music is proposed and described in detail. Some of the ideas described have been implemented and used for symbolic MIR purposes. As techniques in symbolic MIR and audio MIR become more sophisticated formal music theory will becoming increasingly important for these areas.

[55] Guohui Li and Ashfaq Khokar. Content-based indexing and retrieval of audio data using wavelets. In Int. Conf. on Multimedia and Expo (II), pages 885-888. IEEE, 2000.

Wavelets are a relatively new Time-Frequency analysis technique that overcomes some of the resolution problems of the Short-Time Fourier Transform (STFT). This paper shows how audio features for indexing and retrieval can be derived based on Wavelet analysis. The mean, standard-deviation and zero-crossing rate of each subband are used as features and the algorithm is evaluated using the Musclefish database of 418 bsound effects and isolated sounds yielding a high recall ratio (higher than 70 percent).

[56] Stan Li. Content-based classification and retrieval of audio using the nearest feature line method. IEEE Transactions on Speech and Audio Processing, 8(5):619-625, September 2000.

In this paper a pattern classification scheme, called the nearest Feature Line (NFL) is used for audio classification and retrieval. In the NFL, information provided by multiple prototypes per/class is explored. For feature representation standard perceptual and cepstral features and their combinations are considered. The system is evaluated using the Musclefish dataset and it is shown that the NFL-based method produces consistently better results than the NN-based and other methods.

[57] Z. Liu, J. Huang, Y. Wang, and T. Chen. Audio feature extraction & analysis for scene classification. IEEE Signal Processing Society 1997 Workshop on Multimedia Signal Processing (Electronic Proceedings), 1997.

In this paper the authors describe a system for video scene classification based on audio features. Features used are volume dynamic range, pitch contour, frequency features such as centroid and bandwidth, and subband energy ratios. A neural network classifier is trained using backpropagation for each class. The five classes used are: advertisement, basketball, football, news, weather. The paper shows a confusion classification matrix with promising results.

[58] Beth Logan. Mel Frequency Cepstral Coefficients for Music Modeling. In Proc. Int. Symposium on Music Information Retrieval (ISMIR), 2000.

Mel-Frequency Cepstral Coefficients (MFCC) are arguably the most common feature representation used today for the modeling of speech signals. This paper provides a good introduction to the motivation and calculation of the MFCC. The author shows empirically how well the DCT used for decorrelation in the MFCC calculation approximates the Karhunen-Loeve transform (KL) (or Principal Component Analysis) which by definition is the correct way to decorrelate components. This comparison is done both for speech and music signals providing evidence that it is appropriate to use MFCC for music modeling.

[59] Beth Logan. Music summarization using key phrases. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP. IEEE, 2000.

Audio thumbailing or music summarization is an exciting new area for audio MIR. The purpose of summarization is to provide a short thumbnail-summary that captures the essential characteristics of a musical piece. Applications include: presentation of retrieval results, quick browsing, and recomendation services. In this paper, the author makes the assumption that the key phrase is the most repeated section of the song. To the best of my knowledge, this is the first paper dealing with this problem in the audio domain (similar algorithms have been used for video summarization). The features used in this approach are Mel-Frequency Cepstral Coefficients. For discovering the song structure two methods (clustering and hidden markov models (HMM)) are proposed and evaluated. After the structure is discovered heuristics are used to choose the key phrase. Results for summaries of 18 Beatles songs evaluated by ten users show that the technique based on clustering is superior to the HMM approach and to choosing the key phrase at random.

[60] Lie Lu, Jiang Hao, and Zhang HongJiang. A robust audio classification and segmentation method. In Proc. ACM Multimedia, Ottawa, Canada, 2001.

The authors in this system propose an algorithm for classification and segmentation of audio signals that is able to discriminate speech, music, environment sound and silence in a one-second window. The first step is music/speech discrimination where a novel algorithm based on KNN and LSP VQ is used. Some new features such as the noise frame ratio and band periodicity are also introduced. Detailed evaluation results are provided from data gathered from the MPEG-7 data set CD1 as well as news, movies and audio clips from the internet.

[61] John Makhoul. Linear prediction: A tutorial overview. Proceedings of the IEEE, 63:561-580, April 1975.

Linear prediction is a well-known technique used in Speech and Audio processing for the analysis of discrete signals. This tutorial overview provides a good introduction and description of the main ideas and algorithms used to perform linear prediction. In LP, the signal is modelled as a linear combination of its past and present values in addition to the residual signal of the difference between the model and the original. In the frequency domain, this is equivalent to modeling the signal spectrum by a pole-zero spectrum. The model parameters can be obtained by using least-squares analysis in the time domain. The Linear Prediction Coefficients (LPC) and derived values have been used as features for audio analysis and they are especially applicable to speech signals.

[62] Stephane. G Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1999.

Wavelets are a relatively new Time-Frequency analysis technique that overcomes some of the resolution problems of the Short-Time Fourier Transform (STFT). This textbook provides a good introduction to Wavelets from a Signal Processing perspective. Although not directly concerned with audio it can provide many interesting ideas for the use of Wavelets in audio. Although Wavelets in audio compression have not been as successful as they have been in image compression they have a lot of potential in analysis applications such as audio MIR.

[63] Keith Martin. Sound-Source Recognition: A Theory and Computational Model. PhD thesis, MIT Media Lab, 1999.
[ http://sound.media.mit.edu/~kdm ]

The problem of musical instrument identification has a relatively long history compared to other problems in audio MIR. This thesis is probably the most complete coverage of the topic with a detailed analysis of existing literature and description of a complete system for musical-instrument identification. One interesting aspect of this work, is the conducting of user experiments to evaluate how good humans are in the musical instrument identification task and if they make similar errors to the errors of the automated system.

[64] Keith Martin, Eric Scheirer, and Barry Vercoe. Musical content analysis through models of audition. In Proc. Multimedia Workshop on Content-based Processing of Music, Bristol, UK, 1998. ACM.

In this position paper, the authors convincingly address the limitations of conventional approaches to music audio analysis, by questioning their underlying principles. More specifically they argue for a non-transcriptive approach to the problem of audio analysis and demonstrate examples in the extraction of rhythm, timbre, harmony and structure from complex audio signals using an approach that is based on a realistic view of human listening abilities. The authors, argue that researchers should try to model what average human listeners are able to do rather than building elaborate music listening models based on music theory that are applicable only to experienced musicians.

[65] Todd K. Moon. The Expectation-Maximization Algorithm. IEEE Signal Processing Magazine, 13(6):47-60, November 1996.

The expectation-maximization algorithm is an iterative technique used to probabilistically guess missing values from multidimensional data. Probably its most well known application is in the training stage Gaussian Mixture Model Classifiers (GMMS) in Hidden Markov Models (HMM). This tutorial paper describes the basic ideas and calculation steps of the EM algorithm in detail.

[66] F.R. Moore. Elements of Computer Music. Prentice Hall, 1990.

Computer Music has been an active area of research for much longer than audio MIR. Many of the techniques developed including ideas about programming structures such as unit generators and filters are directly or indirectly applicable to audio MIR. The Elements of Computer Music is a good introduction to computer music and contains a lot of code examples in C that can be used to better understand the main concepts.

[67] James Moorer. On the Segmentation and Analysis of Continuous Musical Sound by Digital Computer. PhD thesis, Dept. of Music, Stanford University, 1975.

One of the first attempts to transcribe music automatically , the system described in this thesis contained many ideas that will appear again and again in later systems. Although limited (only able to track monophonic music or very simple duets) the performance of the system was very impressive for its time.

[68] James Moorer. The Lucasfilm Audio Signal Processor. Computer Music Journal, 6(3):30-41, 1982. (also in ICASSP 82).

One of the important areas of current research in audio MIR is the creation of intelligent audio editors and browser that have some understanding of the structure of music. this early paper describes some similar ideas in an embryonic stage for the design of the Lucasfilm Audio Signal Processor. Although many of the hardware implementation details are irrelevant today the paper is still an interesting read.

[69] MPEG-7. Context and Objectives. Technical report, ISO/IEC JTC1/SC29/WG11 MPEG98, 1998.

The goal of MPEG-7 or the so called Multimedia Content Description Interface is to provide standard sets of descriptors that can be used to describe various types of multimedia information, their relationships and how they are linked to the actual content. In addition it standarizes ways to define new descriptors. The descriptors included are both low level (usually computed automatically) and high level (require manual annotation). This technical report describes the motivation behind MPEG-7 and the main design goals. Obviously the descriptors for audio signals are important for researchers in audio MIR.

[70] MpegAudio1. Information technology-coding of moving pictures and associated audio for digital storage media at up to about 1.5 mbit/s-is 11172 (part 3, audio). Technical report, ISO/IEC JTC1/SC29, 1992.

[71] MpegAudio2. Information technology-generic coding of moving pictures and associated audio information-is 13818 (part 3, audio). Technical report, ISO/IEC JTC1/SC29, 1994.

[72] Peter Noll. MPEG digital audio coding. IEEE Signal Processing Magazine, pages 59-81, September 1997.

MPEG audio compression (the well-known mp3 files) is currently the most widely used compression standard for digital music distribution. This tutorial article, explains the main concepts behind the design of the perceptual audio compression method used in MPEG audio compression and provides an overview and explanation of the architecture of the encoder/decoder.

[73] A. Oppenheim and R. Schafer. Discrete-Time Signal Processing. Prentice Hall, Edgewood Cliffs, NJ, 1989.

Signal Processing Techniques are of fundemental importance for audio MIR as they are necessary to analyze and extract information from complex audio signals. This standard textbooks provides an introduction to the main ideas, concepts and algorithms used in digital signal processing.

[74] Laura Ottaviani and Davide Rocchesso. Separation of Speech Signal from Complex Auditory Scenes. In Proc. COST G-6 Conf. on Digital Audio Effects (DAFX), Limerick, Ireland, December 2001.

Source separation and subsequently polyphonic pitch extraction is one of the main goals of audio MIR. In this paper, the authors concentrate on the problem of separating speech signals from complex auditory scenes. The system concists of a module for pitch analysis and a module for resynthesis. A cochlear model followed by summary autocorrelation is used to perform the pitch tracking.

[75] Francois Pachet and Daniel Cazaly. A Taxonomy of Musical Genres. In Proc. RIAO Content-based Multimedia Information Access Conf., Paris, March 2000.

In this paper the author stress the inconsistencies found in existing taxonomies of musical genre as found in the music industry. In order to develop a system for similarity retrieval based on metadata and in particular a genre descriptor they propose a novel music genre taxonomy based on a few guiding principles, and report on the process of building this taxonomy with the intention of using it in the future behind a similarity retrieval engine.

[76] Francois Pachet, Pierre Roy, and Daniel Cazaly. A combinatorial approach to content-based music selection. IEEE Multimedia, 2000.

To the best of my knowledge, this paper is the first to address the issue of automatic playlist generation (which the authors name content-based music selection) with specific properties from annotated attributes. The problem is posed as a constraint-satisfaction problem over the attributes. As an example the user can ask for a playlist with 5 songs, no slow tempos, half of them with female voice, and one song by Bille Holiday. The attributes are separated into technical (name, artist, album, etc) and content (jazz, female voice, brass). For the evaluation the attributes are entered manually. Similarity relations are provided between the attributes. Evaluation on small dataset of 200 titles showed that the automatically generated playlists are good compared to the ones manually generated from experts and contain unexpected items that human experts would not have thought of.

[77] Davis Pan. A Tutorial on MPEG/Audio Compression. IEEE Multimedia, 1995.

MPEG audio compression (the well-known mp3 file) is by far the most popular audio compression standard and started the digital music distribution revolution. Technically it is a lossy perceptual coder that takes advantage of the properties of the human auditory system to conceal the errors resulting from compression by making them perceptually inaudible. This tutorial provides a good overview and description of the main components of MPEG audio encoding and decoding.

[78] Alex Pentland, Rosalind Picard, and Stanley Sclaroff. Photobook: Tools for Content-Based Manipulation of Image Databases. IEEE Multimedia, pages 73-75, July 1994.

Content-based audio IR has many similarity with content-based image and video retrieval. One of the most well-known examples of an content-based image system is the Photobook which combines content-based manipulation and browsing of images. The idea of combining visual browsing with content-based analysis tools has been used recently to develop novel graphical user interfaces for manipulating collections of audio signals.

[79] D. Perrot and Robert Gjerdigen. Scanning the dial: An exploration of factors in identification of musical style. In Proc. Society for Music Perception and Cognition, page 88, 1999. (abstract).

Musical genres are categorical labels created and used by humans to organize and structure the large space of musical pieces. The boundaries between them are fuzzy as there is no strict definition of what constitutes a musical genre. A study of genre classification by humans was conducted in this work. Using a 10-genre forced choice paradigm college students were able to accurately judge (53% correct) after listening to only 250 milliseconds samples and (70% correct) after listening to three seconds (chance would be 10%). Listening to more than 3 seconds did not improve their performance. The subjects where trained using representative samples from each genre. This study is of special importance as a benchmark for automatic musical genre classification.

[80] Silvia Pfeiffer. Pause concepts for audio segmentation at different semantic levels. In Proc. ACM Multimedia, Ottawa, Canada, 2001.

One of the important cues for the segmentation of sound is the relative amplitude levels. This source of information is explored in detail, in this paper where a perceptual loudness measure is the only feature used for segmentation. The segmentation is done at multiple levels using adaptive thresholding. An additional contribution of this paper is the description of a segmentation evaluation methodology.

[81] Silvia Pfeiffer, Stephan Fischer, and Wolfgang Effelsberg. Automatic Audio Content Analysis. ACM Multimedia, pages 21-30, November 1996.

In this paper the authors describe the basic theoretical framework and applications of audio content analysis. More specifically, content-based segmentation of audio, music analysis and violence detection in video signals is discussed. Onset and offset information as well as fundamental frequencies and transitions are calculated using a gammatone filterbank. The fundamental frequency trajectory is then used to produce a characteristic signature of a musical piece which is used for retrieval. This method is shown to be more effective than just using amplitude statistics. By computing audio statistics a method for violence detection is also described.

[82] Robi Polikar. The Wavelet Tutorial, 1999.
[ http://engineering.rowan.edu/~polikar/WAVELETS/WTtutorial.html ]

A good introductory tutorial to Wavelets and the motivation behind their development and usage. For more details the book A Wavelet Tour of Signal Processing by Stephane Mallat offers a more complete introduction.

[83] David Pye. Content-based methods for the management of digital music. In Proc. Int. Conf on Acoustics, Speech and Signal processing ICASSP. IEEE, 2000.

The derivation of features directly from MP3 compressed data and their use for classification and retrieval is the topic of this paper. A new feature parametrization, termed MP3CEP, similar to MFCC and based on a partial decompression of MPEG Layer III audio is proposed. This parametrization is much faster to compute than having to decompress the audio and analyze it again as it takes advantage of the analysis performed for compression. For classification, Gaussian Mixture Modeling as well as a Tree-based vector quantization were used with good results in classifying 6 musical genres (blues, easy listening, classical, opera, dance (techno) and indie rock). The system was evaluated in retrieval by using as relevance judgements (same artist and same genre). Average precision results are provided.

[84] L. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal. A comparative performance study of several pitch detection algorithms. IEEE Trans. Acoustics, Speech, and Signal Processing., ASSP-24:399-417, October 1976.

One of the fundamental building blocks of any audio and music analysis system is automatic pitch detection. There is a huge literature in this topic and despite continuous research for more than thirty years there are still new algorithms being proposed. This early paper provides a detailed comparison and evaluation of several pitch detection algorithm. Many of these algorithms or variants of them are still being used either in isolation or as part of a more complex system therefore this paper is still very relevant despite its age.

[85] Lawrence Rabiner and Biing Huang Juang. Fundamentals of Speech Recognition. Prentice-Hall, 1993.

Many of the techniques and ideas used in audio MIR have their origins in Speech Recognition. This standard reference textbook provides a good introduction to the fundamentals of speech recongnition and can be used as a reference to understand many of the concepts and algorithms used in audio MIR systems.

[86] Curtis Roads. Computer Music Tutorial. MIT Press, 1996.

Computer Music has been an active area of research for much longer that audio MIR. Many of the techniques developed including ideas about programming structures such as unit generators and filters are directly or indirectly applicable to audio MIR. This huge book contains a high level overview of almost every area of Computer Music and is an excellence resource for finding references for these areas.

[87] David F. Rosental and Hiroshi G. Okuno, editors. Computational Auditory Scene Analysis. Lawrence Erlbaum, 1998.

Auditory Scene Analysis is the process of separating a continuous sound signal into components corresponding to its separate physical sources. This collection of articles describes a variety of different approaches to building computatational systems that attempt to mimick various aspects of the human hearing process related to Auditory Scene Analysis. Of special interest to researchers in audio MIR are the following articles: Psychological Data and Computational ASA by Albert Bregman, A Critique of Pure Audition by Malcolm Slaney, Application of the Bayesian Probability Network to Music Scene Analysis by Kahino et al., Musical Understanding at the Beat Level: Real-time Beat Tracking for Audio Signals by Masataka Goto and Yoichi Muraoka, Analysis and Synthesis of Sound Textures by N.Saint-Arnaud and K.Popat, Using Musical Knowledge to Extract Expressive Performance Information from Audio Recordings by Eric Scheirer. Some of these articles appear as separate entries in this annotated bibliography.

[88] Stephane Rossignol, Xavier. Rodet, et al. Features extraction and temporal segmentation of acoustic signals. In Proc. Int. Computer Music Conf. (ICMC), pages 199-202. ICMA, 1998.

The authors describe a feature extraction system that is subsequently used for a three-stage segmentation. At the first stage the signal is segmented between speech, singing voice and instrumental parts. The second stage is to suppress vibrato in singining and the third stage is to to segment into notes or into phones. Evaluation results are only given for the first stage.

[89] John Saunders. Real time discrimination of broadcast speech/music. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 993-996. IEEE, 1996.

Possibly the first paper on music/speech discrimination. The features and classification algorithm used are relatively simple (thresholding on zerocrossings) but the problem is defined and relatively good results were achieved in this early paper.

[90] Gary Scavone, Steve Lakatos, Perry Cook, and C.H Harbke. Perceptual spaces for sound effects obtained with an interactive similarity rating program. In Proc. Int. Symposium on Musical Acoustics, Perugia, Italy, September 2001.

One of the important ideas in timbre and music research is the concept of a perceptual space that models the perceived distances/similarities between auditory objects. In this paper, the authors describe the construction of such perceptual space for collections of sound effects using an interactive similarity rating program.

[91] Robert Schalkoff. Pattern Recognition. Statistical, Structural and Neural Approaches. John Wiley & Sons, 1992.

Pattern Recognition and Machine Learning techniques are important in audio MIR research in order to deal with the inherent errors and uncertainty of audio analysis algorithms. This textbook provides a good introcuction to the field covering in addition to statistical pattern recognition, structural and neural approaches. Although structural approaches have not been used extensively in audio MIR they hold a lot of promise for future research as music exhibits very regular structures at multiple levels.

[92] Eric Scheirer. Bregman's chimerae: Music perception as auditory scene analysis. In Proc. Int. Conf. on Music Perception and Cognition, Montreal, 1996.

The author in this paper argues against the predominant transcriptive note-level approach used in music audition systems. More specifically the author argues that when listening to music, we group multiple sounds into coherent events (chimeraic objects) without necessarily separating each individual note and this approach could also be followed in music audition systems that operate directly on audio signals.

[93] Eric Scheirer. The MPEG-4 structured audio standard. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1998.

The main idea behind structured audio (SA) is that instead of storing audio samples, algorithms and control/mixing parameters for the production of sound are stored. The MPEG-4 standard has a part dealing with structured audio and this paper provides an overview of the main ideas behind this concept. The wide adoption of SA will open many new interesting directions for audio MIR as in essence more detailed information about the audio structure will be available directly.

[94] Eric Scheirer. Tempo and beat analysis of acoustic musical signals. Journal of the .Acoustical Society of America, 103(1):588,601, January 1998.

One of the first and most cited papers in automatic beat tracking directly on audio signals. In this system, a filterbank is coupled with a network of comb filters that track the signal periodicities to provide an estimate of the main beat and its strength. The results of a short validation experiment demonstrate that the performance of the algorithm is similar to the performance of human listeners in a variety of musical situations. In addition this paper provides a good introduction to the topic of beat tracking in general.

[95] Eric Scheirer. Music-Listening Systems. PhD thesis, MIT, 2000.
[ http://sound.media.mit.edu/~eds ]

The author of this thesis has been one of the key figures in the audio MIR community. In this thesis, a variety of automatic music listening systems are designed, described and evaluated. All of them follow a non-transcriptive approach and are evaluated through user studies. Probably, one of the best starting points for someone wanting to enter the literature on audio MIR.

[96] Eric Scheirer and Malcolm Slaney. Construction and evaluation of a robust multifeature speech/music discriminator. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, pages 1331-1334. IEEE, 1997.

Arguably the most complete and well-cited reference to the topic of Music/Speech discrimination. Thirteen features are evaluated for a music/speech discrimination task achieving a classification accuracy of 94.2 on a frame-basis and 98.6 over long (2.4 seconds) segments. The approach used in this paper: audio features, training of statistical pattern recognition classifiers and evaluation using cross-validation has been used in a variety of other papers in audio MIR. The paper is a good example of thorough evaluation that is sometimes lacking in the still maturing field of audio MIR.

[97] Jarno Seppanen. Quantum Grid Analysis of Musical Signals. In Proc. Int. Workshop on applications of Signal Processing to Audio and Acoustics WASPAA, pages 131-135, Mohonk, NY, 2001. IEEE.

In this paper, an algorithm for analyzing the rhythmic content of polyphonic and multitimbral music signals is presented. The analysis consists, of detecting sound onsets, computing an inter-onset interval (IOI) histogram, and estimating the duration of the shortest notes i.e the tatum period from the histogram. In order to accomodate tempo changes short-term memory is used for the tatum grid estimation. The proposed algorithm works causally and a real-time software implementation is available online. The performance of the system is evaluated using 50 musical excerpts, and the algorithm is capable of finding the tatum grid from music with a regular rhythm.

[98] Roger. N. Shepard. Circularity in Judgements of Relative Pitch. Journal of the Acoustical Society of America, 35:2346-2353, 1964.

In this classic paper, two distinct attributes of pitch perception are identified: tone height and chroma. Tone height describes the general increase in the pitch of a sound as its frequency increases. Chroma on the other hand, is cyclic in nature with octave periodicity and closely corresponds to the concept of pitch class familiar to musicians. Under this formulation, two tones differing by an integral number of octaves have the same value of chroma. Chroma-based representations have been used in audio MIR to capture pitch content and harmonic relations.

[99] Ben Shneiderman. Designing the User Interface: Strategies for Effective Human-Computer Interaction. Addison-Wesley, 3rd ed. edition, 1998.

A frequently overlooked area in developing computer systems especially in new areas such as audio MIR is the disciplined design and development of user interfaces. This standard textbook, describes the motivation behind designing good interfaces and provides several real-world examples and guidelines about how to design, develop and evaluate user interfaces. Hopefully as audio MIR matures as a field user interface issues will become increasingly important and formal design and evaluation methods will be adopted.

[100] Malcolm Slaney. A critique of Pure Audition, chapter 3. Lawrence Erlbaum Associates, Mahwah, NJ, 1997.

This position paper argues for the addition of top-down processing information flow in existing sound-separation systems. Most sound-separation systems based on perception assume a bottom-up or Marr-like view of the world. In this paper, the author argues that audio information flow although predominately bottom-up also follows the top-down directions. Evidence of top-down processing is provided both in the auditory and visual domain.

[101] Malcolm Slaney. Semantic-Audio Retrieval. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). IEEE, May 2002.

In this paper a method for connecting sound to words and words to sounds is described. This is achieved by linking multidimensional vector spaces. The acoustic space is represented using anchor models and partitioned using agglomerative clustering. The semantic space is modeled by a hierarchical multinomial clustering model. Using the linked models, users can retrieve sounds with natural language, and the system can describe new sounds with words. For illustration, a set of acoustic and semantic documents about animal sounds is used obtained from the BBC Sound Effects Library.

[102] Malcolm Slaney and Richard Lyon. A perceptual pitch detector. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, pages 357-360, Albuquerque, NM, 1990. IEEE.

Unlike signal processing approaches that work directly on mathematical representations of audio signals perceptual approaches use the properties of the human auditory system as guidelines for their design. In this paper, the author describes a perceptual pitch detection algorithm that is based on Licklider's Duplex Theory of pitch perception. The algorithm works by combining a cochlear model with a bank of autocorrelators. The information of the correlogram, is filtered, non-linearly enhanced and summed. It is shown that the algorithm correctly identifies the pitch of complex harmonic and inharmonic stimuli, and that it is robust to noise and phase changes. Perceptually-based algorithms have proved their advantages in several areas of audio processing such as audio compression (MPEG audio compression-mp3 files) and speech recognition (Mel-Frequency Cepstral Coefficients MFCC). As computers become faster and computer audition algorithms more sophisticated perceptually-based audio analysis algorithms will become more widely used for audio MIR.

[103] Malcolm Slaney and Richard Lyon. On the importance of time-a temporal representation of sound. In M Cooke, B Beet, and M Crawford, editors, Visual Representations of Speech Signals, pages 95-116. John Wiley & Sons Ltd, 1993.

Perceptually-motivated algorithms are becoming increasingly important in audio processing and analysis. Examples include audio compression (MPEG audio compression-mp3 files) and speech recognition (Mel-Frequency Cepstral Coefficients MFCC). In this overview paper, the authors describe in detail what we know about the cochlear mechanics and neural processing that is performed in the human ear. They show how to build computational models that mimick the human auditory system and describes how they can be used to represent and interpret the temporal information in an acoustic signal. Two different cochlear models are described and compared.

[104] Paris Smaragdis. Redundancy Reduction for Computational Audition, a Unifying Approach. PhD thesis, MIT Media Lab, 2001.

Ideas from Computational Auditory Scene Analysis (CASA) form the basis of some existing audio MIR system and will becoming increasingly important in the future. In this thesis, a formal approach is taken in order to unify the objectives of lower level listening functions. More specifically, techniques for redundancy reduction such as Independant Component Analysis (ICA) are used for three perceptual tasks preprocessing, grouping, and scene analysis and shown to give satisfactory results for complex audio signals.

[105] Leigh Smith. A Multiresolution Time-Frequency Analysis And Interpretation Of Musical Rhythm. PhD thesis, University of Western Australia, July 1999.

This thesis describes in detail an approach to representing musical rhythm using multiresolution analysis and wavelets. It also provides a good introduction to the problem of rhythmic representation. Rhythm occurs at multiple hierarchical levels and this structure is well modeled using wavelets.

[106] Robert Spence. Information Visualization. Addison Wesley ACM Press, 2001.

As the field of audio MIR becomes more mature increasing emphasis is places on visualization and graphical user interface aspects. This book provides a good introduction to the topic of information visualization and can provide inspiration for ideas in visualizing audio signals and collections.

[107] Ken Steiglitz. A Digital Signal Processing Primer. Addison Wesley, 1996.

Typically Signal Processing textbooks tend to be dry and full of mathematical details that hinder a more intuitive understanding of the field. This primer offers a good introduction to the fundamentals of signal processing and provide a lot of intuition behind the main ideas without compromising the presentation of the subject. An excellent introduction to the exciting area of Signal Processing.

[108] Abdu Subramanya, S.R. annd Youssef. Wavelet-based indexing of audio data in audio/multimedia databases. In Proc. Int. Workshop on Multimedia Database Management IW-MMDBMS, pages 46-53, 1998.

This paper presents an efficient indexing scheme for audio databases using wavelets. The performance of the scheme is experimentally evaluated and is seen to be more resilient to noise than indexing schemes using signal-level statistics and gives better retrieval performance than DCT-based indexing. The index is based in organizing the various wavelet level coefficients in an indexing structure so they can be used for query-by-example. The paper also provides a good introduction to the calculation of the Discrete Wavelet Transform (DWT).

[109] Tero Tolonen and Matti Karjalainen. A Computationally Efficient Multipitch Analysis Model. IEEE Trans. on Speech and Audio Processing, 8(6):708-716, November 2000.

Multiple pitch detection is a fundamental building block of music transcription and analysis systems. In this paper, a computationally efficient multiple pitch analysis model is proposed. The model is shown to have comparable performance with more elaborate and computationally intensive models based on cochlear models and multiple channel analysis. The algorithm works by dividing the signal into two channels (bellow and above 1000 Hz), computing a generalized autocorrelation of the low-channel signal and the envelope of the high channel and sums the two autocorrelation functions. A autocorrelation enhancing process is used to remove peaks resulting from harmonic multiples of the fundamental frequencies. This algorithm has been used as the basis of some recent papers in audio MIR.

[110] George Tzanetakis and Perry Cook. Multifeature audio segmentation for browsing and annotation. In Proc. Workshop on applications of signal processing to audio and acoustics WASPAA, New Paltz, NY, 1999. IEEE.

Typically music signals contain a variety of different sound textures (some examples are: instruments, voice, guitar solo, strings). Although classification methods can be used to detect sound texture segmentation boundaries this approach requires building a model for each class of interest something which is difficult if not impossible to do for arbitrary musical signals. In this paper, the authors describe a segmentation methodology based on multiple features that works by directly detecting abrupt changes in the feature vector trajectory without requiring classification. The methodology applied with a specific set of audio features is evaluated with a number of user experiments and a prototype audio browsing and annotation tool based on segmentation combined with existing classification techniques is described.

[111] George Tzanetakis and Perry Cook. Marsyas: A framework for audio analysis. Organised Sound, 4(3), 2000.

Although many of the ideas and building blocks used in audio MIR systems are common to various types of analyses typically each application is developed from scratch. In this paper, the authors describe Marsyas a software framework that attempts to abstract the necessary architecture and building blocks for audio MIR. The system is designed to be extensible and have been used in a variety of audio MIR systems and applications. The architecture and the main components of the system are described in the paper, as well as concrete implementations of audio MIR algorithms from the literature.

[112] George Tzanetakis and Perry Cook. Sound analysis using MPEG compressed audio. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, Istanbul, 2000. IEEE.

In order to compress audio signals, the MPEG audio compression standard performs a time-frequency analysis during encoding. The results of this analysis can be used to compute audio features directly without having to decompress the audio signal. In this paper, the derivation of such features from compressed data is described and experimental results are provided that show that the proposed features perform only slightly worse than traditional features based on STFT analysis for the task of music/speech classification and segmenation.

[113] George Tzanetakis and Perry Cook. Marsyas3D: a prototype audio browser-editor using a large scale immersive visual and audio display. In Proc. Int. Conf. on Auditoy Display (ICAD), Espoo, Finland, August 2001.

In this paper, a prototype audio browsing and editing environment that supports classification, segmenation, browsing and similarity retrieval is described. The system runs on large scale immersive visual and audio display providing a possible glimpse of how audio MIR systems of the future will be.

[114] George Tzanetakis and Perry Cook. Musical Genre Classification of Audio Signals. IEEE Transactions on Speech and Audio Processing, July 2002.

Musical Genres are categorical labels created and used by humans to characterize and structure large collections of music. Automatic musical genre classification is an important component of audio MIR systems and provides a good way to evaluate features for representing musical content. Although the boundaries between genres are inherently fuzzy, there are certain statistical properties related to the rhythm, harmony and instrumentation that characterize each genre. In this paper, three feature sets for representing timbral texture, pitch content and rhythm are proposed and evaluated in a 10 genre classification experiment achieving 60 percent classification accurarcy result. As a comparison, a similar experiment conducted using human subjects achieved 70 percent classification accuracy.

[115] George Tzanetakis, Georg Essl, and Perry Cook. Audio Analysis using the Discrete Wavelet Transform. In Proc. Conf. in Acoustics and Music Theory Applications. WSES, September 2001.

The Discrete Wavelet Transform (DWT) is a relatively new time-frequency analysis technique that has been developed as an alternative to the Short Time Fourier Transform (STFT). In this paper, the authors explore the use of the DWT for deriving audio classification features and for automatically extracting the beat of audio signals.

[116] Keith van Rijsbergen. Information retrieval. Butterworths, London, 2nd edition, 1979.

Text Information Retrieval for the basis of the majority of symbolic MIR algorithms. This classic textbook (available for free on the web) describes the main ideas and algorithms behind text information retrieval. Of special interest to audio MIR researchers are the sections about retrieval evaluation where concepts such as precision and recall are introduced and explained.

[117] S. Wake and T. Asahi. Sound Retrieval with Intuitive Verbal Expressions. In Proc. Int. Conf. on Auditory Display (ICAD), Glasgow, Scotland, 1997.

This paper describes a user experiment exploring how people represent sounds and then using the results to build a user interface for audio retrieval. Almost all the representations were made with verbal descriptions that can be classified into three categories: description of the sound itself, description of sounding situation and description of sound impression. The retrieval methods adopts three keyword types: onomatopoeia, sound source, and adjective is based on the results of the user study. This paper is particularly interesting for the design of audio retrieval interfaces as it shows that although humans use different ways to characterize sounds these typically fall into specific categories.

[118] Ye Wang and Vilermo Miikka. A compressed domain beat detector using MP3 audio bitstreams. In Proc. ACM Multimedia, Ottawa, Canada, 2001.

The idea of compressed domain processing is to directly analyze compressed signals before or during decompression. For audio signals this typically refers to signals compressed using the MPEG audio compression format (the well-known mp3 files). In this paper, an algorithm for beat detection that operates directly in compressed mp3 files is proposed. This algorithm is used to conceal errors of packet drops by using beat information to locate a similar segment to the one lost. This works well for popular music with strong repeating rhythmic structure. The information utilized in the beat detector is the mp3 audio compression window type (long, long-to-short, short, short-to-long) and the MDCT coefficients decoded from the mp3 bitstream. For the MDCT beat detection windows of coefficients are used and beat candidates are calculated by thresholding the energy in each band. An inter-onset interval (IOI) histogram is employed to select the correct inter-beat interval (IBI). The output is beat position, IBI, and confidence score. Only 3 subbands are actually used (1 low, 2 high) because the middle ones do not give reliable beat information. The method is tested on 6 popular songs (correctly 4 out of 6). Algorithm didn't work well if there was no beat pattern or complex beat pattern.

[119] Matt Welsh, Nikita Borisov, Jason Hill, Rob von Behren, and Alec Woo. Querying large collections of music for similarity, 1999. Technical Report UCB/CSD00 -1096,U.C. Berkeley Computer Science Division, 1999.

The authors describe a system for music similarity retrieval. A detailed set of features (1248 features/song) representing frequency, amplitude, and tempo data is extracted and then the nearest neighbor algorithm is used for the similarity retrieval. The system is evaluated in a database of 7000 songs. The paper provides detail evaluation results and comparison of the effectiveness of the different feature sets.

[120] Brian Whitman, Gary Flake, and Steve Lawrence. Artist Detection in Music with Minnowmatch. In Proc. Workshop on Neural Networks for Signal Processing, pages 559-568, Falmouth, Massachusetts, September 2001. IEEE.

One important problem in audio MIR is the problem of artist detection or classification which is to identify the name of the artist given an audio recording. In this paper, a system for performing artist detection based on neural networks (NN) and support vector machines (SVM) is proposed. Accuracy of 91 percent is reported for 32 songs from 5 artists and 70 percent for 50 songs from 10 artists. The artist were chosen in order to make the problem challenging. Scaling problems with using neural networks are reduced using a pre-classification step using multiple support vector machines.

[121] Lynn Wilcox, Francine Chen, Don Kimber, and Vijay Balasubramanian. Segmentation of speech using speaker identification. In Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), volume I, pages 161-164, Adelaide, Austrailia, 1994.

A method for segmenting spoken audio documents by speaker is presented in this paper. A hidden markov model (HMM) with states corresponding to each speaker is used simultaneously for classificaiton and segmentation. If data labeled by speaker is not available, agglomerative clustering is used to approximately segment the conversational speech and the result is used to train the HMM model. Segmentation accuracy using the agglomerative clustering initialization matches accuracy using initialization with speaker labeled data. Similar methods using clustering and HMMs have been proposed for the segmentation of musical signals.

[122] Erling Wold, Thom Blum, Douglas Keislar, and James Wheaton. Content-based classification, search and retrieval of audio. IEEE Multimedia, 3(2):27-36, 1996.

This paper is one of the most frequently cited papers in audio information retrieval and the dataset used for the experiments has been used in a variety of subsequent studies. The authors describe a system for classification and retrieval of audio signals based on perceptual and acoustics features. The users can search and retrieve sounds based on specific features, combinations of them as well as specifying previously learned classes. In addition query-by-example similarity retrieval is also supported. For evaluation, a database consisting of 400 sound files mainly of isolated sounds from 1 to 15 seconds has been used to develop and evaluate the system. Interesting aspects of this paper are a relatively complete description of a full audio retrieval system and good ideas for applications and motivations behind such as system.

[123] Cheng Yang. MACS: Music Audio Characteristic Sequence Indexing for Similarity Retrieval. In Proc. Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, 2001. IEEE.

An interesting problem in audio MIR is the identification of different performances of the same piece of music. The algorithm described in this paper is based on identifying local peaks in signal power and extracting a spectral vector near each peak. Nearby peaks are grouped together to form characteristic sequences which are subsequently used for indexing that employs a hashing scheme known as Locality-Sensitive Hashing. The algorithm is evaluated in a set of 120 pieces for the following cases: same digital copy, same analog source - different digital copy, same instrumental performance - different vocals, same score - different performances (possibly at different tempi), same melody different otherwise. The results are good for the first four cases. More work is required for correctly identifying pieces that have the same melody but different instrumental background.

[124] Cheng Yang. Music Database Retrieval based on Spectral Similarity. In Proc. Int. Symposium on Music Information Retrieval (Poster) (ISMIR), Bloomington, Indiana, 2001.

An interesting problem in audio MIR is the identification of different performances of the same piece of music. The algorithm described in this paper is based on identifying local peaks in signal power and extracting a spectral vector near each peak. Nearby peaks are grouped together to form characteristic sequences which are subsequently used for indexing that employs a hashing scheme known as Locality-Sensitive Hashing. The algorithm is evaluated in a set of 120 pieces for the following cases: same digital copy, same analog source - different digital copy, same instrumental performance - different vocals, same score - different performances (possibly at different tempi), same melody different otherwise. The results are good for the first four cases. More work is required for correctly identifying pieces that have the same melody but different instrumental background.

[125] Tong Zhang and Jay Kuo. Audio Content Analysis for online Audiovisual Data Segmentation and Classification. Transactions on Speech and Audio Processing, 9(4):441-457, May 2001.

An heuristic rule-based system for the segmentation and classification of audio signals from movies or TV programs based on the time-varying properties of simple features is proposed in this paper. Signals are classified into two broad groups of music and non-music which are further subdivided into (Music) Harmonic Environmental Sound, Pure Music, Song, Speech with Music, Environmental Sound with Music and (Non-music) Pure Speech and Non-Harmonic Environmental Sound.

This file has been generated by bibtex2html 1.54