Music Understanding

See also Computer Accompaniment and Beat Tracking

Back to Bibliography by Subject

This page is divided into several categories:

General Overviews

These are all overviews of my work in this area. The IAKTA/LIST paper is most up-to-date, although it is actually quite dated. Since then, I've worked on Query-by-Humming, polyphonic score alignment, music search by polyphonic alignment, music structure analysis, and beat tracking informed by music structure..

Dannenberg, ``Music Understanding,'' 1987/1988 Computer Science Research Review, Carnegie Mellon School of Computer Science, pp. 19-28.

[Postscript Version.] [Adobe Acrobat (PDF) Version.]

Dannenberg, ``Recent Work In Real-Time Music Understanding By Computer,'' Music, Language, Speech, and Brain, Wenner-Gren International Symposium Series, Sundberg, Nord, and Carlson, ed., Macmillan, 1991, pp. 194-202.

Postscript Version.

Dannenberg, ``Computerbegleitung und Musicverstehen,'' in Neue Musiktechnologie, Bernd Enders, ed., Schott, 1993, Mainz, pp. 241-252.

Dannenberg, ``Recent Work in Music Understanding,'' in Proceedings of the 11th Annual Symposium on Small Computers in the Arts, Philadelphia: SCAN, (November 1991), pp. 9-14.

Postscript Version.

Dannenberg, ``Music Understanding and the Future of Computer Music,'' Contemporary Music Review, (to appear).

Dannenberg, ``Music Understanding by Computer,'' in IAKTA/LIST International Workshop on Knowledge Technology in the Arts Proceedings, International Association of Knowledge Technology in the Arts, Inc. in cooperation with Laboratories of Image Information Science and Technology, Osaka Japan, pp. 41-56 (September 16, 1993).

ABSTRACT. Music Understanding refers to the recognition or identification of structure and pattern in musical information. Music understanding projects initiated by the author are discussed. In the first, Computer Accompaniment, the goal is to follow a performer in a score. Knowledge of the position in the score as a function of time can be used to synchronize an accompaniment to the live performer and automatically adjust to tempo variations. In the second project, it is shown that statistical methods can be used to recognize the location of an improviser in a cyclic chord progression such as the 12-bar blues. The third project, Beat Tracking, attempts to identify musical beats using note-onset times from a live performance. Parallel search techniques are used to consider several hypotheses simultaneously, and both timing and higher-level musical knowledge are integrated to evaluate the hypotheses. The fourth project, the Piano Tutor, identifies student performance errors and offers advice. The fifth project studies human tempo tracking with the goal of improving the naturalness of automated accompaniment systems.

[Postscript Version.] [Adobe Acrobat (PDF) Versioon.]

Style Classification

Getting a computer music system to listen to a performance and determine aspects of style, such as:

Dannenberg, Thom, and Watson, ``A Machine Learning Approach to Musical Style Recognition" in 1997 International Computer Music Conference, International Computer Music Association (September 1997), pp. 344-347.

ABSTRACT: Much of the work on perception and understanding of music by computers has focused on low-level perceptual features such as pitch and tempo. Our work demonstrates that machine learning can be used to build effective style classifiers for interactive performance systems. We also present an analysis explaining why these techniques work so well when hand-coded approaches have consistently failed. We also describe a reliable real-time performance style classifier.

[Postscript Version.] [Adobe Acrobat (PDF) Version.]

Han, Rho, Dannenberg, and Hwang, ``SMERS: Music Emotion Recognition Using Support Vector Regression'' in Proceedings of the 10th International Conference on Music Information Retrieval (ISMIR 2009), (October 2009), pp. 651-656.

ABSTRACT: Music emotion plays an important role in music retrieval, mood detection and other music-related applications. Many issues for music emotion recognition have been addressed by different disciplines such as physiology, psychology, cognitive science and musicology. We present a support vector regression (SVR) based music emotion recognition system. The recognition process consists of three steps: (i) seven distinct features are extracted from music; (ii) those features are mapped into eleven emotion categories on Thayer's two-dimensional emotion model; (iii) two regression functions are trained using SVR and then arousal and valence values are predicted. We have tested our SVR-based emotion classifier in both Cartesian and polar coordinate systems empirically. The results indicate the SVR classifier in the polar representation produces satisfactory results which reach 94.55% accuracy, superior to the SVR (in Cartesian) and other machine learning classification algorithms such as SVM and GMM.

[Adobe Acrobat (PDF) Version.]

Music Information Retrieval

This work is mostly focussed on retrieval from melodic databases using a sung or hummed query as the search key. This raises many issues relating to melodic similarity, music representation, and pitch recognition. Many of the melodic similarity techniques are related to earlier work in
Computer Accompaniment.
Dannenberg, Foote, Tzanetakis, and Weare, ``Panel: New Directions in Music Information Retrieval,'' in Proceedings of the 2001 International Computer Music Conference, International Computer Music Association, (September 2001), pp. 52-59.
Mazzoni and Dannenberg, ``Melody Matching Directly from Audio,'' in ISMIR 2001 2nd Annual International Symposium on Music Information Retrieval, Bloomington: Indiana University, (2001), pp. 73-82.
ABSTRACT. In this paper we explore a technique for content-based music retrieval using a continuous pitch contour derived from a recording of the audio query instead of a quantization of the query into discrete notes. Our system determines the pitch for each unit of time in the query and then uses a time-warping algorithm to match this string of pitches against songs in a database of MIDI files. This technique, while much slower at matching, is usually far more accurate than techniques based on discrete notes. It would be an ideal technique to use to provide the final ranking of candidate results produced by a faster but lest robust matching algorithm.

[Acrobat (PDF) Version]

Birmingham, Dannenberg, Wakefield, Bartsch, Bykowski, Mazzoni, Meek, Mellody, and Rand, ``MUSART: Music Retrieval via Aural Queries,'' in ISMIR 2001 2nd Annual International Symposium on Music Information Retrieval, Bloomington: Indiana University, (2001), pp. 73-82.

Dannenberg, ``Music Information Retrieval as Music Understanding,'' in ISMIR 2001 2nd Annual International Symposium on Music Information Retrieval, Bloomington: Indiana University, (2001), pp. 139-142.

ABSTRACT. Much of the difficulty in Music Information Retrieval can be traced to problems of good music representations, understanding music structure, and adequate models of music perception. In short, the central problem of Music Information Retrieval is Music Understanding, a topic that also forms the basis for much of the work in the fields of Computer Music and Music Perception. It is important for all of these fields to communicate and share results. With this goal in mind, the author's work on Music Understanding in interactive systems, including computer accompaniment and style recognition, is discussed.

[Acrobat (PDF) Version]

Hu and Dannenberg, ``A Comparison of Melodic Database Retrieval Techniques Using Sung Queries,'' in Joint Conference on Digital Libraries, New York: ACM Press, (2002), pp. 301-307.
ABSTRACT. Query-by-humming systems search a database of music for good matches to a sung, hummed, or whistled melody. Errors in transcription and variations in pitch and tempo can cause substantial mismatch between queries and targets. Thus, algorithms for measuring melodic similarity in query-by-humming systems should be robust. We compare several variations of search algorithms in an effort to improve search precision. In particular, we describe a new frame-based algorithm that significantly outperforms note-by-note algorithms in tests using sung queries and a database of MIDI-encoded music.

[Acrobat (PDF) Version]

Hu, Dannenberg, and Lewis. ``A Probabilistic Model of Melodic Similarity,'' in Proceedings of the 2002 International Computer Music Conference. San Francisco: International Computer Music Association, (2002), pp. 509-15.
ABSTRACT. Melodic similarity is an important concept for music databases, musicological studies, and interactive music systems. Dynamic programming is commonly used to compare melodies, often with a distance function based on pitch differences measured in semitones. This approach computes an "edit distance" as a measure of melodic dissimilarity. The problem can also be viewed in probabilistic terms: What is the probability that a melody is a "mutation" of another melody, given a table of mutation probabilities? We explain this approach and demonstrate how it can be used to search a database of melodies. Our experiments show that the probabilistic model performs better than a typical "edit distance" comparison.

[Acrobat (PDF) Version]

Dannenberg, Birmingham, Tzanetakis, Meek, Hu, and Pardo. “The MUSART testbed for query-by-humming evaluation,” in ISMIR 2003: Proceedings of the Fourth International Conference on Music Information Retrieval, Baltimore: Johns Hopkins Univeristy, (2003), pp. 41-50.

An slightly expanded and revised version of this paper (not online) is published in Computer Music Journal:

Dannenberg, Birmingham, Tzanetakis, Meek, Hu, and Pardo, “The MUSART Testbed for Query-By-Humming Evaluation,” Computer Music Journal, 28(2) (Summer 2004), pp. 34-48.

ABSTRACT: Evaluating music information retrieval systems is acknowledged to be a difficult problem. We have created a database and a software testbed for the systematic evaluation of various query-by-humming (QBH) search systems. As might be expected, different queries and different databases lead to wide variations in observed search precision. “Natural” queries from two sources led to lower performance than that typically reported in the QBH literature. These results point out the importance of careful measurement and objective comparisons to study retrieval algorithms. This study compares search algorithms based on note-interval matching with dynamic programming, fixed-frame melodic contour matching with dynamic time warping, and a hidden Markov model. An examination of scaling trends is encouraging: precision falls off very slowly as the database size increases. This trend is simple to compute and could be useful to predict performance on larger databases.

[Acrobat (PDF) Version]

Birmingham, Dannenberg, and Pardo, “Query by Humming With the VocalSearch System,” Communications of the ACM, 49(8) (August 2006), pp. 49-52.

ABSTRACT: Don't know the composer, performer, or title? Let the system match the theme you know to the song you want. When one wishes to find a piece of music through Apple Computer's iTunes or at the local public library, the usual approach is to enter some textual information (metadata) about the piece (such as composer, performer, or title) into a search engine. However, when one knows the music, but not its metadata, standard search engines are not an option. One might instead hum or whistle a portion of the piece, providing a query for a search engine based on content (the melody) rather than on metadata. Systems able to find a song based on a sung, hummed, or whistled melody are called query by humming, or QBH, even though humming is not always the input.

[Acrobat (PDF) Version]

Dannenberg, Birmingham, Pardo, Hu, Meek, Tzanetakis, “A Comparative Evaluation of Search Techniques for Query-by-Humming Using the MUSART Testbed,” Journal of the American Society for Information Science and Technology, 58(5) (March 2007), pp. 687-701.
ABSTRACT. Query-by-Humming systems offer content-based searching for melodies and require no special musical training or knowledge. Many such systems have been built, but there has not been much useful evaluation and comparison in the literature due to the lack of shared databases and queries. The MUSART project testbed allows various search algorithms to be compared using a shared framework that automatically runs experiments and summarizes results. Using this testbed, we compared algorithms based on string alignment, melodic contour matching, a hidden Markov model, n-grams, and CubyHum. Retrieval performance is very sensitive to distance functions and the representation of pitch and rhythm, which raises questions about some previously published conclusions. Some algorithms are particularly sensitive to the quality of queries. Our queries, which are taken from human subjects in a fairly realistic setting, are quite difficult, especially for n-gram models. Finally, simulations on query-byhumming performance as a function of database size indicate that retrieval performance falls only slowly as the database size increases.

[Acrobat (PDF) Version]

Structural Analysis

Using similarity and repetition to guide them, listeners can discover structure in music. This research aims to build music listening models that, starting with audio such as CD recordings, find patterns and generate explanations of the music. Explanations include analyses of structure, e.g. and "AABA" form, as well as other relationships.

Dannenberg, ``Listening to `Naima': An Automated Structural Analysis of Music from Recorded Audio,'' In Proceedings of the 2002 International Computer Music Conference. San Francisco: International Computer Music Association., (2002), pp. 28-34.

ABSTRACT. A model of music listening has been automated. A program takes digital audio as input, for example from a compact disc, and outputs an explanation of the music in terms of repeated sections and the implied structure. For example, when the program constructs an analysis of John Coltrane's "Naima," it generates a description that relates to the AABA form and notices that the initial AA is omitted the second time. The algorithms are presented and results with two other input songs are also described. This work suggests that music listening is based on the detection of relationships and that relatively simple analyses can successfully recover interesting musical structure.

Acrobat (PDF) Version

Dannenberg and Hu, ``Discovering Musical Structure in Audio Recordings,'' in Music and Artificial Intelligence: Second International Conference, C. Anagnostopoulo, M. Ferrand, A. Smail, eds., Lecture notes in computer science; Vol 2445: Lecture notes in artificial intelligence, Berlin: Springer Verlag, (2002), pp. 43-57.
ABSTRACT. Music is often described in terms of the structure of repeated phrases. For example, many songs have the form AABA, where each letter represents an instance of a phrase. This research aims to construct descriptions or explanations of music in this form, using only audio recordings as input. A system of programs is described that transcribes the melody of a recording, identifies similar segments, clusters these segments to form patterns, and then constructs an explanation of the music in terms of these patterns. Additional work using spectral information rather than melodic transcription is also described. Examples of successful machine “listening” and music analysis are presented.

[Acrobat (PDF) Version]

Dannenberg and Hu, “Pattern Discovery Techniques for Music Audio,” in ISMIR 2002 Conference Proceedings: Third International Conference on Music Information Retrieval, M. Fingerhut, ed., Paris: IRCAM, (2002), pp. 63-70.
ABSTRACT. Human listeners are able to recognize structure in music through the perception of repetition and other relationships within a piece of music. This work aims to automate the task of music analysis. Music is “explained” in terms of embedded relationships, especially repetition of segments or phrases. The steps in this process are the transcription of audio into a representation with a similarity or distance metric, the search for similar segments, forming clusters of similar segments, and explaining music in terms of these clusters. Several transcription methods are considered: monophonic pitch estimation, chroma (spectral) representation, and polyphonic transcription followed by harmonic analysis. Also, several algorithms that search for similar segments are described. These techniques can be used to perform an analysis of musical structure, as illustrated by examples.

[Acrobat (PDF) Version]

Dannenberg and Hu, “Pattern Discovery Techniques for Music Audio,” Journal of New Music Research, (June 2003), pp. 153-164.

Our ISMIR 2002 paper (listed above) was selected from the conference papers for publication in JNMR. The JNMR version is slightly expanded and revised.

ABSTRACT. Human listeners are able to recognize structure in music through the perception of repetition and other relationships within a piece of music. This work aims to automate the task of music analysis. Music is “explained” in terms of embedded relationships, especially repetition of segments or phrases. The steps in this process are the transcription of audio into a representation with a similarity or distance metric, the search for similar segments, forming clusters of similar segments, and explaining music in terms of these clusters. Several pre-existing signal analysis methods have been used: monophonic pitch estimation, chroma (spectral) representation, and polyphonic transcription followed by harmonic analysis. Also, several algorithms that search for similar segments are described. Experience with these various approaches suggests that there are many ways to recover structure from music audio. Examples are offered using classical, jazz, and rock music.

[Acrobat (PDF) Version]

Dannenberg and Goto, ``Music Structure Analysis from Acoustic Signals,'' in Handbook of Signal Processing in Acoustics, Vol 1, Springer Verlag. 2009, pp. 305-331.

This book chapter attempts to summarize various techniques and approaches.

ABSTRACT.Music is full of structure, including sections, sequences of distinct musical textures, and the repetition of phrases or entire sections. The analysis of music audio relies upon feature vectors that convey information about music texture or pitch content. Texture generally refers to the average spectral shape and statistical fluctuation, often reflecting the set of sounding instruments, e.g. strings, vocal, or drums. Pitch content reflects melody and harmony, which is often independent of texture. Structure is found in several ways. Segment boundaries can be detected by observing marked changes in locally averaged texture. Similar sections of music can be detected by clustering segments with similar average textures. The repetition of a sequence of music often marks a logical segment. Repeated phrases and hierarchical structures can be discovered by finding similar sequences of feature vectors within a piece of music. Structure analysis can be used to construct music summaries and to assist music browsing.

[Acrobat (PDF) Version]

Music Alignment

Music alignment is a capability that forms a bridge between signals and symbols. For example, by aligning an audio recording with a MIDI file, you obtain a transcription of the audio. By aligning two audio recordings, you can detect differences in tempo and interpretation. Computer accompaniment also relies on alignment. The papers listed here exploit some of the techniques introduced for
Computer Accompaniment, but explore other applications and the possibility of working with polyphonic music.

Hu, Dannenberg, and Tzanetakis. ``Polyphonic Audio Matching and Alignment for Music Retrieval,'' in 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New York: IEEE (2003), pp. 185-188.

ABSTRACT.We describe a method that aligns polyphonic audio recordings of music to symbolic score information in standard MIDI files without the difficult process of polyphonic transcription. By using this method, we can search through a MIDI database to find the MIDI file corresponding to a polyphonic audio recording.

We must have run out of space for a longer abstract. This paper covers two interesting experiments. One compares different features for alignment and concludes that the chromagram is better than multiple pitch estimation, spectra, and mel cepstra. The paper also includes an experiment where the quality of match is used to search for midi files that match audio. It works, but not very reliably.

[Acrobat (PDF) Version]

Dannenberg and Hu. ``Polyphonic Audio Matching for Score Following and Intelligent Audio Editors,'' in Proceedings of the 2003 International Computer Music Conference, San Francisco: International Computer Music Association, (2003), pp. 27-34.

This paper was actually submitted before the WASPAA paper, so it does not have some results on comparing different distance metrics. Instead, this paper stresses some different applications, one being the possibility of intelligent audio editors that align audio to symbolic notation or midi files to help with search, indexing, aligning multiple takes of live recordings, etc.

ABSTRACT.Getting computers to understand and process audio recordings in terms of their musical content is a difficult challenge. We describe a method in which general, polyphonic audio recordings of music can be aligned to symbolic score information in standard MIDI files. Because of the difficulties of polyphonic transcription, we convert MIDI to audio and perform matching directly on acoustic features. Polyphonic audio matching can be used for polyphonic score following, building intelligent editors that understand the content of recorded audio, and the analysis of expressive performance.

[Acrobat (PDF) Version]

Dannenberg and Raphael, ``Music Score Alignment and Computer Accompaniment,'' Communications of the ACM, 49(8) (August 2006), pp. 38-43.

Dannenberg and Hu, “Bootstrap Learning for Accurate Onset Detection,” Machine Learning 65(2-3) (December 2006), pp. 457-471.

ABSTRACT: Supervised learning models have been applied to create good onset detection systems for musical audio signals. However, this always requires a large set of labeled training examples, and hand-labeling is quite tedious and time consuming. In this paper, we present a bootstrap learning approach to train an accurate note onset detection model. Audio alignment techniques are first used to find the correspondence between a symbolic music representation (such as MIDI data) and an acoustic recording. This alignment provides an initial estimate of note boundaries which can be used to train an onset detector. Once trained, the detector can be used to refine the initial set of note boundaries and training can be repeated. This iterative training process eliminates the need for hand-labeled audio. Tests show that this training method can improve an onset detector initially trained on synthetic data.

[Acrobat (PDF) Version]

See also:
Concatenative Synthesis Using Score-Aligned Transcriptions, a synthesis technique where alignment is used to build a dictionary mapping time slices of MIDI files to units of audio, which are selected and concatenated to "resynthesize" other MIDI files.

Remixing Stereo Music with Score-Informed Source Separation, where alignment is used to help with source separation, with the goal of editing individual instruments within a stereo audio mix.

Bootstrap Learning for Accurate Onset Detection, which uses alignment to find note onsets, which are then used as training data for automatic onset detection.