Dynamic Programming for Music Search

Roger B. Dannenberg

Introduction

Dynamic programming techniques give a powerful and flexible means to compare music sequences. I was the first person to apply these techniques to music, specifically for real-time computer accompaniment, where the problem is to efficiently match performed notes to a score in spite of errors in performance.

These notes demonstrate the approach with a concrete example, and describe variations.

In all cases, we are considering discrete sequences of symbols, e.g. &ldqo;How similar is ABCXDefG to AYBCDEZFG?” We might say these are similar because they both contain ABCDG in that order, or if we consider e similar to E and f similar to F, we could say both contain letters similar to ABCDEFG. For musical applications, the symbols can be an encoding of musical pitches, chords, intervals, inter-onset intervals, etc.

Edit Distance Algorithm

A simple measure of similarity is the following: What is the fewest number of insertions and deletions needed to convert string A into string B? This number is known as the edit distance (we’ll see later that edit distance can be parameterized, so this is just the simplest case.)

Notice that an insertion in string A is equivalent to a deletion in string B — either can account for a 1-character difference. The naive way to compute edit distance is to generate all possible deletions from both string A and string B, and find the longest match. E.g. if string A is ABC and string B is AC, we compute all variations of string A: ABC, BC, AC, BC, A, B, C, and all variations of string B: AC, C, A, and look for the longest match, AC. But this is exponential in time: If a string has length N, each element is either deleted or not, so we have 2^N cases to check!

Dynamic programming builds a solution incrementally. The main principles that allows a faster-than-expoonential solution is that we are only interested in the best solution, and we can build upon partial solutions because edit distance is additive.

The edit distance problem is closely related to the Longest Common Subsequence (LCS) problem, where you want to find an ordered sequence that shares the most elements with A and B. The difference is that edit distance adds up penalties for edits (so fewer edits are better), whereas LCS add up matches between A and B (so more is better).

Intuition

Here is a quick preview of how this can work. Imagine that we are computing the edit distance from AXBCDEF and ABCYDEF. Since the last characters of AXBCDEF and ABCYDEF match, we can reduce the edit distance problem to finding the edit distance to AXBCDE and ABCYDE. Now, we've reduced the problem to a simpler one. The key to dynamic programming is that we will solve all simpler problems first, and, surprisingly, there are only N^2 of them, not an exponential number.

This gives us a recursive way to compute edit distance. Let's let α and β represent arbitrary strings, so αX is a string ending in X, and βY is another string ending in Y. Here's where you have to really pay attention and think hard because it is the heart of dynamic programming:
The edit distance from αX to βY is the minimum of:

This recursion still takes exponential time if implemented naively. We introduce a simple trick: store the edit distances between all prefixes of the strings so we only compute each one once. There are N^2 prefixes, so the total work (and space) will be N^2, assuming we are comparing two strings of length N. (Yes, it's NxM for strings of lengths N and M.)

The Data Structure

Finally, the algorithm will compute a matrix like the following.

AXBCDEF
0123456
1212345
2321234
3432345
4543234
5654323
6765432

Each cell of the matrix is an edit distance between two prefixes of the input strings. E.g. the edit distance from prefix AXB of the first string to prefix ABC of the second string is 2, highlighted in red:

AXBCDEF
0123456
1212345
2321234
3432345
4543234
5654323
6765432

Notice that the bottom right element of the matrix is the edit distance between the complete strings A and B. This is the final output of the algorithm.

The Core Algorithm

The core and inner loop of the algorithm computes each cell of the matrix. Here is some pseudo-code showing the calculation for each row r, column c, assuming the strings to compare are A and B:

M[r, c] = min(M[r-1, c] + 1, 
              M[r, c-1] + 1,
              M[r-1, c-1] + (2 if A[c] != B[r] else 0))
You should see if you can execute this code in your head to obtain the red 2 in the table above. Search for examples in the table where the minimum value comes from other terms of the expression.

One final detail is the “boundary condition” for the first row and column. Assuming zero-based indexing, when we compute row 0 or column 0, this pseudo-code will try to access column and row indices of -1. This would fail in a direct implementation. Intuitively, a row or column of -1 corresponds to the edit distance between a string and the empty string. The edit distance is just the length of the non-empty string, since we need to delete each element to obtain a matching empty string.

In my implementation of the pseudo-code, I replace direct access to array M with a function call that tests for the special cases of row < 0 and col < 0:

def mget(r, c)
    if r >= 0 and c >= 0
        return M[r][c]
    elif r == -1
        return c + 1 // add 1 because index is zero-based, string length is one greater
    else // must be c == -1:
        return r + 1

Recap

You should now understand how dynamic programming can be used to efficiently (N^2) compute the edit distance between two strings. The core idea is to calculate the edit distance between each pair of string prefixes using previously computed, smaller string prefix edit distances to make the problem fast and simple.

Variations

Dynamic programming is actually a flexible framework that can do much more than compute edit distance. Here are some extensions.

Replacement

In addition to deletion from string A or deletion from string B (equivalent to insertion into string A), we could add another editing operation: replacement of one character with another. If the replacement cost is 1, then we can modify our core calculation to the following:
M[r, c] = min(M[r-1, c] + 1, 
              M[r, c-1] + 1,
              M[r-1, c-1] + (1 if A[c] != B[r] else 0))
We simply replaced 2 (deletion from string A + deletion from string B, total cost 2) with 1 (replacement to make A[c] match B[r], total cost 1).

Costs

The previous section on replacement suggests that we might ”play“ with other costs. We could assign costs to editing operations, with insertion cost CI, deletion cost CD, and replacement cost CR:
M[r, c] = min(M[r-1, c] + CI, 
              M[r, c-1] + CD,
              M[r-1, c-1] + (CR if A[c] != B[r] else 0))
For (a real) example, if the string A represented pitches extracted from audio, and there was a high likelihood of detecting repetitions, e.g. instead of the true melody ABCD you might expect the transcription to be AABBBCCDDDD, then you could assign a very low (or zero) deletion cost CD to reduce the penalty for these transcription errors.

Replacement Cost As a Function of Data

A further extension is that we can make replacement cost be a function of A[c] and B[r]. In practice, we might say singing a near-by pitch has a lower replacement cost than being far off. Or perhaps singing an octave off (12 semitones) has lower cost than some non-octave error. This is a powerful technique for situations where exact matches are not very likely.

Probabilistic Interpretation

What does cost really mean? How should we determine costs? There’s a fascinating connection between edit distance and probability. As you know, edit distance is the sum of costs assigned to different editing operations. Imagine a probabilistic model where we assume strings are actually identical, but somehow they are mutated by editing operation according to some randomness. For example, if the question is “Did Melonie sing song A?”, we could assume she did, but with some random lapses of memory and skill, and then ask “What is the probability that Melonie sang song A?” Then, if we knew the probability of insertions, deletions, and replacements, we could form the product of all these changes. But products are not what dynamic programming computes.

Fortunately, there is a simple trick: Just form the log of probabilities. Now the edit distance is the sum of the log of probabilities, so the minimal edit distance gives us the log of the probability of the most likely random mutation from string A to string B. It’s a bit technical, and arguably not the perfect answer you want, but in practice, this is a very useful interpretation of the numbers and provides a nice basis for estimating costs. For example, you can actually use real-life data to estimate replacement costs, e.g. what's the likelihood someone will sing a half-step off vs. a whole-step off, etc.

Matching Short Queries

Now consider the problem of matching a short query or excerpt to a larger string. In other words, what is the edit distance between string A (the query) and any substring of string B. It might seem like you need to consider all N^2 substrings of string B, but once again, dynamic programming offers a nice trick.

Recall that we had to deal with the “boundary condition” problem in our core algorithm description. Instead of accessing column -1 (which did not exist), we interpreted -1 as meaning the cost of deleting every character up to the current row, or the cost of matching a prefix to the empty string.

To allow for matching a query to any substring of string B, we can simply modify the cost returned when we access column -1. We want zero cost to skip a prefix of B and then start matching, so we can simply return 0 as the cost whenever we need to access M[r, -1].

We also want to ignore arbitrary suffixes of string B. To do this, we simply scan down the last column and pick the lowest number (edit distance) as our final result. That way, we do not consider the additional insertion costs charged against us for trying to match the remainder (suffix) of string B.

Dynamic Time Warping

Sometimes, sequences do not come in strings of discrete symbols or events. Consider stock market data, which is essentially a continuous function of time, or consider pitch contours which vary continously. To use these techniques on continous data, we can simply “chop up” the continous data into samples representing equal time periods (e.g. every day for stocks, every 200ms for melody, etc.). Then we can apply these dynamic programming techniques. If the data has continous values, e.g. stock prices or the frequency of a sung pitch, we can use continuous distance functions to calculate replacement costs.

This approach is called Dynamic Time Warping. It is a special case of dynamic programming, and all of the techniques described above can be applied.

Summary

Dynamic programming is a flexible framework for sequence matching. It has been especially effective in a number of music applications.