Consider the following sound. Its just a sequence of notes.
Consider the following sound now. Its a set of chords that are composed of the notes from the previous example. Here is another example composed of the same notes (plus a few other notes). Consider the following example. It comprises a number of phonemes. This sound is a complete, recognizable sentence built only from the phonemes in the above example In both above examples, we note that a more complex sound has been built from a number of simpler building blocks -- music from notes, speech from phonemes. In this work we attempt to derive (and manipulate) these building blocks directly from data. In particular, we will work with sounds, but generically we will work with other types of data as well. Representing Sounds In order to learn the building blocks properly, we need to represent sounds appropriately. We impose two requirements here. Building blocks must combine additively (cumulatively): The presence of two notes does not distort either note. Bulding blocks must combine purely constructively: Notes do not cancel. Adding one note will not diminish another We note that spectral representations satisfy these requirements: the power spectra of two uncorrelated signals add. We will therefore represent signals spectrographically, as a sequence of power spectral vectors estimated from (overlapping) segments of signal. This converts a time signal such as the one below to an image such as this Here, the horizontal axis represents time and the vertical axis represents frequency. The colour at any time and frequency represents the energy at that time, in that frequency (red is more, blue is less, anything else is in-between; in reality it is a numeric value -- a real number -- that is rendered as colour in these pictures). The spectrogram of any signal can be derived by a short-time Fourier transform of the signal. The STFT typically returns a complex matrix of numbers (where the columns represent time and the rows represent frequencies); we however only retain the magnitude of these numbers and do not operate on the phase. Thus, each signal is converted to a matrix, that can be rendered as a two-dimensional image such as the one above. Conventional methods of learning building blocks The spectrographic representation of the speech signal is just a matrix. There are a number of techniques such as SVD, PCA, ICA etc. that can derive "bases", that can be viewed as linearly combinable building blocks for the data. Unfortunately, these do not fit the requirements we have posited above. Here, for example, is the spectrogram of a mixture of two intermittent tones. The two obvious building blocks are the tones themselves. The right panel shows the two bases discovered using PCA. The vertical axis is frequency in this panel. Both bases have both frequency components -- PCA is unable to learn the two separate frequency components. Worse still, the secon basis is observed to have negative values at one of the frequencies, while the other has a positive value at the same frequency. Any combination of the two will result in cancellation, which conflicts with our requirement of pure additivity. In any case, what does it mean to say our building block has negative energy at any frequency? The bases are physically meaningless.
Consider the following sound now. Its a set of chords that are composed of the notes from the previous example.
Here is another example composed of the same notes (plus a few other notes). Consider the following example. It comprises a number of phonemes. This sound is a complete, recognizable sentence built only from the phonemes in the above example In both above examples, we note that a more complex sound has been built from a number of simpler building blocks -- music from notes, speech from phonemes. In this work we attempt to derive (and manipulate) these building blocks directly from data. In particular, we will work with sounds, but generically we will work with other types of data as well. Representing Sounds In order to learn the building blocks properly, we need to represent sounds appropriately. We impose two requirements here. Building blocks must combine additively (cumulatively): The presence of two notes does not distort either note. Bulding blocks must combine purely constructively: Notes do not cancel. Adding one note will not diminish another We note that spectral representations satisfy these requirements: the power spectra of two uncorrelated signals add. We will therefore represent signals spectrographically, as a sequence of power spectral vectors estimated from (overlapping) segments of signal. This converts a time signal such as the one below to an image such as this Here, the horizontal axis represents time and the vertical axis represents frequency. The colour at any time and frequency represents the energy at that time, in that frequency (red is more, blue is less, anything else is in-between; in reality it is a numeric value -- a real number -- that is rendered as colour in these pictures). The spectrogram of any signal can be derived by a short-time Fourier transform of the signal. The STFT typically returns a complex matrix of numbers (where the columns represent time and the rows represent frequencies); we however only retain the magnitude of these numbers and do not operate on the phase. Thus, each signal is converted to a matrix, that can be rendered as a two-dimensional image such as the one above. Conventional methods of learning building blocks The spectrographic representation of the speech signal is just a matrix. There are a number of techniques such as SVD, PCA, ICA etc. that can derive "bases", that can be viewed as linearly combinable building blocks for the data. Unfortunately, these do not fit the requirements we have posited above. Here, for example, is the spectrogram of a mixture of two intermittent tones. The two obvious building blocks are the tones themselves. The right panel shows the two bases discovered using PCA. The vertical axis is frequency in this panel. Both bases have both frequency components -- PCA is unable to learn the two separate frequency components. Worse still, the secon basis is observed to have negative values at one of the frequencies, while the other has a positive value at the same frequency. Any combination of the two will result in cancellation, which conflicts with our requirement of pure additivity. In any case, what does it mean to say our building block has negative energy at any frequency? The bases are physically meaningless.
Here is another example composed of the same notes (plus a few other notes).
Consider the following example. It comprises a number of phonemes. This sound is a complete, recognizable sentence built only from the phonemes in the above example In both above examples, we note that a more complex sound has been built from a number of simpler building blocks -- music from notes, speech from phonemes. In this work we attempt to derive (and manipulate) these building blocks directly from data. In particular, we will work with sounds, but generically we will work with other types of data as well. Representing Sounds In order to learn the building blocks properly, we need to represent sounds appropriately. We impose two requirements here. Building blocks must combine additively (cumulatively): The presence of two notes does not distort either note. Bulding blocks must combine purely constructively: Notes do not cancel. Adding one note will not diminish another We note that spectral representations satisfy these requirements: the power spectra of two uncorrelated signals add. We will therefore represent signals spectrographically, as a sequence of power spectral vectors estimated from (overlapping) segments of signal. This converts a time signal such as the one below to an image such as this Here, the horizontal axis represents time and the vertical axis represents frequency. The colour at any time and frequency represents the energy at that time, in that frequency (red is more, blue is less, anything else is in-between; in reality it is a numeric value -- a real number -- that is rendered as colour in these pictures). The spectrogram of any signal can be derived by a short-time Fourier transform of the signal. The STFT typically returns a complex matrix of numbers (where the columns represent time and the rows represent frequencies); we however only retain the magnitude of these numbers and do not operate on the phase. Thus, each signal is converted to a matrix, that can be rendered as a two-dimensional image such as the one above. Conventional methods of learning building blocks The spectrographic representation of the speech signal is just a matrix. There are a number of techniques such as SVD, PCA, ICA etc. that can derive "bases", that can be viewed as linearly combinable building blocks for the data. Unfortunately, these do not fit the requirements we have posited above. Here, for example, is the spectrogram of a mixture of two intermittent tones. The two obvious building blocks are the tones themselves. The right panel shows the two bases discovered using PCA. The vertical axis is frequency in this panel. Both bases have both frequency components -- PCA is unable to learn the two separate frequency components. Worse still, the secon basis is observed to have negative values at one of the frequencies, while the other has a positive value at the same frequency. Any combination of the two will result in cancellation, which conflicts with our requirement of pure additivity. In any case, what does it mean to say our building block has negative energy at any frequency? The bases are physically meaningless.
Consider the following example. It comprises a number of phonemes.
This sound is a complete, recognizable sentence built only from the phonemes in the above example In both above examples, we note that a more complex sound has been built from a number of simpler building blocks -- music from notes, speech from phonemes. In this work we attempt to derive (and manipulate) these building blocks directly from data. In particular, we will work with sounds, but generically we will work with other types of data as well. Representing Sounds In order to learn the building blocks properly, we need to represent sounds appropriately. We impose two requirements here. Building blocks must combine additively (cumulatively): The presence of two notes does not distort either note. Bulding blocks must combine purely constructively: Notes do not cancel. Adding one note will not diminish another We note that spectral representations satisfy these requirements: the power spectra of two uncorrelated signals add. We will therefore represent signals spectrographically, as a sequence of power spectral vectors estimated from (overlapping) segments of signal. This converts a time signal such as the one below to an image such as this Here, the horizontal axis represents time and the vertical axis represents frequency. The colour at any time and frequency represents the energy at that time, in that frequency (red is more, blue is less, anything else is in-between; in reality it is a numeric value -- a real number -- that is rendered as colour in these pictures). The spectrogram of any signal can be derived by a short-time Fourier transform of the signal. The STFT typically returns a complex matrix of numbers (where the columns represent time and the rows represent frequencies); we however only retain the magnitude of these numbers and do not operate on the phase. Thus, each signal is converted to a matrix, that can be rendered as a two-dimensional image such as the one above. Conventional methods of learning building blocks The spectrographic representation of the speech signal is just a matrix. There are a number of techniques such as SVD, PCA, ICA etc. that can derive "bases", that can be viewed as linearly combinable building blocks for the data. Unfortunately, these do not fit the requirements we have posited above. Here, for example, is the spectrogram of a mixture of two intermittent tones. The two obvious building blocks are the tones themselves. The right panel shows the two bases discovered using PCA. The vertical axis is frequency in this panel. Both bases have both frequency components -- PCA is unable to learn the two separate frequency components. Worse still, the secon basis is observed to have negative values at one of the frequencies, while the other has a positive value at the same frequency. Any combination of the two will result in cancellation, which conflicts with our requirement of pure additivity. In any case, what does it mean to say our building block has negative energy at any frequency? The bases are physically meaningless.
This sound is a complete, recognizable sentence built only from the phonemes in the above example
In both above examples, we note that a more complex sound has been built from a number of simpler building blocks -- music from notes, speech from phonemes. In this work we attempt to derive (and manipulate) these building blocks directly from data. In particular, we will work with sounds, but generically we will work with other types of data as well. Representing Sounds In order to learn the building blocks properly, we need to represent sounds appropriately. We impose two requirements here. Building blocks must combine additively (cumulatively): The presence of two notes does not distort either note. Bulding blocks must combine purely constructively: Notes do not cancel. Adding one note will not diminish another We note that spectral representations satisfy these requirements: the power spectra of two uncorrelated signals add. We will therefore represent signals spectrographically, as a sequence of power spectral vectors estimated from (overlapping) segments of signal. This converts a time signal such as the one below to an image such as this Here, the horizontal axis represents time and the vertical axis represents frequency. The colour at any time and frequency represents the energy at that time, in that frequency (red is more, blue is less, anything else is in-between; in reality it is a numeric value -- a real number -- that is rendered as colour in these pictures). The spectrogram of any signal can be derived by a short-time Fourier transform of the signal. The STFT typically returns a complex matrix of numbers (where the columns represent time and the rows represent frequencies); we however only retain the magnitude of these numbers and do not operate on the phase. Thus, each signal is converted to a matrix, that can be rendered as a two-dimensional image such as the one above. Conventional methods of learning building blocks The spectrographic representation of the speech signal is just a matrix. There are a number of techniques such as SVD, PCA, ICA etc. that can derive "bases", that can be viewed as linearly combinable building blocks for the data. Unfortunately, these do not fit the requirements we have posited above. Here, for example, is the spectrogram of a mixture of two intermittent tones. The two obvious building blocks are the tones themselves. The right panel shows the two bases discovered using PCA. The vertical axis is frequency in this panel. Both bases have both frequency components -- PCA is unable to learn the two separate frequency components. Worse still, the secon basis is observed to have negative values at one of the frequencies, while the other has a positive value at the same frequency. Any combination of the two will result in cancellation, which conflicts with our requirement of pure additivity. In any case, what does it mean to say our building block has negative energy at any frequency? The bases are physically meaningless.
In both above examples, we note that a more complex sound has been built from a number of simpler building blocks -- music from notes, speech from phonemes.
In this work we attempt to derive (and manipulate) these building blocks directly from data. In particular, we will work with sounds, but generically we will work with other types of data as well.
to an image such as this
Here, the horizontal axis represents time and the vertical axis represents frequency. The colour at any time and frequency represents the energy at that time, in that frequency (red is more, blue is less, anything else is in-between; in reality it is a numeric value -- a real number -- that is rendered as colour in these pictures).
The spectrogram of any signal can be derived by a short-time Fourier transform of the signal. The STFT typically returns a complex matrix of numbers (where the columns represent time and the rows represent frequencies); we however only retain the magnitude of these numbers and do not operate on the phase. Thus, each signal is converted to a matrix, that can be rendered as a two-dimensional image such as the one above.
Here, for example, is the spectrogram of a mixture of two intermittent tones. The two obvious building blocks are the tones themselves.
The right panel shows the two bases discovered using PCA. The vertical axis is frequency in this panel. Both bases have both frequency components -- PCA is unable to learn the two separate frequency components. Worse still, the secon basis is observed to have negative values at one of the frequencies, while the other has a positive value at the same frequency. Any combination of the two will result in cancellation, which conflicts with our requirement of pure additivity. In any case, what does it mean to say our building block has negative energy at any frequency? The bases are physically meaningless.