KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES. pdf

Authors: Anurag Kumar, Maksim Khadkevich, Christian Fügen

## ESC-50 Dataset

ESC-50 [4] is a sound event dataset. It consists of a total of 50 sound events. The list of sound events can be found here.
• The dataset consists of a total of 2,000 recordings each of 5 seconds durations.
• It comes pre-divided into 5 folds.
• The training set consists of 4 out of 5 folds and the remaining 5th fold is used for testing. This is done all 5 ways and average accuracies are reported.
• The training set is used for network adaptation as well as for training linear SVMs.

## ESC-50 Results

• Comparison of our proposed method with state of art methods is shown is paper
• Our method not only outperforms previous methods by a considerable margin but also outperforms human accuracy on this dataset
• Even direct representation obtained from $$\mathcal{N}_S$$, that is without any task adaptive training, we obtain an average accuracy of 82.8%
• , compared to 81.3% human accuracy on this dataset.
• Best accuracy of 83.5% is obtained using F1 representations (with $$max()$$ mapping), from $$\mathcal{N}_T^{I}$$ and $$\mathcal{N}_T^{II}$$
• Class-wise results - Below we show confusion matrix for two cases. Classwise confusion matrix for all cases are available here. The file name clarifies the representation used, e.g esc50.NT_III.F1.max.png means $$\mathcal{N}_T^{III}$$ network, F1 representations and $$max()$$ function to map segment level representations to full recording level representations. The figure files here, might be visually more pleasing. All numbers have been rounded to 2 decimal places.