KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES. pdf

Authors: Anurag Kumar, Maksim Khadkevich, Christian Fügen

## DCASE-2016 Dataset

DCASE-2016 [5] is an acoustic scene dataset. It consists of a total of 15 acoustic scenes, namely, Beach, Bus, Cafe/Resturant, Car, City Center, Forest Path, Grocery Store, Forest Path, Home, Library, Metro Station, Office, Park, Residential Area, Train, Tram.
• The dataset consists of a total of 1,170 recordings, each of 30 seconds durations.
• It comes pre-divided into 4 folds.
• The training set consists of 3 out of 4 folds and the remaining 4th fold is used for testing. This is done all 5 ways and average accuracies are reported.
• The training set is used for network adaptation as well as for training linear SVMs.

## DCASE-16 Results

• Acoustic Scenes possess complex acoustic characteristics. Often, they are themselves composed of several sound events meshed together in a complex manner.
• Comparison of our proposed method with baseline method is shown is paper
• Our method not only outperforms baseline by over 4%
• Best accuracy of 76.6.5% is obtained using F1 representations (with $$max()$$ mapping), from $$\mathcal{N}_T^{III}$$.
• Class-wise results - Below we show confusion matrix for the best performance case. Classwise confusion matrix for all cases are available here. The file name clarifies the representation used, e.g dcase16.NT_III.F1.max.png means $$\mathcal{N}_T^{III}$$ network, F1 representations and $$max()$$ function to map segment level representations to full recording level representations. The figure files here, might be visually more pleasing. All numbers have been rounded to 2 decimal places.