KNOWLEDGE TRANSFER FROM WEAKLY LABELED AUDIO USING CONVOLUTIONAL NEURAL NETWORK FOR SOUND EVENTS AND SCENES. pdf

Authors: Anurag Kumar, Maksim Khadkevich, Christian Fügen

## Semantic Understandig

### 1. Visualizations

We try to draw some semantic inferences from our proposed methods. We first see if the learned representations carries meaning. The figures below shows the well-known tSNE visualization for learned representations for ESC-50 dataset.

The visualization shows that the represntations are able to capture semantic meaning. To understand it further we look at the tSNE visualization for higher semantic categories in ESC-50 dataset. These higher semantic categories are Animal, Natural Soundscapes and Water Sounds, Human, non-speech sounds, Interior/domestic sounds, Exterior/Urban sounds . The Left Panel in Fig. 2 shows the tSNE visualization grouped according for these 5 categories. One can see that even for these higher semantic categories, the representations are nicely grouped together and meaning the representations are capturing higher level semantics as well. The Right Panel shows t-SNE visualizations for acoustic scenes.

### 2. Acoustic Scenes and Sound Events Relations

Our proposed methods for transfer learning can be used for drawing semantic relations as well. In particular, we look into acoustic scene and sound event relations. Acoustic scenes are often composed of multiple sound events. In general, there are sound events which you expect to find in an environment or scene. For example, in Park you expect to find sound such as Bird Song, Crow, Wind Noise etc. These scene-sound event relations can be helpful not only for acoustic scene classification task but in general for audio based artificial intelligence. These semantic relations can be used for obtaining higher level semantic information which can further help in human computer interaction and development of audio based context aware systems.

The source model $$\mathcal{N}_S$$ has been trained for 527 sound events, a fairly large number of sounds. As explained in the paper, the target specific adaptation of network ($$\mathcal{N}_T^{II}$$ and $$\mathcal{N}_T^{III}$$) transitions through the label space before going into the target label space. Hence, one can argue that during adaptive training the network learns a mapping from sound events in F2 to the target label space in final layer. This can be useful in establishing relations between labels in the target space and the sound events in the layer before that.

We show this characteristics for scene-sound event relations. For the acoustic scenes in DCASE-2016, we try to establish which sound events from Audioset are related to it or in other words we try to establish the sound events which might frequently occur in an acoustic scene/environment.

To establish these relations, we look at the sound events which are maximally activated for inputs of a given scene. The neurons of F2 are essentially representing the sound events in Audioset. The activation of these neurons can help understand whether a sound event is expected to occur in a scene or not. For an input of a given scene, we make a list of the events that is outputs of F2 layer, which were maximally activated. We consider the top 5 maximally activated events. Since F2 produces outputs at segment level (~1.5 seconds in our case), these activations are noted at segment level. We do this for all recordings of a given scene and note down the frequency of occurrence of events in the top 5 list. We then note down the sound events with highest (among top 10) frequencies. These sound events are often maxmially activated for the given scene and hence we expect them to be related to the acoustic scene. In other words, one can expect to find these sound events in the corresponding acoustic scene. The Table below shows examples of these often maxmimally activated events.

Note that most of these relations are meaningful, imlying the proposed transferring learning approaches is actually able to relate the sound events with the target labels, acoustic scenes in the present case. This gives a general framework to discover relationships between source and target tasks. One can potentially apply the same methods to other audio tasks such as emotion recognition in audio and see if such relationships can be automatically established. This method of automatically learning relationships using transfer learning can further be used for developing sound ontologies as well.

Acoustic Scenes Related Sound Events (Frequent Highly Activated Events)
Cafe Speech, Chuckle- Chortle, Dishes-pots-and-pans, Snicker, Television, Throat Clearing, Singing
City Center Applause, Siren, Emergency Vehicle, Ambulance
Forest Path Stream, Boat Water Vehicle, Squish, Frog, Pour, Clatter
Grocery Store Shuffle, Singing, Speech, Music, Siren, Ice-cream truck
Home Speech, Baby Cry, Finger Snapping, Dishes-pots-and-pans, Cutlery, Sink (filling or washing), Throat Clearing, Writing, Squish
Beach Pour, Stream, Applause, Splash - Splatter, Gush, Wind noise, Squish, Boat- Water vehicle
Library Finger snapping, Squish, Fart, Speech, Snort, Singing
Metro Station Speech, Squish, Singing, Siren, Music
Office Finger Snapping, Snort, Cutlery, Speech, Chirp tone, Shuffling cards
Residential Area Applause, Clatter, Bird, Siren, Emergency Vehicle
Park Bird, Crow, Stream, Rustle, Bee wasp, Wind Noise, Frog
Bus Bus, Motor Vehicle, Car, Vibration, Rustle
Car Car, Motor Vehicle, Rumble, Music, Rustle
Train Heavy Engine, Siren, Car, Motor Vehicle, Music, Rustle
Tram Car, Motor Vehicle, Speech, Throbbing, Rumble