Our goal is to detect and identify sound objects, such as car horns or dog barks, in audio. Our system, called SOLAR
(sound object localization and retrieval) is the first, to our knowledge, that is capable of finding a large variety of sounds in audio data from movies and other complex audio environments. Our approach is to perform a windowed scan over audio data and classify each window using a cascade of boosted decision tree classifiers. See the presentations section for a good overview of our system. This work is performed by Derek Hoiem, Yan Ke, and Rahul Sukthankar and is supported by Intel Research Pittsburgh.
The detection rate of our system was based on the confidences assigned to audio clips containing the sound objects, and the false positive rate was based on the number of detections in movie audio that did not contain the sound object. We report ROC curve results in terms of the detection rate and the number of false positives per hour. Depending on the length of the sound object (ranges from 0.5 seconds to 1.5 seconds), an hour of audio contains 20,000 - 60,000 possible sound locations; therefore, one false positive per hour is equivalent to a false positive rate of between 1.7x10^-5 and 5.0x10^-5. We used 10-40 original object clips testing the detection rate and blended and embedded these clips into movie audio data at various locations,and 45 minutes of audio per sound object was used in determining the false positive rate. The "Stage 1" and "Stage 3" in the plots refer to the results after using 1 stage and 3 stages of the classifier cascade. We do not report the process for using the classifier cascade in the paper, since we determined the results' improvement to be too insignificant to justify the additional complication to the algorithm. In the paper, we report "stage 1" results.
Examples of Retrieved Sounds
We used SOLAR to find various types of sounds in audio data from movies. For many our sounds, we were not able to obtain audio data from movies in which those sounds occured, but we tried to find those sounds anyway to see if the top hits were qualitatively similar. In each directory, we have the top twenty-five sounds (ranked by classifier confidence) from roughly 45 minutes-1 hour of audio per sound object. The audio data is mostly from the movies Lord of the Rings: The Fellowship of the Ring , Austin Powers , and Moulin Rouge . The results have not been censored for content. Explosions, gun shots, male laughs, female laughs, screams, and sword clashes were present in the audio data. Door closings, meows, and telephone rings were not present or were very rare. Car horns, dog barks, door bells, laser guns, and light sabers did not occur in the clips we tested. When the sound does not occur, it is impossible for the system to find the sound (so all of the top twenty-five hits will be false positives), but it is often interesting to see how qualitatively similar the sounds are.
We use a diverse feature set capable of discriminating between any of a large variety of sounds and all other sounds. We partially describe the process of computing the feature representation in our paper (submitted to ICASSP 2005) but, due to limited space, could not give a complete description. The directory linked to below contains two Matlab source files used to compute the features. The "extract_global_features_window.m" file is headed by documentation that describes the different types of features. The code can be browsed if details are wanted on any particular feature type.
D. Hoiem, Y. Ke, and R. Sukthankar, "SOLAR: Sound Object Localization and Retrieval in Complex Audio Environments", ICASSP 2005.