Automatically detecting violent scenes in videos has great potential in
several applications, such as movie selection or recommendation for
children. The annual MediaEval evaluation introduced this task in 2011, to
promote research on this topic.
In MediaEval 2012, we participated in this challenging task and explored several interesting issues with a particular
focus on novel features. Our approach achieved top performance in mAP@20 (mean average precision over top 20
detected shots) and runner-up in mAP@100, among all 35 submissions
Our system framework is shown below, where the circled numbers indicate the submitted result runs. We extract a diverse set of audio-visual features and use Chi-Square Kernel SVM for violent scene detection.
Trajectory-based Features: Dense local patch
trajectories are first extracted, based on which we generate motion
representations called TrajMF by exploring relative locations and motions
between trajectory pairs. These features serve as a strong baseline in our
system. Please refer to
our ECCV 2012 paper
for more details.
SIFT: This is based on standard SIFT local features and the well-known bag of visual words (BoW) method.
Spatial-Temporal Interest Points (STIP): STIP captures a space-time volume in which video pixel values have large variations in both space and time. This feature is also converted to BoW histograms, using a vocabulary of 4000 codewords.
MFCC: The MFCCs are the only audio feature in our framework, which are computed for every 32ms time-window with 50% overlap. Again, the BoW is used to convert a set of MFCCs from each video shot into fixed dimensional vectors, using a vocabulary of 4000 audio codewords.
Concept-based Feature: Different from the low-level features described above, concept-based feature contains mid-level indicators where each dimension is the prediction output of a semantic concept. Ten concepts are provided in MediaEval2012, covering violence-related topics such as "presence of blood", "fights", "presence of fire", "gunshots", etc. We use the above low-level features to train SVM detectors for each of the concepts, and generate a concept-based representation of 10 dimensions.
It is well-known that temporal structure is useful for video content
analysis. There exist complex methods in the modeling of temporal
information, such as the use of graphical models. In this task, we opt for a
simple but very efficient temporal smoothing method, which takes clues from
the shots before and after a target shot into account. Two smoothing choices
Feature Smoothing: This uses the averaged feature over a three-shot window to represent the shot in the middle of the window. Classification/Detection is performed on the smoothed features.
Score Smoothing: Different from feature smoothing, score smoothing uses features from each single shot for classification, and smooth (average) prediction scores over three-shot windows.
As indicated in the above framework, we submitted five runs based on
different feature combinations and/or smoothing choices. Our baseline run 5
uses the seven trajectory-based features, and run 4 includes three
additional features (SIFT, STIP and MFCC). Run 3 further combines the
concept-based feature. Run 2 and run 1 are generated by the two temporal
smoothing methods respectively, using the same feature set as run 4. In all
the submitted runs, kernel-level early fusion (mean of the
individual-feature kernels) is used to combine multiple features.
The figure on the right shows the performance of all the 35 official submissions, where our run 1 produces the highest mAP@20 accuracy (0.736). Our run 5, which uses only the seven trajectory-based features, already shows very competitive results (mAP@20=0.656, mAP@100=0.539). The three additional features used in run 4 are not very helpful. Also, the concept-based scores (run 3) do not improve the results. Further comparing the two temporal smoothing choices, score smoothing (run 1) is significantly better.
[See more details in our notebook paper]