Skip to Main content Skip to Navigation
Conference papers

Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection

Abstract : In video understanding, most cross-modal knowledge distillation (KD) methods are tailored for classification tasks, focusing on the discriminative representation of the trimmed videos. However, action detection requires not only categorizing actions, but also localizing them in untrimmed videos. Therefore, transferring knowledge pertaining to temporal relations is critical for this task which is missing in the previous cross-modal KD frameworks. To this end, we aim at learning an augmented RGB representation for action detection, taking advantage of additional modalities at training time through KD. We propose a KD framework consisting of two levels of distillation. On one hand, atomic-level distillation encourages the RGB student to learn the sub-representation of the actions from the teacher in a contrastive manner. On the other hand, sequence-level distillation encourages the student to learn the temporal knowledge from the teacher, which consists of transferring the Global Contextual Relations and the Action Boundary Saliency. The result is an Augmented-RGB stream that can achieve competitive performance as the two-stream network while using only RGB at inference time. Extensive experimental analysis shows that our proposed distillation framework is generic and outperforms other popular cross-modal distillation methods in action detection task.
Complete list of metadata

https://hal.archives-ouvertes.fr/hal-03314575
Contributor : Rui Dai Connect in order to contact the contributor
Submitted on : Thursday, August 5, 2021 - 10:22:35 AM
Last modification on : Tuesday, October 19, 2021 - 12:58:56 PM
Long-term archiving on: : Saturday, November 6, 2021 - 6:13:30 PM

File

Dai_ICCV21.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03314575, version 1

Collections

Citation

Rui Dai, Srijan Das, François Bremond. Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection. ICCV 2021 - IEEE/CVF International Conference on Computer Vision, Oct 2021, Montreal, Canada. ⟨hal-03314575⟩

Share

Metrics

Record views

146

Files downloads

137