Skip to Main content Skip to Navigation
Conference papers

MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection

Abstract : Action detection is a significant and challenging task, especially in densely-labelled datasets of untrimmed videos. Such data consist of complex temporal relations including composite or co-occurring actions. To detect actions in these complex settings, it is critical to capture both shortterm and long-term temporal information efficiently. To this end, we propose a novel 'ConvTransformer' network for action detection: MS-TCT 1. This network comprises of three main components: (1) a Temporal Encoder module which explores global and local temporal relations at multiple temporal resolutions, (2) a Temporal Scale Mixer module which effectively fuses multi-scale features, creating a unified feature representation, and (3) a Classification module which learns a center-relative position of each action instance in time, and predicts frame-level classification scores. Our experimental results on multiple challenging datasets such as Charades, TSU and MultiTHUMOS, validate the effectiveness of the proposed method, which outperforms the state-of-the-art methods on all three datasets.
Complete list of metadata

https://hal.inria.fr/hal-03682969
Contributor : Rui DAI Connect in order to contact the contributor
Submitted on : Tuesday, May 31, 2022 - 2:14:29 PM
Last modification on : Wednesday, June 1, 2022 - 3:33:42 AM

File

CVPR2022_MSTCT.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : hal-03682969, version 1

Citation

Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael Ryoo, Francois F Bremond. MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection. CVPR - Conference on Computer Vision and Pattern Recognition, Jun 2022, New Orleans, United States. ⟨hal-03682969⟩

Share

Metrics

Record views

72

Files downloads

17