"Open

# Spatio-temporal action detection with MMAction2
Welcome to MMAction2! This is a tutorial on how to use MMAction2 for spatio-temporal action detection. In this tutorial, we will use the MultiSports dataset as an example, and provide a complete step-by-step guide for spatio-temporal action detection, including
- Prepare spatio-temporal action detection dataset
- Train detection model
- Prepare AVA format dataset
- Train spatio-temporal action detection model


## 0. Install MMAction2 and MMDetection

In [6]:
%pip install -U openmim
!mim install mmengine
!mim install mmcv
!mim install mmdet

!git clone https://github.com/open-mmlab/mmaction2.git

%cd mmaction2
%pip install -v -e .
%cd projects/stad_tutorial

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openmim
 Downloading openmim-0.3.7-py2.py3-none-any.whl (51 kB)
[2K [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.3/51.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Collecting colorama (from openmim)
 Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Collecting model-index (from openmim)
 Downloading model_index-0.1.11-py3-none-any.whl (34 kB)
Collecting ordered-set (from model-index->openmim)
 Downloading ordered_set-4.1.0-py3-none-any.whl (7.6 kB)
Installing collected packages: ordered-set, colorama, model-index, openmim
Successfully installed colorama-0.4.6 model-index-0.1.11 openmim-0.3.7 ordered-set-4.1.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.openmmlab.com/mmcv/dist/cu118/torch2.0.0/index.html
Collecting mmengine
 Downloading mmengine-0.7.4-py3-none-any.whl 

## 1. Prepare spatio-temporal action detection dataset

Similar to detection tasks that require bounding box annotations, spatio-temporal action detection tasks require temporal and spatial localization, so more complex tube annotations are required. Taking the MultiSports dataset as an example, the `gttubes` field provides all the target action annotations in the video, and the following is an annotation fragment:

```
 'gttubes': {
 'aerobic_gymnastics/v_aqMgwPExjD0_c001': # video_key
 {
 10: # label index
 [
 array([[ 377., 904., 316., 1016., 584.], # 1st tube of class 10
 [ 378., 882., 315., 1016., 579.], # shape (n, 5): n frames,each annotation includes (frame idx,x1,y1, x2, y2)
 ...
 [ 398., 861., 304., 954., 549.]], dtype=float32),

 array([[ 399., 881., 308., 955., 542.], # 2nd tube of class 10
 [ 400., 862., 303., 988., 539.],
 [ 401., 853., 292., 1000., 535.],
 ...])
 ...

 ] ,
 9: # label index
 [
 array(...), # 1st tube of class 9
 array(...), # 2nd tube of class 9
 ...
 ]
 ...
 }
 }
```

The annotation file also needs to provide other field information, and the complete ground truth file includes the following information:

```
{
 'labels': # label list
 ['aerobic push up', 'aerobic explosive push up', ...],
 'train_videos': # training video list
 [
 [
 'aerobic_gymnastics/v_aqMgwPExjD0_c001',
 'aerobic_gymnastics/v_yaKOumdXwbU_c019',
 ...
 ]
 ]
 'test_videos': # test video list
 [
 [
 'aerobic_gymnastics/v_crsi07chcV8_c004',
 'aerobic_gymnastics/v_dFYr67eNMwA_c005',
 ...
 ]
 ]
 'n_frames': # dict provides frame number of each video
 {
 'aerobic_gymnastics/v_crsi07chcV8_c004': 725,
 'aerobic_gymnastics/v_dFYr67eNMwA_c005': 750,
 ...
 }
 'resolution': # dict provides resolution of each video
 {
 'aerobic_gymnastics/v_crsi07chcV8_c004': (720, 1280),
 'aerobic_gymnastics/v_dFYr67eNMwA_c005': (720, 1280),
 ...
 }
 'gt_tubes': # dict provides bouding boxes of each tube
 {
 ... # refer to above description
 }
}
```

The subsequent experiments are based on MultiSports-tiny, we extracted a small number of videos from MultiSports for demonstration purposes.

In [7]:
# Download dataset
!wget -P data -c https://download.openmmlab.com/mmaction/v1.0/projects/stad_tutorial/multisports-tiny.tar
!tar -xvf data/multisports-tiny.tar --strip 1 -C data
!apt-get -q install tree
!tree data

--2023-06-15 06:00:15-- https://download.openmmlab.com/mmaction/v1.0/projects/stad_tutorial/multisports-tiny.tar
Resolving download.openmmlab.com (download.openmmlab.com)... 163.181.82.215, 163.181.82.216, 163.181.82.218, ...
Connecting to download.openmmlab.com (download.openmmlab.com)|163.181.82.215|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 82780160 (79M) [application/x-tar]
Saving to: ‘data/multisports-tiny.tar’


2023-06-15 06:01:00 (1.78 MB/s) - ‘data/multisports-tiny.tar’ saved [82780160/82780160]

multisports-tiny/multisports/
multisports-tiny/multisports/test/
multisports-tiny/multisports/test/aerobic_gymnastics/
multisports-tiny/multisports/test/aerobic_gymnastics/v_7G_IpU0FxLU_c001.mp4
multisports-tiny/multisports/annotations/
multisports-tiny/multisports/annotations/multisports_GT.pkl
multisports-tiny/multisports/trainval/
multisports-tiny/multisports/trainval/aerobic_gymnastics/
multisports-tiny/multisports/trainval/aerobic_gymnastics/v__wAgw

## 2. Train detection model

In the SlowOnly + Det paradigm, we need to train a human detector first, and then predict actions based on the detection results. In this section, we train a detection model based on the annotation format in the previous section and the MMDetection algorithm library.

### 2.1 Build detection dataset annotation (COCO format)

Based on the annotation information of the spatio-temporal action detection dataset, we can build a COCO format detection dataset for training the detection model. We provide a script to convert the MultiSports format annotation, if you need to convert from other formats, you can refer to the [custom dataset](https://mmdetection.readthedocs.io/zh_CN/latest/advanced_guides/customize_dataset.html) document provided by MMDetection.

In [8]:
!python tools/generate_mmdet_anno.py data/multisports/annotations/multisports_GT.pkl data/multisports/annotations/multisports_det_anno.json
!tree data/multisports/annotations

[01;34mdata/multisports/annotations[00m
├── multisports_det_anno_train.json
├── multisports_det_anno_val.json
└── [01;32mmultisports_GT.pkl[00m

0 directories, 3 files


In [9]:
!python tools/generate_rgb.py

Will generate 3 rgb dir for aerobic_gymnastics.
Generate v__wAgwttPYaQ_c003 rgb dir successfully.
Generate v__wAgwttPYaQ_c002 rgb dir successfully.
Generate v__wAgwttPYaQ_c001 rgb dir successfully.


### 2.2 Modify config file

We use faster-rcnn_x101-64x4d_fpn_1x_coco as the base configuration, and make the following modifications to train on the MultiSports dataset. The following parts need to be modified:
- Number of model categories
- Learning rate adjustment strategy
- Optimizer configuration
- Dataset/annotation file path
- Evaluator configuration
- Pre-trained model

For more detailed tutorials, please refer to the [prepare configuration file](https://mmdetection.readthedocs.io/zh_CN/latest/user_guides/train.html#id9) document provided by MMDetection.

In [10]:
!cat configs/faster-rcnn_r50-caffe_fpn_ms-1x_coco_ms_person.py

# Copyright (c) OpenMMLab. All rights reserved.
_base_ = './faster-rcnn_r50-caffe_fpn_ms-1x_coco.py'
model = dict(roi_head=dict(bbox_head=dict(num_classes=1)))

# take 2 epochs as an example
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=2, val_interval=1)

# learning rate
param_scheduler = [
 dict(type='ConstantLR', factor=1.0, by_epoch=False, begin=0, end=500)
]

# optimizer
optim_wrapper = dict(
 type='OptimWrapper',
 optimizer=dict(type='SGD', lr=0.0050, momentum=0.9, weight_decay=0.0001))

dataset_type = 'CocoDataset'
# modify metainfo
metainfo = {
 'classes': ('person', ),
 'palette': [
 (220, 20, 60),
 ]
}

# specify metainfo, dataset path
data_root = 'data/multisports/'

train_dataloader = dict(
 dataset=dict(
 data_root=data_root,
 ann_file='annotations/multisports_det_anno_train.json',
 data_prefix=dict(img='rawframes/'),
 metainfo=metainfo))

val_dataloader = dict(
 dataset=dict(
 data_root=data_root,
 ann_file='annotations/multisports_det_anno_val.json',
 data_pref

### 2.3 Train detection model

By using MIM, you can directly train MMDetection models in the current directory. Here is the simplest example of training on a single GPU. For more training commands, please refer to the MIM [tutorial](https://github.com/open-mmlab/mim#command).

In [11]:
!mim train mmdet configs/faster-rcnn_r50-caffe_fpn_ms-1x_coco_ms_person.py \
 --work-dir work_dirs/det_model

Training command is /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/mmdet/.mim/tools/train.py configs/faster-rcnn_r50-caffe_fpn_ms-1x_coco_ms_person.py --launcher none --work-dir work_dirs/det_model. 
06/15 06:02:09 - mmengine - [4m[97mINFO[0m - 
------------------------------------------------------------
System environment:
 sys.platform: linux
 Python: 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0]
 CUDA available: True
 numpy_random_seed: 503128501
 GPU 0: Tesla T4
 CUDA_HOME: /usr/local/cuda
 NVCC: Cuda compilation tools, release 11.8, V11.8.89
 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
 PyTorch: 2.0.1+cu118
 PyTorch compiling details: PyTorch built with:
 - GCC 9.3
 - C++ Version: 201703
 - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
 - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
 - OpenMP 201511 (a.k.a. OpenMP 4.5)
 - LAPACK is enabled (usua

### 2.4 Generating Proposal BBoxes

During the training of the spatiotemporal action detection model, we need to rely on proposals generated by the detection model, rather than annotated detection boxes. Therefore, we need to use a trained detection model to perform inference on the entire dataset and convert the resulting proposals into the required format for subsequent training.

#### 2.4.1 Converting the Dataset to Coco Format

We provide a script to convert the MultiSports dataset into an annotation format without ground truth, which is used for inference.

In [12]:
!echo 'person' > data/multisports/annotations/label_map.txt
!python tools/images2coco.py \
 data/multisports/rawframes \
 data/multisports/annotations/label_map.txt \
 ms_infer_anno.json

[>>] 2350/2350, 2053.0 task/s, elapsed: 1s, ETA: 0s
save json file: data/multisports/rawframes/../annotations/ms_infer_anno.json


#### 2.4.2 Inference for Generating Proposal Files


The inference of MMDetection models is also based on MIM. For more testing commands, please refer to the MIM [tutorial](GitHub - open-mmlab/mim: MIM Installs OpenMMLab Packages).

After the inference is completed, the results will be saved in 'data/multisports/ms_proposals.pkl'.

In [13]:
!mim test mmdet configs/faster-rcnn_r50-caffe_fpn_ms-1x_coco_ms_person.py \
 --checkpoint work_dirs/det_model/epoch_2.pth \
 --out data/multisports/annotations/ms_det_proposals.pkl

Testing command is /usr/bin/python3 /usr/local/lib/python3.10/dist-packages/mmdet/.mim/tools/test.py configs/faster-rcnn_r50-caffe_fpn_ms-1x_coco_ms_person.py work_dirs/det_model/epoch_2.pth --launcher none --out data/multisports/annotations/ms_det_proposals.pkl. 
06/15 06:05:16 - mmengine - [4m[97mINFO[0m - 
------------------------------------------------------------
System environment:
 sys.platform: linux
 Python: 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0]
 CUDA available: True
 numpy_random_seed: 1289054678
 GPU 0: Tesla T4
 CUDA_HOME: /usr/local/cuda
 NVCC: Cuda compilation tools, release 11.8, V11.8.89
 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
 PyTorch: 2.0.1+cu118
 PyTorch compiling details: PyTorch built with:
 - GCC 9.3
 - C++ Version: 201703
 - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
 - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
 - Open

## 3. Training the Spatio-temporal Action Detection Model
The provided annotation files and the proposal files generated by MMDetection need to be converted to the required format for training the spatiotemporal action detection model. We have provided relevant script to generate the specified format.

In [14]:
# Convert annotation files
!python ../../tools/data/multisports/parse_anno.py

# Convert proposal files
!python tools/convert_proposals.py

!tree data/multisports/annotations

loading test result...
[>>] 2350/2350, 3799.7 task/s, elapsed: 1s, ETA: 0s
[01;34mdata/multisports/annotations[00m
├── label_map.txt
├── ms_det_proposals.pkl
├── ms_infer_anno.json
├── multisports_det_anno_train.json
├── multisports_det_anno_val.json
├── [01;32mmultisports_GT.pkl[00m
├── multisports_proposals_train.pkl
├── multisports_proposals_val.pkl
├── multisports_train.csv
└── multisports_val.csv

0 directories, 10 files


### 3.2 Training the Spatio-temporal Action Detection Model

MMAction2 already supports training on the MultiSports dataset. You just need to modify the path to the proposal file. For detailed configurations, please refer to the [config](configs/slowonly_k400_multisports.py) file. Since the training data is limited, the configuration uses a pre-trained model trained on the complete MultiSports dataset. When training with a custom dataset, you don't need to specify the `load_from` configuration.

In [15]:
# Train the model using MIM
!mim train mmaction2 configs/slowonly_k400_multisports.py \
 --work-dir work_dirs/stad_model/

Training command is /usr/bin/python3 /content/mmaction2/mmaction/.mim/tools/train.py configs/slowonly_k400_multisports.py --launcher none --work-dir work_dirs/stad_model/. 
06/15 06:10:18 - mmengine - [4m[97mINFO[0m - 
------------------------------------------------------------
System environment:
 sys.platform: linux
 Python: 3.10.12 (main, Jun 7 2023, 12:45:35) [GCC 9.4.0]
 CUDA available: True
 numpy_random_seed: 1735696538
 GPU 0: Tesla T4
 CUDA_HOME: /usr/local/cuda
 NVCC: Cuda compilation tools, release 11.8, V11.8.89
 GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
 PyTorch: 2.0.1+cu118
 PyTorch compiling details: PyTorch built with:
 - GCC 9.3
 - C++ Version: 201703
 - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
 - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
 - OpenMP 201511 (a.k.a. OpenMP 4.5)
 - LAPACK is enabled (usually provided by MKL)
 - NNPACK is en

## 4. Inferring the Spatiotemporal Action Detection Model

After training the detection model and the spatiotemporal action detection model, we can use the spatiotemporal action detection demo for inference and visualize the model's performance.

Since the tutorial uses a limited training dataset, the model's performance is not optimal, so a pre-trained model is used for visualization.

In [16]:
!python ../../demo/demo_spatiotemporal_det.py \
 data/multisports/test/aerobic_gymnastics/v_7G_IpU0FxLU_c001.mp4 \
 data/demo_spatiotemporal_det.mp4 \
 --config configs/slowonly_k400_multisports.py \
 --checkpoint https://download.openmmlab.com/mmaction/v1.0/detection/slowonly/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-8e_multisports-rgb/slowonly_kinetics400-pretrained-r50_8xb16-4x16x1-8e_multisports-rgb_20230320-a1ca5e76.pth \
 --det-config configs/faster-rcnn_r50-caffe_fpn_ms-1x_coco_ms_person.py \
 --det-checkpoint work_dirs/det_model/epoch_2.pth \
 --det-score-thr 0.85 \
 --action-score-thr 0.8 \
 --label-map ../../tools/data/multisports/label_map.txt \
 --predict-stepsize 8 \
 --output-stepsize 1 \
 --output-fps 24

ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5220:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2642:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4732:(_snd_config_evaluate) function snd_func_concat returned error: N

In [None]:
# Show Video
import moviepy.editor
moviepy.editor.ipython_display("data/demo_spatiotemporal_det.mp4")