ABSTRACT

MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation

Zhicheng Zhang, Pancheng Zhao, Eunil Park, Jufeng Yang

TL:DR We present MART, an MAE-style method for learning robust affective representation of videos that exploits the sentiment complementary and emotion intrinsic among temporal segments.

This repository contains the official implementation of our work in CVPR 2024. The pytorch code for MART are released. More details can be viewed in our paper.

Publication

MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation
Zhicheng Zhang, Pancheng Zhao, Eunil Park, Jufeng Yang
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[PDF] [Poster] [Project Page] [Github]

ABSTRACT

Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to extract the temporal correlation of affective cues in the video. Inspired by psychology research and empirical theory, we verify that the degree of emotion may vary in different segments of the video, thus introducing the sentiment complementary and emotion intrinsic among temporal segments. We propose an MAE-style method for learning robust affective representation of videos via masking, termed MART. First, we extract the affective cues of the lexicon and verify the extracted one by computing its matching score with video content. The hierarchical verification strategy is proposed, in terms of sentiment and emotion, to identify the matched cues alongside the temporal dimension. Then, with the verified cues, we propose masked affective modeling to recover temporal emotion distribution. We present temporal affective complementary learning that pulls the complementary part and pushes the intrinsic part of masked multimodal features, for learning robust affective representation. Under the constraint of affective complementary, we leverage cross-modal attention among features to mask the video and recover the degree of emotion among segments. Extensive experiments on five benchmark datasets demonstrate the superiority of our method in video sentiment analysis, video emotion recognition, multimodal sentiment analysis, and multimodal emotion recognition.

DEPENDENCY

Recommended Environment

CUDA 11.1
Python 3.6
Pytorch 1.8.0

You can prepare your environment by running the following lines.

We prepare a frozen conda environment env that can be directly copied.

conda env create -f ./env.yaml

SCRIPTS

Preparation

Dataset: We preprocess the datasets under the following process via the scripts provided in tools.

Dataset Structure: The processed datasets are constructed under the following structure.

VAA_VideoEmotion8
├── imgs
│   ├── Anger
│   ├── Anticipation
│   └── Disgust
│   └── Fear
│   └── Joy
│   └── Sadness
│   └── Surprise
│   └── Trust
├── mp3 
│   ├── Anger
│   ├── Anticipation
│   └── Disgust
│   └── Fear
│   └── Joy
│   └── Sadness
│   └── Surprise
│   └── Trust
├── srt 
│   ├── Anger
│   ├── Anticipation
│   └── Disgust
│   └── Fear
│   └── Joy
│   └── Sadness
│   └── Surprise
│   └── Trust
└── ve8.json

Pre-trained Model: Download the pretrain models from [google drive/baidu netdisk].

Place the audioset_10_10_0.4593.pth at './models/ast/pretrained_models'

Place the vit_base_patch16_224.pth at './models/mbt/pretrained_models/vit_base_patch16_224'

Run

You can easily train and evaluate the model by running the script below.

You can include more details such as epoch, milestone, learning_rate, etc. Please refer to opts.

sh run.sh

REFERENCE

We referenced the repos below for the code.

CITATION

If you find this repo useful in your project or research, please consider citing the relevant publication.

@inproceedings{zhanZhang_2024_CVPRg2024multiple,
  title={Mart: Masked affective representation learning via masked temporal distribution distillation},
  author={Zhang, Zhicheng and Zhao, Pancheng and Park, Eunil and Yang, Jufeng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
__pycache__		__pycache__
assets		assets
ckps		ckps
core		core
datasets		datasets
models		models
tools		tools
transforms		transforms
MART.py		MART.py
README.md		README.md
dataset_info.py		dataset_info.py
env.yaml		env.yaml
main.py		main.py
opts.py		opts.py
requirements.txt		requirements.txt
run.sh		run.sh
train.py		train.py
validation.py		validation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Publication

ABSTRACT

DEPENDENCY

Recommended Environment

SCRIPTS

Preparation

Run

REFERENCE

CITATION

About

Releases

Packages

Languages

nku-zhichengzhang/MART

Folders and files

Latest commit

History

Repository files navigation

Publication

ABSTRACT

DEPENDENCY

Recommended Environment

SCRIPTS

Preparation

Run

REFERENCE

CITATION

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages