# CMCIR

**Repository Path**: ydqzZ/CMCIR

## Basic Information

- **Project Name**: CMCIR
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2024-01-31
- **Last Updated**: 2024-01-31

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# CMCIR
Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering     
IEEE Transactions on Pattern Analysis and Machine Intelligence 2023         
For more details, please refer to our paper [Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering](https://arxiv.org/abs/2207.12647)       
[中文解读](https://mp.weixin.qq.com/s/RRVIACXRLA0-nePQO5bY6g)   

### Abstract
Existing visual question answering methods often suffer from cross-modal spurious correlations and oversimplified event-level reasoning processes that fail to capture event temporality, causality, and dynamics spanning over the video. In this work, to address the task of event-level visual question answering, we propose a framework for cross-modal causal relational reasoning. In particular, a set of causal intervention operations is introduced to discover the underlying causal structures across visual and linguistic modalities. Our framework, named Cross-Modal Causal RelatIonal Reasoning (CMCIR), involves three modules: i) Causality-aware Visual-Linguistic Reasoning (CVLR) module for collaboratively disentangling the visual and linguistic spurious correlations via front-door and back-door causal interventions; ii) Spatial-Temporal Transformer (STT) module for capturing the fine-grained interactions between visual and linguistic semantics; iii) Visual-Linguistic Feature Fusion (VLFF) module for learning the global semantic-aware visual-linguistic representations adaptively. Extensive experiments on four event-level datasets demonstrate the superiority of our CMCIR in discovering visual-linguistic causal structures and achieving robust event-level visual question answering. 

### Model
![Image](Images/CMCIR.gif)        
Figure 1: Framework of our proposed CMCIR.        

### Experimental Results
![Image](Images/SUTD.png)
Figure 2: Results on SUTD-TrafficQA dataset.  
![Image](Images/TGIF.png)
Figure 3: Results on TGIF-QA dataset.  
![Image](Images/MSVD.png)
Figure 4: Results on MSVD-QA dataset.  
![Image](Images/MSRVTT.png)
Figure 5: Results on MSRVTT-QA dataset.  

### Requirements
- python3.7
- numpy
- pytorch
- [pytorch-geometric](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html)    

### Datasets
We conducted our experiment on large-scale event-level urban dataset [SUTD-TrafficQA](https://sutdcv.github.io/SUTD-TrafficQA/#/) and three benchmark real-world datasets [TGIF-QA](https://github.com/YunseokJANG/tgif-qa), [MSVD-QA](https://github.com/xudejing/video-question-answering) and [MSRVTT-QA](https://github.com/xudejing/video-question-answering). The preprocessing steps are the same as the official ones. Please find more details from these datasets.        

### Setups
1. Download [SUTD-TrafficQA](https://sutdcv.github.io/SUTD-TrafficQA/#/), [TGIF-QA](https://github.com/YunseokJANG/tgif-qa), [MSVD-QA](https://github.com/xudejing/video-question-answering) and [MSRVTT-QA](https://github.com/xudejing/video-question-answering) datasets.    
2. Edit absolute paths in preprocess/preprocess_features.py and preprocess/preprocess_questions.py upon where you locate your data.
3. Install dependencies.

## Experiments with SUTD-TrafficQA     
We refer to [SUTD-TrafficQA Official Codes](https://github.com/SUTDCV/SUTD-TrafficQA) for preprocessing.      
### Preprocess Linguistic Features  
1. Download [glove pretrained 300d word vectors](http://nlp.stanford.edu/data/glove.840B.300d.zip) to `/data/glove/` and process it into a pickle file.
```
python txt2pickle.py

```
2. Preprocess train/val/test questions:
```
python 1_preprocess_questions_oie.py --mode train
    
python 1_preprocess_questions_oie.py --mode test
```    
### Preprocess Visual Features    
1. To extract appearance feature with Swin or Resnet101 model:  
 Download Swin [pretrained model](https://github.com/microsoft/Swin-Transformer) (swin_large_patch4_window7_224_22k.pth) and place it to `configs/`.
```
python 1_preprocess_features_appearance.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance.py --model resnet101 --question_type none

```

2. To extract motion feature with Swin or ResnetXt101 model:

 Download Swin3D [pretrained model](https://github.com/microsoft/Swin-Transformer) (swin_base_patch244_window877_kinetics600_22k.pth) and place it to `configs/`.
 
 Download ResNeXt-101 [pretrained model](https://drive.google.com/drive/folders/1zvl89AgFAApbH0At-gMuZSeQB_LpNP-M) (resnext-101-kinetics.pth) and place it to `data/preprocess/pretrained/`.
```
python 1_preprocess_features_motion.py --model Swin --question_type none

or

python 1_preprocess_features_motion.py --model resnext101 --question_type none

```
### Visual K-means Clustering
1. To extract training appearance feature with Swin or Resnet101 model:  

```
python 1_preprocess_features_appearance_train.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_train.py --model resnet101 --question_type none

```

2. To extract training motion feature with Swin or ResnetXt101 model:

```
python 1_preprocess_features_motion_train.py --model Swin --question_type none

or

python 1_preprocess_features_motion_train.py --model resnext101 --question_type none
```
3. K-means Clustering    
```
python k_means.py
```
Edit absolute paths upon where you locate your data.    

### Training and Testing
```
python train_SUTD.py
```

## Experiments with TGIF-QA    
Depending on the task to chose question_type out of 4 options: action, transition, count, frameqa.
### Preprocess Linguistic Features  
1. Preprocess train/val/test questions:
```
python 1_preprocess_questions_oie_tgif.py --mode train --question_type {question_type}
    
python 1_preprocess_questions_oie_tgif.py --mode test  --question_type {question_type}
```    
### Preprocess Visual Features    
1. To extract appearance feature with Swin or Resnet101 model:  

```
python 1_preprocess_features_appearance_tgif_total.py --model Swin --question_type {question_type}

 or
 
python 1_preprocess_features_appearance_tgif_total.py --model resnet101 --question_type {question_type}

```

2. To extract motion feature with Swin or ResnetXt101 model:

```
python 1_preprocess_features_motion_tgif_total.py --model Swin --question_type {question_type}

or

python 1_preprocess_features_motion_tgif_total.py --model resnext101 --question_type {question_type}

```
### Visual K-means Clustering
1. To extract training appearance feature with Swin or Resnet101 model:  

```
python 1_preprocess_features_appearance_tgif.py --model Swin --question_type {question_type}

 or
 
python 1_preprocess_features_appearance_tgif.py --model resnet101 --question_type {question_type}

```

2. To extract training motion feature with Swin or ResnetXt101 model:

```
python 1_preprocess_features_motion_tgif.py --model Swin --question_type {question_type}

or

python 1_preprocess_features_motion_tgif.py --model resnext101 --question_type {question_type}

```

3. K-means Clustering      

```
python k_means.py
```

Edit absolute paths upon where you locate your data.    

### Training and Testing
```
python train_TGIF_Action.py

python train_TGIF_Transition.py

python train_TGIF_Count.py

python train_TGIF_FrameQA.py
```

## Experiments with MSVD-QA/MSRVTT-QA
### Preprocess linguistic features  
1. Preprocess train/val/test questions:
```
python 1_preprocess_questions_oie_msvd.py --mode train
    
python 1_preprocess_questions_oie_msvd.py --mode test
```    
or    

```
python 1_preprocess_questions_oie_msrvtt.py --mode train
    
python 1_preprocess_questions_oie_msrvtt.py --mode test
```  

### Preprocess visual features    
1. To extract appearance feature with Swin or Resnet101 model:  

```
python 1_preprocess_features_appearance_msvd.py --model Swin --question_type none

python 1_preprocess_features_appearance_msrvtt.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_msvd.py --model resnet101 --question_type none

python 1_preprocess_features_appearance_msrvtt.py --model resnet101 --question_type none

```

2. To extract motion feature with Swin or ResnetXt101 model:

```
python 1_preprocess_features_motion_msvd.py --model Swin --question_type none

python 1_preprocess_features_motion_msrvtt.py --model Swin --question_type none

or

python 1_preprocess_features_motion_msvd.py --model resnext101 --question_type none

python 1_preprocess_features_motion_msrvtt.py --model resnext101 --question_type none

```
### Visual K-means Clustering
1. To extract training appearance feature with Swin or Resnet101 model:  

```
python 1_preprocess_features_appearance_msvd_train.py --model Swin --question_type none

python 1_preprocess_features_appearance_msrvtt_train.py --model Swin --question_type none

 or
 
python 1_preprocess_features_appearance_msvd_train.py --model resnet101 --question_type none

python 1_preprocess_features_appearance_msrvtt_train.py --model resnet101 --question_type none

```

2. To extract training motion feature with Swin or ResnetXt101 model:

```
python 1_preprocess_features_motion_msvd_train.py --model Swin --question_type none

python 1_preprocess_features_motion_msrvtt_train.py --model Swin --question_type none

or

python 1_preprocess_features_motion_msvd_train.py --model resnext101 --question_type none

python 1_preprocess_features_motion_msrvtt_train.py --model resnext101 --question_type none

```
3. K-means Clustering   
```
python k_means.py
```
Edit absolute paths upon where you locate your data.    

### Training and Testing
```
python train_MSVD.py

python train_MSRVTT.py
```

### Citation
If you use this code for your research, please cite our paper.      
```
@article{CMCIR,
  title={Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering},
  author={Liu, Yang and Li, Guanbin and Lin, Liang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2023}
  doi={10.1109/TPAMI.2023.3284038}
}
@article{liu2022cross,
  title={Cross-modal causal relational reasoning for event-level visual question answering},
  author={Liu, Yang and Li, Guanbin and Lin, Liang},
  journal={arXiv preprint arXiv:2207.12647},
  year={2022}
}
``` 
If you have any question about this code, feel free to reach (liuy856@mail.sysu.edu.cn).