# MultiRD

**Repository Path**: thunlp/MultiRD

## Basic Information

- **Project Name**: MultiRD
- **Description**: Code and data of the AAAI-20 paper "Multi-channel Reverse Dictionary Model"
- **Primary Language**: Python
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-05-29
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# MultiRD
Code and data of the AAAI-20 paper "**Multi-channel Reverse Dictionary Model**" [[pdf](https://arxiv.org/pdf/1912.08441.pdf)]

## Requirements
* Python 3.x
* Pytorch 1.x
* Other requirements: numpy, tqdm, nltk, gensim, thulac

## Quick Start
Download the code and data from [Google Drive](https://drive.google.com/drive/folders/1jeyPE8iGdGUSVJe_6Smr_NzoWfR52f4g?usp=sharing) or [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/d/ec29131d38fd4ca2a6ca/), where the code is the same as that here.

Unzip the data.zip (under English and Chinese paths respectively), and all files under `EnglishReverseDictionary` and `ChineseReverseDictionary` should be prepared as follows:

```
ReverseDictionary
|- EnglishReverseDictionary
|  |- data
|  |  |- data_train.json
|  |  |- data_dev.json
|  |  |- data_test_500_rand1_seen.json
|  |  |- data_test_500_rand1_unseen.json
|  |  |- data_defi_c.json           [definitions of the target words in 200 descriptions]
|  |  |- data_desc_c.json           [testset of 200 descriptions]
|  |  |- vec_inuse.json             [Only embeddings used in this model are included.]
|  |  |- lexname_all.txt
|  |  |- root_affix_freq.txt
|  |  |- sememes_all.txt
|  |  |- target_words.txt
|  |- code
|     |- main.py
|     |- model.py
|     |- data.py
|     |- evaluate.py
|     |- evaluate_result.py
|     |- analyse_result.py
|     |- result_analysis_En_1200.py
|- ChineseReverseDictionary
|  |- data
|  |  |- Cx.json                    [x=1,2,3,4]
|  |  |- description_sense.json     [train & dev dataset]
|  |  |- description_idio_locu.json [testset of Question]
|  |  |- description_byHand.json    [testset of description]
|  |  |- hownet.json
|  |  |- sememe.json
|  |  |- word_cilinClass.json
|  |  |- word_index.json
|  |  |- word_vector.npy            [Only embeddings used in this model are included.]
|  |- code
|     |- main.py
|     |- model.py
|     |- data.py
|     |- evaluate.py
|     |- evaluate_result.py
|- PrepareYourOwnDataset
   |- <See below.>
```

### Train English Model
Execute this command under code path：
```bash
python main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -f [freq_mor] -m [rsl, r, s, l, b] -v
```
In `-m [rsl, r, s, l, b]`, 

- `-m r` indicates the use of Morpheme information including roots and affixes. You can filter morphemes by `-f`, usually 15~35;
-  `-m s` means using the Sememe predictor;
-  `-m l` means using WordNet lexnames, which is word category information (include Lexical name and POS tag information);
-  `-m b` means not using any other information, just the basic BiLSTM model;
-  `-m rsl` means to use all information which is our Multi-channel model;

`-e` is usually set to 10~20;

`-g` indicates which GPU to use;

`-v` means showing progess bar.


After training, you will get two new files, `xxx_label_list.json` and `xxx_pred_list.json`. "xxx" indicates the mode you set in `-m`, e.g., the `-m rsl` setting indicates that the file will be `rsl_label_list.json`. 

#### Evaluation
Execute this command under code path:
```bash
python evaluate_result.py -m [mode]
```
Here, `mode` is the same as above.

Then you'll get `median rank`,  ` accuracy@1/10/100` and  `rank variance` results on 3 test sets including **seen**, **unseen** and **description**. 


You can evaluate model performance with prior knowledge:

```bash
python analyse_result.py
python result_analysis_En_1200.py -m [mode]
```

### Train Chinese Model
Execute this command under code path：
```bash
python main.py -b [batch_size] -e [epoch_num] -g [gpu_num] -sd [random_seed] -u/-s -m [CPsc, C, P, s, c, b] -v
```
Different from English model training, we use `-u` or `-s` to represent **Unseen** or **Seen** test mode. In fact, there is no need to use the test mode on the Seen Definition test set. 
In `-m [CPsc, C, P, s, c, b]`

-  `-m C` means using Cilin word category information and we use 4 word classes in Cilin;
-  `-m P` means using POS predictor;
-  `-m s` means using Sememe predictor;
-  `-m c` indicates the use of Morpheme predictor where morphemes are Chinese characters;
-  `-m b` means not using any other information, just the basic BiLSTM model;
-  `-m CPsc` means to use all information as our Multi-channel model.

`-e` , `-g` and `-v` are the same as those in English model training. 

#### Evaluation

```bash
python evaluate_result.py -m [mode]
```
Here, the `mode` is the prefix of `xxx_label_list.json`. 
Then you'll get `median rank`,  ` accuracy@1/10/100` and  `rank variance` results on 4 test sets including **seen**, **unseen**, **Description** and **Question**. 


You can evaluate model performance with prior knowledge:
```bash
python result_analysis_Ch.py -m [mode]
```

## Prepare Your Own Data

Here is some code for reference. The data format is shown below, and you can build your own data set.
```
ReverseDictionary
|- EnglishReverseDictionary
|- ChineseReverseDictionary
|- PrepareYourOwnDataset
   |- proc_allFeatures.py
   |- get_wordnet_lexname.py
   |- get_wordnet_500sample.py
   |- process_googleVec_checkAllData.py
   |- readHowNet_to_word_sememe.py
   |- wordnik_get_defi.py
   |- check_root_affix.py
```
### Data Formats
It is json format in data_xxx.json files.
```
{
     "word": "fatalism",
     "lexnames": [
         "noun.cognition"
     ],
     "root_affix": [
         "fatal",
         "ism"
     ],
     "sememes": [
         "knowledge",
         "believe",
         "experience",
         "Fate"
     ],
     "definitions": "the doctrine that all events are predetermined by fate and are therefore unalterable"
}
```
Word embeddings are in `vec_inuse.json` which contains all target words and words in definitions. Only used words are included. The format is `{word: [vector]}`, ....
`lexname_all.txt` contains all 45 lexnames from WordNet.
`sememes_all.txt` contains 1400 sememes from HowNet.
Morphemes (root and affix) are in `root_affix_freq.txt`, which contains morphemes and their numbers, separated by spaces.

### Download and Process Data
In English experiments, we use the Description dataset from [(Hill et al. 2016)](https://arxiv.org/pdf/1504.00548.pdf). 

Word embeddings are from [GoogleNews-vectors-negative300](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing). 

Sememes can be obtained using [OpenHowNet](https://github.com/thunlp/OpenHowNet). 

Lexnames are from WordNet which you can get them easily by NLTK.

We get morphemes by [Morfessor tool](https://morfessor.readthedocs.io/en/latest/). The used dataset is from [morpho.aalto.fi](http://morpho.aalto.fi/events/morphochallenge2010/datasets.shtml). You should train mofessor model first, and then use it to process the target words to get the corresponding roots and affixes.

```bash
morfessor-train --encoding=ISO_8859-15 --traindata-list --logfile=log.log -s model.bin -d ones wordlist-2010.eng
morfessor-segment -l ../morfessor_data/model.bin target_words.txt -o word_root_affix.txt
```
Unfortunately, the morphemes obtained by this method are not accurate. It is recommended that you use the standard root-affix dictionary.


## Cite
If you use any code or data, please cite this paper

```
@article{zhang2019multi
    title={Multi-channel Reverse Dictionary Model},
    author={Zhang, Lei and Qi, Fanchao and Liu, Zhiyuan and Wang, Yasheng and Liu, Qun and Sun, Maosong},
    journal={arXiv preprint arXiv:1912.08441},
  	year={2019}
}
```

## Contact
You can visit our [online reverse dictionary website](https://wantwords.thunlp.org/), where we have optimized our methods and datasets, but we haven't updated them here. You can post issues if you have any questions.