# SpERT_chinese **Repository Path**: labofnlp/SpERT_chinese ## Basic Information - **Project Name**: SpERT_chinese - **Description**: A good example for NER - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-01-07 - **Last Updated**: 2024-01-07 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # SpERT_chinese 基于论文SpERT: "Span-based Entity and Relation Transformer"的中文关系抽取,同时抽取实体、实体类别和关系类别。 原始论文地址: https://arxiv.org/abs/1909.07755 (published at ECAI 2020) 原始论文代码:https://github.com/lavis-nlp/spert ![alt text](http://deepca.cs.hs-rm.de/img/deepca/spert.png) ## 设置 ### Requirements - Required - Python 3.5+ - PyTorch (tested with version 1.4.0) - transformers (+sentencepiece, e.g. with 'pip install transformers[sentencepiece]', tested with version 4.1.1) - scikit-learn (tested with version 0.24.0) - tqdm (tested with version 4.55.1) - numpy (tested with version 1.17.4) - Optional - jinja2 (tested with version 2.10.3) - if installed, used to export relation extraction examples - tensorboardX (tested with version 1.6) - if installed, used to save training process to tensorboard - spacy (tested with version 3.0.1) - if installed, used to tokenize sentences for prediction ```python pip install transformers ==4.1.1 pip install tensorboardX pip install tqdm pip install jinja2 pip install spacy==3.3.1 ``` 额外的,下载:https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.3.0/zh_core_web_sm-3.3.0.tar.gz 。执行:```pip install zh_core_web_sm-3.3.0.tar.gz``` 还需要在huggingface上下载chinese-bert-wwm-ext到model_hub/chinese-bert-wwm-ext/下。 ### 获取数据 这里使用的数据是千言数据中的信息抽取数据,可以去这里下载:[千言(LUGE)| 全面的中文开源数据集合](https://www.luge.ai/#/luge/dataDetail?id=5) 。下载并解压获得duie_train.json、duie_dev.json、duie_schema.json,将它们放置在data/duie/下,然后运行那下面的process.py以获得: ```` train.json # 训练集 dev.json # 验证集,如果有测试集,也可以生成test.json duie_prediction_example.json # 预测样本 duie_types.json # 存储的实体类型和关系类型 entity_types.txt # 实际上用不上,只是我们自己看看 relation_types.txt # 实际上用不上,只是我们自己看看 ```` train.json和dev.json里面的数据格式如下所示: ```python [ {"tokens": ["这", "件", "婚", "事", "原", "本", "与", "陈", "国", "峻", "无", "关", ",", "但", "陈", "国", "峻", "却", "“", "欲", "求", "配", "而", "无", "由", ",", "夜", "间", "乃", "潜", "入", "天", "城", "公", "主", "所", "居", "通", "之"], "entities": [{"type": "人物", "start": 8, "end": 10}, {"type": "人物", "start": 31, "end": 35}], "relations": [{"type": "丈夫", "tail": 0, "head": 1}, {"type": "妻子", "head": 0, "tail": 1}]}, ...... ] ``` 需要说明的是relations里面的head和tail对应的是entities里面实体的列表里的索引。 duie_types.json格式如下所示: ```python {"entities": {"行政区": {"short": "行政区", "verbose": "行政区"}, "人物": {"short": "人物", "verbose": "人物"}, "气候": {"short": "气候", "verbose": "气候"}, "文学作品": {"short": "文学作品", "verbose": "文学作品"}, "Text": {"short": "Text", "verbose": "Text"}, "学科专业": {"short": "学科专业", "verbose": "学科专业"}, "作品": {"short": "作品", "verbose": "作品"}, "奖项": {"short": "奖项", "verbose": "奖项"}, "国家": {"short": "国家", "verbose": "国家"}, "电视综艺": {"short": "电视综艺", "verbose": "电视综艺"}, "影视作品": {"short": "影视作品", "verbose": "影视作品"}, "企业": {"short": "企业", "verbose": "企业"}, "语言": {"short": "语言", "verbose": "语言"}, "歌曲": {"short": "歌曲", "verbose": "歌曲"}, "Date": {"short": "Date", "verbose": "Date"}, "企业/品牌": {"short": "企业/品牌", "verbose": "企业/品牌"}, "地点": {"short": "地点", "verbose": "地点"}, "Number": {"short": "Number", "verbose": "Number"}, "图书作品": {"short": "图书作品", "verbose": "图书作品"}, "景点": {"short": "景点", "verbose": "景点"}, "城市": {"short": "城市", "verbose": "城市"}, "学校": {"short": "学校", "verbose": "学校"}, "音乐专辑": {"short": "音乐专辑", "verbose": "音乐专辑"}, "机构": {"short": "机构", "verbose": "机构"}}, "relations": {"编剧": {"short": "编剧", "verbose": "编剧", "symmetric": false}, "修业年限": {"short": "修业年限", "verbose": "修业年限", "symmetric": false}, "毕业院校": {"short": "毕业院校", "verbose": "毕业院校", "symmetric": false}, "气候": {"short": "气候", "verbose": "气候", "symmetric": false}, "配音": {"short": "配音", "verbose": "配音", "symmetric": false}, "注册资本": {"short": "注册资本", "verbose": "注册资本", "symmetric": false}, "成立日期": {"short": "成立日期", "verbose": "成立日期", "symmetric": false}, "父亲": {"short": "父亲", "verbose": "父亲", "symmetric": false}, "面积": {"short": "面积", "verbose": "面积", "symmetric": false}, "专业代码": {"short": "专业代码", "verbose": "专业代码", "symmetric": false}, "作者": {"short": "作者", "verbose": "作者", "symmetric": false}, "首都": {"short": "首都", "verbose": "首都", "symmetric": false}, "丈夫": {"short": "丈夫", "verbose": "丈夫", "symmetric": false}, "嘉宾": {"short": "嘉宾", "verbose": "嘉宾", "symmetric": false}, "官方语言": {"short": "官方语言", "verbose": "官方语言", "symmetric": false}, "作曲": {"short": "作曲", "verbose": "作曲", "symmetric": false}, "号": {"short": "号", "verbose": "号", "symmetric": false}, "票房": {"short": "票房", "verbose": "票房", "symmetric": false}, "简称": {"short": "简称", "verbose": "简称", "symmetric": false}, "母亲": {"short": "母亲", "verbose": "母亲", "symmetric": false}, "制片人": {"short": "制片人", "verbose": "制片人", "symmetric": false}, "导演": {"short": "导演", "verbose": "导演", "symmetric": false}, "歌手": {"short": "歌手", "verbose": "歌手", "symmetric": false}, "改编自": {"short": "改编自", "verbose": "改编自", "symmetric": false}, "海拔": {"short": "海拔", "verbose": "海拔", "symmetric": false}, "占地面积": {"short": "占地面积", "verbose": "占地面积", "symmetric": false}, "出品公司": {"short": "出品公司", "verbose": "出品公司", "symmetric": false}, "上映时间": {"short": "上映时间", "verbose": "上映时间", "symmetric": false}, "所在城市": {"short": "所在城市", "verbose": "所在城市", "symmetric": false}, "主持人": {"short": "主持人", "verbose": "主持人", "symmetric": false}, "作词": {"short": "作词", "verbose": "作词", "symmetric": false}, "人口数量": {"short": "人口数量", "verbose": "人口数量", "symmetric": false}, "祖籍": {"short": "祖籍", "verbose": "祖籍", "symmetric": false}, "校长": {"short": "校长", "verbose": "校长", "symmetric": false}, "朝代": {"short": "朝代", "verbose": "朝代", "symmetric": false}, "主题曲": {"short": "主题曲", "verbose": "主题曲", "symmetric": false}, "获奖": {"short": "获奖", "verbose": "获奖", "symmetric": false}, "代言人": {"short": "代言人", "verbose": "代言人", "symmetric": false}, "主演": {"short": "主演", "verbose": "主演", "symmetric": false}, "所属专辑": {"short": "所属专辑", "verbose": "所属专辑", "symmetric": false}, "饰演": {"short": "饰演", "verbose": "饰演", "symmetric": false}, "董事长": {"short": "董事长", "verbose": "董事长", "symmetric": false}, "主角": {"short": "主角", "verbose": "主角", "symmetric": false}, "妻子": {"short": "妻子", "verbose": "妻子", "symmetric": false}, "总部地点": {"short": "总部地点", "verbose": "总部地点", "symmetric": false}, "国籍": {"short": "国籍", "verbose": "国籍", "symmetric": false}, "创始人": {"short": "创始人", "verbose": "创始人", "symmetric": false}, "邮政编码": {"short": "邮政编码", "verbose": "邮政编码", "symmetric": false}}} ``` ## 例子 **(1)** 在duie上使用训练集进行训练, 在验证集上进行评估。需要注意的是,这里我只使用了训练集的10000条数据和验证集的10000条数据训练了1个epoch。 ``` python ./spert.py train --config configs/duie_train.conf ``` ```python -------------------------------------------------- Config: {'label': 'duie_train', 'model_type': 'spert', 'model_path': 'model_hub/chinese-bert-wwm-ext', 'tokenizer_path': 'model_hub/chinese-bert-wwm-ext', 'train_path': 'data/duie/train.json', 'valid_path': 'data/duie/dev.json', 'types_path': 'data/duie/duie_types.json', 'train_batch_size': '2', 'eval_batch_size': '1', 'neg_entity_count': '100', 'neg_relation_count': '100', 'epochs': '1', 'lr': '5e-5', 'lr_warmup': '0.1', 'weight_decay': '0.01', 'max_grad_norm': '1.0', 'rel_filter_threshold': '0.4', 'size_embedding': '25', 'prop_drop': '0.1', 'max_span_size': '20', 'store_predictions': 'true', 'store_examples': 'true', 'sampling_processes': '2', 'max_pairs': '1000', 'final_eval': 'true', 'log_path': 'data/log/', 'save_path': 'data/save/'} Repeat 1 times -------------------------------------------------- Iteration 0 -------------------------------------------------- 2022-11-17 06:48:16,488 [MainThread ] [INFO ] Datasets: data/duie/train.json, data/duie/dev.json 2022-11-17 06:48:16,489 [MainThread ] [INFO ] Model type: spert Parse dataset 'train': 100% 10000/10000 [00:52<00:00, 189.61it/s] Parse dataset 'valid': 100% 10000/10000 [00:52<00:00, 191.25it/s] 2022-11-17 06:50:02,108 [MainThread ] [INFO ] Relation type count: 49 2022-11-17 06:50:02,108 [MainThread ] [INFO ] Entity type count: 25 2022-11-17 06:50:02,108 [MainThread ] [INFO ] Entities: 2022-11-17 06:50:02,108 [MainThread ] [INFO ] No Entity=0 2022-11-17 06:50:02,108 [MainThread ] [INFO ] 行政区=1 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 人物=2 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 气候=3 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 文学作品=4 2022-11-17 06:50:02,109 [MainThread ] [INFO ] Text=5 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 学科专业=6 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 作品=7 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 奖项=8 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 国家=9 2022-11-17 06:50:02,109 [MainThread ] [INFO ] 电视综艺=10 2022-11-17 06:50:02,110 [MainThread ] [INFO ] 影视作品=11 2022-11-17 06:50:02,110 [MainThread ] [INFO ] 企业=12 2022-11-17 06:50:02,110 [MainThread ] [INFO ] 语言=13 2022-11-17 06:50:02,110 [MainThread ] [INFO ] 歌曲=14 2022-11-17 06:50:02,110 [MainThread ] [INFO ] Date=15 2022-11-17 06:50:02,110 [MainThread ] [INFO ] 企业/品牌=16 2022-11-17 06:50:02,110 [MainThread ] [INFO ] 地点=17 2022-11-17 06:50:02,110 [MainThread ] [INFO ] Number=18 2022-11-17 06:50:02,111 [MainThread ] [INFO ] 图书作品=19 2022-11-17 06:50:02,111 [MainThread ] [INFO ] 景点=20 2022-11-17 06:50:02,111 [MainThread ] [INFO ] 城市=21 2022-11-17 06:50:02,111 [MainThread ] [INFO ] 学校=22 2022-11-17 06:50:02,111 [MainThread ] [INFO ] 音乐专辑=23 2022-11-17 06:50:02,111 [MainThread ] [INFO ] 机构=24 2022-11-17 06:50:02,111 [MainThread ] [INFO ] Relations: 2022-11-17 06:50:02,111 [MainThread ] [INFO ] No Relation=0 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 编剧=1 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 修业年限=2 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 毕业院校=3 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 气候=4 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 配音=5 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 注册资本=6 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 成立日期=7 2022-11-17 06:50:02,112 [MainThread ] [INFO ] 父亲=8 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 面积=9 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 专业代码=10 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 作者=11 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 首都=12 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 丈夫=13 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 嘉宾=14 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 官方语言=15 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 作曲=16 2022-11-17 06:50:02,113 [MainThread ] [INFO ] 号=17 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 票房=18 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 简称=19 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 母亲=20 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 制片人=21 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 导演=22 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 歌手=23 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 改编自=24 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 海拔=25 2022-11-17 06:50:02,114 [MainThread ] [INFO ] 占地面积=26 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 出品公司=27 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 上映时间=28 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 所在城市=29 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 主持人=30 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 作词=31 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 人口数量=32 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 祖籍=33 2022-11-17 06:50:02,115 [MainThread ] [INFO ] 校长=34 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 朝代=35 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 主题曲=36 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 获奖=37 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 代言人=38 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 主演=39 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 所属专辑=40 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 饰演=41 2022-11-17 06:50:02,116 [MainThread ] [INFO ] 董事长=42 2022-11-17 06:50:02,117 [MainThread ] [INFO ] 主角=43 2022-11-17 06:50:02,117 [MainThread ] [INFO ] 妻子=44 2022-11-17 06:50:02,117 [MainThread ] [INFO ] 总部地点=45 2022-11-17 06:50:02,117 [MainThread ] [INFO ] 国籍=46 2022-11-17 06:50:02,117 [MainThread ] [INFO ] 创始人=47 2022-11-17 06:50:02,117 [MainThread ] [INFO ] 邮政编码=48 2022-11-17 06:50:02,117 [MainThread ] [INFO ] Dataset: train 2022-11-17 06:50:02,117 [MainThread ] [INFO ] Document count: 10000 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Relation count: 18119 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Entity count: 28033 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Dataset: valid 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Document count: 10000 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Relation count: 18223 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Entity count: 28071 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Updates per epoch: 5000 2022-11-17 06:50:02,118 [MainThread ] [INFO ] Updates total: 5000 Some weights of the model checkpoint at model_hub/chinese-bert-wwm-ext were not used when initializing SpERT: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias'] - This IS expected if you are initializing SpERT from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing SpERT from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of SpERT were not initialized from the model checkpoint at model_hub/chinese-bert-wwm-ext and are newly initialized: ['rel_classifier.weight', 'rel_classifier.bias', 'entity_classifier.weight', 'entity_classifier.bias', 'size_embeddings.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 2022-11-17 06:50:07,261 [MainThread ] [INFO ] Train epoch: 0 Train epoch 0: 100% 5000/5000 [09:01<00:00, 9.24it/s] 2022-11-17 06:59:08,476 [MainThread ] [INFO ] Evaluate: valid Evaluate epoch 1: 0% 0/10000 [00:00 [lavis-nlp/spert: PyTorch code for SpERT: Span-based Entity and Relation Transformer (github.com)](https://github.com/lavis-nlp/spert) > > [SpERT: "Span-based Entity and Relation Transformer"](https://arxiv.org/abs/1909.07755)