# ShortMedKG

**Repository Path**: wingter/ShortMedKG

## Basic Information

- **Project Name**: ShortMedKG
- **Description**: A text-to-graph automatic Knowledge Graph (KG) construction pipeline for Chinese medical corpora, along with an out-of-the-box Knowledge Graph (KG) built for daily medical QA.
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-29
- **Last Updated**: 2026-04-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Chinese Medical Knowledge Graph Construction and Question Answering System

An entity recognition, relation extraction, knowledge graph construction, and retrieval-augmented question answering system based on Chinese medical text. This project integrates Chinese medical datasets such as CMeEE-V2, CMeIE-V2, and CMID, and builds a complete pipeline from medical text to structured triples, Neo4j knowledge graph, and a Web-based QA interface.

基于中文医学文本的实体识别、关系抽取、知识图谱构建与检索增强问答系统。项目整合 CMeEE-V2、CMeIE-V2 和 CMID 等中文医学数据集，构建从医学文本到结构化三元组、Neo4j 知识图谱以及 Web 问答界面的完整流程。

![python](https://img.shields.io/badge/python-3.8%2B-blue)![pytorch](https://img.shields.io/badge/PyTorch-BERT-red)![neo4j](https://img.shields.io/badge/Neo4j-KG-green)

## Overview

This project focuses on automatic construction of a Chinese medical knowledge graph and its downstream QA applications. It mainly addresses entity recognition, relation extraction, graph import, and graph-based question answering from medical text. The system first uses BERT to identify medical entities such as diseases, symptoms, drugs, examinations, body parts, and populations, then uses a relation extraction model to predict medical relations between entity pairs, and finally imports the extracted triples into Neo4j. It also supports knowledge graph question answering through both command-line and local Web interfaces.

本项目面向中文医疗知识图谱自动构建与问答应用，主要解决医学文本中的实体识别、实体关系抽取、图谱导入和图谱问答问题。系统首先使用 BERT 识别疾病、症状、药物、检查、部位、人群等医学实体，再使用关系抽取模型预测实体对之间的医学关系，最终将抽取出的三元组导入 Neo4j，并支持命令行和本地 Web 页面的知识图谱问答。

The project provides three core assets:

1. **A dataset with incrementally annotated `pop` labels based on CMeEE-V2**: population labels are manually annotated through Label Studio.

2. **A Chinese medical knowledge extraction pipeline**: covering NER, RE, rule matching, triple extraction, CSV export, and Neo4j import.

3. **A KG+LLM question answering framework**: generating Cypher queries through entity recognition, relation classification, and rule matching, and optionally calling a large language model to generate natural language answers.

## Key Features

- **Medical entity recognition**: a BERT-based multi-head NER model that recognizes both core medical entities and population-related entities.
- **Medical relation extraction**: using `[E1] [/E1] [E2] [/E2]` to mark entity positions, and predicting 44 medical relation types plus `no_relation` with BERT.
- **Knowledge graph construction**: exporting predicted triples into Neo4j-importable `nodes.csv` and `edges.csv`.
- **Knowledge graph question answering**: supporting command-line QA and local Web QA, with optional LLM-based final answer generation after querying Neo4j.
- **Evaluation scripts**: providing standalone evaluation entries for NER, RE, and QA to facilitate experimental reproduction.

## System Architecture

The system adopts a pipeline design of "data processing - model training - knowledge extraction - graph construction - question answering service".

系统采用“数据处理 - 模型训练 - 知识抽取 - 图谱构建 - 问答服务”的流水线设计。

| Component    | Responsibility                                               |
| ------------ | ------------------------------------------------------------ |
| NER Model    | Recognizes medical entities, including diseases, symptoms, drugs, body parts, examinations, and populations. |
| RE Model     | Performs relation classification on candidate entity pairs and outputs medical SPO triples. |
| Intent Model | Recognizes user query intent and serves only as a supplement to rule-based relation matching. |
| KG Builder   | Deduplicates and filters triples, then exports them as Neo4j nodes and edges. |
| QA Engine    | Extracts entities and relations from questions, generates Cypher queries, and returns graph results. |
| Web Chat     | Provides a local QA interface based on Python HTTPServer.    |

### Entity Types

| Type  | Description                                            |
| ----- | ------------------------------------------------------ |
| `dis` | Disease                                                |
| `sym` | Symptom / Clinical manifestation                       |
| `dru` | Drug                                                   |
| `pro` | Medical procedure / treatment or examination operation |
| `bod` | Body part                                              |
| `mic` | Microorganism                                          |
| `ite` | Examination item                                       |
| `equ` | Medical equipment                                      |
| `dep` | Department                                             |
| `pop` | Population                                             |

Current entity statistics in the refined NER dataset are: `bod` 31,467, `dis` 25,699, `sym` 22,415, `pro` 13,007, `dru` 5,945, `ite` 5,749, `pop` 2,938, `mic` 2,964, `equ` 1,053, and `dep` 431.

当前精修 NER 数据中的实体统计：`bod` 31,467，`dis` 25,699，`sym` 22,415，`pro` 13,007，`dru` 5,945，`ite` 5,749，`pop` 2,938，`mic` 2,964，`equ` 1,053，`dep` 431。

### Relation Types

The relation extraction task currently covers 44 medical relation types, such as clinical manifestation, cause, prevention, drug treatment, surgical treatment, auxiliary treatment, radiotherapy, chemotherapy, laboratory examination, imaging examination, department, complications, lesion site, transmission route, high-risk factors, differential diagnosis, and prognosis.

关系抽取任务当前覆盖 44 类医学关系，例如：临床表现、病因、预防、药物治疗、手术治疗、辅助治疗、放射治疗、化疗、实验室检查、影像学检查、就诊科室、并发症、发病部位、传播途径、高危因素、鉴别诊断、预后状况等。

## Measurement Environment

| Item                | Version / Configuration                        |
| ------------------- | ---------------------------------------------- |
| OS                  | Windows 11                                     |
| GPU                 | NVIDIA GeForce RTX 3060 Laptop GPU             |
| NVIDIA Driver       | 581.83                                         |
| CUDA                | 11.3 (PyTorch CUDA), driver supports CUDA 13.0 |
| Neo4j Python Driver | 5.28.3                                         |

## Installation

It is recommended to create an isolated environment using Conda or venv.

建议使用 Conda 或 venv 创建独立环境。

```bash
conda create -n medical-kg python=3.8
conda activate medical-kg
pip install torch transformers scikit-learn numpy pandas tqdm matplotlib datasets neo4j openai
```

## Quick Start

### 1. Train NER

```bash
python ner_train.py
```

Evaluate NER:

```bash
python ner_eval.py --dev_file data/CMeEE-V2/CMeEE-V2_dev_pop_refine.json --model_dir ner_model
```

### 2. Train Relation Extraction

```bash
python re_train.py
```

Evaluate RE:

```bash
python re_eval.py --model_path best_re_model.pt --eval_file data/CMeIE-V2/CMeIE-V2_dev.jsonl
```

### 3. Train Intent Classifier

```bash
python intent_train.py \
  --train data/CMID-master/CMID.json \
  --model_name_or_path bert-base-chinese \
  --mapping cmid_intent_mapping.json \
  --out cmid_intent_model
```

### 4. Extract Triples

```bash
python extract_triples.py \
  --input input_v4.jsonl \
  --output predict_to_kg_v3.jsonl \
  --ner_model ner_model \
  --re_model best_re_model.pt \
  --re_model_dir re_model \
  --bert_path bert-base-chinese
```

### 5. Export Neo4j CSV

```bash
python export_neo4j_csv.py \
  --input predict_to_kg_v3.jsonl \
  --nodes-output neo4j_import/nodes.csv \
  --edges-output neo4j_import/edges.csv \
  --score-threshold 0.8
```

After placing `nodes.csv` and `edges.csv` into the Neo4j import directory, you can import them using Neo4j Admin Import or Cypher `LOAD CSV`. The exact import method depends on your Neo4j version and deployment setup.

将 `nodes.csv` 和 `edges.csv` 放入 Neo4j import 目录后，可使用 Neo4j Admin Import 或 Cypher `LOAD CSV` 导入。导入方式取决于你的 Neo4j 版本和部署方式。

### 6. Command-line QA

```bash
python qa_query.py --question "your medical question"
```

### 7. Web Chat

```bash
python web_chat.py --host 127.0.0.1 --port 8000
```

QA evaluation entry:

```bash
python qa_eval.py \
  --input qa/project-23.json \
  --output qa_eval_results.jsonl \
  --summary_path qa_eval_summary.json \
```

## Benchmark Results

### Ner

<img src="./README.assets/image-20260423173301863.png" style="zoom:67%;" />

### Re

<img src="./README.assets/image-20260423171730987.png" style="zoom:67%;" />

## Data and Model Availability

This project uses or references the following public Chinese medical NLP resources:

- CMeEE-V2: Chinese medical named entity recognition dataset
- CMeIE-V2: Chinese medical relation extraction dataset
- CMID: Chinese Medical Intent Dataset

The NER, relation extraction, and intent classification modules in this project all use `bert-base-chinese` as the backbone encoder.

Please download the pretrained model from Hugging Face and place it under the project root directory in the `bert-base-chinese/` folder：

- https://huggingface.co/google-bert/bert-base-chinese

## Contribution

Issues and pull requests are welcome to improve data processing, model training, graph construction, and QA performance. Before submitting code, please ensure that:

1. It does not contain sensitive keys or private personal data.
2. It can run correctly with the corresponding training, evaluation, or inference commands.
3. Any newly added data complies with the original dataset licenses.

## References

The relation extraction module in this project was inspired by the data processing workflow and modeling ideas of the CMeIE-V2-NER project:

- CMeIE-V2-NER: https://github.com/wubingheng111/CMeIE-V2-NER

## Acknowledgements

**List of contributors:**

- Wentai Wu, JNU

- HaoMin Ye, JNU