# bert_offline **Repository Path**: liusure/bert_offline ## Basic Information - **Project Name**: bert_offline - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-03-15 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ### BERT offline 使用BERT的输出作为词向量,下游任务中不再对词向量进行训练,效果优于一般词向量模型,又显著的减少了显存占用和推理时间,基于此词向量的双向lstm的文本二分类任务在4g显存的笔记本上能以300 examples/s 的速度推,验证集准确率93%左右。 `bert.npz`: BERT的输出,做为下游任务的`embedding`层 `vocab.txt`: BERT词表 `text_classify.py`: 文本分类任务样例,基于`tensorflow2.0`,运行脚本: ```Bash python text_classify.py --data_dir=./ --output_dir=./model/ --vocab_file=./vocab.txt --train_batch_size=32 --num_train_epochs=10 --max_seq_length=256python text_classify.py --data_dir=./ --output_dir=./model/ --vocab_file=./vocab.txt --train_batch_size=32 --num_train_epochs=10 --max_seq_length=256 ``` `train.tsv, dev_matched.tsv`: 训练、测试数据,格式: ```plaintext 我从山中来,带着兰花草 negative ``` 以`\t`为分隔符 ps: 词表查找通过`list`的`index`方法查找的,效率比较低,对速度要求比较高的可以修改成字典方式,或者修改bert自带的`tokenizer`来实现词表高速查找 [`bert_embeddings.npz`下载](https://pan.baidu.com/s/1WuR4Rv6HnXn3K4cGZ8GT0Q) 提取码:`9ya8` *** BERT offline is a simple but efficient way to use BERT embeddings output on some downstream task such as text classification and sequence labeling. It dumps bert's last output layer as a numpy array and will not be trained during downstream's training. the accuracy of BERT offline on sst-2 is about 93% on validition dataset and performs well on `1050ti(4g)` GPU, about 300 examples/s during inference. `bert.npz`: output of BERT, can be embedding layer on downstream task `vocab.txt`: bert's vocabulary list `text_classify.py`: an example code of text calssification,based on `tensorflow2.0`: ```Bash python text_classify.py --data_dir=./ --output_dir=./model/ --vocab_file=./vocab.txt --train_batch_size=32 --num_train_epochs=10 --max_seq_length=256python text_classify.py --data_dir=./ --output_dir=./model/ --vocab_file=./vocab.txt --train_batch_size=32 --num_train_epochs=10 --max_seq_length=256 ``` `tran.tsv, dev_matched.tsv`: training,validation dataset,format: ```plaintext I am groot negative ``` delimit `\t` ps: I use `list.index()` method to find input_ids for text, you can use a `dict` or modify bert's `tokenizer` to speed up. [`bert_embeddings.npz`download](https://pan.baidu.com/s/1WuR4Rv6HnXn3K4cGZ8GT0Q) downloadcode:`9ya8`