this is roberta wwm base distilled model which was distilled from roberta wwm by roberta wwm large
This is a chinese Roberta wwm distillation model which was distilled from roberta-ext-wwm-large. The large model is from this github, thanks for his contribution.
This model was trained based on this paper, which was punished by huggingface.
For train this model, I used baike_qa2019, news2016_zh, webtext_2019, wiki_zh. this data can be found in this github
I just support BaiduYun to down this model, this link is below.
Model | BaiduYun |
---|---|
Roberta-wwm-ext-base-distill, Chinese | Tensorflow |
Roberta-wwm-ext-large-3layers-distill, Chinese | Tensorflow 26hu |
Roberta-wwm-ext-large-6layers-distill, Chinese | Tensorflow seou |
To train this model, I used 2 steps.
I used roberta_ext_wwm_large model to get all examples tokens' output.
I used the output to train the model, which inited roberta_ext_wwm_base pretrain model weights.
I just used 5 different ways to mask one sentence, not dynamic mask.
Every example just use maximum 20 token masks
Loss: In this training, I use 2 loss functions, first is cross entropy, second is cosin loss, add them together, I think it has a big improvement if I use another loss function, but I didn't have too much resource to train this model, because my free Google TPU expired.
Other Parameters
Parameter | batch size | learning rate | training step | warming step |
---|---|---|---|---|
Roberta-wwm-ext-base-distill, Chinese | 384 | 5e-5 | 1M | 2W |
Roberta-wwm-ext-large-3layers-distill, Chinese | 128 | 3e-5 | 3M | 2.5K |
Roberta-wwm-ext-large-6layers-distill, Chinese | 512 | 8e-5 | 1M | 5K |
In this part, every task I just ran one time, the result is below.
Model | AFQMC | CMNLI | TNEWS |
---|---|---|---|
Roberta-wwm-ext-base, Chinese | 74.04% | 80.51% | 56.94% |
Roberta-wwm-ext-base-distill, Chinese | 74.44% | 81.1% | 57.6% |
Roberta-wwm-ext-large-3layers-distill, Chinese | 68.8% | 75.5% | 55.7% |
Roberta-wwm-ext-large-6layers-distill, Chinese | 72% | 79.3% | 56.7% |
Model | LCQMC dev | LCQMC test |
---|---|---|
Roberta-wwm-ext-base, Chinese | 89% | 86.5% |
Roberta-wwm-ext-base-distill, Chinese | 89% | 87.2% |
Roberta-wwm-ext-large-3layers-distill, Chinese | 85.1% | 86% |
Roberta-wwm-ext-large-6layers-distill, Chinese | 87.7% | 86.7% |
Model | CMRC2018 dev (F1/EM) |
---|---|
Roberta-wwm-ext-base, Chinese | 84.72%/65.24% |
Roberta-wwm-ext-base-distill, Chinese | 85.2%/65.20% |
Roberta-wwm-ext-large-3layers-distill, Chinese | 78.5%/57.4% |
Roberta-wwm-ext-large-6layers-distill, Chinese | 82.6%/61.7% |
In this part you could ask, your comparison is different with this github, I don't know why, I just used the original base model to run this task, got the score is up, and I used same parameters and distilled model to run this task, got the score is up. Maybe I used the different parameters.
But as you can see, in the same situation, the distilled model has improvement than the original model.
export DATA_DIR=YOUR_DATA_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR
export VOCAB_FILE=YOUR_VOCAB_FILE
python create_pretraining_data.py \
--input_dir=$DATA_DIR\
--output_dir=$OUTPUT_DIR \
--vocab_file=$YOUR_VOCAB_FILE \
--do_whole_word_mask=True \
--ramdom_next=True \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--random_seed=12345 \
--dupe_factor=5 \
--masked_lm_prob=0.15 \
--doc_stride=256 \
--max_workers=2 \
--short_seq_prob=0.1
export TF_RECORDS=YOUR_PRETRAINING_TF_RECORDS
export TEACHER_MODEL=YOUR_TEACHER_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR
python create_teacher_output_data.py \
--bert_config_file=$TEACHER_MODEL/bert_config.json \
--input_file=$TF_RECORDS \
--output_dir=$YOUR_OUTPUT_DIR \
--truncation_factor=128 \
--init_checkpoint=$TEACHER_MODEL\bert_model.ckpt \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--predict_batch_size=64
export TF_RECORDS=YOUR_TEACHER_OUTPUT_TF_RECORDS
export STUDENT_MODEL_DIR=YOUR_STUDENT_MODEL_DIR
export OUTPUT_DIR=YOUR_OUTPUT_DIR
python run_distill.py \
--bert_config_file=$STUDENT_MODEL_DIR\bert_config.json \
--input_file=$TF_RECORDS \
--output_dir=$OUTPUT_DIR \
--init_checkpoint=$STUDENT_MODEL_DIR\bert_model.ckpt
--truncation_factor=128 \
--max_seq_length=512 \
--max_predictions_per_seq=20 \
--do_train=True \
--do_eval=True \
--train_batch_size=384 \
--eval_batch_size=1024 \
--num_train_steps=1000000 \
--num_warmup_steps=20000
The purpose of punish this model is to identify feasibility of distilled of method.
As you can see, this distilled method can improve the accuracy.
Thanks TFRC supports the TPU!