Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data
This is the code base for weakly supervised NER.
We provide a three stage framework:
In this code base, we actually provide basic building blocks which allow arbitrary combination of different stages. We also provide examples scripts for reproducing our results in BioMedical NER.
See details in arXiv.
BioMedical NER
Method (F1) | BC5CDR-chem | BC5CDR-disease | NCBI-disease |
---|---|---|---|
BERT | 89.99 | 79.92 | 85.87 |
bioBERT | 92.85 | 84.70 | 89.13 |
PubMedBERT | 93.33 | 85.62 | 87.82 |
Ours | 94.17 | 90.69 | 92.28 |
See more in bio_script/README.md
pytorch==1.6.0
transformers==3.3.1
allennlp==1.1.0
flashtool==0.0.10
ray==0.8.7
Install requirements
pip install -r requirements.txt
(If the allennlp
and transformers
are incompatible, install allennlp
first and then update transformers
. Since we only use some small functions of allennlp
, it should works fine. )
├── bert-ner # Python Code for Training NER models
│ └── ...
└── bio_script # Shell Scripts for Training BioMedical NER models
└── ...
See examples in bio_script
Here we explain hyperparameters used the scripts in ./bio_script
.
Scripts
roberta_mlm_pretrain.sh
weak_weighted_selftrain.sh
finetune.sh
Hyperparameter
GPUID
: Choose the GPU for training. It can also be specified by xxx.sh 0,1,2,3
.MASTER_PORT
: automatically constructed (avoid conflicts) for distributed training.DISTRIBUTE_GPU
: use distributed training or notPROJECT_ROOT
: automatically detected, the root path of the project folder.DATA_DIR
: Directory of the training data, where it contains train.txt
test.txt
dev.txt
labels.txt
weak_train.txt
(weak data) aug_train.txt
(optional).USE_DA
: if augment training data by augmentation, i.e., combine train.txt
+ aug_train.txt
in DATA_DIR
for training.BERT_MODEL
: the model backbone, e.g., roberta-large
. See transformers for details.BERT_CKP
: see BERT_MODEL_PATH
.BERT_MODEL_PATH
: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP
.LOSSFUNC
: nll
the normal loss function, corrected_nll
noise-aware risk (i.e., add weighted log-unlikelihood regularization: wei*nll + (1-wei)*null ).MAX_WEIGHT
: The maximum weight of a sample in the loss.MAX_LENGTH
: max sentence length.BATCH_SIZE
: batch size per GPU.NUM_EPOCHS
: number of training epoches.LR
: learning rate.WARMUP
: learning rate warmup steps.SAVE_STEPS
: the frequency of saving models.EVAL_STEPS
: the frequency of testing on validation.SEED
: radnom seed.OUTPUT_DIR
: the directory for saving model and code. Some parameters will be automatically appended to the path.
roberta_mlm_pretrain.sh
: It's better to manually check where you want to save the model.]finetune.sh
: It will be save in ${BERT_MODEL_PATH}/finetune_xxxx
.weak_weighted_selftrain.sh
: It will be save in ${BERT_MODEL_PATH}/selftrain/${FBA_RULE}_xxxx
(see FBA_RULE
below)There are some addition parameters need to be set for weakly supervised learning (weak_weighted_selftrain.sh
).
WEAK_RULE
: what kind of weakly supervised data to use. See Weakly Supervised Data Refinement Script for details.Scripts
profile.sh
Profiling scripts also use the same entry as the training script: bert-ner/run_ner.py
but only do evaluation.
Hyperparameter Basically the same as training script.
PROFILE_FILE
: can be train,dev,test
or a specific path to a txt
data. E.g., using Weak by
PROFILE_FILE=weak_train_100.txt
PROFILE_FILE=$DATA_DIR/$PROFILE_FILE
OUTPUT_DIR
: It will be saved in OUTPUT_DIR=${BERT_MODEL_PATH}/predict/profile
Scripts
profile2refinedweakdata.sh
Hyperparameter
BERT_CKP
: see BERT_MODEL_PATH
.BERT_MODEL_PATH
: the path of the model checkpoint that you want to load as the initialization. Usually used with BERT_CKP
.WEI_RULE
: rule for generating weight for each weak sample.
uni
: all are 1avgaccu
: confidence estimate for new labels generated by all_overwrite
avgaccu_weak_non_O_promote
: confidence estimate for new labels generated by non_O_overwrite
PRED_RULE
: rule for generating new weak labels.
non_O_overwrite
: non-entity ('O') is overwrited by predictionall_overwrite
: all use prediction, i.e., self-trainingno
: use original weak labelsnon_O_overwrite_all_overwrite_over_accu_xx
: non_O_overwrite
+ if confidence is higher than xx
all tokens use prediction as new labelsThe generated data will be saved in ${BERT_MODEL_PATH}/predict/weak_${PRED_RULE}-WEI_${WEI_RULE}
WEAK_RULE
specified in weak_weighted_selftrain.sh
is essential the name of folder weak_${PRED_RULE}-WEI_${WEI_RULE}
.
BERT_CKP
appropriately;./bio_script/profile.sh
for dev set and weak set./bio_script/profile2refinedweakdata.sh
. You can use different rules to generate weights for each sample (WEI_RULE
) and different rules to refine weak labels (PRED_RULE
). See more details in ./ber-ner/profile2refinedweakdata.py
./bio_script/weak_weighted_selftrain.sh
.BERT_CKP
appropriately;./bio_script/finetune.sh
.@inproceedings{Jiang2021NamedER,
title={Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data},
author={Haoming Jiang and Danqing Zhang and Tianyue Cao and Bing Yin and T. Zhao},
booktitle={ACL/IJCNLP},
year={2021}
}
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.