[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
This is the 3rd place solution to ACM International Conference on Web Search and Data Mining(WSDM) Cup 2019, a challenge to fake news detection and sentence pairs modeling.
Clone this project.
Download the dataset from the corresponding competition on Kaggle and extract it under the directory zake7749/data/dataset
|-- dataset
|-- sample_submission.csv
|-- test.csv
`-- train.csv
We use 2 open-source pretrained word embeddings in this competiton:
And put these two embeddings under the folder zake7749/data/wordvec/
|-- wordvec
|-- Tencent_AILab_ChineseEmbedding.txt
`-- sgns.merge.bigram
The notebooks are under the folder zake7749/code
Stage 1.1. Preprocessing-on-word-level.ipynb
Stage 1.2. Preprocessing-on-char-level.ipynb
These notebooks would generate 8 cleaned datasets under zake7749/data/processed_dataset
.
.
|-- engineered_chars_test.csv
|-- engineered_chars_train.csv
|-- engineered_words_test.csv
|-- engineered_words_train.csv
|-- processed_chars_test.csv
|-- processed_chars_train.csv
|-- processed_words_test.csv
`-- processed_words_train.csv
Execute Stage 1.3. Train-char-embeddings
, which would output 3 char embeddings under zake7749/data/wordvec/
|-- wordvec
|-- Tencent_AILab_ChineseEmbedding.txt
|-- fasttext-50-win3.vec
|-- sgns.merge.bigram
|-- zh-wordvec-50-cbow-windowsize50.vec
`-- zh-wordvec-50-skipgram-windowsize7.vec
Stage 2. First-Level-with-char-level.ipynb
Stage 2. First-Level-with-word-level.ipynb
Stage 3.1. First-level-ensemble-ridge-regression
Stage 3.2. First-level-ensemble-with-LGBM-each-side
Stage 3.3. First-level-ensemble-with-LGBM
Stage 3.4. First-level-ensemble-with-NN
Stage 3.5. Second-level-ensemble
hanshan/bert/train_wsdm.sh
zake7749/bert/data/probs_to_preds.py
Stage 3.6. Bagging-with-BERT
** Note: Please change the path of sec_stacking_df to the corresponding file **
Stage 4.1. Fine-tune-word-level-models.ipynb
Stage 4.2. Fine-tune-char-level-models.ipynb
hanshan/prep_pseudo_labels.py
hanshan/bert/train_wsdm_pl.sh
Stage 5.1. First-level-fine-tuned-ensemble-ridge-regression.ipynb
Stage 5.2. First-level-fine-tuned-ensemble-withNN.ipynb
Stage 5.3. First-level-fine-tuned-ensemble-with-LGBM.ipynb
Stage 5.4. Second-level-fine-tuned-ensemble.ipynb
Stage 9. High-Ground.ipynb
Stage 42. Final Answer.ipynb
The final prediction final_answer.csv
would be generated under the folder zake7749/data/high_ground/