Skip to content

Commit 7f931fc

Browse files
authored
Update README.md
1 parent 6aa4ba5 commit 7f931fc

File tree

1 file changed

+14
-11
lines changed

1 file changed

+14
-11
lines changed

minilm/README.md

+14-11
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
We release the **uncased** **12**-layer and **6**-layer MiniLM models with **384** hidden size distilled from an in-house pre-trained [UniLM v2](/unilm) model in BERT-Base size. We also release **uncased** **6**-layer MiniLM model with **768** hidden size distilled from [BERT-Base](https://github.com/google-research/bert). The models use the same WordPiece vocabulary as BERT.
1010

1111
The links to the pre-trained models:
12-
- MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base
12+
- [MiniLMv1-L12-H384-uncased](https://1drv.ms/u/s!AjHn0yEmKG8qixAYyu2Fvq5ulnU7?e=DFApTA): 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base
1313
- MiniLMv1-L6-H384-uncased: 6-layer, 384-hidden, 12-heads, 22M parameters, 5.3x faster than BERT-Base
1414
- MiniLMv1-L6-H768-uncased: 6-layer, 768-hidden, 12-heads, 66M parameters, 2.0x faster than BERT-Base
1515

@@ -24,7 +24,7 @@ We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks.
2424
| **MiniLM-L12xH384** | 33M | 81.7 | 85.7 | 93.0 | 91.5 | 58.5 | 73.3 | 89.5 | 91.3 |
2525
| **MiniLM-L6xH384** | 22M | 75.6 | 83.3 | 91.5 | 90.5 | 47.5 | 68.8 | 88.9 | 90.6 |
2626

27-
This example code fine-tunes **6**-layer MiniLM on SQuAD 2.0 dataset.
27+
This example code fine-tunes **12**-layer MiniLM on SQuAD 2.0 dataset.
2828

2929
```bash
3030
# run fine-tuning on SQuAD 2.0
@@ -35,13 +35,13 @@ MODEL_PATH=/{path_of_pre-trained_model}/
3535
export CUDA_VISIBLE_DEVICES=0,1,2,3
3636
python -m torch.distributed.launch --nproc_per_node=4 ./examples/run_squad.py --model_type bert \
3737
--output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
38-
--model_name_or_path ${MODEL_PATH}/minilm-l6-h384-uncased.bin --tokenizer_name ${MODEL_PATH}/vocab.txt \
39-
--config_name ${MODEL_PATH}/minilm-l6-h384-uncased-config.json \
38+
--model_name_or_path ${MODEL_PATH}/minilm-l12-h384-uncased.bin --tokenizer_name ${MODEL_PATH}/vocab.txt \
39+
--config_name ${MODEL_PATH}/minilm-l12-h384-uncased-config.json \
4040
--do_train --do_eval --do_lower_case \
4141
--train_file train-v2.0.json --predict_file dev-v2.0.json \
42-
--learning_rate 5e-5 --num_train_epochs 5 \
42+
--learning_rate 4e-5 --num_train_epochs 4 \
4343
--max_seq_length 384 --doc_stride 128 \
44-
--per_gpu_eval_batch_size=8 --per_gpu_train_batch_size=8 --save_steps 5000 \
44+
--per_gpu_eval_batch_size=12 --per_gpu_train_batch_size=12 --save_steps 5000 \
4545
--version_2_with_negative
4646
```
4747

@@ -58,40 +58,43 @@ Following [UniLM](/unilm-v1), MiniLM can be fine-tuned as a sequence-to-sequence
5858
| **MiniLM-L12xH384** | 33M | 40.43 | 17.72 | 32.60 |
5959
| **MiniLM-L6xH384** | 22M | 38.79 | 16.39 | 31.10 |
6060

61-
This example code fine-tunes **6**-layer MiniLM on XSum dataset.
61+
This example code fine-tunes **12**-layer MiniLM on XSum dataset.
6262

6363
```bash
6464
# run fine-tuning on XSum
6565
TRAIN_FILE=/your/path/to/train.json
6666
CACHED_FEATURE_FILE=/your/path/to/xsum_train.uncased.features.pt
6767
OUTPUT_DIR=/your/path/to/save_checkpoints
6868
CACHE_DIR=/your/path/to/transformer_package_cache
69+
MODEL_PATH=/your/path/to/pre_trained_model/
6970

7071
export CUDA_VISIBLE_DEVICES=0,1,2,3
7172
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
7273
--train_file ${TRAIN_FILE} --cached_train_features_file ${CACHED_FEATURE_FILE} \
7374
--output_dir ${OUTPUT_DIR} \
74-
--model_type minilm --model_name_or_path minilm-l6-h384-uncased \
75+
--model_type bert --model_name_or_path ${MODEL_PATH}/minilm-l12-h384-uncased.bin \
76+
--tokenizer_name ${MODEL_PATH}/minilm-l12-h384-uncased-vocab-nlg.txt --config_name ${MODEL_PATH}/minilm-l12-h384-uncased-config.json \
7577
--do_lower_case --fp16 --fp16_opt_level O2 \
7678
--max_source_seq_length 464 --max_target_seq_length 48 \
7779
--per_gpu_train_batch_size 16 --gradient_accumulation_steps 1 \
78-
--learning_rate 1.5e-4 --num_warmup_steps 500 --num_training_steps 120000 --cache_dir ${CACHE_DIR}
80+
--learning_rate 1e-4 --num_warmup_steps 500 --num_training_steps 108000 --cache_dir ${CACHE_DIR}
7981
```
8082

8183
```bash
8284
# run decoding on XSum
8385
MODEL_PATH=/your/path/to/model_checkpoint
86+
VOCAB_PATH=/your/path/to/vocab_file
8487
SPLIT=validation
8588
INPUT_JSON=/your/path/to/${SPLIT}.json
8689

8790
export CUDA_VISIBLE_DEVICES=0
8891
export OMP_NUM_THREADS=4
8992
export MKL_NUM_THREADS=4
9093
python decode_seq2seq.py \
91-
--fp16 --model_type minilm --tokenizer_name minilm-l6-h384-uncased \
94+
--fp16 --model_type bert --tokenizer_name ${VOCAB_PATH}/minilm-l12-h384-uncased-vocab-nlg.txt \
9295
--input_file ${INPUT_JSON} --split $SPLIT --do_lower_case \
9396
--model_path ${MODEL_PATH} --max_seq_length 512 --max_tgt_length 48 --batch_size 32 --beam_size 5 \
94-
--length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "."
97+
--length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "." --need_score_traces
9598
```
9699

97100
### Abstractive Summarization - [CNN / Daily Mail](https://github.com/harvardnlp/sent-summary)

0 commit comments

Comments
 (0)