9
9
We release the ** uncased** ** 12** -layer and ** 6** -layer MiniLM models with ** 384** hidden size distilled from an in-house pre-trained [ UniLM v2] ( /unilm ) model in BERT-Base size. We also release ** uncased** ** 6** -layer MiniLM model with ** 768** hidden size distilled from [ BERT-Base] ( https://github.com/google-research/bert ) . The models use the same WordPiece vocabulary as BERT.
10
10
11
11
The links to the pre-trained models:
12
- - MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base
12
+ - [ MiniLMv1-L12-H384-uncased] ( https://1drv.ms/u/s!AjHn0yEmKG8qixAYyu2Fvq5ulnU7?e=DFApTA ) : 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base
13
13
- MiniLMv1-L6-H384-uncased: 6-layer, 384-hidden, 12-heads, 22M parameters, 5.3x faster than BERT-Base
14
14
- MiniLMv1-L6-H768-uncased: 6-layer, 768-hidden, 12-heads, 66M parameters, 2.0x faster than BERT-Base
15
15
@@ -24,7 +24,7 @@ We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks.
24
24
| ** MiniLM-L12xH384** | 33M | 81.7 | 85.7 | 93.0 | 91.5 | 58.5 | 73.3 | 89.5 | 91.3 |
25
25
| ** MiniLM-L6xH384** | 22M | 75.6 | 83.3 | 91.5 | 90.5 | 47.5 | 68.8 | 88.9 | 90.6 |
26
26
27
- This example code fine-tunes ** 6 ** -layer MiniLM on SQuAD 2.0 dataset.
27
+ This example code fine-tunes ** 12 ** -layer MiniLM on SQuAD 2.0 dataset.
28
28
29
29
``` bash
30
30
# run fine-tuning on SQuAD 2.0
@@ -35,13 +35,13 @@ MODEL_PATH=/{path_of_pre-trained_model}/
35
35
export CUDA_VISIBLE_DEVICES=0,1,2,3
36
36
python -m torch.distributed.launch --nproc_per_node=4 ./examples/run_squad.py --model_type bert \
37
37
--output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
38
- --model_name_or_path ${MODEL_PATH} /minilm-l6 -h384-uncased.bin --tokenizer_name ${MODEL_PATH} /vocab.txt \
39
- --config_name ${MODEL_PATH} /minilm-l6 -h384-uncased-config.json \
38
+ --model_name_or_path ${MODEL_PATH} /minilm-l12 -h384-uncased.bin --tokenizer_name ${MODEL_PATH} /vocab.txt \
39
+ --config_name ${MODEL_PATH} /minilm-l12 -h384-uncased-config.json \
40
40
--do_train --do_eval --do_lower_case \
41
41
--train_file train-v2.0.json --predict_file dev-v2.0.json \
42
- --learning_rate 5e -5 --num_train_epochs 5 \
42
+ --learning_rate 4e -5 --num_train_epochs 4 \
43
43
--max_seq_length 384 --doc_stride 128 \
44
- --per_gpu_eval_batch_size=8 --per_gpu_train_batch_size=8 --save_steps 5000 \
44
+ --per_gpu_eval_batch_size=12 --per_gpu_train_batch_size=12 --save_steps 5000 \
45
45
--version_2_with_negative
46
46
```
47
47
@@ -58,40 +58,43 @@ Following [UniLM](/unilm-v1), MiniLM can be fine-tuned as a sequence-to-sequence
58
58
| ** MiniLM-L12xH384** | 33M | 40.43 | 17.72 | 32.60 |
59
59
| ** MiniLM-L6xH384** | 22M | 38.79 | 16.39 | 31.10 |
60
60
61
- This example code fine-tunes ** 6 ** -layer MiniLM on XSum dataset.
61
+ This example code fine-tunes ** 12 ** -layer MiniLM on XSum dataset.
62
62
63
63
``` bash
64
64
# run fine-tuning on XSum
65
65
TRAIN_FILE=/your/path/to/train.json
66
66
CACHED_FEATURE_FILE=/your/path/to/xsum_train.uncased.features.pt
67
67
OUTPUT_DIR=/your/path/to/save_checkpoints
68
68
CACHE_DIR=/your/path/to/transformer_package_cache
69
+ MODEL_PATH=/your/path/to/pre_trained_model/
69
70
70
71
export CUDA_VISIBLE_DEVICES=0,1,2,3
71
72
python -m torch.distributed.launch --nproc_per_node=4 run_seq2seq.py \
72
73
--train_file ${TRAIN_FILE} --cached_train_features_file ${CACHED_FEATURE_FILE} \
73
74
--output_dir ${OUTPUT_DIR} \
74
- --model_type minilm --model_name_or_path minilm-l6-h384-uncased \
75
+ --model_type bert --model_name_or_path ${MODEL_PATH} /minilm-l12-h384-uncased.bin \
76
+ --tokenizer_name ${MODEL_PATH} /minilm-l12-h384-uncased-vocab-nlg.txt --config_name ${MODEL_PATH} /minilm-l12-h384-uncased-config.json \
75
77
--do_lower_case --fp16 --fp16_opt_level O2 \
76
78
--max_source_seq_length 464 --max_target_seq_length 48 \
77
79
--per_gpu_train_batch_size 16 --gradient_accumulation_steps 1 \
78
- --learning_rate 1.5e -4 --num_warmup_steps 500 --num_training_steps 120000 --cache_dir ${CACHE_DIR}
80
+ --learning_rate 1e -4 --num_warmup_steps 500 --num_training_steps 108000 --cache_dir ${CACHE_DIR}
79
81
```
80
82
81
83
``` bash
82
84
# run decoding on XSum
83
85
MODEL_PATH=/your/path/to/model_checkpoint
86
+ VOCAB_PATH=/your/path/to/vocab_file
84
87
SPLIT=validation
85
88
INPUT_JSON=/your/path/to/${SPLIT} .json
86
89
87
90
export CUDA_VISIBLE_DEVICES=0
88
91
export OMP_NUM_THREADS=4
89
92
export MKL_NUM_THREADS=4
90
93
python decode_seq2seq.py \
91
- --fp16 --model_type minilm --tokenizer_name minilm-l6 -h384-uncased \
94
+ --fp16 --model_type bert --tokenizer_name ${VOCAB_PATH} / minilm-l12 -h384-uncased-vocab-nlg.txt \
92
95
--input_file ${INPUT_JSON} --split $SPLIT --do_lower_case \
93
96
--model_path ${MODEL_PATH} --max_seq_length 512 --max_tgt_length 48 --batch_size 32 --beam_size 5 \
94
- --length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word " ."
97
+ --length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word " ." --need_score_traces
95
98
```
96
99
97
100
### Abstractive Summarization - [ CNN / Daily Mail] ( https://github.com/harvardnlp/sent-summary )
0 commit comments