Skip to content

Commit 2b380cb

Browse files
author
Alex Kharlamov
committed
GPT-FAST-MIXTRAL-MOE integration
1 parent 60727af commit 2b380cb

File tree

4 files changed

+127
-0
lines changed

4 files changed

+127
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
2+
## Mixtral-MOE
3+
4+
We will be using [Mixtral-MOE](https://huggingface.co/docs/transformers/en/model_doc/mixtral).
5+
6+
It features:
7+
* 8 experts per MLP
8+
* 45 billion parameters
9+
* compute required is the same as that of a 14 billion parameter model
10+
* Sliding Window Attention
11+
* GQA
12+
* Byte-fallback BPE tokenizer
13+
14+
As a low-level framework we will be using [GPT fast](https://github.com/pytorch-labs/gpt-fast).
15+
16+
17+
18+
#### Pre-requisites
19+
20+
- PyTorch 2.3
21+
- CUDA >= 11.8
22+
23+
`cd` to the example folder `examples/large_models/gpt_fast_mixtral_moe`
24+
25+
Install dependencies
26+
```
27+
git clone https://github.com/pytorch-labs/gpt-fast/
28+
pip install sentencepiece huggingface_hub
29+
```
30+
31+
### Step 1: Download and convert the weights
32+
33+
Currently supported models:
34+
```
35+
mistralai/Mixtral-8x7B-v0.1
36+
```
37+
Prepare weights:
38+
```
39+
export MODEL_REPO=mistralai/Mixtral-8x7B-v0.1
40+
huggingface-cli login
41+
python gpt-fast/mixtral-moe/scripts/download.py --repo_id $MODEL_REPO
42+
python gpt-fast/mixtral-moe/scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/$MODEL_REPO
43+
```
44+
45+
### Step 1.5: Quantize the model to int8
46+
47+
To speed up model loading and inference even further we can optionally quantize the model to int8. Please see the [blog post](https://pytorch.org/blog/accelerating-generative-ai-2/) for details on the potential accuracy loss.
48+
49+
```
50+
python gpt-fast/mixtral-moe/quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
51+
```
52+
53+
The quantized model will show up as checkpoints/$MODEL_REPO/model_int8.pth.
54+
55+
After that we will be using quantized version because of lower memory requirements, but you are free to use original model. To enable it in the example you need to exchange the filename in the [`model_config.yaml`](./model_config.yaml) file.
56+
57+
58+
### Step 2: Generate model archive
59+
At this stage we're creating the model archive which includes the configuration of our model in [model_config.yaml](./model_config.yaml).
60+
It's also the point where we need to decide if we want to deploy our model on a single or multiple GPUs.
61+
For the single GPU case we can use the default configuration that can be found in [model_config.yaml](./model_config.yaml).
62+
All configs enable the current prototyping feature FxGraphCache by setting fx_graph_cache to *true*.
63+
This feature stores the TorchInductor output in a cache to speed up torch.compile times when rerunning the handler.
64+
65+
Please proceed with [TorchServe instalation](https://github.com/pytorch/serve/blob/master/README.md) in order to have torch-model-archiver.
66+
67+
```
68+
torch-model-archiver --model-name gpt_fast_mixtral_moe --version 1.0 --handler ../gpt_fast/handler.py --config-file model_config.yaml --extra-files "gpt-fast/mixtral-moe/generate.py,gpt-fast/mixtral-moe/model.py,gpt-fast/mixtral-moe/quantize.py,gpt-fast/mixtral-moe/tp.py" --archive-format no-archive
69+
mv checkpoints gpt_fast_mixtral_moe/
70+
```
71+
72+
If we want to use tensor parallel variant and split the model over multiple GPUs we need to set the grade of desired tensor parallelism in [model_config_tp.yaml](./model_config_tp.yaml) and use this configuration for creating the archive:
73+
```
74+
torch-model-archiver --model-name gpt_fast_mixtral_moe --version 1.0 --handler ../gpt_fast/handler.py --config-file model_config_tp.yaml --extra-files "gpt-fast/mixtral-moe/generate.py,gpt-fast/mixtral-moe/model.py,gpt-fast/mixtral-moe/quantize.py,gpt-fast/mixtral-moe/tp.py" --archive-format no-archive
75+
mv checkpoints gpt_fast_mixtral_moe/
76+
```
77+
78+
### Step 3: Add the model archive to model store
79+
80+
```
81+
mkdir model_store
82+
mv gpt_fast_mixtral_moe model_store
83+
```
84+
85+
### Step 4: Start torchserve
86+
87+
```
88+
torchserve --start --ncs --model-store model_store --models gpt_fast_mixtral_moe
89+
```
90+
91+
### Step 5: Run inference
92+
93+
```
94+
curl "http://localhost:8080/predictions/gpt_fast_mixtral_moe" -T request.json
95+
# Returns: Paris, is one of the most visited cities in the world. It is a city of romance, art, culture, and fashion. Paris is home to some of the most iconic landmarks in the world, including the Eiffel Tower
96+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#frontend settings
2+
minWorkers: 1
3+
maxWorkers: 1
4+
maxBatchDelay: 200
5+
responseTimeout: 300
6+
deviceType: "gpu"
7+
handler:
8+
converted_ckpt_dir: "checkpoints/mistralai/Mixtral-8x7B-v0.1/model_int8.pth"
9+
max_new_tokens: 50
10+
compile: true
11+
fx_graph_cache: True
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#frontend settings
2+
minWorkers: 1
3+
maxWorkers: 1
4+
maxBatchDelay: 200
5+
responseTimeout: 300
6+
parallelType: "tp"
7+
deviceType: "gpu"
8+
torchrun:
9+
nproc-per-node: 4
10+
handler:
11+
profile: true
12+
converted_ckpt_dir: "checkpoints/mistralai/Mixtral-8x7B-v0.1/model_int8.pth"
13+
max_new_tokens: 50
14+
compile: true
15+
stream: false
16+
fx_graph_cache: True
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"prompt": "The capital of France",
3+
"max_new_tokens": 50
4+
}

0 commit comments

Comments
 (0)