Skip to content

Commit f433a66

Browse files
authored
Merge branch 'master' into dependabot/pip/requirements/requests-2.32.0
2 parents 5fd63e8 + b891309 commit f433a66

File tree

9 files changed

+149
-29
lines changed

9 files changed

+149
-29
lines changed

.github/workflows/regression_tests_cpu_binaries.yml

+14-19
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ jobs:
1616
strategy:
1717
fail-fast: false
1818
matrix:
19-
os: [ubuntu-20.04, macOS-latest]
19+
os: [ubuntu-20.04, macos-latest]
2020
python-version: ["3.8", "3.9", "3.10"]
2121
binaries: ["pypi", "conda"]
2222
exclude:
@@ -31,38 +31,33 @@ jobs:
3131
with:
3232
submodules: recursive
3333
- name: Setup conda with Python ${{ matrix.python-version }}
34-
if: matrix.os == 'macos-14'
3534
uses: conda-incubator/setup-miniconda@v3
3635
with:
3736
auto-update-conda: true
3837
channels: anaconda, conda-forge
3938
python-version: ${{ matrix.python-version }}
40-
- name: Setup conda with Python ${{ matrix.python-version }}
41-
if: matrix.os != 'macos-14'
42-
uses: s-weigand/setup-conda@v1
43-
with:
44-
update-conda: true
45-
python-version: ${{ matrix.python-version }}
46-
conda-channels: anaconda, conda-forge
4739
- name: Setup Java 17
4840
uses: actions/setup-java@v3
4941
with:
5042
distribution: 'zulu'
5143
java-version: '17'
5244
- name: Checkout TorchServe
5345
uses: actions/checkout@v3
54-
- name: Run install dependencies and regression test
55-
if: matrix.os == 'macos-14'
56-
shell: bash -el {0}
57-
run: |
58-
conda info
59-
python ts_scripts/install_dependencies.py --environment=dev
60-
python test/regression_tests.py --binaries --${{ matrix.binaries }} --nightly
6146
- name: Install dependencies
62-
if: matrix.os != 'macos-14'
47+
shell: bash -el {0}
6348
run: |
49+
echo "=====CHECK ENV AND PYTHON VERSION===="
50+
conda info --envs
51+
python --version
52+
echo "=====RUN INSTALL DEPENDENCIES===="
6453
python ts_scripts/install_dependencies.py --environment=dev
65-
- name: Validate Torchserve CPU Regression
66-
if: matrix.os != 'macos-14'
54+
- name: Torchserve Regression Tests
55+
shell: bash -el {0}
56+
env:
57+
TS_MAC_ARM64_CPU_ONLY: ${{ matrix.os == 'macos-latest' && 'True' || 'False' }}
6758
run: |
59+
echo "=====CHECK ENV AND PYTHON VERSION===="
60+
conda info --envs
61+
python --version
62+
echo "=====RUN REGRESSION TESTS===="
6863
python test/regression_tests.py --binaries --${{ matrix.binaries }} --nightly

.github/workflows/regression_tests_gpu_binaries.yml

+4-9
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,7 @@ jobs:
3939
with:
4040
python-version: ${{ matrix.python-version }}
4141
architecture: x64
42-
- name: Setup Conda
43-
uses: s-weigand/setup-conda@v1
44-
with:
45-
update-conda: true
46-
python-version: ${{ matrix.python-version }}
47-
conda-channels: anaconda, conda-forge
42+
- run: python --version
4843
- run: conda --version
4944
- name: Setup Java 17
5045
uses: actions/setup-java@v3
@@ -53,17 +48,17 @@ jobs:
5348
java-version: '17'
5449
- name: Install dependencies
5550
shell: bash -el {0}
56-
run: |
51+
run: |
5752
echo "=====CHECK ENV AND PYTHON VERSION===="
5853
/home/ubuntu/actions-runner/_work/serve/serve/3/condabin/conda info --envs
5954
python --version
6055
echo "=====RUN INSTALL DEPENDENCIES===="
6156
python ts_scripts/install_dependencies.py --environment=dev --cuda=cu121
6257
- name: Torchserve Regression Tests
63-
shell: bash -el {0}
58+
shell: bash -el {0}
6459
run: |
6560
echo "=====CHECK ENV AND PYTHON VERSION===="
6661
/home/ubuntu/actions-runner/_work/serve/serve/3/condabin/conda info --envs
6762
python --version
6863
echo "=====RUN REGRESSION TESTS===="
69-
python test/regression_tests.py --binaries --${{ matrix.binaries }} --nightly
64+
python test/regression_tests.py --binaries --${{ matrix.binaries }} --nightly
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
2+
## Mixtral-MOE
3+
4+
We will be using [Mixtral-MOE](https://huggingface.co/docs/transformers/en/model_doc/mixtral).
5+
6+
It features:
7+
* 8 experts per MLP
8+
* 45 billion parameters
9+
* compute required is the same as that of a 14 billion parameter model
10+
* Sliding Window Attention
11+
* GQA
12+
* Byte-fallback BPE tokenizer
13+
14+
As a low-level framework we will be using [GPT fast](https://github.com/pytorch-labs/gpt-fast).
15+
16+
17+
18+
#### Pre-requisites
19+
20+
- PyTorch 2.3
21+
- CUDA >= 11.8
22+
23+
`cd` to the example folder `examples/large_models/gpt_fast_mixtral_moe`
24+
25+
Install dependencies
26+
```
27+
git clone https://github.com/pytorch-labs/gpt-fast/
28+
pip install sentencepiece huggingface_hub
29+
```
30+
31+
### Step 1: Download and convert the weights
32+
33+
Currently supported models:
34+
```
35+
mistralai/Mixtral-8x7B-v0.1
36+
```
37+
Prepare weights:
38+
```
39+
export MODEL_REPO=mistralai/Mixtral-8x7B-v0.1
40+
huggingface-cli login
41+
python gpt-fast/mixtral-moe/scripts/download.py --repo_id $MODEL_REPO
42+
python gpt-fast/mixtral-moe/scripts/convert_hf_checkpoint.py --checkpoint_dir checkpoints/$MODEL_REPO
43+
```
44+
45+
### Step 1.5: Quantize the model to int8
46+
47+
To speed up model loading and inference even further we can optionally quantize the model to int8. Please see the [blog post](https://pytorch.org/blog/accelerating-generative-ai-2/) for details on the potential accuracy loss.
48+
49+
```
50+
python gpt-fast/mixtral-moe/quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
51+
```
52+
53+
The quantized model will show up as checkpoints/$MODEL_REPO/model_int8.pth.
54+
55+
After that we will be using quantized version because of lower memory requirements, but you are free to use original model. To enable it in the example you need to exchange the filename in the [`model_config.yaml`](./model_config.yaml) file.
56+
57+
58+
### Step 2: Generate model archive
59+
At this stage we're creating the model archive which includes the configuration of our model in [model_config.yaml](./model_config.yaml).
60+
It's also the point where we need to decide if we want to deploy our model on a single or multiple GPUs.
61+
For the single GPU case we can use the default configuration that can be found in [model_config.yaml](./model_config.yaml).
62+
All configs enable the current prototyping feature FxGraphCache by setting fx_graph_cache to *true*.
63+
This feature stores the TorchInductor output in a cache to speed up torch.compile times when rerunning the handler.
64+
65+
Please proceed with [TorchServe instalation](https://github.com/pytorch/serve/blob/master/README.md) in order to have torch-model-archiver.
66+
67+
```
68+
torch-model-archiver --model-name gpt_fast_mixtral_moe --version 1.0 --handler ../gpt_fast/handler.py --config-file model_config.yaml --extra-files "gpt-fast/mixtral-moe/generate.py,gpt-fast/mixtral-moe/model.py,gpt-fast/mixtral-moe/quantize.py,gpt-fast/mixtral-moe/tp.py" --archive-format no-archive
69+
mv checkpoints gpt_fast_mixtral_moe/
70+
```
71+
72+
If we want to use tensor parallel variant and split the model over multiple GPUs we need to set the grade of desired tensor parallelism in [model_config_tp.yaml](./model_config_tp.yaml) and use this configuration for creating the archive:
73+
```
74+
torch-model-archiver --model-name gpt_fast_mixtral_moe --version 1.0 --handler ../gpt_fast/handler.py --config-file model_config_tp.yaml --extra-files "gpt-fast/mixtral-moe/generate.py,gpt-fast/mixtral-moe/model.py,gpt-fast/mixtral-moe/quantize.py,gpt-fast/mixtral-moe/tp.py" --archive-format no-archive
75+
mv checkpoints gpt_fast_mixtral_moe/
76+
```
77+
78+
### Step 3: Add the model archive to model store
79+
80+
```
81+
mkdir model_store
82+
mv gpt_fast_mixtral_moe model_store
83+
```
84+
85+
### Step 4: Start torchserve
86+
87+
```
88+
torchserve --start --ncs --model-store model_store --models gpt_fast_mixtral_moe
89+
```
90+
91+
### Step 5: Run inference
92+
93+
```
94+
curl "http://localhost:8080/predictions/gpt_fast_mixtral_moe" -T request.json
95+
# Returns: Paris, is one of the most visited cities in the world. It is a city of romance, art, culture, and fashion. Paris is home to some of the most iconic landmarks in the world, including the Eiffel Tower
96+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#frontend settings
2+
minWorkers: 1
3+
maxWorkers: 1
4+
maxBatchDelay: 200
5+
responseTimeout: 300
6+
deviceType: "gpu"
7+
handler:
8+
converted_ckpt_dir: "checkpoints/mistralai/Mixtral-8x7B-v0.1/model_int8.pth"
9+
max_new_tokens: 50
10+
compile: true
11+
fx_graph_cache: True
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#frontend settings
2+
minWorkers: 1
3+
maxWorkers: 1
4+
maxBatchDelay: 200
5+
responseTimeout: 300
6+
parallelType: "tp"
7+
deviceType: "gpu"
8+
torchrun:
9+
nproc-per-node: 4
10+
handler:
11+
profile: true
12+
converted_ckpt_dir: "checkpoints/mistralai/Mixtral-8x7B-v0.1/model_int8.pth"
13+
max_new_tokens: 50
14+
compile: true
15+
stream: false
16+
fx_graph_cache: True
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{
2+
"prompt": "The capital of France",
3+
"max_new_tokens": 50
4+
}

requirements/common.txt

+1
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ packaging==23.2
55
pynvml==11.5.0
66
pyyaml==6.0
77
ninja==1.11.1.1
8+
setuptools==69.5.1

test/pytest/test_continuous_batching.py

+3
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ def register_model(mar_file_path, model_store, torchserve):
9494
test_utils.unregister_model(model_name)
9595

9696

97+
@pytest.mark.skip(reason="Skipping this test for now")
9798
def test_echo_stream_inference(model_name_and_stdout):
9899
model_name, _ = model_name_and_stdout
99100
responses = []
@@ -145,6 +146,7 @@ def test_echo_stream_inference(model_name_and_stdout):
145146
assert all_predictions[3] == "When travelling to NYC, I was able to"
146147

147148

149+
@pytest.mark.skip(reason="Skipping this test for now")
148150
def test_decoding_stage(monkeypatch):
149151
monkeypatch.syspath_prepend((CURR_FILE_PATH / "test_data" / "streaming"))
150152

@@ -211,6 +213,7 @@ def test_decoding_stage(monkeypatch):
211213
assert ctx.cache["id2"]["encoded"]["attention_mask"].size()[-1] == 11
212214

213215

216+
@pytest.mark.skip(reason="Skipping this test for now")
214217
def test_closed_connection(model_name_and_stdout):
215218
model_name, stdout = model_name_and_stdout
216219

ts_scripts/install_dependencies.py

-1
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,6 @@ def install_python_packages(self, cuda_version, requirements_file_path, nightly)
146146
else:
147147
self.install_torch_packages(cuda_version)
148148

149-
os.system(f"{sys.executable} -m pip install -U pip setuptools")
150149
# developer.txt also installs packages from common.txt
151150
os.system(f"{sys.executable} -m pip install -U -r {requirements_file_path}")
152151

0 commit comments

Comments
 (0)