Skip to content

Commit f74f85c

Browse files
lxningmreso
andauthored
Refactor inf2 streamer handler (#3035)
* add inf2 streamer base handler * fmt * updated model-config * remove batch_size * fixed config * fixed config * fixed config * updated model-config * update doc * add notebook * update docker cmd * update config for neuron sdk2.18.1 * update continuous batch notebook for neuron sdk2.18.1 * delete out dated files * add neuron_cc_flag in model-config.yaml * support demo streaming response * support demo streaming response * support demo streaming response * support demo streaming response * support demo streaming response * support demo streaming response * support n_positions setting * mv base_neuronx_microbatching_handler to example dir --------- Co-authored-by: Matthias Reso <[email protected]>
1 parent 4ec7518 commit f74f85c

15 files changed

+433
-323
lines changed

examples/large_models/inferentia2/llama2/config.properties

-1
This file was deleted.

examples/large_models/inferentia2/llama2/continuous_batching/Readme.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ This example can also be extended to support Mistral without code changes. Custo
1515
| mistral | mistral.model.MistralForSampling |
1616

1717

18-
The batch size in [model-config.yaml](model-config.yaml) indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay. It is the batch size used for the Inf2 model compilation.
18+
The `batchSize` in [model-config.yaml](model-config.yaml) indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay. It is the batch size used for the Inf2 model compilation.
1919
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
2020

2121
`inf2-llama-2-continuous-batching.ipynb` is the notebook example.

examples/large_models/inferentia2/llama2/continuous_batching/inf2-llama-2-continuous-batching.ipynb

+12-24
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"source": [
66
"## TorchServe Continuous Batching Serve Llama-2-70B on Inferentia-2\n",
7-
"This notebook demonstrates TorchServe continuous batching serving Llama-2-70b on Inferentia-2 `inf2.48xlarge` with DLAMI: Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20231226"
7+
"This notebook demonstrates TorchServe continuous batching serving Llama-2-70b on Inferentia-2 `inf2.48xlarge` with Neuron DLAMI Deep Learning AMI Neuron (Ubuntu 22.04) 20240401 and Neuron DLC [public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.18.1-ubuntu20.04](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-inference-neuronx)"
88
],
99
"metadata": {
1010
"collapsed": false
@@ -13,8 +13,7 @@
1313
{
1414
"cell_type": "markdown",
1515
"source": [
16-
"### Installation\n",
17-
"Note: This section can be skipped once Neuron DLC 2.16 with TorchServe latest version is released."
16+
"### Installation"
1817
],
1918
"metadata": {
2019
"collapsed": false
@@ -25,26 +24,17 @@
2524
"execution_count": null,
2625
"outputs": [],
2726
"source": [
28-
"# Install Python venv\n",
29-
"!sudo apt-get install -y python3.9-venv g++\n",
27+
"# Activate Transformers NeuronX (PyTorch 2.1) Python venv\n",
28+
"!source /opt/aws_neuronx_venv_transformers_neuronx/bin/activate\n",
3029
"\n",
31-
"# Create Python venv\n",
32-
"!python3.9 -m venv aws_neuron_venv_pytorch\n",
33-
"\n",
34-
"# Activate Python venv\n",
35-
"!source aws_neuron_venv_pytorch/bin/activate\n",
36-
"!python -m pip install -U pip\n",
30+
"# Install torch-model-archiver\n",
31+
"!pip install torch-model-archiver\n",
3732
"\n",
3833
"# Clone Torchserve git repository\n",
3934
"!git clone https://github.com/pytorch/serve.git\n",
4035
"\n",
4136
"# Install dependencies, now all commands run under serve dir\n",
42-
"!cd serve\n",
43-
"!git checkout feat/inf2_cb\n",
44-
"!python ts_scripts/install_dependencies.py --neuronx --environment=dev\n",
45-
"\n",
46-
"# Install torchserve and torch-model-archiver\n",
47-
"python ts_scripts/install_from_src.py"
37+
"!cd serve"
4838
],
4939
"metadata": {
5040
"collapsed": false
@@ -53,10 +43,7 @@
5343
{
5444
"cell_type": "markdown",
5545
"source": [
56-
"### Create model artifacts\n",
57-
"\n",
58-
"Note: run `mv model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939/model.safetensors.index.json.bkp`\n",
59-
" if neuron sdk does not support safetensors"
46+
"### Create model artifacts"
6047
],
6148
"metadata": {
6249
"collapsed": false
@@ -68,8 +55,9 @@
6855
"outputs": [],
6956
"source": [
7057
"# login in Hugginface hub\n",
58+
"!pip install --upgrade huggingface_hub\n",
7159
"!huggingface-cli login --token $HUGGINGFACE_TOKEN\n",
72-
"!python examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-2-13b-hf --use_auth_token True\n",
60+
"!python examples/large_models/utils/Download_model.py --model_path model --model_name meta-llama/Llama-2-70b-hf --use_auth_token True\n",
7361
"\n",
7462
"# Create TorchServe model artifacts\n",
7563
"!torch-model-archiver --model-name llama-2-70b --version 1.0 --handler ts/torch_handler/distributed/base_neuronx_continuous_batching_handler.py -r examples/large_models/inferentia2/llama2/requirements.txt --config-file examples/large_models/inferentia2/llama2/continuous_batching/model-config.yaml --archive-format no-archive\n",
@@ -85,7 +73,7 @@
8573
{
8674
"cell_type": "markdown",
8775
"source": [
88-
"### Start TorchServe"
76+
"### Start docker"
8977
],
9078
"metadata": {
9179
"collapsed": false
@@ -96,7 +84,7 @@
9684
"execution_count": null,
9785
"outputs": [],
9886
"source": [
99-
"torchserve --ncs --start --model-store model_store --models llama-2-70b --ts-config examples/large_models/inferentia2/llama2/config.properties"
87+
"!docker run --rm -it -v /home/ubuntu/serve/model_store/:/opt/ml/model -v /home/ubuntu/serve/:/home/model-server/serve --device /dev/neuron0:/dev/neuron0 --device /dev/neuron1:/dev/neuron1 --device /dev/neuron2:/dev/neuron2 --device /dev/neuron3:/dev/neuron3 --device /dev/neuron4:/dev/neuron4 --device /dev/neuron5:/dev/neuron5 --device /dev/neuron6:/dev/neuron6 --device /dev/neuron7:/dev/neuron7 --device /dev/neuron8:/dev/neuron8 --device /dev/neuron9:/dev/neuron9 --device /dev/neuron10:/dev/neuron10 --device /dev/neuron11:/dev/neuron11 -p 127.0.0.1:8080:8080 -p 127.0.0.1:8081:8081 -p 127.0.0.1:8082:8082 -p 127.0.0.1:7070:7070 -p 127.0.0.1:7071:7071 -e TS_INSTALL_PY_DEP_PER_MODEL=true public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.18.1-ubuntu20.04"
10088
],
10189
"metadata": {
10290
"collapsed": false

examples/large_models/inferentia2/llama2/continuous_batching/model-config.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ handler:
1212
model_module_prefix: "transformers_neuronx"
1313
model_class_name: "llama.model.LlamaForSampling"
1414
tokenizer_class_name: "transformers.LlamaTokenizer"
15+
# see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/torch/transformers-neuronx/index.html#known-issues-and-limitations for llama2-13b
16+
# neuron_cc_flag: "-O1 --model-type=transformer --enable-mixed-precision-accumulation --enable-saturate-infinity"
1517
amp: "bf16"
1618
tp_degree: 24
1719
max_length: 256

examples/large_models/inferentia2/llama2/continuous_batching/requirements.txt

-2
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,2 @@
1-
transformers
2-
tokenizers
3-
sentencepiece
1+
transformers==4.36.2
2+
sentencepiece==0.1.99

examples/large_models/inferentia2/llama2/streamer/Readme.md

+13-103
Original file line numberDiff line numberDiff line change
@@ -4,110 +4,20 @@ This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama)
44

55
Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.
66

7-
**Note**: To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass input which matches the batch size that was used during compilation. Model compilation and input padding to match compiled model batch size is taken care of by the [custom handler](inf2_handler.py) in this example.
7+
This example can also be extended to support Mistral without code changes. Customers only set the following items in model-config.yaml. For example:
8+
* model_path: "model/models--meta-llama--Llama-2-70b-hf/snapshots/90052941a64de02075ca800b09fcea1bdaacb939"
9+
* model_checkpoint_dir: "llama-2-70b-split"
10+
* model_module_prefix: "transformers_neuronx"
11+
* model_class_name: "llama.model.LlamaForSampling"
12+
* tokenizer_class_name: "transformers.LlamaTokenizer"
813

9-
The batch size and micro batch size configurations are present in [model-config.yaml](model-config.yaml). The batch size indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay.
10-
The batch size is chosen to be a relatively large value, say 16 since micro batching enables running the preprocess(tokenization) and inference steps in parallel on the micro batches. The micro batch size is the batch size used for the Inf2 model compilation.
11-
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
12-
13-
This example also demonstrates the utilization of neuronx cache to store inf2 model compilation artifacts using the `NEURONX_CACHE` and `NEURONX_DUMP_TO` environment variables in the custom handler.
14-
When the model is loaded for the first time, the model is compiled for the configured micro batch size and the compilation artifacts are saved to the neuronx cache.
15-
On subsequent model load, the compilation artifacts in the neuronx cache serves as `Ahead of Time(AOT)` compilation artifacts and significantly reduces the model load time.
16-
For convenience, the compiled model artifacts for this example are made available on the Torchserve model zoo: `s3://torchserve/mar_files/llama-2-13b-neuronx-b4`\
17-
Instructions on how to use the AOT compiled model artifacts is shown below.
18-
19-
### Step 1: Inf2 instance
20-
21-
Get an Inf2 instance(Note: This example was tested on instance type:`inf2.24xlarge`), ssh to it, make sure to use the following DLAMI as it comes with PyTorch and necessary packages for AWS Neuron SDK pre-installed.
22-
DLAMI Name: ` Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 Amazon Machine Image (AMI)` or higher.
23-
24-
**Note**: The `inf2.24xlarge` instance consists of 6 neuron chips with 2 neuron cores each. The total accelerator memory is 192GB.
25-
Based on the configuration used in [model-config.yaml](model-config.yaml), with `tp_degree` set to 6, 3 of the 6 neuron chips are used, i.e 6 neuron cores.
26-
On loading the model, the accelerator memory consumed is 38.1GB (12.7GB per chip).
27-
28-
### Step 2: Package Installations
29-
30-
Follow the steps below to complete package installations
31-
32-
```bash
33-
sudo apt-get update
34-
sudo apt-get upgrade
35-
36-
# Activate Python venv
37-
source /opt/aws_neuron_venv_pytorch/bin/activate
38-
39-
# Clone Torchserve git repository
40-
git clone https://github.com/pytorch/serve.git
41-
cd serve
42-
43-
# Install dependencies
44-
python ts_scripts/install_dependencies.py --neuronx --environment=dev
45-
46-
# Install torchserve and torch-model-archiver
47-
python ts_scripts/install_from_src.py
48-
49-
# Navigate to `examples/large_models/inferentia2/llama2` directory
50-
cd examples/large_models/inferentia2/llama2/
51-
52-
# Install additional necessary packages
53-
python -m pip install -r requirements.txt
54-
```
55-
56-
### Step 3: Save the model artifacts compatible with `transformers-neuronx`
57-
In order to use the pre-compiled model artifacts, copy them from the model zoo using the command shown below and skip to **Step 5**
58-
```bash
59-
aws s3 cp s3://torchserve/mar_files/llama-2-13b-neuronx-b4/ llama-2-13b --recursive
60-
```
14+
| Model | Model Class |
15+
| :--- | :----: |
16+
| llama | lama.model.LlamaForSampling |
17+
| mistral | mistral.model.MistralForSampling |
6118

62-
In order to download and compile the Llama2 model from scratch for support on Inf2:\
63-
Request access to the Llama2 model\
64-
https://huggingface.co/meta-llama/Llama-2-13b-hf
6519

66-
Login to Huggingface
67-
```bash
68-
huggingface-cli login
69-
```
70-
71-
Run the `inf2_save_split_checkpoints.py` script
72-
```bash
73-
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
74-
```
75-
76-
77-
### Step 4: Package model artifacts
78-
79-
```bash
80-
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py -r requirements.txt --config-file model-config.yaml --archive-format no-archive
81-
mv llama-2-13b-split llama-2-13b
82-
```
83-
84-
### Step 5: Add the model artifacts to model store
85-
86-
```bash
87-
mkdir model_store
88-
mv llama-2-13b model_store
89-
```
90-
91-
### Step 6: Start torchserve
92-
93-
```bash
94-
torchserve --ncs --start --model-store model_store --ts-config config.properties
95-
```
96-
97-
### Step 7: Register model
98-
99-
```bash
100-
curl -X POST "http://localhost:8081/models?url=llama-2-13b"
101-
```
102-
103-
### Step 8: Run inference
104-
105-
```bash
106-
python test_stream_response.py
107-
```
108-
109-
### Step 9: Stop torchserve
20+
The `batchSize` in [model-config.yaml](model-config.yaml) indicates the maximum number of requests torchserve will aggregate and send to the custom handler within the batch delay. `micro_batch_size` is the batch size used for the Inf2 model compilation.
21+
Since compilation batch size can influence compile time and also constrained by the Inf2 instance type, this is chosen to be a relatively smaller value, say 4.
11022

111-
```bash
112-
torchserve --stop
113-
```
23+
`inf2-llama-2-micro-batching.ipynb` is the notebook example.

0 commit comments

Comments
 (0)