Skip to content

Commit 192785a

Browse files
committed
Merge branch 'docker_aarch' of https://github.com/pytorch/serve into docker_aarch
2 parents a3079a4 + a45eb7c commit 192785a

File tree

20 files changed

+450
-185
lines changed

20 files changed

+450
-185
lines changed

README.md

+13-2
Original file line numberDiff line numberDiff line change
@@ -62,12 +62,23 @@ Refer to [torchserve docker](docker/README.md) for details.
6262

6363
### 🤖 Quick Start LLM Deployment
6464

65+
#### VLLM Engine
6566
```bash
6667
# Make sure to install torchserve with pip or conda as described above and login with `huggingface-cli login`
67-
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth
68+
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --disable_token_auth
6869

6970
# Try it out
70-
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"
71+
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3.1-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"
72+
```
73+
74+
#### TRT-LLM Engine
75+
```bash
76+
# Make sure to install torchserve with python venv as described above and login with `huggingface-cli login`
77+
# pip install -U --use-deprecated=legacy-resolver -r requirements/trt_llm.txt
78+
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine trt_llm --disable_token_auth
79+
80+
# Try it out
81+
curl -X POST -d '{"prompt":"count from 1 to 9 in french ", "max_tokens": 100}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
7182
```
7283

7384
### 🚢 Quick Start LLM Deployment with Docker

examples/large_models/trt_llm/llama/README.md

+18-15
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,19 @@
44

55
## Pre-requisites
66

7-
TRT-LLM requires Python 3.10
7+
- TRT-LLM requires Python 3.10
8+
- TRT-LLM works well with python venv (vs conda)
89
This example is tested with CUDA 12.1
910
Once TorchServe is installed, install TensorRT-LLM using the following.
10-
This will downgrade the versions of PyTorch & Triton but this doesn't cause any issue.
1111

1212
```
13-
pip install tensorrt_llm==0.10.0 --extra-index-url https://pypi.nvidia.com
14-
pip install tensorrt-cu12==10.1.0
13+
pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
14+
pip install transformers>=4.44.2
1515
python -c "import tensorrt_llm"
1616
```
1717
shows
1818
```
19-
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
19+
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024090300
2020
```
2121

2222
## Download model from HuggingFace
@@ -26,29 +26,32 @@ huggingface-cli login
2626
huggingface-cli login --token $HUGGINGFACE_TOKEN
2727
```
2828
```
29-
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3-8B-Instruct
29+
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3.1-8B-Instruct --use_auth_token True
3030
```
3131

3232
## Create TensorRT-LLM Engine
3333
Clone TensorRT-LLM which will be used to create the TensorRT-LLM Engine
3434

3535
```
36-
git clone -b v0.10.0 https://github.com/NVIDIA/TensorRT-LLM.git
36+
git clone https://github.com/NVIDIA/TensorRT-LLM.git
3737
```
3838

3939
Compile the model into a TensorRT engine with model weights and a model definition written in the TensorRT-LLM Python API.
4040

4141
```
42-
python TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/ --output_dir ./tllm_checkpoint_1gpu_bf16 --dtype bfloat16
42+
python TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f/ --output_dir ./tllm_checkpoint_1gpu_bf16 --dtype bfloat16
4343
```
44+
4445
```
45-
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_bf16 --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --output_dir ./llama-3-8b-engine
46+
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_bf16 --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --max_batch_size 4 --output_dir ./llama-3.1-8b-engine
4647
```
48+
If you have enough GPU memory, you can try increasing the `max_batch_size`
4749

4850
You can test if TensorRT-LLM Engine has been compiled correctly by running the following
4951
```
50-
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3-8b-engine --max_output_len 100 --tokenizer_dir model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/ --input_text "How do I count to nine in French?"
52+
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3.1-8b-engine --max_output_len 100 --tokenizer_dir model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f/ --input_text "How do I count to nine in French?"
5153
```
54+
If you are running into OOM, try reducing `kv_cache_free_gpu_memory_fraction`
5255

5356
You should see an output as follows
5457
```
@@ -70,17 +73,17 @@ That's it! You can now count to nine in French. Just remember that the numbers o
7073

7174
```
7275
mkdir model_store
73-
torch-model-archiver --model-name llama3-8b --version 1.0 --handler trt_llm_handler.py --config-file model-config.yaml --archive-format no-archive --export-path model_store -f
74-
mv model model_store/llama3-8b/.
75-
mv llama-3-8b-engine model_store/llama3-8b/.
76+
torch-model-archiver --model-name llama3.1-8b --version 1.0 --handler trt_llm_handler --config-file model-config.yaml --archive-format no-archive --export-path model_store -f
77+
mv model model_store/llama3.1-8b/.
78+
mv llama-3.1-8b-engine model_store/llama3.1-8b/.
7679
```
7780

7881
## Start TorchServe
7982
```
80-
torchserve --start --ncs --model-store model_store --models llama3-8b --disable-token-auth
83+
torchserve --start --ncs --model-store model_store --models llama3.1-8b --disable-token-auth
8184
```
8285

8386
## Run Inference
8487
```
85-
python ../../utils/test_llm_streaming_response.py -o 50 -t 2 -n 4 -m llama3-8b --prompt-text "@prompt.json" --prompt-json
88+
python ../../utils/test_llm_streaming_response.py -o 50 -t 2 -n 4 -m llama3.1-8b --prompt-text "@prompt.json" --prompt-json
8689
```

examples/large_models/trt_llm/llama/model-config.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ deviceType: "gpu"
77
asyncCommunication: true
88

99
handler:
10-
tokenizer_dir: "model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/"
11-
trt_llm_engine_config:
12-
engine_dir: "llama-3-8b-engine"
10+
tokenizer_dir: "model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f/"
11+
engine_dir: "llama-3.1-8b-engine"
12+
kv_cache_config:
13+
free_gpu_memory_fraction: 0.1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
{"prompt": "How is the climate in San Francisco?",
22
"temperature":0.5,
3-
"max_new_tokens": 200}
3+
"max_tokens": 400,
4+
"streaming": true}

examples/large_models/trt_llm/llama/trt_llm_handler.py

-118
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Llama TensorRT-LLM Engine + LoRA model integration with TorchServe
2+
3+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an option to build TensorRT engines for LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
4+
5+
## Pre-requisites
6+
7+
- TRT-LLM requires Python 3.10
8+
- TRT-LLM works well with python venv (vs conda)
9+
This example is tested with CUDA 12.1
10+
Once TorchServe is installed, install TensorRT-LLM using the following.
11+
12+
```
13+
pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
14+
pip install transformers>=4.44.2
15+
python -c "import tensorrt_llm"
16+
```
17+
shows
18+
```
19+
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024090300
20+
```
21+
22+
## Download Base model & LoRA adapter from Hugging Face
23+
```
24+
huggingface-cli login
25+
# or using an environment variable
26+
huggingface-cli login --token $HUGGINGFACE_TOKEN
27+
```
28+
```
29+
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3.1-8B-Instruct --use_auth_token True
30+
python ../../utils/Download_model.py --model_path model --model_name llama-duo/llama3.1-8b-summarize-gpt4o-128k --use_auth_token True
31+
```
32+
33+
## Create TensorRT-LLM Engine
34+
Clone TensorRT-LLM which will be used to create the TensorRT-LLM Engine
35+
36+
```
37+
git clone https://github.com/NVIDIA/TensorRT-LLM.git
38+
```
39+
40+
Compile the model into a TensorRT engine with model weights and a model definition written in the TensorRT-LLM Python API.
41+
42+
```
43+
python TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f --output_dir ./tllm_checkpoint_1gpu_bf16 --dtype bfloat16
44+
```
45+
46+
```
47+
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_bf16 --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --output_dir ./llama-3.1-8b-engine-lora --max_batch_size 4 --lora_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --lora_plugin bfloat16
48+
```
49+
If you have enough GPU memory, you can try increasing the `max_batch_size`
50+
51+
You can test if TensorRT-LLM Engine has been compiled correctly by running the following
52+
```
53+
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3.1-8b-engine-lora --max_output_len 100 --tokenizer_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --input_text "Amanda: I baked cookies. Do you want some?\nJerry: Sure \nAmanda: I will bring you tomorrow :-)\n\nSummarize the dialog:" --lora_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --kv_cache_free_gpu_memory_fraction 0.3 --use_py_session
54+
```
55+
If you are running into OOM, try reducing `kv_cache_free_gpu_memory_fraction`
56+
57+
You should see an output as follows
58+
```
59+
Input [Text 0]: "<|begin_of_text|>Amanda: I baked cookies. Do you want some?\nJerry: Sure \nAmanda: I will bring you tomorrow :-)\n\nSummarize the dialog:"
60+
Output [Text 0 Beam 0]: " Amanda offered Jerry cookies and said she would bring them to him tomorrow.
61+
Amanda offered Jerry cookies and said she would bring them to him tomorrow.
62+
The dialogue is between Amanda and Jerry. Amanda offers Jerry cookies and says she will bring them to him tomorrow. The dialogue is a simple exchange between two people, with no complex plot or themes. The tone is casual and friendly. The dialogue is a good example of a short, everyday conversation.
63+
The dialogue is a good example of a short,"
64+
```
65+
66+
## Create model archive
67+
68+
```
69+
mkdir model_store
70+
torch-model-archiver --model-name llama3.1-8b --version 1.0 --handler trt_llm_handler --config-file model-config.yaml --archive-format no-archive --export-path model_store -f
71+
mv model model_store/llama3.1-8b/.
72+
mv llama-3.1-8b-engine-lora model_store/llama3.1-8b/.
73+
```
74+
75+
## Start TorchServe
76+
```
77+
torchserve --start --ncs --model-store model_store --models llama3.1-8b --disable-token-auth
78+
```
79+
80+
## Run Inference
81+
```
82+
python ../../utils/test_llm_streaming_response.py -o 50 -t 2 -n 4 -m llama3.1-8b --prompt-text "@prompt.json" --prompt-json
83+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# TorchServe frontend parameters
2+
minWorkers: 1
3+
maxWorkers: 1
4+
maxBatchDelay: 100
5+
responseTimeout: 1200
6+
deviceType: "gpu"
7+
asyncCommunication: true
8+
9+
handler:
10+
tokenizer_dir: "model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825"
11+
engine_dir: "llama-3.1-8b-engine-lora"
12+
kv_cache_config:
13+
free_gpu_memory_fraction: 0.1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{"prompt": "Amanda: I baked cookies. Do you want some?\nJerry: Sure \nAmanda: I will bring you tomorrow :-)\n\nSummarize the dialog:",
2+
"temperature":0.0,
3+
"max_new_tokens": 100,
4+
"streaming": true}

examples/large_models/vllm/llama3/model-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
minWorkers: 1
33
maxWorkers: 1
44
maxBatchDelay: 100
5-
responseTimeout: 1200
5+
startupTimeout: 1200
66
deviceType: "gpu"
77
asyncCommunication: true
88

examples/large_models/vllm/lora/Readme.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ The vllm integration uses an OpenAI compatible interface which lets you perform
5555

5656
Curl:
5757
```bash
58-
curl --header "Content-Type: application/json" --request POST --data @prompt.json http://localhost:8080/predictions/llama-8b-lora/1.0/v1
58+
curl --header "Content-Type: application/json" --request POST --data @prompt.json http://localhost:8080/predictions/llama-8b-lora/1.0/v1/completions
5959
```
6060

6161
Python + Request:

examples/large_models/vllm/lora/model-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
minWorkers: 1
33
maxWorkers: 1
44
maxBatchDelay: 100
5-
responseTimeout: 1200
5+
startupTimeout: 1200
66
deviceType: "gpu"
77
asyncCommunication: true
88

examples/large_models/vllm/mistral/model-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
minWorkers: 1
33
maxWorkers: 1
44
maxBatchDelay: 100
5-
responseTimeout: 1200
5+
startupTimeout: 1200
66
deviceType: "gpu"
77
asyncCommunication: true
88

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
vllm==0.5.0
1+
vllm==0.6.1.post2

0 commit comments

Comments
 (0)