Skip to content

Commit 8f3aadd

Browse files
committed
Merge branch 'master' of github.com:m10an/TrochServe into load_models_all_targz
2 parents 8b391bc + d993070 commit 8f3aadd

31 files changed

+488
-207
lines changed

README.md

+13-2
Original file line numberDiff line numberDiff line change
@@ -62,12 +62,23 @@ Refer to [torchserve docker](docker/README.md) for details.
6262

6363
### 🤖 Quick Start LLM Deployment
6464

65+
#### VLLM Engine
6566
```bash
6667
# Make sure to install torchserve with pip or conda as described above and login with `huggingface-cli login`
67-
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3-8B-Instruct --disable_token_auth
68+
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --disable_token_auth
6869

6970
# Try it out
70-
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"
71+
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3.1-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model/1.0/v1/completions"
72+
```
73+
74+
#### TRT-LLM Engine
75+
```bash
76+
# Make sure to install torchserve with python venv as described above and login with `huggingface-cli login`
77+
# pip install -U --use-deprecated=legacy-resolver -r requirements/trt_llm.txt
78+
python -m ts.llm_launcher --model_id meta-llama/Meta-Llama-3.1-8B-Instruct --engine trt_llm --disable_token_auth
79+
80+
# Try it out
81+
curl -X POST -d '{"prompt":"count from 1 to 9 in french ", "max_tokens": 100}' --header "Content-Type: application/json" "http://localhost:8080/predictions/model"
7182
```
7283

7384
### 🚢 Quick Start LLM Deployment with Docker

benchmarks/utils/system_under_test.py

+26
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,7 @@ def start(self):
113113
execute("torchserve --stop", wait=True)
114114
click.secho("*Setting up model store...", fg="green")
115115
self._prepare_local_dependency()
116+
self._clear_neuron_cache_if_exists()
116117
click.secho("*Starting local Torchserve instance...", fg="green")
117118

118119
ts_cmd = (
@@ -141,6 +142,31 @@ def start(self):
141142
if "Model server started" in str(line).strip():
142143
break
143144

145+
def _clear_neuron_cache_if_exists(self):
146+
cache_dir = "/var/tmp/neuron-compile-cache/"
147+
148+
# Check if the directory exists
149+
if os.path.exists(cache_dir) and os.path.isdir(cache_dir):
150+
click.secho(
151+
f"Directory {cache_dir} exists. Clearing contents...", fg="green"
152+
)
153+
154+
# Remove the directory contents
155+
for filename in os.listdir(cache_dir):
156+
file_path = os.path.join(cache_dir, filename)
157+
try:
158+
if os.path.isfile(file_path) or os.path.islink(file_path):
159+
os.unlink(file_path)
160+
elif os.path.isdir(file_path):
161+
shutil.rmtree(file_path)
162+
except Exception as e:
163+
click.secho(f"Failed to delete {file_path}. Reason: {e}", fg="red")
164+
click.secho(f"Cache cleared: {cache_dir}", fg="green")
165+
else:
166+
click.secho(
167+
f"Directory {cache_dir} does not exist. No action taken.", fg="green"
168+
)
169+
144170
def stop(self):
145171
click.secho("*Terminating Torchserve instance...", fg="green")
146172
execute("torchserve --stop", wait=True)

docker/Dockerfile

+2-2
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ COPY ./ serve
7373
RUN \
7474
if echo "$LOCAL_CHANGES" | grep -q "false"; then \
7575
rm -rf serve;\
76-
git clone --recursive $REPO_URL -b $BRANCH_NAME; \
76+
git clone --recursive $REPO_URL -b $BRANCH_NAME serve; \
7777
fi
7878

7979

@@ -238,7 +238,7 @@ COPY ./ serve
238238
RUN \
239239
if echo "$LOCAL_CHANGES" | grep -q "false"; then \
240240
rm -rf serve;\
241-
git clone --recursive $REPO_URL -b $BRANCH_NAME; \
241+
git clone --recursive $REPO_URL -b $BRANCH_NAME serve; \
242242
fi
243243

244244
COPY --from=compile-image /home/venv /home/venv

examples/large_models/trt_llm/llama/README.md

+18-15
Original file line numberDiff line numberDiff line change
@@ -4,19 +4,19 @@
44

55
## Pre-requisites
66

7-
TRT-LLM requires Python 3.10
7+
- TRT-LLM requires Python 3.10
8+
- TRT-LLM works well with python venv (vs conda)
89
This example is tested with CUDA 12.1
910
Once TorchServe is installed, install TensorRT-LLM using the following.
10-
This will downgrade the versions of PyTorch & Triton but this doesn't cause any issue.
1111

1212
```
13-
pip install tensorrt_llm==0.10.0 --extra-index-url https://pypi.nvidia.com
14-
pip install tensorrt-cu12==10.1.0
13+
pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
14+
pip install transformers>=4.44.2
1515
python -c "import tensorrt_llm"
1616
```
1717
shows
1818
```
19-
[TensorRT-LLM] TensorRT-LLM version: 0.10.0
19+
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024090300
2020
```
2121

2222
## Download model from HuggingFace
@@ -26,29 +26,32 @@ huggingface-cli login
2626
huggingface-cli login --token $HUGGINGFACE_TOKEN
2727
```
2828
```
29-
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3-8B-Instruct
29+
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3.1-8B-Instruct --use_auth_token True
3030
```
3131

3232
## Create TensorRT-LLM Engine
3333
Clone TensorRT-LLM which will be used to create the TensorRT-LLM Engine
3434

3535
```
36-
git clone -b v0.10.0 https://github.com/NVIDIA/TensorRT-LLM.git
36+
git clone https://github.com/NVIDIA/TensorRT-LLM.git
3737
```
3838

3939
Compile the model into a TensorRT engine with model weights and a model definition written in the TensorRT-LLM Python API.
4040

4141
```
42-
python TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/ --output_dir ./tllm_checkpoint_1gpu_bf16 --dtype bfloat16
42+
python TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f/ --output_dir ./tllm_checkpoint_1gpu_bf16 --dtype bfloat16
4343
```
44+
4445
```
45-
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_bf16 --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --output_dir ./llama-3-8b-engine
46+
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_bf16 --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --max_batch_size 4 --output_dir ./llama-3.1-8b-engine
4647
```
48+
If you have enough GPU memory, you can try increasing the `max_batch_size`
4749

4850
You can test if TensorRT-LLM Engine has been compiled correctly by running the following
4951
```
50-
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3-8b-engine --max_output_len 100 --tokenizer_dir model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/ --input_text "How do I count to nine in French?"
52+
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3.1-8b-engine --max_output_len 100 --tokenizer_dir model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f/ --input_text "How do I count to nine in French?"
5153
```
54+
If you are running into OOM, try reducing `kv_cache_free_gpu_memory_fraction`
5255

5356
You should see an output as follows
5457
```
@@ -70,17 +73,17 @@ That's it! You can now count to nine in French. Just remember that the numbers o
7073

7174
```
7275
mkdir model_store
73-
torch-model-archiver --model-name llama3-8b --version 1.0 --handler trt_llm_handler.py --config-file model-config.yaml --archive-format no-archive --export-path model_store -f
74-
mv model model_store/llama3-8b/.
75-
mv llama-3-8b-engine model_store/llama3-8b/.
76+
torch-model-archiver --model-name llama3.1-8b --version 1.0 --handler trt_llm_handler --config-file model-config.yaml --archive-format no-archive --export-path model_store -f
77+
mv model model_store/llama3.1-8b/.
78+
mv llama-3.1-8b-engine model_store/llama3.1-8b/.
7679
```
7780

7881
## Start TorchServe
7982
```
80-
torchserve --start --ncs --model-store model_store --models llama3-8b --disable-token-auth
83+
torchserve --start --ncs --model-store model_store --models llama3.1-8b --disable-token-auth
8184
```
8285

8386
## Run Inference
8487
```
85-
python ../../utils/test_llm_streaming_response.py -o 50 -t 2 -n 4 -m llama3-8b --prompt-text "@prompt.json" --prompt-json
88+
python ../../utils/test_llm_streaming_response.py -o 50 -t 2 -n 4 -m llama3.1-8b --prompt-text "@prompt.json" --prompt-json
8689
```

examples/large_models/trt_llm/llama/model-config.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ deviceType: "gpu"
77
asyncCommunication: true
88

99
handler:
10-
tokenizer_dir: "model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/"
11-
trt_llm_engine_config:
12-
engine_dir: "llama-3-8b-engine"
10+
tokenizer_dir: "model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f/"
11+
engine_dir: "llama-3.1-8b-engine"
12+
kv_cache_config:
13+
free_gpu_memory_fraction: 0.1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
{"prompt": "How is the climate in San Francisco?",
22
"temperature":0.5,
3-
"max_new_tokens": 200}
3+
"max_tokens": 400,
4+
"streaming": true}

examples/large_models/trt_llm/llama/trt_llm_handler.py

-118
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Llama TensorRT-LLM Engine + LoRA model integration with TorchServe
2+
3+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an option to build TensorRT engines for LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
4+
5+
## Pre-requisites
6+
7+
- TRT-LLM requires Python 3.10
8+
- TRT-LLM works well with python venv (vs conda)
9+
This example is tested with CUDA 12.1
10+
Once TorchServe is installed, install TensorRT-LLM using the following.
11+
12+
```
13+
pip install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
14+
pip install transformers>=4.44.2
15+
python -c "import tensorrt_llm"
16+
```
17+
shows
18+
```
19+
[TensorRT-LLM] TensorRT-LLM version: 0.13.0.dev2024090300
20+
```
21+
22+
## Download Base model & LoRA adapter from Hugging Face
23+
```
24+
huggingface-cli login
25+
# or using an environment variable
26+
huggingface-cli login --token $HUGGINGFACE_TOKEN
27+
```
28+
```
29+
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3.1-8B-Instruct --use_auth_token True
30+
python ../../utils/Download_model.py --model_path model --model_name llama-duo/llama3.1-8b-summarize-gpt4o-128k --use_auth_token True
31+
```
32+
33+
## Create TensorRT-LLM Engine
34+
Clone TensorRT-LLM which will be used to create the TensorRT-LLM Engine
35+
36+
```
37+
git clone https://github.com/NVIDIA/TensorRT-LLM.git
38+
```
39+
40+
Compile the model into a TensorRT engine with model weights and a model definition written in the TensorRT-LLM Python API.
41+
42+
```
43+
python TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f --output_dir ./tllm_checkpoint_1gpu_bf16 --dtype bfloat16
44+
```
45+
46+
```
47+
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_bf16 --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --output_dir ./llama-3.1-8b-engine-lora --max_batch_size 4 --lora_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --lora_plugin bfloat16
48+
```
49+
If you have enough GPU memory, you can try increasing the `max_batch_size`
50+
51+
You can test if TensorRT-LLM Engine has been compiled correctly by running the following
52+
```
53+
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3.1-8b-engine-lora --max_output_len 100 --tokenizer_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --input_text "Amanda: I baked cookies. Do you want some?\nJerry: Sure \nAmanda: I will bring you tomorrow :-)\n\nSummarize the dialog:" --lora_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --kv_cache_free_gpu_memory_fraction 0.3 --use_py_session
54+
```
55+
If you are running into OOM, try reducing `kv_cache_free_gpu_memory_fraction`
56+
57+
You should see an output as follows
58+
```
59+
Input [Text 0]: "<|begin_of_text|>Amanda: I baked cookies. Do you want some?\nJerry: Sure \nAmanda: I will bring you tomorrow :-)\n\nSummarize the dialog:"
60+
Output [Text 0 Beam 0]: " Amanda offered Jerry cookies and said she would bring them to him tomorrow.
61+
Amanda offered Jerry cookies and said she would bring them to him tomorrow.
62+
The dialogue is between Amanda and Jerry. Amanda offers Jerry cookies and says she will bring them to him tomorrow. The dialogue is a simple exchange between two people, with no complex plot or themes. The tone is casual and friendly. The dialogue is a good example of a short, everyday conversation.
63+
The dialogue is a good example of a short,"
64+
```
65+
66+
## Create model archive
67+
68+
```
69+
mkdir model_store
70+
torch-model-archiver --model-name llama3.1-8b --version 1.0 --handler trt_llm_handler --config-file model-config.yaml --archive-format no-archive --export-path model_store -f
71+
mv model model_store/llama3.1-8b/.
72+
mv llama-3.1-8b-engine-lora model_store/llama3.1-8b/.
73+
```
74+
75+
## Start TorchServe
76+
```
77+
torchserve --start --ncs --model-store model_store --models llama3.1-8b --disable-token-auth
78+
```
79+
80+
## Run Inference
81+
```
82+
python ../../utils/test_llm_streaming_response.py -o 50 -t 2 -n 4 -m llama3.1-8b --prompt-text "@prompt.json" --prompt-json
83+
```

0 commit comments

Comments
 (0)