You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json""http://localhost:8080/predictions/model/1.0/v1/completions"
71
+
curl -X POST -d '{"model":"meta-llama/Meta-Llama-3.1-8B-Instruct", "prompt":"Hello, my name is", "max_tokens": 200}' --header "Content-Type: application/json""http://localhost:8080/predictions/model/1.0/v1/completions"
72
+
```
73
+
74
+
#### TRT-LLM Engine
75
+
```bash
76
+
# Make sure to install torchserve with python venv as described above and login with `huggingface-cli login`
curl -X POST -d '{"prompt":"count from 1 to 9 in french ", "max_tokens": 100}' --header "Content-Type: application/json""http://localhost:8080/predictions/model"
If you have enough GPU memory, you can try increasing the `max_batch_size`
47
49
48
50
You can test if TensorRT-LLM Engine has been compiled correctly by running the following
49
51
```
50
-
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3-8b-engine --max_output_len 100 --tokenizer_dir model/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e1945c40cd546c78e41f1151f4db032b271faeaa/ --input_text "How do I count to nine in French?"
52
+
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3.1-8b-engine --max_output_len 100 --tokenizer_dir model/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/5206a32e0bd3067aef1ce90f5528ade7d866253f/ --input_text "How do I count to nine in French?"
51
53
```
54
+
If you are running into OOM, try reducing `kv_cache_free_gpu_memory_fraction`
52
55
53
56
You should see an output as follows
54
57
```
@@ -70,17 +73,17 @@ That's it! You can now count to nine in French. Just remember that the numbers o
# Llama TensorRT-LLM Engine + LoRA model integration with TorchServe
2
+
3
+
[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an option to build TensorRT engines for LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
4
+
5
+
## Pre-requisites
6
+
7
+
- TRT-LLM requires Python 3.10
8
+
- TRT-LLM works well with python venv (vs conda)
9
+
This example is tested with CUDA 12.1
10
+
Once TorchServe is installed, install TensorRT-LLM using the following.
If you have enough GPU memory, you can try increasing the `max_batch_size`
50
+
51
+
You can test if TensorRT-LLM Engine has been compiled correctly by running the following
52
+
```
53
+
python TensorRT-LLM/examples/run.py --engine_dir ./llama-3.1-8b-engine-lora --max_output_len 100 --tokenizer_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --input_text "Amanda: I baked cookies. Do you want some?\nJerry: Sure \nAmanda: I will bring you tomorrow :-)\n\nSummarize the dialog:" --lora_dir model/models--llama-duo--llama3.1-8b-summarize-gpt4o-128k/snapshots/4ba83353f24fa38946625c8cc49bf21c80a22825 --kv_cache_free_gpu_memory_fraction 0.3 --use_py_session
54
+
```
55
+
If you are running into OOM, try reducing `kv_cache_free_gpu_memory_fraction`
56
+
57
+
You should see an output as follows
58
+
```
59
+
Input [Text 0]: "<|begin_of_text|>Amanda: I baked cookies. Do you want some?\nJerry: Sure \nAmanda: I will bring you tomorrow :-)\n\nSummarize the dialog:"
60
+
Output [Text 0 Beam 0]: " Amanda offered Jerry cookies and said she would bring them to him tomorrow.
61
+
Amanda offered Jerry cookies and said she would bring them to him tomorrow.
62
+
The dialogue is between Amanda and Jerry. Amanda offers Jerry cookies and says she will bring them to him tomorrow. The dialogue is a simple exchange between two people, with no complex plot or themes. The tone is casual and friendly. The dialogue is a good example of a short, everyday conversation.
0 commit comments