You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Asynchronous worker communication and vllm integration (#3146)
* Added dummy async comm worker thread
* First version of async worker in frontend running
* [WIP]Running async worker but requests get corrupted if parallel
* First version running with thread feeding + async predict
* shorten vllm test time
* Added AsyncVLLMEngine
* Extend vllm test with multiple possible prompts
* Batch size =1 and remove stream in test
* Switched vllm examples to async comm and added llama3 example
* Fix typo
* Corrected java file formatting
* Cleanup and silent chatty debug message
* Added multi-gpu support to vllm examples
* fix java format
* Remove debugging messages
* Fix async comm worker test
* Added cl_socket to fixture
* Added multi worker note to vllm example readme
* Disable tests
* Enable async worker comm test
* Debug CI
* Fix python version <= 3.9 issue in async worker
* Renamed async worker test
* Update frontend/server/src/main/java/org/pytorch/serve/wlm/AsyncBatchAggregator.java
Remove job from jobs_in_backend on error
Co-authored-by: Naman Nandan <[email protected]>
* Unskip vllm example test
* Clean up async worker code
* Safely remove jobs from jobs_in_backend
* Let worker die if one of the threads in async service dies
* Add description of parallelLevel and parallelType=custom to docs/large_model_inference.md
* Added description of parallelLevel to model-archiver readme.md
* fix typo + added words
* Fix skip condition for vllm example test
---------
Co-authored-by: Naman Nandan <[email protected]>
Copy file name to clipboardexpand all lines: docs/large_model_inference.md
+27-3
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,7 @@
3
3
This document explain how Torchserve supports large model serving, here large model refers to the models that are not able to fit into one gpu so they need be split in multiple partitions over multiple gpus.
4
4
This page is split into the following sections:
5
5
-[How it works](#how-it-works)
6
+
-[Large Model Inference with vLLM](#pippy-pytorch-native-solution-for-large-model-inference)
6
7
-[Large Model Inference with PiPPy](#pippy-pytorch-native-solution-for-large-model-inference)
7
8
-[Large Model Inference with Deep Speed](#deepspeed)
8
9
-[Deep Speed MII](#deepspeed-mii)
@@ -11,13 +12,36 @@ This page is split into the following sections:
11
12
12
13
## How it works?
13
14
14
-
During deployment a worker of a large model, TorchServe utilizes [torchrun](https://pytorch.org/docs/stable/elastic/run.html) to set up the distributed environment for model parallel processing. TorchServe has the capability to support multiple workers for a large model. By default, TorchServe uses a round-robin algorithm to assign GPUs to a worker on a host. In case of large models inference GPUs assigned to each worker is automatically calculated based on number of GPUs specified in the model_config.yaml. CUDA_VISIBLE_DEVICES is set based this number.
15
+
For GPU inference of smaller models TorchServe executes a single process per worker which gets assigned a single GPU.
16
+
For large model inference the model needs to be split over multiple GPUs.
17
+
There are different modes to achieve this split which usually include pipeline parallel (PP), tensor parallel or a combination of these.
18
+
Which mode is selected and how the split is implemented depends on the implementation in the utilized framework.
19
+
TorchServe allows users to utilize any framework for their model deployment and tries to accommodate the needs of the frameworks through flexible configurations.
20
+
Some frameworks require to execute a separate process for each of the GPUs (PiPPy, Deep Speed) while others require a single process which get assigned all GPUs (vLLM).
21
+
In case multiple processes are required TorchServe utilizes [torchrun](https://pytorch.org/docs/stable/elastic/run.html) to set up the distributed environment for the worker.
22
+
During the setup `torchrun` will start a new process for each GPU assigned to the worker.
23
+
If torchrun is utilized or not depends on the parameter parallelType which can be set in the `model-config.yaml` to one of the following options:
15
24
16
-
For instance, suppose there are eight GPUs on a node and one worker needs 4 GPUs (ie, nproc-per-node=4) on a node. In this case, TorchServe would assign CUDA_VISIBLE_DEVICES="0,1,2,3" to worker1 and CUDA_VISIBLE_DEVICES="4,5,6,7" to worker2.
25
+
*`pp` - for pipeline parallel
26
+
*`tp` - for tensor parallel
27
+
*`pptp` - for pipeline + tensor parallel
28
+
*`custom`
17
29
18
-
In addition to this default behavior, TorchServe provides the flexibility for users to specify GPUs for a worker. For instance, if the user sets "deviceIds: [2,3,4,5]" in the [model config YAML file](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/model-archiver/README.md?plain=1#L164), and nproc-per-node is set to 2, then TorchServe would assign CUDA_VISIBLE_DEVICES="2,3" to worker1 and CUDA_VISIBLE_DEVICES="4,5" to worker2.
30
+
The first three options setup the environment using torchrun while the "custom" option leaves the way of parallelization to the user and assigned the GPUs assigned to a worker to a single process.
31
+
The number of assigned GPUs is determined either by the number of processes started by torchrun i.e. configured through nproc-per-node OR the parameter parallelLevel.
32
+
Meaning that the parameter parallelLevel should NOT be set if nproc-per-node is set and vice versa.
33
+
34
+
By default, TorchServe uses a round-robin algorithm to assign GPUs to a worker on a host.
35
+
In case of large models inference GPUs assigned to each worker is automatically calculated based on the number of GPUs specified in the model_config.yaml.
36
+
CUDA_VISIBLE_DEVICES is set based this number.
37
+
38
+
For instance, suppose there are eight GPUs on a node and one worker needs 4 GPUs (ie, nproc-per-node=4 OR parallelLevel=4) on a node.
39
+
In this case, TorchServe would assign CUDA_VISIBLE_DEVICES="0,1,2,3" to worker1 and CUDA_VISIBLE_DEVICES="4,5,6,7" to worker2.
40
+
41
+
In addition to this default behavior, TorchServe provides the flexibility for users to specify GPUs for a worker. For instance, if the user sets "deviceIds: [2,3,4,5]" in the [model config YAML file](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/model-archiver/README.md?plain=1#L164), and nproc-per-node (OR parallelLevel) is set to 2, then TorchServe would assign CUDA_VISIBLE_DEVICES="2,3" to worker1 and CUDA_VISIBLE_DEVICES="4,5" to worker2.
19
42
20
43
Using Pippy integration as an example, the image below illustrates the internals of the TorchServe large model inference.
44
+
For an example using vLLM see [this example](../examples/large_models/vllm/).
This folder contains multiple demonstrations showcasing the integration of [vLLM Engine](https://github.com/vllm-project/vllm) with TorchServe, running inference with continuous batching.
4
-
vLLM achieves high throughput using PagedAttention. More details can be found [here](https://vllm.ai/)
4
+
vLLM achieves high throughput using PagedAttention. More details can be found [here](https://vllm.ai/).
5
+
The vLLM integration uses our new asynchronous worker communication mode which decoupled communication between frontend and backend from running the actual inference.
6
+
By using this new feature TorchServe is capable to feed incoming requests into the vLLM engine while asynchronously running the engine in the backend.
7
+
As long as a single request is inside the engine it will continue to run and asynchronously stream out the results until the request is finished.
8
+
New requests are added to the engine in a continuous fashion similar to the continuous batching mode shown in other examples.
9
+
For all examples distributed inference can be enabled by following the instruction [here](./Readme.md#distributed-inference)
vLLM [SamplingParams](https://github.com/vllm-project/vllm/blob/258a2c58d08fc7a242556120877a89404861fbce/vllm/sampling_params.py#L27) is defined in the JSON format, for example, [prompt.json](lora/prompt.json).
22
+
23
+
### Distributed Inference
24
+
All examples can be easily distributed over multiple GPUs by enabling tensor parallelism in vLLM.
25
+
To enable distributed inference the following additions need to made to the model-config.yaml of the examples where 4 is the number of desired GPUs to use for the inference:
26
+
27
+
```yaml
28
+
# TorchServe frontend parameters
29
+
...
30
+
parallelType: "custom"
31
+
parallelLevel: 4
32
+
33
+
handler:
34
+
...
35
+
vllm_engine_config:
36
+
...
37
+
tensor_parallel_size: 4
38
+
```
39
+
40
+
### Multi-worker Note:
41
+
While this example in theory works with multiple workers it would distribute the incoming requests in a round robin fashion which might lead to non optimal worker/hardware utilization.
42
+
It is therefore advised to only use a single worker per engine and utilize tensor parallelism to distribute the model over multiple GPUs as described in the previous section.
43
+
This will result in better hardware utilization and inference performance.
# Example showing inference with vLLM on LoRA model
2
+
3
+
This is an example showing how to integrate [vLLM](https://github.com/vllm-project/vllm) with TorchServe and run inference on model `meta-llama/Meta-Llama-3-8B-Instruct` with continuous batching.
4
+
This examples supports distributed inference by following [this instruction](../Readme.md#distributed-inference)
5
+
6
+
### Step 1: Download Model from HuggingFace
7
+
8
+
Login with a HuggingFace account
9
+
```
10
+
huggingface-cli login
11
+
# or using an environment variable
12
+
huggingface-cli login --token $HUGGINGFACE_TOKEN
13
+
```
14
+
15
+
```bash
16
+
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3-8B-Instruct --use_auth_token True
17
+
```
18
+
19
+
### Step 2: Generate model artifacts
20
+
21
+
Add the downloaded path to "model_path:" in `model-config.yaml` and run the following.
Copy file name to clipboardexpand all lines: examples/large_models/vllm/lora/Readme.md
+1
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
# Example showing inference with vLLM on LoRA model
2
2
3
3
This is an example showing how to integrate [vLLM](https://github.com/vllm-project/vllm) with TorchServe and run inference on model `Llama-2-7b-hf` + LoRA model `llama-2-7b-sql-lora-test` with continuous batching.
4
+
This examples supports distributed inference by following [this instruction](../Readme.md#distributed-inference)
Copy file name to clipboardexpand all lines: examples/large_models/vllm/mistral/Readme.md
+1
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,7 @@
1
1
# Example showing inference with vLLM on Mistral model
2
2
3
3
This is an example showing how to integrate [vLLM](https://github.com/vllm-project/vllm) with TorchServe and run inference on model `mistralai/Mistral-7B-v0.1` with continuous batching.
4
+
This examples supports distributed inference by following [this instruction](../Readme.md#distributed-inference)
0 commit comments