You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/batch_inference_with_ts.md
+120-2
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,7 @@
2
2
3
3
## Contents of this Document
4
4
*[Introduction](#introduction)
5
+
5
6
*[Conclusion](#conclusion)
6
7
7
8
## Introduction
@@ -16,14 +17,131 @@ Before jumping into this document, please go over the following docs
16
17
1.[What is TorchServe?](../README.md)
17
18
1.[What is custom service code?](custom_service.md)
18
19
19
-
## Batch Inference with TorchServe
20
+
## Batch Inference with TorchServe using ResNet-152 model
20
21
To support batching of inference requests, TorchServe needs the following:
21
22
1. TorchServe Model Configuration: TorchServe provides means to configure "Max Batch Size" and "Max Batch Delay" through "POST /models" API.
22
23
TorchServe needs to know the maximum batch size that the model can handle and the maximum delay that TorchServe should wait for, to form this request-batch.
23
24
2. Model Handler code: TorchServe requires the Model Handler to handle the batch of inference requests.
24
25
25
-
## TODO : Add detailed example with pytorch model.
26
+
For a full working code of a custom model handler with batch processing, refer to [resnet152_handler.py](../examples/image_classifier/resnet_152_batch/resnet152_handler.py)
27
+
28
+
### TorchServe Model Configuration
29
+
To configure TorchServe to use the batching feature, you would have to provide the batch configuration information through [**POST /models** API](management_api.md#register-a-model).
30
+
The configuration that we are interested in is the following:
31
+
1.`batch_size`: This is the maximum batch size that a model is expected to handle.
32
+
2.`max_batch_delay`: This is the maximum batch delay time TorchServe waits to receive `batch_size` number of requests. If TorchServe doesn't receive `batch_size` number of requests
33
+
before this timer time's out, it sends what ever requests that were received to the model `handler`.
34
+
35
+
Let's look at an example using this configuration
36
+
```bash
37
+
# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milli seconds.
38
+
curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=50"
39
+
```
40
+
41
+
These configurations are used both in TorchServe and in the model's custom-service-code (a.k.a the handler code). TorchServe associates the batch related configuration with each model. The frontend then tries to aggregate the batch-size number of requests and send it to the backend.
42
+
43
+
## Demo to configure TorchServe with batch-supported model
44
+
In this section lets bring up model server and launch Resnet-152 model, which has been built to handle a batch of request.
45
+
46
+
### Pre-requisites
47
+
Follow the main [Readme](../README.md) and install all the required packages including "torchserve"
48
+
49
+
### Loading Resnet-152 which handles batch inferences
50
+
* Start the model server. In this example, we are starting the model server to run on inference port 8080 and management port 8081.
51
+
```text
52
+
$ cat config.properties
53
+
...
54
+
inference_address=http://0.0.0.0:8080
55
+
management_address=http://0.0.0.0:8081
56
+
...
57
+
$ torchserve --start --model-store model_store
58
+
```
59
+
60
+
Note : This example assumes that the resnet-152.mar file is available in the torchserve model_store. For more details on creating resnet-152 mar file and serving it on TorchServe refer [resnet152 image classification example](../examples/image_classifier/resnet_152_batch/README.md)
61
+
62
+
* Verify that the TorchServe is up and running
63
+
```text
64
+
$ curl localhost:8080/ping
65
+
{
66
+
"status": "Healthy"
67
+
}
68
+
```
69
+
70
+
* Now lets launch resnet-152 model, which we have built to handle batch inference. Since this is an example, we are going to launch 1 worker which handles a batch size of 8
71
+
with a max-batch-delay of 10ms.
72
+
```text
73
+
$ curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=10&initial_workers=1"
$ curl -X POST localhost/predictions/resnet-152 -T kitten.jpg
110
+
{
111
+
"probability": 0.7148938179016113,
112
+
"class": "n02123045 tabby, tabby cat"
113
+
},
114
+
{
115
+
"probability": 0.22877725958824158,
116
+
"class": "n02123159 tiger cat"
117
+
},
118
+
{
119
+
"probability": 0.04032370448112488,
120
+
"class": "n02124075 Egyptian cat"
121
+
},
122
+
{
123
+
"probability": 0.00837081391364336,
124
+
"class": "n02127052 lynx, catamount"
125
+
},
126
+
{
127
+
"probability": 0.0006728120497427881,
128
+
"class": "n02129604 tiger, Panthera tigris"
129
+
}
130
+
```
131
+
132
+
* Now that we have the service up and running, we could run performance tests with the same kitten image as follows. There are multiple tools to measure performance of web-servers. We will use
133
+
[apache-bench](https://httpd.apache.org/docs/2.4/programs/ab.html) to run our performance tests. We chose `apache-bench` for our tests because of the ease of installation and ease of running tests.
134
+
Before running this test, we need to first install `apache-bench` on our System. Since we were running this on a ubuntu host, we installed apache-bench as follows
The above test simulates TorchServe receiving 1000 concurrent requests at once and a total of 10,000 requests. All of these requests are directed to the endpoint "localhost:8080/predictions/resnet-152", which assumes
143
+
that resnet-152 is already registered and scaled-up on TorchServe. We had done this registration and scaling up in the above steps.
144
+
27
145
## Conclusion
28
146
The take away from the experiments is that batching is a very useful feature. In cases where the services receive heavy load of requests or each request has high I/O, its advantageous
29
147
to batch the requests. This allows for maximally utilizing the compute resources, especially GPU compute which are also more often than not more expensive. But customers should
curl -X POST curl -X POST "localhost:8081/models?model_name=resnet152&url=resnet-152-batch.mar&batch_size=4&max_batch_delay=5000&initial_workers=3&synchronous=true"
10
+
```
11
+
12
+
The above commands will create the mar file and register the resnet152 model with torchserve with following configuration :
13
+
14
+
- model_name : resnet152
15
+
- batch_size : 4
16
+
- max_batch_delay : 5000 ms
17
+
- workers : 3
18
+
19
+
To test batch inference execute the following commands within the specified max_batch_delay time :
20
+
21
+
```bash
22
+
curl -X POST http://127.0.0.1:8080/predictions/resnet152 -T serve/examples/image_classifier/resnet_152_batch/images/croco.jpg &
23
+
curl -X POST http://127.0.0.1:8080/predictions/resnet152 -T serve/examples/image_classifier/resnet_152_batch/images/dog.jpg &
24
+
curl -X POST http://127.0.0.1:8080/predictions/resnet152 -T serve/examples/image_classifier/resnet_152_batch/images/kitten.jpg &
25
+
```
26
+
27
+
#### TorchScript example using Resnet152 image classifier:
28
+
29
+
* Save the Resnet152-batch model in as an executable script module or a traced script:
0 commit comments