Skip to content

Commit 5f5904b

Browse files
agunapalmreso
andauthored
Update performance documentation (#3159)
* Update performance documentation * Update performance_checklist.md Add inference_mode decorator * Update performance_checklist.md --------- Co-authored-by: Matthias Reso <[email protected]>
1 parent f2dd94b commit 5f5904b

File tree

2 files changed

+6
-2
lines changed

2 files changed

+6
-2
lines changed

docs/performance_checklist.md

+2
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ As this [example](https://colab.research.google.com/drive/1NMaLS8PG0eYhbd8IxQAaj
2121

2222
Start model inference optimization only after other factors, the “low-hanging fruit”, have been extensively evaluated and addressed.
2323

24+
- Using `with torch.inference_mode()` context before calling forward pass on your model or `@torch.inference_mode()` decorator on your `inference()` method improves inference performance. This is achieved by [disabling](https://pytorch.org/docs/stable/generated/torch.autograd.grad_mode.inference_mode.html) view tracking and version counter bumps.
25+
2426
- Use fp16 for GPU inference. The speed will most likely more than double on newer GPUs with tensor cores, with negligible accuracy degradation. Technically fp16 is a type of quantization but since it seldom suffers from loss of accuracy for inference it should always be explored. As shown in this [article](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#abstract), use of fp16 offers speed up in large neural network applications.
2527

2628
- Use model quantization (i.e. int8) for CPU inference. Explore different quantization options: dynamic quantization, static quantization, and quantization aware training, as well as tools such as Intel Neural Compressor that provide more sophisticated quantization methods. It is worth noting that quantization comes with some loss in accuracy and might not always offer significant speed up on some hardware thus this might not always be the right approach.

docs/performance_guide.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ Starting with PyTorch 2.0, `torch.compile` provides out of the box speed up ( ~1
1515

1616
Models which have been fully optimized with `torch.compile` show performance improvements up to 10x
1717

18+
When using smaller batch sizes, using `mode="reduce-overhead"` with `torch.compile` can give improved performance as it makes use of CUDA graphs
19+
1820
You can find all the examples of `torch.compile` with TorchServe [here](https://github.com/pytorch/serve/tree/master/examples/pt2)
1921

2022
Details regarding `torch.compile` GenAI examples can be found in this [link](https://github.com/pytorch/serve/tree/master/examples/pt2#torchcompile-genai-examples)
@@ -30,13 +32,13 @@ At a high level what TorchServe allows you to do is
3032

3133
To use ONNX with GPU on TorchServe Docker, we need to build an image with [NVIDIA CUDA runtime](https://github.com/NVIDIA/nvidia-docker/wiki/CUDA) as the base image as shown [here](https://github.com/pytorch/serve/blob/master/docker/README.md#create-torchserve-docker-image)
3234

33-
<h4>TensorRT<h4>
35+
<h4>TensorRT</h4>
3436

3537
TorchServe also supports models optimized via TensorRT. To leverage the TensorRT runtime you can convert your model by [following these instructions](https://github.com/pytorch/TensorRT) and once you're done you'll have serialized weights which you can load with [`torch.jit.load()`](https://pytorch.org/TensorRT/getting_started/getting_started_with_python_api.html#getting-started-with-python-api).
3638

3739
After a conversion there is no difference in how PyTorch treats a Torchscript model vs a TensorRT model.
3840

39-
<h4>Better Transformer<h4>
41+
<h4>Better Transformer</h4>
4042

4143
Better Transformer from PyTorch implements a backwards-compatible fast path of `torch.nn.TransformerEncoder` for Transformer Encoder Inference and does not require model authors to modify their models. BetterTransformer improvements can exceed 2x in speedup and throughput for many common execution scenarios.
4244
You can find more information on Better Transformer [here](https://pytorch.org/blog/a-better-transformer-for-fast-transformer-encoder-inference/) and [here](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers#speed-up-inference-with-better-transformer).

0 commit comments

Comments
 (0)