|
4 | 4 | 
|
5 | 5 | 
|
6 | 6 | 
|
| 7 | + |
7 | 8 |
|
8 | 9 | TorchServe is a flexible and easy-to-use tool for serving and scaling PyTorch models in production.
|
9 | 10 |
|
@@ -55,26 +56,27 @@ docker pull pytorch/torchserve-nightly
|
55 | 56 | Refer to [torchserve docker](docker/README.md) for details.
|
56 | 57 |
|
57 | 58 | ## ⚡ Why TorchServe
|
58 |
| -* Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, [Nvidia MPS](master/docs/nvidia_mps.md) |
| 59 | +* Write once, run anywhere, on-prem, on-cloud, supports inference on CPUs, GPUs, AWS Inf1/Inf2/Trn1, Google Cloud TPUs, [Nvidia MPS](docs/nvidia_mps.md) |
59 | 60 | * [Model Management API](docs/management_api.md): multi model management with optimized worker to model allocation
|
60 | 61 | * [Inference API](docs/inference_api.md): REST and gRPC support for batched inference
|
61 | 62 | * [TorchServe Workflows](examples/Workflows/README.md): deploy complex DAGs with multiple interdependent models
|
62 | 63 | * Default way to serve PyTorch models in
|
63 | 64 | * [Sagemaker](https://aws.amazon.com/blogs/machine-learning/serving-pytorch-models-in-production-with-the-amazon-sagemaker-native-torchserve-integration/)
|
64 | 65 | * [Vertex AI](https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-deploy-pytorch-models-vertex-ai)
|
65 |
| - * [Kubernetes](master/kubernetes) with support for [autoscaling](kubernetes#session-affinity-with-multiple-torchserve-pods), session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS |
| 66 | + * [Kubernetes](kubernetes) with support for [autoscaling](kubernetes#session-affinity-with-multiple-torchserve-pods), session-affinity, monitoring using Grafana works on-prem, AWS EKS, Google GKE, Azure AKS |
66 | 67 | * [Kserve](https://kserve.github.io/website/0.8/modelserving/v1beta1/torchserve/): Supports both v1 and v2 API, [autoscaling and canary deployments](kubernetes/kserve/README.md#autoscaling) for A/B testing
|
67 |
| - * [Kubeflow](https://v0-5.kubeflow.org/docs/components/pytorchserving/) |
| 68 | + * [Kubeflow](https://v0-5.kubeflow.org/docs/components/pytorchserving/) |
68 | 69 | * [MLflow](https://github.com/mlflow/mlflow-torchserve)
|
69 | 70 | * Export your model for optimized inference. Torchscript out of the box, [PyTorch Compiler](examples/pt2/README.md) preview, [ORT and ONNX](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [IPEX](https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch), [TensorRT](https://github.com/pytorch/serve/blob/master/docs/performance_guide.md), [FasterTransformer](https://github.com/pytorch/serve/tree/master/examples/FasterTransformer_HuggingFace_Bert), FlashAttention (Better Transformers)
|
70 | 71 | * [Performance Guide](docs/performance_guide.md): builtin support to optimize, benchmark, and profile PyTorch and TorchServe performance
|
71 | 72 | * [Expressive handlers](CONTRIBUTING.md): An expressive handler architecture that makes it trivial to support inferencing for your use case with [many supported out of the box](https://github.com/pytorch/serve/tree/master/ts/torch_handler)
|
72 |
| -* [Metrics API](docs/metrics.md): out-of-the-box support for system-level metrics with [Prometheus exports](https://github.com/pytorch/serve/tree/master/examples/custom_metrics), custom metrics, |
| 73 | +* [Metrics API](docs/metrics.md): out-of-the-box support for system-level metrics with [Prometheus exports](https://github.com/pytorch/serve/tree/master/examples/custom_metrics), custom metrics, |
73 | 74 | * [Large Model Inference Guide](docs/large_model_inference.md): With support for GenAI, LLMs including
|
| 75 | + * [SOTA GenAI performance](https://github.com/pytorch/serve/tree/master/examples/pt2#torchcompile-genai-examples) using `torch.compile` |
74 | 76 | * Fast Kernels with FlashAttention v2, continuous batching and streaming response
|
75 |
| - * PyTorch [Tensor Parallel](examples/large_models/tp_llama) preview, [Pipeline Parallel](examples/large_models/Huggingface_pippy) |
76 |
| - * Microsoft [DeepSpeed](examples/large_models/deepspeed), [DeepSpeed-Mii](examples/large_models/deepspeed_mii) |
77 |
| - * Hugging Face [Accelerate](large_models/Huggingface_accelerate), [Diffusers](examples/diffusers) |
| 77 | + * PyTorch [Tensor Parallel](examples/large_models/tp_llama) preview, [Pipeline Parallel](examples/large_models/Huggingface_pippy) |
| 78 | + * Microsoft [DeepSpeed](examples/large_models/deepspeed), [DeepSpeed-Mii](examples/large_models/deepspeed_mii) |
| 79 | + * Hugging Face [Accelerate](examples/large_models/Huggingface_accelerate), [Diffusers](examples/diffusers) |
78 | 80 | * Running large models on AWS [Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials-torchserve.html) and [Inferentia2](https://pytorch.org/blog/high-performance-llama/)
|
79 | 81 | * Running [Llama 2 Chatbot locally on Mac](examples/LLM/llama2)
|
80 | 82 | * Monitoring using Grafana and [Datadog](https://www.datadoghq.com/blog/ai-integrations/#model-serving-and-deployment-vertex-ai-amazon-sagemaker-torchserve)
|
@@ -113,7 +115,7 @@ To learn more about how to contribute, see the contributor guide [here](https://
|
113 | 115 | ## 📰 News
|
114 | 116 | * [High performance Llama 2 deployments with AWS Inferentia2 using TorchServe](https://pytorch.org/blog/high-performance-llama/)
|
115 | 117 | * [Naver Case Study: Transition From High-Cost GPUs to Intel CPUs and oneAPI powered Software with performance](https://pytorch.org/blog/ml-model-server-resource-saving/)
|
116 |
| -* [Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs](https://aws.amazon.com/blogs/machine-learning/run-multiple-generative-ai-models-on-gpu-using-amazon-sagemaker-multi-model-endpoints-with-torchserve-and-save-up-to-75-in-inference-costs/) |
| 118 | +* [Run multiple generative AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe and save up to 75% in inference costs](https://pytorch.org/blog/amazon-sagemaker-w-torchserve/) |
117 | 119 | * [Deploying your Generative AI model in only four steps with Vertex AI and PyTorch](https://cloud.google.com/blog/products/ai-machine-learning/get-your-genai-model-going-in-four-easy-steps)
|
118 | 120 | * [PyTorch Model Serving on Google Cloud TPU v5](https://cloud.google.com/tpu/docs/v5e-inference#pytorch-model-inference-and-serving)
|
119 | 121 | * [Monitoring using Datadog](https://www.datadoghq.com/blog/ai-integrations/#model-serving-and-deployment-vertex-ai-amazon-sagemaker-torchserve)
|
|
0 commit comments