Add AMD documentation

jakki-amd · jakki-amd · commit 70c500c606d6 · 2024-12-02T09:53:54.000+02:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -11,18 +11,7 @@ Your contributions will fall into two categories:
     - Search for your issue here: https://github.com/pytorch/serve/issues (look for the "good first issue" tag if you're a first time contributor)
     - Pick an issue and comment on the task that you want to work on this feature.
     - To ensure your changes doesn't break any of the existing features run the sanity suite as follows from serve directory:
-        - Install dependencies (if not already installed)
-          For CPU
-
-          ```bash
-          python ts_scripts/install_dependencies.py --environment=dev
-          ```
-
-         For GPU
-           ```bash
-           python ts_scripts/install_dependencies.py --environment=dev --cuda=cu121
-           ```
-            > Supported cuda versions as cu121, cu118, cu117, cu116, cu113, cu111, cu102, cu101, cu92
+        - [Install dependencies](#Install-TorchServe-for-development) (if not already installed)
         - Install `pre-commit` to your Git flow:
             ```bash
             pre-commit install
@@ -60,26 +49,30 @@ pytest -k  test/pytest/test_mnist_template.py
 
 If you plan to develop with TorchServe and change some source code, you must install it from source code.
 
-Ensure that you have `python3` installed, and the user has access to the site-packages or `~/.local/bin` is added to the `PATH` environment variable.
+1. Clone the repository, including third-party modules, with `git clone --recurse-submodules --remote-submodules git@github.com:pytorch/serve.git`
+2. Ensure that you have `python3` installed, and the user has access to the site-packages or `~/.local/bin` is added to the `PATH` environment variable.
+3. Run the following script from the top of the source directory. NOTE: This script force re-installs `torchserve`, `torch-model-archiver` and `torch-workflow-archiver` if existing installations are found
 
-Run the following script from the top of the source directory.
+    #### For Debian Based Systems/MacOS
 
-NOTE: This script force re-installs `torchserve`, `torch-model-archiver` and `torch-workflow-archiver` if existing installations are found
+    ```
+    python ./ts_scripts/install_dependencies.py --environment=dev
+    python ./ts_scripts/install_from_src.py --environment=dev
+    ```
+    ##### Installing Dependencies for Accelerator Support
+    Use the optional `--rocm` or `--cuda` flag with `install_dependencies.py` for installing accelerator specific dependencies.
 
-#### For Debian Based Systems/ MacOS
-
-```
-python ./ts_scripts/install_dependencies.py --environment=dev
-python ./ts_scripts/install_from_src.py --environment=dev
-```
+    Possible values are
+    - rocm: `rocm61`, `rocm60`
+    - cuda: `cu111`, `cu102`, `cu101`, `cu92`
 
-Use `--cuda` flag with `install_dependencies.py` for installing cuda version specific dependencies. Possible values are `cu111`, `cu102`, `cu101`, `cu92`
+    For example `python ./ts_scripts/install_dependencies.py --environment=dev --rocm=rocm61`
 
-#### For Windows
+    #### For Windows
 
-Refer to the documentation [here](docs/torchserve_on_win_native.md).
+    Refer to the documentation [here](docs/torchserve_on_win_native.md).
 
-For information about the model archiver, see [detailed documentation](model-archiver/README.md).
+    For information about the model archiver, see [detailed documentation](model-archiver/README.md).
 
 ### What to Contribute?
 
diff --git a/README.md b/README.md
@@ -22,7 +22,10 @@ curl http://127.0.0.1:8080/predictions/bert -T input.txt
 
 ```bash
 # Install dependencies
-# cuda is optional
+python ./ts_scripts/install_dependencies.py
+
+# Include dependencies for accelerator support with the relevant optional flags
+python ./ts_scripts/install_dependencies.py --rocm=rocm61
 python ./ts_scripts/install_dependencies.py --cuda=cu121
 
 # Latest release
@@ -36,7 +39,10 @@ pip install torchserve-nightly torch-model-archiver-nightly torch-workflow-archi
 
 ```bash
 # Install dependencies
-# cuda is optional
+python ./ts_scripts/install_dependencies.py
+
+# Include depeendencies for accelerator support with the relevant optional flags
+python ./ts_scripts/install_dependencies.py --rocm=rocm61
 python ./ts_scripts/install_dependencies.py --cuda=cu121
 
 # Latest release
diff --git a/docs/contents.rst b/docs/contents.rst
@@ -16,9 +16,7 @@
   model_zoo
   request_envelopes
   server
-  nvidia_mps
   snapshot
-  intel_extension_for_pytorch <https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch>
   torchserve_on_win_native
   torchserve_on_wsl
   use_cases
@@ -27,6 +25,12 @@
   Security
   FAQs
 
+.. toctree::
+  :maxdepth: 0
+  :caption: Hardware Support:
+
+  hardware_support/hardware_support
+
 .. toctree::
   :maxdepth: 0
   :caption: Service APIs:
diff --git a/docs/hardware_support/amd_support.md b/docs/hardware_support/amd_support.md
@@ -0,0 +1,81 @@
+# AMD Support
+
+TorchServe can be run on any combination of operating system and device that is
+[supported by ROCm](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html).
+
+## Supported Versions of ROCm
+
+The current stable `major.patch` version of ROCm and the previous path version will be supported. For example version `N.2` and `N.1` where `N` is the current major version.
+
+## Installation
+
+  - Make sure you have **python >= 3.8 installed** on your system.
+  - clone the repo
+    ```bash
+    git clone git@github.com:pytorch/serve.git
+    ```
+
+  - cd into the cloned folder
+
+    ```bash
+    cd serve
+    ```
+
+  - create a virtual environment for python
+
+    ```bash
+    python -m venv venv
+    ```
+
+  - activate the virtual environment. If you use another shell (fish, csh, powershell) use the relevant option in from `/venv/bin/`
+    ```bash
+    source venv/bin/activate
+    ```
+
+  - install the dependencies needed for ROCm support.
+
+    ```bash
+    python ./ts_scripts/install_dependencies.py --rocm=rocm61
+    python ./ts_scripts/install_from_src.py
+    ```
+  - enable amd-smi in the python virtual environment
+    ```bash
+    sudo chown -R $USER:$USER /opt/rocm/share/amd_smi/
+    pip install -e /opt/rocm/share/amd_smi/
+    ```
+
+### Selecting Accelerators Using `HIP_VISIBLE_DEVICES`
+
+If you have multiple accelerators on the system where you are running TorchServe you can select which accelerators should be visible to TorchServe
+by setting the environment variable `HIP_VISIBLE_DEVICES` to a string of 0-indexed comma-separated integers representing the ids of the accelerators.
+
+If you have 8 accelerators but only want TorchServe to see the last four of them do `export HIP_VISIBLE_DEVICES=4,5,6,7`.
+
+>ℹ️  **Not setting** `HIP_VISIBLE_DEVICES` will cause TorchServe to use all available accelerators on the system it is running on.
+
+> ⚠️  You can run into trouble if you set `HIP_VISIBLE_DEVICES` to an empty string.
+> eg. `export HIP_VISIBLE_DEVICES=` or `export HIP_VISIBLE_DEVICES=""`
+> use `unset HIP_VISIBLE_DEVICES` if you want to remove its effect.
+
+> ⚠️  Setting both `CUDA_VISIBLE_DEVICES` and `HIP_VISIBLE_DEVICES` may cause unintended behaviour and should be avoided.
+> Doing so may cause an exception in the future.
+
+## Docker
+
+**In Development**
+
+`Dockerfile.rocm` provides preliminary ROCm support for TorchServe.
+
+Building and running `dev-image`:
+
+```bash
+docker build --file docker/Dockerfile.rocm --target dev-image -t torch-serve-dev-image-rocm --build-arg USE_ROCM_VERSION=rocm62 --build-arg BUILD_FROM_SRC=true .
+
+docker run -it --rm --device=/dev/kfd --device=/dev/dri torch-serve-dev-image-rocm bash
+```
+
+## Example Usage
+
+After installing TorchServe with the required dependencies for ROCm you should be ready to serve your model.
+
+For a simple example, refer to `serve/examples/image_classifier/mnist/`.
diff --git a/docs/hardware_support/apple_silicon_support.md b/docs/hardware_support/apple_silicon_support.md
@@ -1,19 +1,19 @@
-# Apple Silicon Support 
+# Apple Silicon Support
 
-## What is supported 
+## What is supported
 * TorchServe CI jobs now include M1 hardware in order to ensure support, [documentation](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories) on github M1 hardware.
-    - [Regression Tests](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu.yml)  
-    - [Regression binaries Test](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu_binaries.yml) 
+    - [Regression Tests](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu.yml)
+    - [Regression binaries Test](https://github.com/pytorch/serve/blob/master/.github/workflows/regression_tests_cpu_binaries.yml)
 * For [Docker](https://docs.docker.com/desktop/install/mac-install/) ensure Docker for Apple silicon is installed then follow [setup steps](https://github.com/pytorch/serve/tree/master/docker)
 
 ## Experimental Support
 
-* For GPU jobs on Apple Silicon, [MPS](https://pytorch.org/docs/master/notes/mps.html) is now auto detected and enabled. To prevent TorchServe from using MPS, users have to set `deviceType: "cpu"` in model-config.yaml. 
-    * This is an experimental feature and NOT ALL models are guaranteed to work.  
+* For GPU jobs on Apple Silicon, [MPS](https://pytorch.org/docs/master/notes/mps.html) is now auto detected and enabled. To prevent TorchServe from using MPS, users have to set `deviceType: "cpu"` in model-config.yaml.
+    * This is an experimental feature and NOT ALL models are guaranteed to work.
 * Number of GPUs now reports GPUs on Apple Silicon
 
-### Testing 
-* [Pytests](https://github.com/pytorch/serve/tree/master/test/pytest/test_device_config.py) that checks for MPS on MacOS M1 devices 
+### Testing
+* [Pytests](https://github.com/pytorch/serve/tree/master/test/pytest/test_device_config.py) that checks for MPS on MacOS M1 devices
 * Models that have been tested and work: Resnet-18, Densenet161, Alexnet
 * Models that have been tested and DO NOT work: MNIST
 
@@ -31,10 +31,10 @@ Config file: N/A
 Inference address: http://127.0.0.1:8080
 Management address: http://127.0.0.1:8081
 Metrics address: http://127.0.0.1:8082
-Model Store: 
+Model Store:
 Initial Models: resnet-18=resnet-18.mar
-Log dir: 
-Metrics dir: 
+Log dir:
+Metrics dir:
 Netty threads: 0
 Netty client threads: 0
 Default workers per model: 16
@@ -48,7 +48,7 @@ Custom python dependency for model allowed: false
 Enable metrics API: true
 Metrics mode: LOG
 Disable system metrics: false
-Workflow Store: 
+Workflow Store:
 CPP log config: N/A
 Model config: N/A
 024-04-08T14:18:02,380 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
@@ -69,17 +69,17 @@ serve % curl http://127.0.0.1:8080/predictions/resnet-18 -T ./examples/image_cla
 }
 ...
 ```
-#### Conda Example 
+#### Conda Example
 
 ```
-(myenv) serve % pip list | grep torch                                                                   
+(myenv) serve % pip list | grep torch
 torch                     2.2.1
 torchaudio                2.2.1
 torchdata                 0.7.1
 torchtext                 0.17.1
 torchvision               0.17.1
 (myenv3) serve % conda install -c pytorch-nightly torchserve torch-model-archiver torch-workflow-archiver
-(myenv3) serve % pip list | grep torch                                                                   
+(myenv3) serve % pip list | grep torch
 torch                     2.2.1
 torch-model-archiver      0.10.0b20240312
 torch-workflow-archiver   0.2.12b20240312
@@ -119,11 +119,11 @@ System metrics command: default
 2024-03-12T15:58:54,702 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: densenet161, count: 10
 Model server started.
 ...
-(myenv3) serve % curl http://127.0.0.1:8080/predictions/densenet161 -T examples/image_classifier/kitten.jpg 
+(myenv3) serve % curl http://127.0.0.1:8080/predictions/densenet161 -T examples/image_classifier/kitten.jpg
 {
   "tabby": 0.46661922335624695,
   "tiger_cat": 0.46449029445648193,
   "Egyptian_cat": 0.0661405548453331,
   "lynx": 0.001292439759708941,
   "plastic_bag": 0.00022909720428287983
-}
+}
diff --git a/docs/hardware_support/hardware_support.rst b/docs/hardware_support/hardware_support.rst
@@ -0,0 +1,8 @@
+.. toctree::
+  :caption: Hardware Support:
+
+  amd_support
+  apple_silicon_support
+  linux_aarch64
+  nvidia_mps
+  Intel Extension for PyTorch <https://github.com/pytorch/serve/tree/master/examples/intel_extension_for_pytorch>
diff --git a/docs/hardware_support/linux_aarch64.md b/docs/hardware_support/linux_aarch64.md
diff --git a/docs/hardware_support/nvidia_mps.md b/docs/hardware_support/nvidia_mps.md