Name		Name	Last commit message	Last commit date
parent directory ..
Readme.md		Readme.md
chat.json		chat.json
model-config.yaml		model-config.yaml
prompt.json		prompt.json

Readme.md

Example showing inference with vLLM on LoRA model

This is an example showing how to integrate vLLM with TorchServe and run inference on model meta-llama/Meta-Llama-3.1-8B-Instruct with continuous batching. This examples supports distributed inference by following this instruction

Step 0: Install vLLM

To leverage the power of vLLM we fist need to install it using pip in out development environment

python -m pip install -r ../requirements.txt

For later deployments we can make vLLM part of the deployment environment by adding the requirements.txt while building the model archive in step 2 (see here for details) or we can make it part of a docker image like here.

Step 1: Download Model from HuggingFace

Login with a HuggingFace account

huggingface-cli login
# or using an environment variable
huggingface-cli login --token $HUGGINGFACE_TOKEN

python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3.1-8B-Instruct --use_auth_token True

Step 2: Generate model artifacts

Add the downloaded path to "model_path:" in model-config.yaml and run the following.

torch-model-archiver --model-name llama3-8b --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model llama3-8b

Step 3: Add the model artifacts to model store

mkdir model_store
mv llama3-8b model_store

Step 4: Start torchserve

torchserve --start --ncs --ts-config ../config.properties --model-store model_store --models llama3-8b --disable-token-auth --enable-model-api

Step 5: Run inference

Run a text completion:

python ../../utils/test_llm_streaming_response.py -m llama3-8b -o 50 -t 2 -n 4 --prompt-text "@prompt.json" --prompt-json --openai-api

Or use the chat interface:

python ../../utils/test_llm_streaming_response.py -m llama3-8b -o 50 -t 2 -n 4 --prompt-text "@chat.json" --prompt-json --openai-api --demo-streaming --api-endpoint "v1/chat/completions"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3

llama3

Readme.md

Example showing inference with vLLM on LoRA model

Step 0: Install vLLM

Step 1: Download Model from HuggingFace

Step 2: Generate model artifacts

Step 3: Add the model artifacts to model store

Step 4: Start torchserve

Step 5: Run inference

Files

llama3

Directory actions

More options

Directory actions

More options

Latest commit

History

llama3

Folders and files

parent directory

Readme.md

Example showing inference with vLLM on LoRA model

Step 0: Install vLLM

Step 1: Download Model from HuggingFace

Step 2: Generate model artifacts

Step 3: Add the model artifacts to model store

Step 4: Start torchserve

Step 5: Run inference