This is an example showing how to integrate vLLM with TorchServe and run inference on model meta-llama/Meta-Llama-3.1-8B-Instruct
with continuous batching.
This examples supports distributed inference by following this instruction
To leverage the power of vLLM we fist need to install it using pip in out development environment
python -m pip install -r ../requirements.txt
For later deployments we can make vLLM part of the deployment environment by adding the requirements.txt while building the model archive in step 2 (see here for details) or we can make it part of a docker image like here.
Login with a HuggingFace account
huggingface-cli login
# or using an environment variable
huggingface-cli login --token $HUGGINGFACE_TOKEN
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3.1-8B-Instruct --use_auth_token True
Add the downloaded path to "model_path:" in model-config.yaml
and run the following.
torch-model-archiver --model-name llama3-8b --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model llama3-8b
mkdir model_store
mv llama3-8b model_store
torchserve --start --ncs --ts-config ../config.properties --model-store model_store --models llama3-8b --disable-token-auth --enable-model-api
Run a text completion:
python ../../utils/test_llm_streaming_response.py -m llama3-8b -o 50 -t 2 -n 4 --prompt-text "@prompt.json" --prompt-json --openai-api
Or use the chat interface:
python ../../utils/test_llm_streaming_response.py -m llama3-8b -o 50 -t 2 -n 4 --prompt-text "@chat.json" --prompt-json --openai-api --demo-streaming --api-endpoint "v1/chat/completions"