Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

removed hf token from cpu based example #464

Merged
merged 7 commits into from
Mar 19, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 38 additions & 26 deletions config/manifests/vllm/cpu-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,16 +26,11 @@ spec:
- "--max-loras"
- "4"
- "--lora-modules"
- '{"name": "tweet-summary-0", "path": "/adapters/ai-blond/Qwen-Qwen2.5-Coder-1.5B-Instruct-lora_0"}'
- '{"name": "tweet-summary-1", "path": "/adapters/ai-blond/Qwen-Qwen2.5-Coder-1.5B-Instruct-lora_1"}'
- '{"name": "tweet-summary-0", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
- '{"name": "tweet-summary-1", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}'
env:
- name: PORT
value: "8000"
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
value: "true"
- name: VLLM_CPU_KVCACHE_SPACE
Expand Down Expand Up @@ -64,6 +59,13 @@ spec:
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "12"
memory: "9000Mi"
requests:
cpu: "12"
memory: "9000Mi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the adapter-loader initContainer? we removed that for the gpu deployment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when removing the init container I'm getting this error about missing adapter:

INFO 03-13 07:42:00 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 4.57x
INFO 03-13 07:42:02 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.30 seconds
INFO 03-13 07:42:02 api_server.py:756] Using supplied chat template:
INFO 03-13 07:42:02 api_server.py:756] None
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 879, in run_server
    await init_app_state(engine_client, model_config, app.state, args)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 765, in init_app_state
    await state.openai_serving_models.init_static_loras()
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_models.py", line 96, in init_static_loras
    raise ValueError(load_result.message)
ValueError: Loading lora tweet-summary-0 failed: No adapter found for /adapters/hub/models--ai-blond--Qwen-Qwen2.5-Coder-1.5B-Instruct-lora/snapshots/9cde18d8ed964b0519fb481cca6acd936b2ca811
Nirs-MBP-2:gateway-api-inference-extension nirro$ 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the solution is to have the adapters in the flags above directly point to HF, right now they are pointing to the volume that is being created and populated by the side car, which is not necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls see the gpu deployment as an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g good pointer. I found adapters from hugging face and tested that it's working without the init container.
will push this change soon.
Thanks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing the lora-syncer sidecar and configmap, otherwise the lora rollout guide wouldn't work. Please see the gpu-deployment.yaml file and try to mirror it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I got what exactly wouldn't work, but I've added the configmap and sidecar per your request.
I've deployed this deployment and verifies I can call the open api endpoints and get responses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I've read "Adapter Rollout" readme file. got it.

volumeMounts:
- mountPath: /data
name: data
Expand All @@ -72,26 +74,18 @@ spec:
- name: adapters
mountPath: "/adapters"
initContainers:
- name: adapter-loader
image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo
command: ["python"]
args:
- ./pull_adapters.py
- --adapter
- ai-blond/Qwen-Qwen2.5-Coder-1.5B-Instruct-lora
- --duplicate-count
- "4"
- name: lora-adapter-syncer
tty: true
stdin: true
image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
restartPolicy: Always
imagePullPolicy: Always
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: HF_HOME
value: /adapters
volumeMounts:
- name: adapters
mountPath: "/adapters"
- name: DYNAMIC_LORA_ROLLOUT_CONFIG
value: "/config/configmap.yaml"
volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths
- name: config-volume
mountPath: /config
restartPolicy: Always
schedulerName: default-scheduler
terminationGracePeriodSeconds: 30
Expand All @@ -103,3 +97,21 @@ spec:
medium: Memory
- name: adapters
emptyDir: {}
- name: config-volume
configMap:
name: vllm-qwen-adapters
---
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-qwen-adapters
data:
configmap.yaml: |
vLLMLoRAConfig:
name: vllm-llama2-7b
port: 8000
ensureExist:
models:
- base-model: Qwen/Qwen2.5-1.5B
id: tweet-summary-1
source: SriSanth2345/Qwen-1.5B-Tweet-Generations
15 changes: 11 additions & 4 deletions site-src/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
## **Prerequisites**
- Envoy Gateway [v1.3.0](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher
- A cluster with:
- Support for services of typs `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running).
- Support for services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running).
For example, with Kind, you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer).
- Support for [sidecar containers](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) (enabled by default since Kubernetes v1.29)
to run the model server deployment.
Expand All @@ -20,14 +20,15 @@ This quickstart guide is intended for engineers familiar with k8s and model serv
Requirements: a Hugging Face access token that grants access to the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf).

1. CPU-based model server (not using GPUs).
Requirements: a Hugging Face access token that grants access to the model [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct).
The sample uses the model [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct).

Choose one of these options and follow the steps below. Please do not deploy both, as the deployments have the same name and will override each other.

=== "GPU-Based Model Server"

For this setup, you will need 3 GPUs to run the sample model server. Adjust the number of replicas in `./config/manifests/vllm/gpu-deployment.yaml` as needed.
Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model.

Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
```bash
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2
Expand All @@ -36,10 +37,16 @@ This quickstart guide is intended for engineers familiar with k8s and model serv

=== "CPU-Based Model Server"

Create a Hugging Face secret to download the model [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct). Ensure that the token grants access to this model.
This setup is using the formal `vllm-cpu` image, which according to the documentation can run vLLM on x86 CPU platform.
For this setup, we use approximately 9.5GB of memory and 12 CPUs for each replica.
While it is possible to deploy the model server with less resources, this is not recommended.
For example, in our tests, loading the model using 8GB of memory and 1 CPU was possible but took almost 3.5 minutes and inference requests took unreasonable time.
In general, there is a tradeoff between the memory and CPU we allocate to our pods and the performance. The more memory and CPU we allocate the better performance we can get.
After running multiple configurations of these values we decided in this sample to use 9.5GB of memory and 12 CPUs for each replica, which gives reasonable response times. You can increase those numbers and potentially may even get better response times.
For modifying the allocated resources, adjust the numbers in `./config/manifests/vllm/cpu-deployment.yaml` as needed.

Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway.
```bash
kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Qwen
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml
```

Expand Down