removed hf token from cpu based example #464

nirrozenbaum · 2025-03-09T10:41:15Z

trying to run the example without configuring the HF token seems to work, so this part can be removed.

netlify · 2025-03-09T10:41:33Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`057d176`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67da8ae3c6c6f90008b2fd54
😎 Deploy Preview	https://deploy-preview-464--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

config/manifests/vllm/cpu-deployment.yaml

ahg-g · 2025-03-09T20:44:18Z

I tried the deployment-cpu.yaml, the container crashloops and I am getting the following error:

":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 03-09 20:41:57 cpu.py:39] Cannot use None backend on CPU.
INFO 03-09 20:41:57 cpu.py:40] Using Torch SDPA backend.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

nirrozenbaum · 2025-03-10T13:15:27Z

@ahg-g I pushed a commit with cpu and memory requirements.
I've done a lot of trial and error and this setup seems to give reasonable response time. of course the more memory and cpu we allocate the better performance we can get, but these values seems good enough for me.
BTW, we can reduce the number of replicas from 3 to 2 to reduce the requirements.. but no doubt it has to be a strong cluster and not small one.

ahg-g · 2025-03-12T17:05:33Z

config/manifests/vllm/cpu-deployment.yaml

+               memory: "9000Mi"
+             requests:
+               cpu: "12"
+               memory: "9000Mi"


do we need the adapter-loader initContainer? we removed that for the gpu deployment

when removing the init container I'm getting this error about missing adapter:

INFO 03-13 07:42:00 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 4.57x INFO 03-13 07:42:02 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.30 seconds INFO 03-13 07:42:02 api_server.py:756] Using supplied chat template: INFO 03-13 07:42:02 api_server.py:756] None Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module> uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 879, in run_server await init_app_state(engine_client, model_config, app.state, args) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 765, in init_app_state await state.openai_serving_models.init_static_loras() File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_models.py", line 96, in init_static_loras raise ValueError(load_result.message) ValueError: Loading lora tweet-summary-0 failed: No adapter found for /adapters/hub/models--ai-blond--Qwen-Qwen2.5-Coder-1.5B-Instruct-lora/snapshots/9cde18d8ed964b0519fb481cca6acd936b2ca811 Nirs-MBP-2:gateway-api-inference-extension nirro$

the solution is to have the adapters in the flags above directly point to HF, right now they are pointing to the volume that is being created and populated by the side car, which is not necessary.

Pls see the gpu deployment as an example.

@ahg-g good pointer. I found adapters from hugging face and tested that it's working without the init container.
will push this change soon.
Thanks :)

This is missing the lora-syncer sidecar and configmap, otherwise the lora rollout guide wouldn't work. Please see the gpu-deployment.yaml file and try to mirror it.

I'm not sure I got what exactly wouldn't work, but I've added the configmap and sidecar per your request.
I've deployed this deployment and verifies I can call the open api endpoints and get responses.

ok, I've read "Adapter Rollout" readme file. got it.

ahg-g

I am still getting the following error:

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

nirrozenbaum · 2025-03-13T08:40:15Z

I am still getting the following error:

Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in uvloop.run(run_server(args)) File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper return await main File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in aenter return await anext(self.gen) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause.

that's interesting. I'm failing to understand from this log what could be the issue.
@kfswain @danehans @liu-cong can you guys try to deploy the cpu-deployment from this PR (which contains cpu and memory specifications) in your cluster and update if it is working for you?

nirrozenbaum · 2025-03-17T10:20:32Z

@ahg-g I've confirmed that this deployment is working on two additional clusters.
can you please give it another try and verify you used the deployment version of this PR?
in addition, it would be very helpful if someone else can try to run it.

nirrozenbaum · 2025-03-17T10:29:19Z

and btw, on another cluster I used, I was able to get reasonable response times for inference requests when using only 4 CPUs per pod.

ahg-g · 2025-03-17T23:18:52Z

Update: this worked for me, including the removal of the HF token. As I mentioned in the other PR, the issue was the CPU model, I was using AMD, but when switching to intel it worked. We need to document that somewhere or make the image multi-arch. Thanks @nirrozenbaum!

nirrozenbaum · 2025-03-18T06:18:51Z

Update: this worked for me, including the removal of the HF token. As I mentioned in the other PR, the issue was the CPU model, I was using AMD, but when switching to intel it worked. We need to document that somewhere or make the image multi-arch. Thanks @nirrozenbaum!

@ahg-g I was checking the vllm-cpu image arch just now.
the first lines in the Dockerfile are:

# This vLLM Dockerfile is used to construct image that can build and run vLLM on x86 CPU platform.

I will add this line to the documentation.

nirrozenbaum · 2025-03-18T08:59:19Z

Update: this worked for me, including the removal of the HF token. As I mentioned in the other PR, the issue was the CPU model, I was using AMD, but when switching to intel it worked. We need to document that somewhere or make the image multi-arch. Thanks @nirrozenbaum!

@ahg-g I was checking the vllm-cpu image arch just now. the first lines in the Dockerfile are:
# This vLLM Dockerfile is used to construct image that can build and run vLLM on x86 CPU platform.
I will add this line to the documentation.

@ahg-g the cpu Dockerfile is using ubuntu:22.04 as the base image.
If I'm not mistaken, in such a case the cpu image inherits the multi-arch that the base image supports.

danehans · 2025-03-18T18:47:54Z

xref #527 for using tabs to separate CPU/GPU-based deployment modes.

Signed-off-by: Nir Rozenbaum <[email protected]>

ahg-g · 2025-03-19T14:15:57Z

/lgtm
/approve

@nirrozenbaum I think the issue is not x86 per se (both AMD and Intel CPUs are x86), it is probably related to some special Intel instructions that perhaps vllm uses.

k8s-ci-robot · 2025-03-19T14:16:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from danehans and Jeffwan March 9, 2025 10:41

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 9, 2025

nirrozenbaum mentioned this pull request Mar 9, 2025

e2e CI Job #259

Open

nirrozenbaum force-pushed the remove-hf-token branch from 8650b30 to e2c381b Compare March 9, 2025 11:31

ahg-g reviewed Mar 9, 2025

View reviewed changes

config/manifests/vllm/cpu-deployment.yaml Show resolved Hide resolved

ahg-g reviewed Mar 12, 2025

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 13, 2025

nirrozenbaum force-pushed the remove-hf-token branch from fbf81a2 to c74605d Compare March 13, 2025 09:02

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 13, 2025

nirrozenbaum force-pushed the remove-hf-token branch from c74605d to cb83786 Compare March 18, 2025 09:02

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2025

nirrozenbaum added 4 commits March 19, 2025 10:49

removed hf token from cpu based example

014b003

Signed-off-by: Nir Rozenbaum <[email protected]>

added limits to cpu deployment

e55efc3

Signed-off-by: Nir Rozenbaum <[email protected]>

fixed a typo

e1584ca

Signed-off-by: Nir Rozenbaum <[email protected]>

updated LoRA adapters

ddc8381

Signed-off-by: Nir Rozenbaum <[email protected]>

nirrozenbaum added 2 commits March 19, 2025 10:53

documentation cpu platform

d3d0dff

Signed-off-by: Nir Rozenbaum <[email protected]>

rebase

1d2bedc

Signed-off-by: Nir Rozenbaum <[email protected]>

nirrozenbaum force-pushed the remove-hf-token branch from cb83786 to 1d2bedc Compare March 19, 2025 08:57

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 19, 2025

updated config map and lora syncer init container

057d176

Signed-off-by: Nir Rozenbaum <[email protected]>

k8s-ci-robot assigned ahg-g Mar 19, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 19, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 19, 2025

k8s-ci-robot merged commit ad29b48 into kubernetes-sigs:main Mar 19, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removed hf token from cpu based example #464

removed hf token from cpu based example #464

nirrozenbaum commented Mar 9, 2025

netlify bot commented Mar 9, 2025 •

edited

Loading

ahg-g commented Mar 9, 2025

nirrozenbaum commented Mar 10, 2025 •

edited

Loading

ahg-g Mar 12, 2025

nirrozenbaum Mar 13, 2025

ahg-g Mar 17, 2025

ahg-g Mar 17, 2025

nirrozenbaum Mar 18, 2025

ahg-g Mar 18, 2025

nirrozenbaum Mar 19, 2025

nirrozenbaum Mar 19, 2025

ahg-g left a comment

nirrozenbaum commented Mar 13, 2025 •

edited

Loading

nirrozenbaum commented Mar 17, 2025

nirrozenbaum commented Mar 17, 2025

ahg-g commented Mar 17, 2025

nirrozenbaum commented Mar 18, 2025

nirrozenbaum commented Mar 18, 2025

danehans commented Mar 18, 2025

ahg-g commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

removed hf token from cpu based example #464

removed hf token from cpu based example #464

Conversation

nirrozenbaum commented Mar 9, 2025

netlify bot commented Mar 9, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g commented Mar 9, 2025

nirrozenbaum commented Mar 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

nirrozenbaum commented Mar 13, 2025 • edited Loading

nirrozenbaum commented Mar 17, 2025

nirrozenbaum commented Mar 17, 2025

ahg-g commented Mar 17, 2025

nirrozenbaum commented Mar 18, 2025

nirrozenbaum commented Mar 18, 2025

danehans commented Mar 18, 2025

ahg-g commented Mar 19, 2025

k8s-ci-robot commented Mar 19, 2025

netlify bot commented Mar 9, 2025 •

edited

Loading

nirrozenbaum commented Mar 10, 2025 •

edited

Loading

nirrozenbaum commented Mar 13, 2025 •

edited

Loading