-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
removed hf token from cpu based example #464
removed hf token from cpu based example #464
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
8650b30
to
e2c381b
Compare
I tried the deployment-cpu.yaml, the container crashloops and I am getting the following error:
|
@ahg-g I pushed a commit with cpu and memory requirements. |
memory: "9000Mi" | ||
requests: | ||
cpu: "12" | ||
memory: "9000Mi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the adapter-loader initContainer? we removed that for the gpu deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when removing the init container I'm getting this error about missing adapter:
INFO 03-13 07:42:00 executor_base.py:115] Maximum concurrency for 32768 tokens per request: 4.57x
INFO 03-13 07:42:02 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 2.30 seconds
INFO 03-13 07:42:02 api_server.py:756] Using supplied chat template:
INFO 03-13 07:42:02 api_server.py:756] None
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 879, in run_server
await init_app_state(engine_client, model_config, app.state, args)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 765, in init_app_state
await state.openai_serving_models.init_static_loras()
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_models.py", line 96, in init_static_loras
raise ValueError(load_result.message)
ValueError: Loading lora tweet-summary-0 failed: No adapter found for /adapters/hub/models--ai-blond--Qwen-Qwen2.5-Coder-1.5B-Instruct-lora/snapshots/9cde18d8ed964b0519fb481cca6acd936b2ca811
Nirs-MBP-2:gateway-api-inference-extension nirro$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the solution is to have the adapters in the flags above directly point to HF, right now they are pointing to the volume that is being created and populated by the side car, which is not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls see the gpu deployment as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g good pointer. I found adapters from hugging face and tested that it's working without the init container.
will push this change soon.
Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the lora-syncer sidecar and configmap, otherwise the lora rollout guide wouldn't work. Please see the gpu-deployment.yaml file and try to mirror it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I got what exactly wouldn't work, but I've added the configmap and sidecar per your request.
I've deployed this deployment and verifies I can call the open api endpoints and get responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I've read "Adapter Rollout" readme file. got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still getting the following error:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
that's interesting. I'm failing to understand from this log what could be the issue. |
fbf81a2
to
c74605d
Compare
@ahg-g I've confirmed that this deployment is working on two additional clusters. |
and btw, on another cluster I used, I was able to get reasonable response times for inference requests when using only 4 CPUs per pod. |
Update: this worked for me, including the removal of the HF token. As I mentioned in the other PR, the issue was the CPU model, I was using AMD, but when switching to intel it worked. We need to document that somewhere or make the image multi-arch. Thanks @nirrozenbaum! |
@ahg-g I was checking the vllm-cpu image arch just now.
I will add this line to the documentation. |
@ahg-g the cpu Dockerfile is using ![]() |
c74605d
to
cb83786
Compare
xref #527 for using tabs to separate CPU/GPU-based deployment modes. |
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
Signed-off-by: Nir Rozenbaum <[email protected]>
cb83786
to
1d2bedc
Compare
Signed-off-by: Nir Rozenbaum <[email protected]>
/lgtm @nirrozenbaum I think the issue is not x86 per se (both AMD and Intel CPUs are x86), it is probably related to some special Intel instructions that perhaps vllm uses. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, nirrozenbaum The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
trying to run the example without configuring the HF token seems to work, so this part can be removed.
CC @danehans