Huggingface_accelerate

agunapal

and

mreso

Use Case: Enhancing LLM Serving with Torch Compiled RAG on AWS Gravit…

Aug 2, 2024

3f40180 · Aug 2, 2024

History

This branch is 2 commits ahead of, 20 commits behind pytorch/serve:master.

Name	Name	Last commit message	Last commit date
parent directory ..
llama	llama	Use Case: Enhancing LLM Serving with Torch Compiled RAG on AWS Gravit…	Aug 2, 2024
Download_model.py	Download_model.py	Exchange Llama2 against Llama3 in HuggingFace_accelerate example (pyt…	Apr 24, 2024
Readme.md	Readme.md	Updating examples for security tags (pytorch#3224 )	Jul 3, 2024
config.properties	config.properties	Update default address from 0.0.0.0 to 127.0.0.1 in documentation and…	Oct 2, 2023
custom_handler.py	custom_handler.py	fix typo : Artifact terminology unification (pytorch#2551 )	Aug 30, 2023
requirements.txt	requirements.txt	Large model inference (pytorch#2215 )	Apr 20, 2023
sample_text.txt	sample_text.txt	Large model inference (pytorch#2215 )	Apr 20, 2023
setup_config.json	setup_config.json	Large model inference (pytorch#2215 )	Apr 20, 2023

Readme.md

Loading large Huggingface models with constrained resources using accelerate

This document briefs on serving large HG models with limited resource using accelerate. This option can be activated with low_cpu_mem_usage=True. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).

Step 1: Download model

Login into huggingface hub with token by running the below command

huggingface-cli login

paste the token generated from huggingface hub.

python Download_model.py --model_name bigscience/bloom-7b1

The script prints the path where the model is downloaded as below.

model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/

The downloaded model is around 14GB.

Step 2: Compress downloaded model

NOTE: Install Zip cli tool

Navigate to the path got from the above script. In this example it is

cd model/models--bigscience-bloom-7b1/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/
zip -r /home/ubuntu/serve/examples/Huggingface_Largemodels//model.zip *
cd -

Step 3: Generate MAR file

Navigate up to Huggingface_Largemodels directory.

torch-model-archiver --model-name bloom --version 1.0 --handler custom_handler.py --extra-files model.zip,setup_config.json -r requirements.txt

Note: Modifying setup_config.json

Enable low_cpu_mem_usage to use accelerate
Recommended max_memory in setup_config.json is the max size of shard.
Refer: https://huggingface.co/docs/transformers/main_classes/model#large-model-loading

Step 4: Add the mar file to model store

mkdir model_store
mv bloom.mar model_store

Step 5: Start torchserve

Update config.properties and start torchserve

torchserve --start --ncs --ts-config config.properties --disable-token-auth  --enable-model-api

Step 5: Run inference

curl -v "http://localhost:8080/predictions/bloom" -T sample_text.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

Huggingface_accelerate

Huggingface_accelerate

Readme.md

Loading large Huggingface models with constrained resources using accelerate

Step 1: Download model

Step 2: Compress downloaded model

Step 3: Generate MAR file

Step 4: Add the mar file to model store

Step 5: Start torchserve

Step 5: Run inference

Files

Huggingface_accelerate

Directory actions

More options

Directory actions

More options

Latest commit

History

Huggingface_accelerate

Folders and files

parent directory

Readme.md

Loading large Huggingface models with constrained resources using accelerate

Step 1: Download model

Step 2: Compress downloaded model

Step 3: Generate MAR file

Step 4: Add the mar file to model store

Step 5: Start torchserve

Step 5: Run inference