Skip to content

Commit d2d8966

Browse files
authored
Merge branch 'master' into fix-docker-regression
2 parents c03db10 + bebab5a commit d2d8966

19 files changed

+88
-235
lines changed

.github/workflows/doc-automation.yml

+3
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ on:
77
jobs:
88
build_docs_job:
99
runs-on: ubuntu-20.04
10+
permissions:
11+
# Grant write permission here so that the doc can be pushed to gh-pages branch
12+
contents: write
1013
steps:
1114
- name: Setup Python 3.9
1215
uses: actions/setup-python@v5

examples/large_models/Huggingface_accelerate/Download_model.py

+1
Original file line numberDiff line numberDiff line change
@@ -47,5 +47,6 @@ def hf_model(model_str):
4747
revision=args.revision,
4848
cache_dir=args.model_path,
4949
use_auth_token=True,
50+
ignore_patterns=["original/*"],
5051
)
5152
print(f"Files for '{args.model_name}' is downloaded to '{snapshot_path}'")

examples/large_models/Huggingface_accelerate/llama2/Readme.md examples/large_models/Huggingface_accelerate/llama/Readme.md

+12-12
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# Loading meta-llama/Llama-2-70b-chat-hf on AWS EC2 g5.24xlarge using accelerate
1+
# Loading meta-llama/Meta-Llama-3-70B-Instruct on AWS EC2 g5.24xlarge using accelerate
22

3-
This document briefs on serving large HG models with limited resource using accelerate. This option can be activated with `low_cpu_mem_usage=True`. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).
3+
This document briefs on serving large HF models with limited resource using accelerate. This option can be activated with `low_cpu_mem_usage=True`. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint). This examples uses Meta Llama-3 as an example but it works with Llama2 as well by replacing the model identifier.
44

55
### Step 1: Download model Permission
66

7-
Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) to get permission
7+
Follow [this instruction](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) to get permission
88

99
Login with a Hugging Face account
1010
```
@@ -14,44 +14,44 @@ huggingface-cli login --token $HUGGINGFACE_TOKEN
1414
```
1515

1616
```bash
17-
python ../Download_model.py --model_path model --model_name meta-llama/Llama-2-70b-chat-hf
17+
python ../Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3-70B-Instruct
1818
```
19-
Model will be saved in the following path, `model/models--meta-llama--Llama-2-70b-chat-hf`.
19+
Model will be saved in the following path, `model/models--meta-llama--Meta-Llama-3-70B-Instruct`.
2020

2121
### Step 2: Generate MAR file
2222

2323
Add the downloaded path to " model_path:" in `model-config.yaml` and run the following.
2424

2525
```bash
26-
torch-model-archiver --model-name llama2-70b-chat --version 1.0 --handler custom_handler.py --config-file model-config.yaml -r requirements.txt --archive-format no-archive
26+
torch-model-archiver --model-name llama3-70b-instruct --version 1.0 --handler custom_handler.py --config-file model-config.yaml -r requirements.txt --archive-format no-archive
2727
```
2828

29-
If you are using conda, and notice issues with mpi4py, you would need to install openmpi-mpicc using the following
29+
If you are using conda, and notice issues with mpi4py, you can install it with
3030

3131
```
32-
conda install -c conda-forge openmpi-mpicc
32+
conda install mpi4py
3333
```
3434

3535
### Step 3: Add the mar file to model store
3636

3737
```bash
3838
mkdir model_store
39-
mv llama2-70b-chat model_store
40-
mv model model_store/llama2-70b-chat
39+
mv llama3-70b-instruct model_store
40+
mv model model_store/llama3-70b-instruct
4141
```
4242

4343
### Step 3: Start torchserve
4444

4545
Update config.properties and start torchserve
4646

4747
```bash
48-
torchserve --start --ncs --ts-config config.properties --model-store model_store --models llama2-70b-chat
48+
torchserve --start --ncs --ts-config config.properties --model-store model_store --models llama3-70b-instruct
4949
```
5050

5151
### Step 4: Run inference
5252

5353
```bash
54-
curl -v "http://localhost:8080/predictions/llama2-70b-chat" -T sample_text.txt
54+
curl -v "http://localhost:8080/predictions/llama3-70b-instruct" -T sample_text.txt
5555
```
5656

5757
results in the following output

examples/large_models/Huggingface_accelerate/llama2/custom_handler_code.py examples/large_models/Huggingface_accelerate/llama/custom_handler.py

+39-44
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
import logging
22
from abc import ABC
3+
from typing import Dict
34

45
import torch
56
import transformers
6-
from transformers import AutoModelForCausalLM, AutoTokenizer
7+
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
78

89
from ts.context import Context
910
from ts.torch_handler.base_handler import BaseHandler
@@ -39,26 +40,30 @@ def initialize(self, ctx: Context):
3940
seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
4041
torch.manual_seed(seed)
4142

42-
logger.info("Model %s loading tokenizer", ctx.model_name)
43+
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
44+
self.tokenizer.pad_token = self.tokenizer.eos_token
45+
self.tokenizer.padding_side = "left"
46+
logger.info("Model %s loaded tokenizer successfully", ctx.model_name)
47+
48+
if self.tokenizer.vocab_size >= 128000:
49+
quant_config = BitsAndBytesConfig(
50+
load_in_4bit=True,
51+
bnb_4bit_use_double_quant=True,
52+
bnb_4bit_quant_type="nf4",
53+
bnb_4bit_compute_dtype=torch.bfloat16,
54+
)
55+
else:
56+
quant_config = BitsAndBytesConfig(load_in_8bit=True)
57+
4358
self.model = AutoModelForCausalLM.from_pretrained(
4459
model_path,
4560
device_map="balanced",
4661
low_cpu_mem_usage=True,
4762
torch_dtype=torch.float16,
48-
load_in_8bit=True,
63+
quantization_config=quant_config,
4964
trust_remote_code=True,
5065
)
51-
if ctx.model_yaml_config["handler"]["fast_kernels"]:
52-
from optimum.bettertransformer import BetterTransformer
53-
54-
try:
55-
self.model = BetterTransformer.transform(self.model)
56-
except RuntimeError as error:
57-
logger.warning(
58-
"HuggingFace Optimum is not supporting this model,for the list of supported models, please refer to this doc,https://huggingface.co/docs/optimum/bettertransformer/overview"
59-
)
60-
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
61-
66+
self.device = next(iter(self.model.parameters())).device
6267
logger.info("Model %s loaded successfully", ctx.model_name)
6368
self.initialized = True
6469

@@ -72,38 +77,31 @@ def preprocess(self, requests):
7277
tuple: A tuple with two tensors: the batch of input ids and the batch of
7378
attention masks.
7479
"""
75-
input_texts = [data.get("data") or data.get("body") for data in requests]
76-
input_ids_batch, attention_mask_batch = [], []
77-
for input_text in input_texts:
78-
input_ids, attention_mask = self.encode_input_text(input_text)
79-
input_ids_batch.append(input_ids)
80-
attention_mask_batch.append(attention_mask)
81-
input_ids_batch = torch.cat(input_ids_batch, dim=0).to(self.model.device)
82-
attention_mask_batch = torch.cat(attention_mask_batch, dim=0).to(self.device)
83-
return input_ids_batch, attention_mask_batch
84-
85-
def encode_input_text(self, input_text):
80+
input_texts = [self.preprocess_requests(r) for r in requests]
81+
82+
logger.info("Received texts: '%s'", input_texts)
83+
inputs = self.tokenizer(
84+
input_texts,
85+
max_length=self.max_length,
86+
padding=True,
87+
add_special_tokens=True,
88+
return_tensors="pt",
89+
truncation=True,
90+
).to(self.device)
91+
return inputs
92+
93+
def preprocess_requests(self, request: Dict):
8694
"""
87-
Encodes a single input text using the tokenizer.
95+
Preprocess request
8896
Args:
89-
input_text (str): The input text to be encoded.
97+
request (Dict): Request to be decoded.
9098
Returns:
91-
tuple: A tuple with two tensors: the encoded input ids and the attention mask.
99+
str: Decoded input text
92100
"""
101+
input_text = request.get("data") or request.get("body")
93102
if isinstance(input_text, (bytes, bytearray)):
94103
input_text = input_text.decode("utf-8")
95-
logger.info("Received text: '%s'", input_text)
96-
inputs = self.tokenizer.encode_plus(
97-
input_text,
98-
max_length=self.max_length,
99-
padding=False,
100-
add_special_tokens=True,
101-
return_tensors="pt",
102-
truncation=True,
103-
)
104-
input_ids = inputs["input_ids"]
105-
attention_mask = inputs["attention_mask"]
106-
return input_ids, attention_mask
104+
return input_text
107105

108106
def inference(self, input_batch):
109107
"""
@@ -115,11 +113,8 @@ def inference(self, input_batch):
115113
Returns:
116114
list: A list of strings with the predicted values for each input text in the batch.
117115
"""
118-
input_ids_batch, attention_mask_batch = input_batch
119-
input_ids_batch = input_ids_batch.to(self.device)
120116
outputs = self.model.generate(
121-
input_ids_batch,
122-
attention_mask=attention_mask_batch,
117+
**input_batch,
123118
max_length=self.max_new_tokens,
124119
)
125120

examples/large_models/Huggingface_accelerate/llama2/model-config.yaml examples/large_models/Huggingface_accelerate/llama/model-config.yaml

+2-3
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,8 @@ responseTimeout: 1200
66
deviceType: "gpu"
77

88
handler:
9-
model_name: "meta-llama/Llama-2-70b-chat-hf"
10-
model_path: "model/models--meta-llama--Llama-2-70b-chat-hf/snapshots/9ff8b00464fc439a64bb374769dec3dd627be1c2"
9+
model_name: "meta-llama/Meta-Llama-3-70B-Instruct"
10+
model_path: "model/models--meta-llama--Meta-Llama-3-70B-Instruct/snapshots/5fcb2901844dde3111159f24205b71c25900ffbd"
1111
max_length: 50
1212
max_new_tokens: 50
1313
manual_seed: 40
14-
fast_kernels: True
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
what is the recipe of mayonnaise?

examples/large_models/Huggingface_accelerate/llama2/custom_handler.py

-146
This file was deleted.

examples/large_models/Huggingface_accelerate/llama2/sample_text.txt

-1
This file was deleted.

requirements/developer.txt

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,5 @@ torchpippy==0.1.1
1818
intel_extension_for_pytorch==2.2.0; sys_platform != 'win32' and sys_platform != 'darwin'
1919
onnxruntime==1.17.1
2020
googleapis-common-protos
21-
onnx==1.14.1
21+
onnx==1.16.0
2222
orjson

requirements/torch_cu118_linux.txt

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117
22
--extra-index-url https://download.pytorch.org/whl/cu118
33
-r torch_common.txt
4-
torch==2.2.1+cu118; sys_platform == 'linux'
5-
torchvision==0.17.1+cu118; sys_platform == 'linux'
6-
torchtext==0.17.1; sys_platform == 'linux'
7-
torchaudio==2.2.1+cu118; sys_platform == 'linux'
4+
torch==2.3.0+cu118; sys_platform == 'linux'
5+
torchvision==0.18.0+cu118; sys_platform == 'linux'
6+
torchtext==0.18.0; sys_platform == 'linux'
7+
torchaudio==2.3.0+cu118; sys_platform == 'linux'

0 commit comments

Comments
 (0)