有没有一些能够减少显存的操作,我在推理128k的needlebench时,8张A100仍然显存爆炸[Feature] #1131
Unanswered
1518630367
asked this question in
Q&A
Replies: 2 comments
-
You can try LMDeploy |
Beta Was this translation helpful? Give feedback.
0 replies
-
可以尝试加入model_kwargs=dict(tensor_parallel_size=2, gpu_memory_utilization=0.7),在配置文件中,我加了后可以正常运行 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe the feature
下面是我的一些参数配置
from opencompass.models import HuggingFaceCausalLM
from opencompass.models import VLLM
_meta_template = dict(
round=[
dict(role="HUMAN", begin="<|start_header_id|>user<|end_header_id|>\n\n", end="<|eot_id|>"),
dict(role="BOT", begin="<|start_header_id|>assistant<|end_header_id|>\n\n", end="<|eot_id|>", generate=True),
],
)
models = [
dict(
type=HuggingFaceCausalLM,
abbr="llama-3-8b-instruct-hf",
path="/opt/218/models/Meta-Llama-3-8B-Instruct-NTK",
model_kwargs=dict(device_map="auto"),
tokenizer_kwargs=dict(
padding_side="left",
truncation_side="left",
use_fast=False,
),
meta_template=_meta_template,
max_out_len=128,
max_seq_len=122880,
batch_size=1,
run_cfg=dict(num_gpus=7, num_procs=1),
generation_kwargs={"eos_token_id": [128001, 128009]},
# batch_padding=True,
)
]
Will you implement it?
Beta Was this translation helpful? Give feedback.
All reactions