You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Model is hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
As soon as I run the TGI benchmarking tool (text-generation-benchmark) with the desired input length for our use case and batch size of 2, I get CUDA Out of Memory and the TGI server stops.
When starting TGI without max-batch-total-tokens, the logs were showing that I have max batch total tokens of 134902 available. That's why I came with a config like --max-batch-size 7 --max-batch-total-tokens 125000
INFO text_generation_launcher: Using prefill chunking = True
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
INFO shard-manager: text_generation_launcher: Shard ready in 148.169397034s rank=3
INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
INFO shard-manager: text_generation_launcher: Shard ready in 150.767169532s rank=2
INFO shard-manager: text_generation_launcher: Shard ready in 150.800946226s rank=0
INFO shard-manager: text_generation_launcher: Shard ready in 150.812528352s rank=1
INFO text_generation_launcher: Starting Webserver
INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
INFO text_generation_launcher: Using optimized Triton indexing kernels.
INFO text_generation_launcher: KV-cache blocks: 134902, size: 1
INFO text_generation_launcher: Cuda Graphs are enabled for sizes [32, 16, 8, 4, 2, 1]
INFO text_generation_router_v3: backends/v3/src/lib.rs:137: Setting max batch total tokens to 134902
WARN text_generation_router_v3::backend: backends/v3/src/backend.rs:39: Model supports prefill chunking. waiting_served_ratio and max_waiting_tokens will be ignored.
INFO text_generation_router_v3: backends/v3/src/lib.rs:166: Using backend V3
INFO text_generation_router::server: router/src/server.rs:1873: Using config Some(Llama)
This is how the GPU memory looks like after server startup:
I then run the benchmarking tool like this and get the OOM error:
2025-01-24T14:15:44.533276Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 1 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-01-24T14:15:44.535857Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 3 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-01-24T14:15:44.536587Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 0 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-01-24T14:15:44.536907Z ERROR prefill{id=0 size=2}:prefill{id=0 size=2}: text_generation_client: backends/client/src/lib.rs:46: Server error: CUDA out of memory. Tried to allocate 182.00 MiB. GPU 2 has a total capacity of 22.06 GiB of which 103.12 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.40 GiB is allocated by PyTorch, with 24.46 MiB allocated in private pools (e.g., CUDA Graphs), and 41.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Expected behavior
Because of the reported Setting max batch total tokens to 134902 I am expecting the TGI server to be able to handle requests of 6667 tokens in batches of 2. Is that not the case? What am I missing here?
Is it possible that the benchmarking tool is doing something weird?
Thank you!
The text was updated successfully, but these errors were encountered:
System Info
hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4
As soon as I run the TGI benchmarking tool (text-generation-benchmark) with the desired input length for our use case and batch size of 2, I get CUDA Out of Memory and the TGI server stops.
TGI starting command:
When starting TGI without
max-batch-total-tokens
, the logs were showing that I have max batch total tokens of 134902 available. That's why I came with a config like--max-batch-size 7 --max-batch-total-tokens 125000
This is how the GPU memory looks like after server startup:
I then run the benchmarking tool like this and get the OOM error:
Information
Tasks
Reproduction
I run the benchmarking tool like this and get the OOM error:
This is how the error looks like:
Expected behavior
Because of the reported
Setting max batch total tokens to 134902
I am expecting the TGI server to be able to handle requests of 6667 tokens in batches of 2. Is that not the case? What am I missing here?Is it possible that the benchmarking tool is doing something weird?
Thank you!
The text was updated successfully, but these errors were encountered: