Issue with bulk ingestion with few Query nodes hitting maximum memory limit #28682

kdabbir · 2023-11-23T07:12:56Z

kdabbir
Nov 23, 2023

Hi team,

We are doing bulk ingestion of around 1 billion vectors for some scale testing, below are the configurations we used in milvus:

Milvus Details : 2.3.1 version, Milvus cluster deployed in kuberenetes. Using external S3 for indexing and external Kafka (AWS MSK) (able to reproduce same issue in 2.3.3 version)
num_shards of collection: 2 (Default)
Index segment size: 512MB (Default)
Collection Schema details (referenced from milvus docs)

fields = [
    FieldSchema(name='reviewer_id', dtype=DataType.INT64, description="", is_primary=True),
    FieldSchema(name='store_address', dtype=DataType.VARCHAR, description="", max_length=512, is_partition_key=True),
    FieldSchema(name='review', dtype=DataType.VARCHAR, description="", max_length=16384),
    FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, description="", dim=384, is_index=True),
]

schema = CollectionSchema(
    fields=fields,
    description="",
    enable_dynamic_field=True,
    # The following is an alternative to setting `is_partition_key` in a field schema.
    partition_key_field="store_address",
    # The number of physical partitions to create
    num_partitions=100
)

We have referred to milvus sizing tool for the setup we had, screenshot attached below

Referring to this, We upscaled 54 query nodes of 12core cpu/ 64gb mem, 6 data nodes with 8core cpu/16gb mem. Note that we took a buffer and upscaled more nodes.

We are hitting memory limits and seeing interruptions in ingestion with below error in query nodes:

2023-11-20 16:57:28,304 | ERROR | RPC error: [insert_rows], <MilvusException: (code=53, message=deny to write, reason: memory quota exhausted, please allocate more resources, req: /milvus.proto.milvus.MilvusService/Insert)>, <Time:{'RPC start': '2023-11-20 16:57:17.919830', 'RPC error': '2023-11-20 16:57:28.304182'}>
2023-11-20 16:57:28,305 | INFO | An exception occurred at attempt3 at ts: 2023-11-20 16:57:28.305354
2023-11-20 16:57:28,305 | INFO | <MilvusException: (code=53, message=deny to write, reason: memory quota exhausted, please allocate more resources, req: /milvus.proto.milvus.MilvusService/Insert)>

The above error states memory quota is not enough, but I can see enough free capacity in kubernetes.

I can see 1 query node reaching 95% mem consumption and it is throwing the above errors, while all other query nodes are at 62% mem consumption.
I had the below questions referring to above scenario:

Why is memory not consumed uniformly across 54 query nodes and 1 query node is over-utilized reaching 95% capacity? When 53 nodes are at 62% capacity, why is the error log complaining about lack of memory resources?
How can we divide growing segment memory uniformly to different query nodes so that we can avoid 1 node hitting memory limits?
Is there any configuration we can do to optimize this behavior? We did try increasing num_shards to 4 and it did not help, I still see non-uniform memory distribution, seeing this in both milvus versions 2.3.1 & 2.3.3

When we upscaled more query nodes (54 to 60) after 5 minutes I noticed the memory utilization reducing on the over-utilized node and ingestion continued.
But this behavior can cause ingestion failures in production and we need a way to avoid a single/fewer query nodes occuping full mem capacity.
Please suggest on the optimizations we can make.

Answered by yhmo

Nov 23, 2023

If shard_num = 4, there will be 4 "data channel" for the collection. And 4 query nodes as "leader", each of them consumes the data from one channel.
When you insert data, the proxy node hashes each "reviewer_id" to be an integer value. The hash value is mod with 4 to determine which channel the entity belongs to.
If all the values of "reviewer_id" are the same, all the data will be consumed by the same channel. That means only one channel is busy, others are idle.

View full answer

kdabbir · 2023-11-23T07:21:01Z

kdabbir
Nov 23, 2023
Author

We are running a new scale test on another milvus cluster and I can see same behavior here as well. This collection has shards_num set to 4, ingesting a 1024 dimension vector with HNSW index.

59 query nodes are utilizing around 3900 MB of memory where as 1 query node occupies 37522 MB of memory (reaching 45% of capacity), almost 10 times more than other nodes.

Below is a snapshot of number of nodes we have spinned up

0 replies

yhmo · 2023-11-23T07:55:45Z

yhmo
Nov 23, 2023
Collaborator

The primary key is not auto-generated:
FieldSchema(name='reviewer_id', dtype=DataType.INT64, description="", is_primary=True)

Did you assign different unique values to 'reviewer_id' of each entity when you call insert() to do bulk ingestion?

11 replies

yhmo Nov 23, 2023
Collaborator

So, you insert entities batch by batch, each batch has 10k entities, and their reviewer_id is from 0 to 10k?

kdabbir Nov 23, 2023
Author

Yes correct

kdabbir Nov 23, 2023
Author

I'm using insert

yhmo Nov 23, 2023
Collaborator

If shard_num = 4, there will be 4 "data channel" for the collection. And 4 query nodes as "leader", each of them consumes the data from one channel.
When you insert data, the proxy node hashes each "reviewer_id" to be an integer value. The hash value is mod with 4 to determine which channel the entity belongs to.
If all the values of "reviewer_id" are the same, all the data will be consumed by the same channel. That means only one channel is busy, others are idle.

Answer selected by kdabbir

kdabbir Nov 23, 2023
Author

I see, so this is causing 1 query node to overshoot the memory, let me tweak the code to change reviewer_id for each request and monitor it. thanks @yhmo will test this out

yhmo Nov 23, 2023
Collaborator

Ideally, if the primary keys are unique, the data is averagely distributed to the 4 channels. The 4 leaders use more memory than other query nodes. The balancer will transfer sealed segments from the leaders to other query nodes. Eventually, the data is loaded in balance.

yhmo Nov 23, 2023
Collaborator

The error message "deny to write, reason: memory quota exhausted" is thrown from the quota center of milvus. The quota center detects the memory usage of each query node and data node. If any one node's memory usage is almost reaching the maximum capacity, the quota center will return this error to deny insert/upsert/delete operations.

kdabbir Nov 24, 2023
Author

Thanks, i've monitored the cluster after changing shards_num, modified schema to auto increment ID and can see 4 query nodes getting memory distributed uniformly, though we are getting other errors in data and query nodes, raised another discussion topic: #28724

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with bulk ingestion with few Query nodes hitting maximum memory limit #28682

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Issue with bulk ingestion with few Query nodes hitting maximum memory limit #28682

kdabbir Nov 23, 2023

Replies: 2 comments · 11 replies

kdabbir Nov 23, 2023 Author

yhmo Nov 23, 2023 Collaborator

yhmo Nov 23, 2023 Collaborator

kdabbir Nov 23, 2023 Author

kdabbir Nov 23, 2023 Author

yhmo Nov 23, 2023 Collaborator

kdabbir Nov 23, 2023 Author

yhmo Nov 23, 2023 Collaborator

yhmo Nov 23, 2023 Collaborator

kdabbir Nov 24, 2023 Author

kdabbir
Nov 23, 2023

Replies: 2 comments 11 replies

kdabbir
Nov 23, 2023
Author

yhmo
Nov 23, 2023
Collaborator

yhmo Nov 23, 2023
Collaborator

kdabbir Nov 23, 2023
Author

kdabbir Nov 23, 2023
Author

yhmo Nov 23, 2023
Collaborator

kdabbir Nov 23, 2023
Author

yhmo Nov 23, 2023
Collaborator

yhmo Nov 23, 2023
Collaborator

kdabbir Nov 24, 2023
Author