You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We were able to find something related but not completely understand the root cause yet.
Here are some of our observations:
Replicator thread tries to look for a log that is out of the local storage log index range:
2024/11/21 13:12:09.026240 [warning] [log.cpp:1204 thread:2200151] [BUtilLogSink] Attempted to access entry 110267310 outside of log, first_log_index: 110266091 last_log_index: 110267306
This logs shows the leader’s log index range is [110266091, 110267306], while the follower replicated the log index 110267310 which is larger than the last index of leader log.
The not-found log entry is already applied on the leader.
2024/11/21 13:12:08.674570 [info] [state_machine_manager_impl.cpp:547 thread:2200145] Applying operation with index 110267309 term 69 ...
2024/11/21 13:12:08.674621 [info] [state_machine_manager_impl.cpp:547 thread:2200145] Applying operation with index 110267310 term 69 ...
2024/11/21 13:12:08.715403 [info] [state_machine_manager_impl.cpp:547 thread:2200151] Applying operation with index 110267311 term 69 ...
...
2024/11/21 13:12:09.024454 [info] [state_machine_manager_impl.cpp:547 thread:2200151] Applying operation with index 110267316 term 69 ...
The logs show that right before the crash, the leader had committed and applied entries all the way to 110267316. In other words, 110267310 should be already committed and applied to the leader.
No leadership changes
We verified that before the crash happened, there was no leadership change. So it's not likely an issue related to leader election or leader change.
Snapshot taken
Logs shows there was a snapshot taken shortly before the braft crash. The deleted log index matches the start_log_index in log_manager 110266512, and the created log index matches the last applied index 110267314.
2024/11/21 13:12:08.715694 [info] [state_machine_manager_impl.cpp:547 thread:2200151] Applying operation with index 110267314 term 69 ...
2024/11/21 13:12:09.008249 [info] [snapshot.cpp:638 thread:2200151] [BUtilLogSink] Deleting /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/snapshot_00000000000110267314
2024/11/21 13:12:09.008264 [info] [snapshot.cpp:644 thread:2200151] [BUtilLogSink] Renaming /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/temp to /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/snapshot_00000000000110267314
2024/11/21 13:12:09.008280 [info] [snapshot.cpp:516 thread:2200151] [BUtilLogSink] Deleting /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/snapshot_00000000000110266511
So basically the log compaction/truncation should clear logs before index 110266512, which should not affect the non-found index 110267310.
Does anyone seen similar issues before? Would like to hear any suggestions on how to debug this. Thanks!
Symptom
We recently running into an issue that for busy clusters, it occasionally output this log and then crashes the process:
RCA
We were able to find something related but not completely understand the root cause yet.
Here are some of our observations:
2024/11/21 13:12:09.026240 [warning] [log.cpp:1204 thread:2200151] [BUtilLogSink] Attempted to access entry 110267310 outside of log, first_log_index: 110266091 last_log_index: 110267306
This logs shows the leader’s log index range is [110266091, 110267306], while the follower replicated the log index 110267310 which is larger than the last index of leader log.
The logs show that right before the crash, the leader had committed and applied entries all the way to 110267316. In other words, 110267310 should be already committed and applied to the leader.
No leadership changes
We verified that before the crash happened, there was no leadership change. So it's not likely an issue related to leader election or leader change.
Snapshot taken
Logs shows there was a snapshot taken shortly before the braft crash. The deleted log index matches the start_log_index in log_manager 110266512, and the created log index matches the last applied index 110267314.
So basically the log compaction/truncation should clear logs before index 110266512, which should not affect the non-found index 110267310.
Does anyone seen similar issues before? Would like to hear any suggestions on how to debug this. Thanks!
Translation to Chinese
我们最近遇到了一个问题,好像只有braft thread 过饱和的时候才会发生。就是braft的replicator thread在找当前index的term的时候,在本机没有找到。我们有log证明了这个index已经在本机和在其他的follower上都已经commit并且apply了,所以按理来说不应该没有找到的。不知道有没有人遇到过类似的问题。感谢!
The text was updated successfully, but these errors were encountered: