Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process crashes: Check failed: prev_log_index < _options.log_manager->first_log_index() #479

Open
CCisGG opened this issue Dec 4, 2024 · 0 comments

Comments

@CCisGG
Copy link

CCisGG commented Dec 4, 2024

Symptom

We recently running into an issue that for busy clusters, it occasionally output this log and then crashes the process:

2024/11/21 13:12:09.049593 [critical] [replicator.cpp:536 thread:2200151] [BUtilLogSink] Check failed: prev_log_index < _options.log_manager->first_log_index() (110267310 vs 110266512). 
#0 0x0000011848e2 braft::Replicator::_fill_common_fields()
#1 0x000001183ee7 braft::Replicator::_send_entries()
#2 0x000001180509 braft::Replicator::_continue_sending()
#3 0x000001153b32 braft::LogManager::run_on_new_log()
#4 0x0000011cc63d bthread::TaskGroup::task_runner()
#5 0x00000134a0a1 bthread_make_fcontext

RCA

We were able to find something related but not completely understand the root cause yet.

Here are some of our observations:

  1. Replicator thread tries to look for a log that is out of the local storage log index range:

2024/11/21 13:12:09.026240 [warning] [log.cpp:1204 thread:2200151] [BUtilLogSink] Attempted to access entry 110267310 outside of log, first_log_index: 110266091 last_log_index: 110267306

This logs shows the leader’s log index range is [110266091, 110267306], while the follower replicated the log index 110267310 which is larger than the last index of leader log.

  1. The not-found log entry is already applied on the leader.
2024/11/21 13:12:08.674570 [info] [state_machine_manager_impl.cpp:547 thread:2200145] Applying operation with index 110267309 term 69 ...
2024/11/21 13:12:08.674621 [info] [state_machine_manager_impl.cpp:547 thread:2200145] Applying operation with index 110267310 term 69 ...
2024/11/21 13:12:08.715403 [info] [state_machine_manager_impl.cpp:547 thread:2200151] Applying operation with index 110267311 term 69 ...
...
2024/11/21 13:12:09.024454 [info] [state_machine_manager_impl.cpp:547 thread:2200151] Applying operation with index 110267316 term 69 ...

The logs show that right before the crash, the leader had committed and applied entries all the way to 110267316. In other words, 110267310 should be already committed and applied to the leader.

  1. No leadership changes
    We verified that before the crash happened, there was no leadership change. So it's not likely an issue related to leader election or leader change.

  2. Snapshot taken
    Logs shows there was a snapshot taken shortly before the braft crash. The deleted log index matches the start_log_index in log_manager 110266512, and the created log index matches the last applied index 110267314.

2024/11/21 13:12:08.715694 [info] [state_machine_manager_impl.cpp:547 thread:2200151] Applying operation with index 110267314 term 69 ...
2024/11/21 13:12:09.008249 [info] [snapshot.cpp:638 thread:2200151] [BUtilLogSink] Deleting /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/snapshot_00000000000110267314
2024/11/21 13:12:09.008264 [info] [snapshot.cpp:644 thread:2200151] [BUtilLogSink] Renaming /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/temp to /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/snapshot_00000000000110267314
2024/11/21 13:12:09.008280 [info] [snapshot.cpp:516 thread:2200151] [BUtilLogSink] Deleting /mnt/u001/northguard/i004/metadata/686af3bb-0995-e7a2-20da-cbe0047a887c/data/snapshot/snapshot_00000000000110266511

So basically the log compaction/truncation should clear logs before index 110266512, which should not affect the non-found index 110267310.

Does anyone seen similar issues before? Would like to hear any suggestions on how to debug this. Thanks!

Translation to Chinese

我们最近遇到了一个问题,好像只有braft thread 过饱和的时候才会发生。就是braft的replicator thread在找当前index的term的时候,在本机没有找到。我们有log证明了这个index已经在本机和在其他的follower上都已经commit并且apply了,所以按理来说不应该没有找到的。不知道有没有人遇到过类似的问题。感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant