Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash before integrity hash on stop #1872

Open
ericpassmore opened this issue Nov 7, 2023 · 16 comments
Open

Crash before integrity hash on stop #1872

ericpassmore opened this issue Nov 7, 2023 · 16 comments
Assignees
Labels
bug Something isn't working 👍 lgtm
Milestone

Comments

@ericpassmore
Copy link
Contributor

Nodoes terminates before integrity hash is reported. The command line reports segmentation fault(core dump) but I'm unable to find the core dump on the host.

Running nodeos 5.0.0-rc2 , the config will sync with peers and terminate at end block. Snapshot starts at block num 323784127 and end block is 323956925. sync-config.ini may be found here. Snapshot is v6 from eos nation.

nodeos   --snapshot "${NODEOS_DIR}"/snapshot/snapshot.bin  \
--data-dir "${NODEOS_DIR}"/data/      \
--config "${CONFIG_DIR}"/sync-config.ini      \
--terminate-at-block ${END_BLOCK}      \
--integrity-hash-on-start      \
--integrity-hash-on-stop

Last lines of the log file

info  2023-11-07T06:02:05.463 net-0     net_plugin.cpp:2253           recv_handshake       ] ["eosn-eos-seed171:9876 - 47a4e43" - 1 209.249.216.152:9876] handshake lib 340293504, head 340293840, head id 63c09107414eba99.. sync 1, head 323956819, lib 323956493
info  2023-11-07T06:02:05.463 net-0     net_plugin.cpp:1572           operator()           ] ["eosn-eos-seed171:9876 - 47a4e43" - 1 209.249.216.152:9876] Sending handshake generation 1430, lib 323956493, head 323956819, id a3dcd867d009105e
info  2023-11-07T06:02:07.667 nodeos    producer_plugin.cpp:1754      start_block          ] Reached configured maximum block 323956925; terminating
error 2023-11-07T06:02:07.667 nodeos    producer_plugin.cpp:2459      schedule_production_ ] Failed to start a pending block, will try again later
info  2023-11-07T06:02:07.667 net-0     net_plugin.cpp:1453           _close               ] ["peer1.eosphere.io:9876 - 48fdafa" - 3 198.50.156.24:9876] closing
info  2023-11-07T06:02:07.676 nodeos    resource_monitor_plugi:140    plugin_shutdown      ] shutdown...
info  2023-11-07T06:02:07.676 nodeos    resource_monitor_plugi:147    plugin_shutdown      ] exit shutdown
info  2023-11-07T06:02:07.676 nodeos    net_plugin.cpp:4390           plugin_shutdown      ] shutdown..
info  2023-11-07T06:02:07.676 nodeos    net_plugin.cpp:4578           close_all            ] close all 4 connections
info  2023-11-07T06:02:07.677 nodeos    net_plugin.cpp:4394           plugin_shutdown      ] exit shutdown
info  2023-11-07T06:02:07.677 nodeos    producer_plugin.cpp:1400      plugin_shutdown      ] exit shutdown
info  2023-11-07T06:02:07.677 nodeos    http_plugin.cpp:515           plugin_shutdown      ] exit shutdown

Next steps, going to run this with a much smaller sync span, and try with nodeos v4.0.4.

@ericpassmore ericpassmore added the bug Something isn't working label Nov 7, 2023
@enf-ci-bot enf-ci-bot moved this to Todo in Team Backlog Nov 7, 2023
@ericpassmore
Copy link
Contributor Author

ericpassmore commented Nov 7, 2023

No luck with other options. Testing with smaller sync span, and tested on 4.0.4. Still crashing without logging ending integrity hash.

@heifner
Copy link
Member

heifner commented Nov 8, 2023

I tried this with release/5.0 using your config.ini and it didn't segfault.

@ericpassmore
Copy link
Contributor Author

Closing this issue. First I am unable to get a core dump. After changing ulimit its reporting segmentation fault, and no core dump. Second due to lack of root privileges I have unpacked nodoes locally by expanding the deb. Therefore this isn't a standard install.

@github-project-automation github-project-automation bot moved this from Todo to Done in Team Backlog Nov 9, 2023
@ericpassmore
Copy link
Contributor Author

reopening to continue investigation, will attach a debugger.

@ericpassmore
Copy link
Contributor Author

ericpassmore commented Dec 13, 2023

Reliably getting errors with 5.0.0-rc2. Attached gdb and included a stack trace from the treads at the time of nodeos crash.

Leap 5.0.0-rc2 Stack-Trce-Stop-Hash-Failure.txt
Leap 5.0.0-rc2 Nodes-Log-Stop-Hash-Failure.log

For Comparison here is the nodeos log lines without the --integrity-hash-on-stop option, and a successful exit.
Leap 5.0.0-rc2 OK-Log-Without-Stop-Hash-Option.log

@ericpassmore
Copy link
Contributor Author

Stack trace from Leap 5.0.0-rc3 slightly different
Leap 5.0.0-rc3 Stack-Trce-Stop-Hash-Failure.txt

@ericpassmore
Copy link
Contributor Author

Stack trace from Leap 4.0.5.
Nodeos-4.0.5-Stack-Trce-Stop-Hash-Failure.txt

@heifner
Copy link
Member

heifner commented Dec 13, 2023

clear_expired_input_transactions is calling a method on controller, but controller is being destroyed at this point. Seems like we should move the calculate_integrity_hash() out of the ~controller_impl() destructor and into chain_plugin shutdown.

@heifner heifner reopened this Dec 13, 2023
@github-project-automation github-project-automation bot moved this from Done to Todo in Team Backlog Dec 13, 2023
@greg7mdp
Copy link
Contributor

greg7mdp commented Dec 13, 2023

clear_expired_input_transactions is calling a method on controller, but controller is being destroyed at this point. Seems like we should move the calculate_integrity_hash() out of the ~controller_impl() destructor and into chain_plugin shutdown.

Or maybe we add a bool parameter to add_to_snapshot(.., at_shutdown = false) so that when called at shutdown we don't call clear_expired_input_transactions?

It is kinda weird that the fact this log statement:
ilog( "chain database stopped with hash: ${hash}", ("hash", calculate_integrity_hash()) ); in ~controller_impl() has the side effect of writing to the snapshot and clearing expired transactions.

@heifner
Copy link
Member

heifner commented Dec 13, 2023

clear_expired_input_transactions is required.

Yeah, if you enable warn or error level logging then the integrity hash generation on start/stop doesn't work. We should fix that also.

@bhazzard
Copy link

Summary: The issue comes into play when outputting the integrity hash on start and stop...

--integrity-hash-on-start   
--integrity-hash-on-stop

It needs to be fixed in affected versions, and the feature was added in 3.2.

Fix in 3.2.6, 4.0.6, and 5.0.x, and main/6

High priority for 5.0 because it would block chicken dance.

It should be fixed in 3.2.x and 4.0.x, but won't on its own necessitate a release.

@bhazzard bhazzard added 👍 lgtm and removed triage labels Dec 14, 2023
@bhazzard bhazzard added this to the Leap v3.2.6 milestone Dec 14, 2023
@ericpassmore
Copy link
Contributor Author

Chicken dance is not blocked. There is a work around already in place.

@spoonincode spoonincode moved this from Todo to In Progress in Team Backlog Dec 18, 2023
@spoonincode
Copy link
Member

fwiw so far I have been unable to reproduce the crash nor can I get any tooling to report an invalid access. That's somewhat a problem in creating a test case to show the problem is fixed. Wonder what I'm missing.

clear_expired_input_transactions is calling a method on controller, but controller is being destroyed at this point.

controller or controller_impl? The only actions that have been taken on controller_impl at that time are,

~controller_impl() {
thread_pool.stop();
pending.reset();
//only log this not just if configured to, but also if initialization made it to the point we'd log the startup too
if(okay_to_print_integrity_hash_on_stop && conf.integrity_hash_on_stop)
ilog( "chain database stopped with hash: ${hash}", ("hash", calculate_integrity_hash()) );
}

and pending.reset(); is pretty much a noop since it was just done anyways,
controller::~controller() {
my->abort_block();

I'm not immediately seeing anything in clear_expired_input_transactions() that accesses something now invalid? All the members of controller_impl are still intact other then pending which is well handled as part of is_building_block().

Yeah, if you enable warn or error level logging then the integrity hash generation on start/stop doesn't work. We should fix that also.

What are you suggesting? There isn't really anything to do other then log it.

@spoonincode
Copy link
Member

What are you suggesting? There isn't really anything to do other then log it.

Maybe it gets its own non-default logger?

@heifner
Copy link
Member

heifner commented Jan 17, 2024

self in controller_impl::clear_expired_input_transactions is controller& maybe that is not valid at this point.

What are you suggesting? There isn't really anything to do other then log it.

Maybe it gets its own non-default logger?

I mean we need this:

      if(okay_to_print_integrity_hash_on_stop && conf.integrity_hash_on_stop) {
         auto hash = calculate_integrity_hash();
         fc::logger::get(DEFAULT_LOGGER)).log( FC_LOG_MESSAGE( info, "chain database stopped with hash: ${hash}", ("hash", hash) ) );
      }

Otherwise if a user has default logger configured for warn or error they will not get output even if they ask for --integrity-hash-on-stop.

@ericpassmore
Copy link
Contributor Author

Note:start
group: STABILITY
category: BUG
summary: Fixes bug and prints final hash when specifying the option --integrity-hash-on-stop. Previously a race condition in the shutdown logic prevented the hash from being printed.
Note:end

@BenjaminGormanPMP BenjaminGormanPMP moved this from In Progress to Todo in Team Backlog Jan 25, 2024
@bhazzard bhazzard modified the milestones: Leap v3.2.6, Leap v3.2.7 Feb 27, 2024
@ericpassmore ericpassmore modified the milestones: Leap v3.2.7, Leap v6.0.1 Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working 👍 lgtm
Projects
Archived in project
Development

No branches or pull requests

6 participants