Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upon restart p2p connectback or sync are very hard plus rpc usually timoeut #836

Closed
jun0tpyrc opened this issue Sep 30, 2024 · 21 comments · Fixed by #921
Closed

Upon restart p2p connectback or sync are very hard plus rpc usually timoeut #836

jun0tpyrc opened this issue Sep 30, 2024 · 21 comments · Fixed by #921
Assignees
Labels
👍 lgtm more-info More information requested from issue reporter

Comments

@jun0tpyrc
Copy link

v1.0.1

nodes were powerful enough staying in sync before reboot - but post reboot looks everything change

hard to connect , keep doing handshake

info  2024-09-30T09:56:43.473 net-3     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 26/26, block producer peers: 0
info  2024-09-30T09:56:45.948 net-1     net_plugin.cpp:2308           recv_handshake       ] ["0.0.0.0:9876:blk - 25c26e4" - 5 95.217.34.40:9876] handshake lib 396920011, fhead 396920013,
 id d3cd63d3f73b1b42.. sync 1, fhead 396904273, lib 396903272
info  2024-09-30T09:56:45.948 net-1     net_plugin.cpp:1604           operator()           ] ["0.0.0.0:9876:blk - 25c26e4" - 5 95.217.34.40:9876] Sending handshake generation 2059, lib 39
6903272, fhead 396904273, id 0e0150127f1cae4e
info  2024-09-30T09:56:46.359 net-2     net_plugin.cpp:2308           recv_handshake       ] ["eosn-eos-seed172:9876 - 216c673" - 9 209.249.216.152:9876] handshake lib 396920012, fhead 39
6920014, id a7ee3f4876e3821e.. sync 1, fhead 396904273, lib 396903272
info  2024-09-30T09:56:46.359 net-2     net_plugin.cpp:1604           operator()           ] ["eosn-eos-seed172:9876 - 216c673" - 9 209.249.216.152:9876] Sending handshake generation 2591
, lib 396903272, fhead 396904273, id 0e0150127f1cae4e

for hours
(p2p port already open to world)

@jun0tpyrc
Copy link
Author

we have 3 or 4 nodes hitting this . If reimport from almost brand new snaphosts - still hard to connect

@alecalve
Copy link

I am encountering the same issues

@heifner
Copy link
Member

heifner commented Sep 30, 2024

See #829. The handshake messages are harmless, but they do make it hard to see what the real issue is if anything. Can you provide your complete log?

@heifner heifner added the more-info More information requested from issue reporter label Sep 30, 2024
@FoieringSerjo
Copy link

Same here.

The sync is very slow, and when it reaches specific blocks, the node cannot establish a connection at all.

Uploading CleanShot 2024-09-30 at 22.38.32@2x.png…

@spoonincode
Copy link
Member

In my experience this is what logs can look like when nodeos is exceptionally slow applying blocks due to low memory or poor disk performance. Can you run top and see if nodeos is almost always listed with its status as D (instead of R or S, which would be more normal)?

@FoieringSerjo
Copy link

FoieringSerjo commented Sep 30, 2024

In my experience this is what logs can look like when nodeos is exceptionally slow applying blocks due to low memory or poor disk performance. Can you run top and see if nodeos is almost always listed with its status as D (instead of R or S, which would be more normal)?

Indeed always list with status D

CleanShot 2024-09-30 at 23 51 21@2x

However, I definitely have more than enough memory. it uses 2GB, and I have 64GB.

I will check the disk; however, it's hard to believe that this is the root cause; the node worked perfectly until the new version was upgraded.

@jun0tpyrc
Copy link
Author

In my experience this is what logs can look like when nodeos is exceptionally slow applying blocks due to low memory or poor disk performance. Can you run top and see if nodeos is almost always listed with its status as D (instead of R or S, which would be more normal)?

Indeed always list with status D

CleanShot 2024-09-30 at 23 51 21@2x

However, I definitely have more than enough memory. it uses 2GB, and I have 64GB.

I will check the disk; however, it's hard to believe that this is the root cause; the node worked perfectly until the new version was upgraded.

yes , it's D even stuck
same mem-spec
nodes were perfectly in sync and feed the hyperion indexer fail BEFORE reboot -
but after reboot they output not much logs useful other than p2p /handshake and not quite serving rpc

@jun0tpyrc
Copy link
Author

jun0tpyrc commented Oct 1, 2024

last 100000lines of logs of one of the stuck node , should include the process which i restarted it few times and tried various param change to see if helpful

last100000log-eos.txt

@heifner
Copy link
Member

heifner commented Oct 1, 2024

Thanks for the log.

Unfortunately you are hitting: #830 which will be fixed in 1.0.2.
As a work-around you can remove any peers that you are failing to connect to:

18.234.6.119:80
fullnode.eoslaomao.com
p2p.meet.one

info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to fn001.eossv.org:443, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to seed2.greymass.com:9876, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to seed1.greymass.com:9876, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to node1.eoscannon.io:59876, Element not found
info  2024-09-30T11:05:52.788 net-2     net_plugin.cpp:2833           operator()           ] connection failed to node.eosflare.io:1883, Element not found
info  2024-09-30T11:05:52.788 net-0     net_plugin.cpp:2833           operator()           ] connection failed to publicnode.cypherglass.com:9876, Element not found
info  2024-09-30T11:05:52.788 net-1     net_plugin.cpp:2833           operator()           ] connection failed to bp.eosbeijing.one:8080, Element not found
info  2024-09-30T11:05:52.788 net-3     net_plugin.cpp:2833           operator()           ] connection failed to peer2.eosphere.io:9876, Element not found

etc.

Until 1.0.2, just remove all but one, two, or at most three good peers.

@heifner
Copy link
Member

heifner commented Oct 1, 2024

@jun0tpyrc
Copy link
Author

jun0tpyrc commented Oct 1, 2024

even with 1 - 3 peers it's still pretty easy to hit same issues at our side after few attempts

For example , for using snapshot x single peer that works previously to rework (did not come up in error logs before) , may still encounter something like below

info  2024-10-01T08:06:40.385 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync
info  2024-10-01T08:07:09.662 net-0     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.662 net-3     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.900 net-2     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 1, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 2, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2017           set_state            ] old state in sync becoming lib catchup
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2117           request_next_chunk   ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Catching up with chain, our last req is 0, theirs is 397073291, next expected 396068750, fhead 396068749
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2156           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] requesting range 396068750 to 396069249, fhead 396068749, lib 396068749
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:3438           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Local network version: 9
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:2308           recv_handshake       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] handshake lib 397073291, fhead 397073293, id 41da1e90867fe683.. sync 1, fhead 396068749, lib 396068749
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:3506           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] received go_away_message, reason = some other non-fatal condition, possibly unknown block
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:1463           _close               ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] closing
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2139           request_next_chunk   ] Unable to continue syncing at this time
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync

turn out it won't finish the import->catchup flow forever and if one turning off at this stage - node got corrupted and need rerun the import flow

===
This would be a big obstacle for who need to maintain private nodes stably & continuously behind dapp/exchange like our usage. Feel it's risky to reboot things with current version & even fresh rework may not be seamless as before

@blocknodehub
Copy link

even with 1 - 3 peers it's still pretty easy to hit same issues at our side after few attempts

For example , for using snapshot x single peer that works previously to rework (did not come up in error logs before) , may still encounter something like below

info  2024-10-01T08:06:40.385 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync
info  2024-10-01T08:07:09.662 net-0     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.662 net-3     net_plugin.cpp:4863           connection_monitor   ] p2p client connections: 0/150, peer connections: 1/1, block producer peers: 0
info  2024-10-01T08:07:39.900 net-2     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 1, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:1604           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Sending handshake generation 2, lib 396068749, fhead 396068749, id 9ed73bed1562926a
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2017           set_state            ] old state in sync becoming lib catchup
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2117           request_next_chunk   ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Catching up with chain, our last req is 0, theirs is 397073291, next expected 396068750, fhead 396068749
info  2024-10-01T08:07:40.138 net-3     net_plugin.cpp:2156           operator()           ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] requesting range 396068750 to 396069249, fhead 396068749, lib 396068749
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:3438           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] Local network version: 9
info  2024-10-01T08:07:40.138 net-0     net_plugin.cpp:2308           recv_handshake       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] handshake lib 397073291, fhead 397073293, id 41da1e90867fe683.. sync 1, fhead 396068749, lib 396068749
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:3506           handle_message       ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] received go_away_message, reason = some other non-fatal condition, possibly unknown block
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:1463           _close               ] ["peer1.eosphere.io:9876 - 922061b" - 1 198.50.156.24:9876] closing
warn  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2139           request_next_chunk   ] Unable to continue syncing at this time
info  2024-10-01T08:07:40.376 net-3     net_plugin.cpp:2017           set_state            ] old state lib catchup becoming in sync

turn out it won't finish the import->catchup flow forever and if one turning off at this stage - node got corrupted and need rerun the import flow

=== This would be a big obstacle for who need to maintain private nodes stably & continuously behind dapp/exchange like our usage. Feel it's risky to reboot things with current version & even fresh rework may not be seamless as before

our nodes have the same problem, become very unstable since upgrade to spring

@heifner
Copy link
Member

heifner commented Oct 1, 2024

Please DM me complete logs and config.ini. Would like to try and reproduce.

@the-smooth-operator
Copy link

we are having a similar issue. Nodes are struggling to process blocks, RPC API stops responding. Snapshots take a long time to process and, once processed, the node doesn't start syncing up blocks.

We have noticed a huge increase in data per block on Friday and again today.

What is the recommended configuration for running a node that can keep up? Looking for advice on modifications to config.ini like number of net/http threads etc...

@the-smooth-operator
Copy link

node got corrupted and need rerun the import flow

this corruption seems to happen because the process is so busy that it doesn't catch SIGTERM so it can't finish gracefully

@heifner
Copy link
Member

heifner commented Oct 1, 2024

It will respond to SIGTERM, it just takes awhile. When I tried stopping my syncing node it took 9 minutes to respond to SIGTERM.

@the-smooth-operator
Copy link

@heifner could you share a config.ini for a high-performing (i.e z1d.6xlarge) RPC node?

@heifner
Copy link
Member

heifner commented Oct 1, 2024

@heifner could you share a config.ini for a high-performing (i.e z1d.6xlarge) RPC node?

If not also being used as a P2P relay node for transactions:
p2p-accept-transactions=false

Since you have plenty of RAM you can run with:
chain-state-db-size-mb = 180000
database-map-mode = heap
The startup/shutdown is slower, but runs faster. You could also use:
database-map-mode = mapped_private
which provide fast startup, slower shutdown than default but does not use disk IO like default.

You also have plenty of cores, so:
net-threads = 6
chain-threads = 6

Keep your peer list small, just a small number of reliable peers.

If you are not processing read-only-transactions and not running a public node:
read-only-threads = 0
Although, I don't think this makes much if any difference.

If not running a public node, and not using:
If you are not using /v1/chain/get_block or /v1/chain/get_block_info or /v1/chain/get_block_header_state or /v1/chain/get_raw_block or /v1/chain/get_block_header then you can run without a block log:
block-log-retain-blocks = 0

@jun0tpyrc
Copy link
Author

jun0tpyrc commented Oct 2, 2024

We examined p2p-accept-transactions=false is the killer config here to fix p2p issue during initial import of recent snapshot - smooth until reaching chain tip

The rpc-thread timeout may still occur after importing multiple day's data , haven't examined completely for catching-up with an older snapshot, which i need to do for eliminating data gap - a much minor issue then

It will respond to SIGTERM, it just takes awhile. When I tried stopping my syncing node it took 9 minutes to respond to SIGTERM.

Wonder if this graceful-stop time could be improved (thread priority/race ?!) , in old version even in moderate config machine it just needs within 1 or at most 2 minute i recall

@heifner
Copy link
Member

heifner commented Oct 2, 2024

Wonder if this graceful-stop time could be improved (thread priority/race ?!) , in old version even in moderate config machine it just needs within 1 or at most 2 minute i recall

Yes, we have a plan to fix it: #284

@bhazzard bhazzard added 👍 lgtm and removed triage labels Nov 7, 2024
@heifner
Copy link
Member

heifner commented Nov 7, 2024

Fixed by #284 and released in 1.0.3.

@heifner heifner closed this as completed Nov 7, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Team Backlog Nov 7, 2024
@arhag arhag linked a pull request Nov 7, 2024 that will close this issue
@heifner heifner self-assigned this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
👍 lgtm more-info More information requested from issue reporter
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

9 participants