Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using warp sync on parachain triggers OOM #5053

Open
2 tasks done
Tracked by #56
Dinonard opened this issue Jul 17, 2024 · 23 comments
Open
2 tasks done
Tracked by #56

Using warp sync on parachain triggers OOM #5053

Dinonard opened this issue Jul 17, 2024 · 23 comments
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@Dinonard
Copy link
Contributor

Dinonard commented Jul 17, 2024

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

As the title suggest, using warp sync on parachain causes OOM, crashing the client.

We've had this problem on Astar for a few months, and have recently uplifted to polkadot-sdk version v1.9.0 but are still seeing the problem. There are no outstanding traces in the log, it just explodes at some point.

There's an issue opened in our repo, AstarNetwork/Astar#1110, with steps to reproduce as well as images of resource consumption before the crash.

We haven't been able to find similar issues or discussion related to the topic.

Steps to reproduce

Run latest Astar node as described in the linked issue.

@Dinonard Dinonard added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Jul 17, 2024
@bkchr
Copy link
Member

bkchr commented Jul 22, 2024

Hey,
This is basically a duplicate of: #4

The problem is that right now we keep the entire state in memory while downloading. For chains that have a big state (not sure what the state size of Astar is), this can lead to OOM on machines with not enough main memory.

@bkchr bkchr closed this as completed Jul 22, 2024
@Dinonard
Copy link
Contributor Author

Hey @bkchr,
I'm not sure this is the same as the issue you linked - the memory consumption suddenly spikes, it doesn't grow slowly over the time.

Please check the image here.
There are huge spikes that happen abruptly.

@bkchr
Copy link
Member

bkchr commented Jul 22, 2024

Hmm yeah. I still think it is related to having everything in memory.

Did you try to run the node with gdb? To get some stacktrace when it OOMs?

@Dinonard
Copy link
Contributor Author

We haven't but I'll ask our devops team to do that and I'll post the traces here.

@bkchr
Copy link
Member

bkchr commented Jul 22, 2024

Ty.

@bkchr bkchr reopened this Jul 22, 2024
@Dinonard
Copy link
Contributor Author

Does this help?
threads.log

We've used this command to generate it:
sudo gdb --batch --quiet -ex "thread apply all bt full" -p <...>

@bkchr
Copy link
Member

bkchr commented Jul 24, 2024

@Dinonard but this is not from the point when OOMs. You need to run the node with gdb all the time attached.

@Dinonard
Copy link
Contributor Author

TBH I checked the logs myself and couldn't see anything wrong, but AFAIK this was run immediately when the node started. Let me get back to you.

@Dinonard
Copy link
Contributor Author

I've run it on the same server as before with gdb now properly encapsulating the entire service.
And it's weird:
Thread 12 "tokio-runtime-w" received signal SIGPIPE, Broken pipe.

gdb_dump.log


Sorry about the missing symbols but I haven't used GDB with any Rust app before.
I figured that building with debug flag (-g) would include the debug info (and symbols?) into the binary itself but it doesn't seem to have helped.

@liuchengxu
Copy link
Contributor

Just to share my experiences on the state sync with a chain having a huge state. The subcoin node crashed the first time when importing the downloaded state at a certain height. And then I reran it with the same command syncing the state at the same height, with gdb attached. Unfortunately (or fortunately :P), it didn't crash with gdb. I observed that this successful state importing almost absorbed my entire memory (my machine has 128GiB of memory) at the peak.

2024-08-19 16:35:01.747  INFO tokio-runtime-worker sync: State sync is complete (5438 MiB), restarting block sync.

liuchengxu added a commit to subcoin-project/Grants-Program that referenced this issue Aug 22, 2024
In light of recent developments, it has become evident that fully syncing
to the tip of the Bitcoin network and enabling new nodes to perform fast
sync to the latest Bitcoin state is more challenging than initially anticipated,
caused by the huge state of UTXO set (over 12GiB). As a result, I propose
adjusting the delivery goal for this milestone.

The most significant known blocker is paritytech/polkadot-sdk#4.
Other underlying issues may also contribute to the difficulty. Recent experiments
have shown that fast sync from around block height 580,000 is currently infeasible,
succeeding only on machines with 128GiB of memory (paritytech/polkadot-sdk#5053 (comment)),
which is impractical for most users. Nevertheless, we have successfully demonstrated that
decentralized fast sync is possible within a prototype implementation.

While syncing to the Bitcoin network's tip remains a future target, addressing
the existing technical challenges will require substantial R&D efforts.
We remain committed to exploring potential solutions, including architectural
changes and contributing to resolving issue paritytech/polkadot-sdk#4,
Noc2 pushed a commit to w3f/Grants-Program that referenced this issue Aug 23, 2024
In light of recent developments, it has become evident that fully syncing
to the tip of the Bitcoin network and enabling new nodes to perform fast
sync to the latest Bitcoin state is more challenging than initially anticipated,
caused by the huge state of UTXO set (over 12GiB). As a result, I propose
adjusting the delivery goal for this milestone.

The most significant known blocker is paritytech/polkadot-sdk#4.
Other underlying issues may also contribute to the difficulty. Recent experiments
have shown that fast sync from around block height 580,000 is currently infeasible,
succeeding only on machines with 128GiB of memory (paritytech/polkadot-sdk#5053 (comment)),
which is impractical for most users. Nevertheless, we have successfully demonstrated that
decentralized fast sync is possible within a prototype implementation.

While syncing to the Bitcoin network's tip remains a future target, addressing
the existing technical challenges will require substantial R&D efforts.
We remain committed to exploring potential solutions, including architectural
changes and contributing to resolving issue paritytech/polkadot-sdk#4,
@liuchengxu
Copy link
Contributor

diff --git a/substrate/primitives/trie/src/lib.rs b/substrate/primitives/trie/src/lib.rs
index ef6b6a5743..e0a2cf3b30 100644
--- a/substrate/primitives/trie/src/lib.rs
+++ b/substrate/primitives/trie/src/lib.rs
@@ -296,23 +296,30 @@ where
        V: Borrow<[u8]>,
        DB: hash_db::HashDB<L::Hash, trie_db::DBValue>,
 {
-       {
+       // {
                let mut trie = TrieDBMutBuilder::<L>::from_existing(db, &mut root)
                        .with_optional_cache(cache)
                        .with_optional_recorder(recorder)
                        .build();

+               tracing::info!("====================== Collecting delta");
                let mut delta = delta.into_iter().collect::<Vec<_>>();
+               tracing::info!("====================== Finished Collecting delta: {}", delta.len());
                delta.sort_by(|l, r| l.0.borrow().cmp(r.0.borrow()));
+               tracing::info!("====================== Sorted delta");

-               for (key, change) in delta {
+               tracing::info!("====================== Starting to write trie, mem usage: {:.2?}GiB", memory_stats::memory_stats().map(|usage| usage.physical_mem as f64 / 1024.0 / 1024.0 / 1024.0));
+               for (index, (key, change)) in delta.into_iter().enumerate() {
                        match change.borrow() {
                                Some(val) => trie.insert(key.borrow(), val.borrow())?,
                                None => trie.remove(key.borrow())?,
                        };
                }
-       }
+               tracing::info!("====================== Finished writing delta to trie, mem usage: {:.2?}GiB", memory_stats::memory_stats().map(|usage| usage.physical_mem as f64 / 1024.0 / 1024.0 / 1024.0));
+               drop(trie);
+       // }

+       tracing::info!("====================== End of delta_trie_root, mem usage: {:.2?}GiB", memory_stats::memory_stats().map(|usage| usage.physical_mem as f64 / 1024.0 / 1024.0 / 1024.0));
        Ok(root)
 }
2024-08-29 10:52:53 ====================== Collecting delta
2024-08-29 10:52:55 ====================== Finished Collecting delta: 104674943
2024-08-29 10:52:56 ====================== Sorted delta
2024-08-29 10:52:57 ====================== Starting to write trie, mem usage: Some(26.23)GiB
2024-08-29 10:54:09 ====================== Finished writing delta to trie, mem usage: Some(76.36)GiB
2024-08-29 11:00:28 ====================== End of delta_trie_root, mem usage: Some(90.53)GiB

I added some logging for the memory usage in the block import pipeline. It turned out that https://github.com/subcoin-project/polkadot-sdk/blob/13ca1b64692b05b699f49e729d0522ed4be730b9/substrate/primitives/trie/src/lib.rs#L285 is the culprit. The memory usage surged from 26 GiB to 76 GiB after the trie was built. Importing the same state does not always succeed, if it crashes due to OOM, Finished writing delta to trie... would be printed and then it crashes, End of delta_trie_root won't be printed. Any clue for the further investigation and potential fix? @bkchr

@liuchengxu
Copy link
Contributor

Constructing the entire trie from the state at a specific height in memory seems to be the primary cause of the OOM issue. This is a critical design flaw, in my opinion, especially since the chain state will continue to grow over time. While Polkadot may be fine for now, it will inevitably face the same problem in the long run if we don't address this. Please prioritize pushing this issue forward. @bkchr

@bkchr
Copy link
Member

bkchr commented Sep 3, 2024

Constructing the entire trie from the state at a specific height in memory seems to be the primary cause of the OOM issue.

Yeah sounds reasonable. I need to think about on how to improve this, but yeah this should not happen ;)

@liuchengxu
Copy link
Contributor

Hey @bkchr, I understand this is a non-trivial issue, but I wanted to highlight that it’s a critical blocker for the Subcoin fast sync feature. I'm eager to collaborate closely with the Parity team to help push this forward. Let me know how I can contribute!

@bkchr
Copy link
Member

bkchr commented Sep 24, 2024

Yeah that would be nice!

I looked briefly into it. I still want to solve this with #4 together. I have the following rough plan:

  1. Support updating the state of a block. This means instead of importing the entire state with reset_storage in db/lib.rs we are supporting to add new keys to the same state and recalculate the state root. Currently the state root etc is build here. Instead of doing it there, we should forward the key/value to the db layer and then we insert them there to the trie. Updating the trie on the "fly".
  2. Instead of collecting the intermediate key/value pairs in the state sync handler in memory, we send them to the db directly. This way we should solve both problems.
  3. When state sync is finished, we import the header and ensure that the state root matches.

@liuchengxu Do you think that you could start looking into this? I think starting with the db part should be doable in parallel and can be its own pr.

@liuchengxu
Copy link
Contributor

@bkchr This makes sense to me, I'll look into the part of updating the state directly using the new keys.

@liuchengxu
Copy link
Contributor

liuchengxu commented Nov 27, 2024

With #5956, importing a state of 5 GB into the node on a machine with 32 GB RAM from my local testing is possible. @Dinonard It would be very helpful if you could also try it out with Astar.

@Dinonard
Copy link
Contributor Author

@liuchengxu I can get back to you, but it probably cannot be very soon. We're on stable2407, so adding this might require a bit more time to integrate.

@liuchengxu
Copy link
Contributor

@Dinonard #5956 is relatively small if the tests and formatting changes are excluded. If you have time, the quickest way to test it with Astar is probably to manually pick #5956 into a polkadot-sdk fork based on stable2407 and compile an Astar node with that fork, which shouldn't take a lot of time.

@Dinonard
Copy link
Contributor Author

I cannot cherry-pick a PR 🙂. I just manually added your code but it starts falling apart around different places (with stable2407). Will try to find time to debug in the coming days.

@liuchengxu
Copy link
Contributor

Hmm, the code changes in #5956 should be pretty isolated and are not supposed to have any impact on other functions. Just in case, I have also manually applied #5956 into this branch which is based on the latest stable2407, please try it with Astar. @Dinonard

@Dinonard
Copy link
Contributor Author

@liuchengxu we've tested your fix, the node didn't crash!
Great work! 💪

I'm attaching HW usage report too (took around 15 minutes for the sync).
The RAM usage trend was upwards, not sure how it would have fared in case of a longer chain.
Screenshot 2024-11-29 at 14 33 45

@liuchengxu
Copy link
Contributor

@Dinonard Thanks for the test! This is not the final version. As the entire state is still loaded in memory, there is room for further improvement. We'll see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
Status: backlog
Development

No branches or pull requests

3 participants