-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/2.5] ModuleTracker: Add explicit garbage collection #1661
base: release/2.5
Are you sure you want to change the base?
Conversation
When running an FSDP model with FlopCounterMode, we are experiencing a memory leak. It is coming from ModuleTracker class. Even though ModuleTracker class is keeping weakrefrences of the operators, the tensors/operators are not being freed after the backward pass. To force free these tensors/operators after backwardpass, I explicitly added garbage collection in the post forward hook. (cherry picked from commit 63dc40d)
Not yet decided on cherry-picks into 2.5, so want to wait on this PR merge. |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE Detected error during Pytorch building:
|
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
Jenkins build for 328b01c325248ec56b34544b0f7ef3be914a5bb5 commit finished as FAILURE |
When running an FSDP model with FlopCounterMode, we are experiencing a memory leak. It is coming from ModuleTracker class. Even though ModuleTracker class is keeping weakrefrences of the operators, the tensors/operators are not being freed after the backward pass. To force free these tensors/operators after forward pass, I explicitly added garbage collection in the post forward hook.
(cherry picked from commit 63dc40d)
Fixes #ISSUE_NUMBER