Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Memory issue in version 1.0.0? #764

Closed
GoldenGoldy opened this issue Dec 4, 2024 · 19 comments
Closed

[BUG]: Memory issue in version 1.0.0? #764

GoldenGoldy opened this issue Dec 4, 2024 · 19 comments
Assignees
Labels
bug Something isn't working priority: high

Comments

@GoldenGoldy
Copy link

GoldenGoldy commented Dec 4, 2024

What happened?

I first reported "juliacall.JuliaError: TaskFailedException" errors in #759. Having done further tests, I strongly suspect those were actually two separate issues. The domain error that was occurring with sin or other "unsafe" functions I believe was indeed solved as per the fix applied for that ticket.

However, I keep getting crashes. I tested without using sin, and still it crashed. Furthermore, I applied the fix from #759 and made sure Julia re-compiled the relevant package etc., but still have crashes after that.

Looking further into the log files, when scrolling up a bit from the stacktrace, I get:

run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.

, or similar, as the task number and worker number differ for each crash.

And the julia-xxx-xxxxxx-0000.out log says:

slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Can it be that v1.0.0 is more likely to have such memory issues? I never had issues like this before the upgrade.

The crashes occur always roughly in the same timeframe, which would be consistent with a memory issue, because one would expect it to "boil over" at roughly the same time. In my case, that is each time somewhere between 8 and 11 hours. And this is using a VM with 240GB of RAM. This used to be more than enough, and if anything, before the upgrade to v1.0.0 memory usage was usually very low and I was planning to switch to VMs with less RAM to avoid unnecessary costs.

Here is a memory usage graph of a run started this morning. It can be seen that memory usage climbs quite steep. And while there seems to be some garbage collection or other cleanup process, it doesn't make much of a dent and then memory usage continues to climb. Note that the line that is climbing steeply is the line for the usage space (applications) while the ones that remain flat are kernel and disk data.
PySR_memory_usage

And when looking at which processes consume the memory, the top users are all Julia workers. See screenshot below, where the heap-size is also visible in the screenshot.
PySR_memory_procs

Might this memory issue be due to changes in v1.0.0?
And/or is there an easy fix such as assigning a different memory amount to the processes or to encourage better garbage collection somehow?

I saw something similar in #490 but it's my understanding that was fixed.

I tried using the "heap_size_hint_in_bytes" parameter but this does not solve the issue it seems, see comment with screenshot added to this ticket.

Version

v1.0.0

Operating System

Linux

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.


slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Extra Info

I was running in distributed mode (cluster_manager='slurm'), with 30 CPU cores. The dataset has around 2500 records, but only two features. It's unfortunately not possible to share the full Python script I'm using, but here are the main parameters used when calling PySRRegressor:

niterations=10000000,
binary_operators=["+", "-", "*", "/"],
unary_operators=["exp", "sin", "square", "cube", "sqrt"],
procs=30, populations=450,
cluster_manager='slurm',
ncycles_per_iteration=20000,
batching=False,
weight_optimize=0.35,
parsimony=1,
adaptive_parsimony_scaling=1000,
maxsize=35,
parallelism='multiprocessing',
bumper=False
@GoldenGoldy GoldenGoldy added the bug Something isn't working label Dec 4, 2024
@GoldenGoldy
Copy link
Author

Looking at it again, I suppose I should give the "heap_size_hint_in_bytes" parameter a try, which I will do now. Still strange though that before this was never necessary. I'll see if it fixes the issue though!

@GoldenGoldy
Copy link
Author

I tried using heap_size_hint_in_bytes=1500000000 (1.5 GB) but that doesn't help so far. See screenshot below. Below the graph it can be seen that the requested heap size is reflected for each of the processes, however, the lines in the graph make it clear that memory usage for each process happily grows well beyond that limit, so heap_size_hint_in_bytes=1500000000 doesn't actually change anything.
Any thoughts?

PySR_memory_usage_heap_size

@MilesCranmer
Copy link
Owner

Can you share your PySRRegressor settings and script? The more details the better.

@MilesCranmer
Copy link
Owner

Oh sorry I just saw it and the fact you can't share more details.

In this case can you explain your dataset size, and maybe any other details of the system?

Are those all the PySR parameters, or do you have other ones like the logger?

@MilesCranmer
Copy link
Owner

MilesCranmer commented Dec 4, 2024

Can you also:

  1. See which parameters are most strongly correlated with this memory leakage?
  2. Does it occur with parallelism="multithreading"? And parallelism="serial"?
  3. This figure looks good:

image

for a heap size hint of 1.5 GiB, it seems like Julia is correctly doing aggressive garbage collection when it gets close to that limit. (Edit: Or, wait, it looks like 8 GiB in this screenshot?)

So, perhaps you could try other heap size hints? Maybe like 150 MiB?

These are just to help get me more diagnostics on the issue.

Thanks!

@MilesCranmer
Copy link
Owner

Other ideas:

  • Can you try forcing Julia 1.10 instead of Julia 1.11? I have noticed some garbage collection issues on the latest Julia which I wonder are related. You can do this with juliapkg:
import juliapkg
juliapkg.require_julia("~1.10")

# THEN, import pysr:
import pysr

This will modify the version constraint to one that is compatible with both PySR and your new requirement of [1.10.0, 1.11)

@GoldenGoldy
Copy link
Author

Can you also:

  1. See which parameters are most strongly correlated with this memory leakage?
  2. Does it occur with parallelism="multithreading"? And parallelism="serial"?
  3. This figure looks good:

image

for a heap size hint of 1.5 GiB, it seems like Julia is correctly doing aggressive garbage collection when it gets close to that limit. (Edit: Or, wait, it looks like 8 GiB in this screenshot?)

So, perhaps you could try other heap size hints? Maybe like 150 MiB?

These are just to help get me more diagnostics on the issue.

Thanks!

Thanks for the feedback!
I hadn't tried yet to see if it also occurs with other parallelism settings but I was planning to do that as a next step to see if the system becomes "usable" again with parallelism="multithreading" for example. Let me try that, but I might first force the 1.10 Julia version, as suggested in the other comment, to see if that might fix the issue.

The graph you referred to above though was not from the situation where the heap size hint was set to 1.5 GiB, in fact that was when I didn't set any heap size hint at all yet. See this comment:
#764 (comment)
for what happened when setting the heap size hint, not so much it seems.

Also, I was thinking to set bumper=True to see if that makes any difference.

@GoldenGoldy
Copy link
Author

And the only other parameter I'm currently passing is:
nested_constraints = { "sin": {"sin": 0}, "square": {"square": 1, "cube": 1, "exp": 0, "sqrt": 1}, "cube": {"square": 1, "cube": 1, "exp": 0, "sqrt": 1}, "exp": {"square": 1, "cube": 1, "exp": 0, "sqrt": 1}, "sqrt": {"square": 1, "cube": 1, "exp": 1, "sqrt": 1}, }

@GoldenGoldy
Copy link
Author

The dataset has around 2500 records, and only two features in this case, not exactly huge.

@MilesCranmer
Copy link
Owner

I think the garbage collector for Julia 1.11 is a bit buggy because they rewrote it to be parallel and likely haven't solved all the issues yet (I have seen some other issues with it, like JuliaLang/julia#56735).

So I am really curious to hear if simply switching to Julia 1.10 is enough to fix it. If that is true I think I might try to default to 1.10 in the future until the new Julia version is more stable.

We can likely make use of JuliaPy/pyjuliapkg#29 to have a "recommended" version.

@GoldenGoldy
Copy link
Author

Thanks for the additional insights, and indeed changing the Julia version tot 1.10 makes all the difference!

First, I tried still with Julia 1.11, but with bumper=True. This seems to slow down the rate at which the memory fills up, however, it is still happening, and the task is destined to crash at some point. And this is with heap_size_hint_in_bytes=1500000000 (1.5GB):

Total memory usage
PySR_memory_usage_heap_size_Julia1_11_bumper

Memory usage per process
PySR_memory_usage_heap_size_Julia1_11_bumper_detail

Then, I forced Julia 1.10. Initially, I did that in combination with a lower value for heap size, with heap_size_hint_in_bytes=500000000, and then the issues seemed to immediately go away:

Memory usage per process
PySR_memory_usage_heap_size_Julia1_10

Then, I tried it again with forcing Julia 1.10, but now without any heap_size_hint at all. And indeed, with Julia 1.10 it happily works, even though no heap size is specified it all:

Total memory usage
PySR_memory_usage_Julia1_10

Memory usage per process
PySR_memory_usage_Julia1_10_detail

So, with Julia 1.10, the memory usage remains almost constant during a whole night.

@MilesCranmer
Copy link
Owner

Thanks, that is really interesting and useful. I raised an issue in Julia: JuliaLang/julia#56759 as it seems to be a core language issue rather than PySR specific.

In the meantime, if you have time, do you think you can try with parallelism="multithreading"? If this also has the memory leak then I think I should just force PySR to Julia 1.10 until they fix the bugs on 1.11.

@GoldenGoldy
Copy link
Author

Thanks!

Sure, let me try running with parallelism="multithreading", I'll post some details on that later today hopefully.

@GoldenGoldy
Copy link
Author

I wanted to do multiple additional tests but lacked the time. However, I did manage to do one test with parallelism="multithreading". Again no issues on Julia 1.10, while memory usage continues to grow on Julia 1.11.

In both cases I forced the Julia version, using:

import juliapkg
juliapkg.require_julia("~1.10")
from pysr import PySRRegressor

or

import juliapkg
juliapkg.require_julia("~1.11")
from pysr import PySRRegressor

Julia 1.10
PySR_memory_usage_force_Julia1_10_multithreading

Julia 1.11
PySR_memory_usage_force_Julia1_11_multithreading

@GoldenGoldy
Copy link
Author

The last observation I thought I'd add, is that the issue sometimes seems to take a bit longer to surface in case of multiprocessing. There may be a period that the memory usage growth is non-existent or very limited, and the lines in the graph are more or less flat. See below an example, where this lasted about half an hour, after which the lines curl up and memory keeps growing out of control. I've also seen a case where the flat lines lasted for a bit over an hour, after which the issue still occurred, and memory usage again started growing fast suddenly. With multithreading the issue seems to occur faster and even more aggressive, at least in the tests that I've done, but with multiprocessing the issue also occurs each time it seems in the end. In all cases where I had the issue, it concerned Julia 1.11

PySR_memory_usage_force_Julia1_11_multiprocessing_detail

@MilesCranmer
Copy link
Owner

Thanks! It sounds like the Julia team is looking into this: JuliaLang/julia#56759. It seems to be a real bug in Julia as far as I can tell. (I'm considering whether to push a change that defaults PySR to Julia 1.10 until this is solved.)

@MilesCranmer
Copy link
Owner

In the meantime, if you are up for running some more experiments, I think the most useful signal would be to know what hyperparameter makes the memory blow up the fastest

@GoldenGoldy
Copy link
Author

Ok, I ran a few more experiments. Because it would be very time consuming to try and change each of the parameters and check the result each time, I thought I'd just run with the default parameters. So, the following is the result of running on the same dataset as before, but now with all the parameters on their default values in accordance with the PySR API documentation, except for setting niterations=10000000. And just to stress that this is using multithreading. Result:

PySR_memory_usage_force_Julia1_11_multithreading_default_par

The memory usage still increases very fast, perhaps slightly slower than before, although that might also be some random variation. It doesn't look that any of the parameters that were changed from their previous value to their default make a huge difference?

Next, I wanted to exclude any impact from the particular data that I'm using, or the script that I'm running and so on. So, I also tried running a script in Julia directly, completely removing Python or anything custom that I did. I took the script you gave here:
JuliaLang/julia#56759

using SymbolicRegression
X = randn(Float32, 5, 10_000)
y = randn(Float32, 10_000)
options = SymbolicRegression.Options(binary_operators=[+, *, /, -], unary_operators=[cos, exp])
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options)

and ran it for a short while, here is the result:
Julia_memory_usage_example_script

There is still an increase in memory usage but it is very modest now.

Then, I amended the Julia script, to bring the data dimensions more in line with the dataset I was using each time when I ran into the memory issue. Amended script:

using SymbolicRegression
X = randn(Float32, 2, 2500)
y = randn(Float32, 2500)
options = SymbolicRegression.Options(binary_operators=[+, *, /, -], unary_operators=[cos, exp])
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options)

and ran that for a short while, now we see a completely different result:

Julia_memory_usage_example_script_changed_dim

That's a huge growth in memory usage.

I repeated the last two experiments just to be sure, in the same order:
Julia_memory_usage_example_script (2)

Julia_memory_usage_example_script_changed_dim (2)

Note that for all experiments running Julia directly I used "--threads auto" to try and make sure the available threads are being used. If I didn't add this, it seemed that only one thread was being used.

So, the data dimensions seem to play a huge role?
And maybe one key difference is then simply how fast each of the scripts goes through the iterations? The second (amended) script results in a reported ETA (in days) that is roughly 20 times shorter than the unchanged example script.

Hopefully this helps with reproducing and fixing the issue?

@MilesCranmer
Copy link
Owner

MilesCranmer commented Dec 13, 2024

This was fixed in JuliaLang/julia#56801. The next Julia version should include this patch: JuliaLang/julia#56741. Thanks again for helping figure this out!

@MilesCranmer MilesCranmer pinned this issue Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: high
Projects
None yet
Development

No branches or pull requests

2 participants