[BUG]: Memory issue in version 1.0.0? #764

GoldenGoldy · 2024-12-04T09:46:10Z

What happened?

I first reported "juliacall.JuliaError: TaskFailedException" errors in #759. Having done further tests, I strongly suspect those were actually two separate issues. The domain error that was occurring with sin or other "unsafe" functions I believe was indeed solved as per the fix applied for that ticket.

However, I keep getting crashes. I tested without using sin, and still it crashed. Furthermore, I applied the fix from #759 and made sure Julia re-compiled the relevant package etc., but still have crashes after that.

Looking further into the log files, when scrolling up a bit from the stacktrace, I get:

run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.

, or similar, as the task number and worker number differ for each crash.

And the julia-xxx-xxxxxx-0000.out log says:

slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Can it be that v1.0.0 is more likely to have such memory issues? I never had issues like this before the upgrade.

The crashes occur always roughly in the same timeframe, which would be consistent with a memory issue, because one would expect it to "boil over" at roughly the same time. In my case, that is each time somewhere between 8 and 11 hours. And this is using a VM with 240GB of RAM. This used to be more than enough, and if anything, before the upgrade to v1.0.0 memory usage was usually very low and I was planning to switch to VMs with less RAM to avoid unnecessary costs.

Here is a memory usage graph of a run started this morning. It can be seen that memory usage climbs quite steep. And while there seems to be some garbage collection or other cleanup process, it doesn't make much of a dent and then memory usage continues to climb. Note that the line that is climbing steeply is the line for the usage space (applications) while the ones that remain flat are kernel and disk data.

And when looking at which processes consume the memory, the top users are all Julia workers. See screenshot below, where the heap-size is also visible in the screenshot.

Might this memory issue be due to changes in v1.0.0?
And/or is there an easy fix such as assigning a different memory amount to the processes or to encourage better garbage collection somehow?

I saw something similar in #490 but it's my understanding that was fixed.

I tried using the "heap_size_hint_in_bytes" parameter but this does not solve the issue it seems, see comment with screenshot added to this ticket.

Version

v1.0.0

Operating System

Linux

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

run: error: hpcslurm-computenodeset-1: task 13: Out Of Memory
Worker 15 terminated.


slurmstepd: error: Detected 1 oom_kill event in StepId=10.0. Some of the step tasks have been OOM Killed.
slurmstepd: error: *** STEP 10.0 ON hpcslurm-computenodeset-1 FAILED (non-zero exit code or other failure mode) ***
slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Connection refused

Extra Info

I was running in distributed mode (cluster_manager='slurm'), with 30 CPU cores. The dataset has around 2500 records, but only two features. It's unfortunately not possible to share the full Python script I'm using, but here are the main parameters used when calling PySRRegressor:

niterations=10000000,
binary_operators=["+", "-", "*", "/"],
unary_operators=["exp", "sin", "square", "cube", "sqrt"],
procs=30, populations=450,
cluster_manager='slurm',
ncycles_per_iteration=20000,
batching=False,
weight_optimize=0.35,
parsimony=1,
adaptive_parsimony_scaling=1000,
maxsize=35,
parallelism='multiprocessing',
bumper=False

The text was updated successfully, but these errors were encountered:

GoldenGoldy · 2024-12-04T10:24:27Z

Looking at it again, I suppose I should give the "heap_size_hint_in_bytes" parameter a try, which I will do now. Still strange though that before this was never necessary. I'll see if it fixes the issue though!

GoldenGoldy · 2024-12-04T17:14:20Z

I tried using heap_size_hint_in_bytes=1500000000 (1.5 GB) but that doesn't help so far. See screenshot below. Below the graph it can be seen that the requested heap size is reflected for each of the processes, however, the lines in the graph make it clear that memory usage for each process happily grows well beyond that limit, so heap_size_hint_in_bytes=1500000000 doesn't actually change anything.
Any thoughts?

MilesCranmer · 2024-12-04T18:31:36Z

Can you share your PySRRegressor settings and script? The more details the better.

MilesCranmer · 2024-12-04T18:35:23Z

Oh sorry I just saw it and the fact you can't share more details.

In this case can you explain your dataset size, and maybe any other details of the system?

Are those all the PySR parameters, or do you have other ones like the logger?

MilesCranmer · 2024-12-04T18:38:22Z

Can you also:

See which parameters are most strongly correlated with this memory leakage?
Does it occur with parallelism="multithreading"? And parallelism="serial"?
This figure looks good:

for a heap size hint of 1.5 GiB, it seems like Julia is correctly doing aggressive garbage collection when it gets close to that limit. (Edit: Or, wait, it looks like 8 GiB in this screenshot?)

So, perhaps you could try other heap size hints? Maybe like 150 MiB?

These are just to help get me more diagnostics on the issue.

Thanks!

MilesCranmer · 2024-12-04T18:44:24Z

Other ideas:

Can you try forcing Julia 1.10 instead of Julia 1.11? I have noticed some garbage collection issues on the latest Julia which I wonder are related. You can do this with juliapkg:

import juliapkg
juliapkg.require_julia("~1.10")

# THEN, import pysr:
import pysr

This will modify the version constraint to one that is compatible with both PySR and your new requirement of [1.10.0, 1.11)

GoldenGoldy · 2024-12-04T19:22:05Z

Can you also:

See which parameters are most strongly correlated with this memory leakage?

Does it occur with parallelism="multithreading"? And parallelism="serial"?

This figure looks good:

for a heap size hint of 1.5 GiB, it seems like Julia is correctly doing aggressive garbage collection when it gets close to that limit. (Edit: Or, wait, it looks like 8 GiB in this screenshot?)

So, perhaps you could try other heap size hints? Maybe like 150 MiB?

These are just to help get me more diagnostics on the issue.

Thanks!

Thanks for the feedback!
I hadn't tried yet to see if it also occurs with other parallelism settings but I was planning to do that as a next step to see if the system becomes "usable" again with parallelism="multithreading" for example. Let me try that, but I might first force the 1.10 Julia version, as suggested in the other comment, to see if that might fix the issue.

The graph you referred to above though was not from the situation where the heap size hint was set to 1.5 GiB, in fact that was when I didn't set any heap size hint at all yet. See this comment:
#764 (comment)
for what happened when setting the heap size hint, not so much it seems.

Also, I was thinking to set bumper=True to see if that makes any difference.

GoldenGoldy · 2024-12-04T19:31:00Z

And the only other parameter I'm currently passing is:
nested_constraints = { "sin": {"sin": 0}, "square": {"square": 1, "cube": 1, "exp": 0, "sqrt": 1}, "cube": {"square": 1, "cube": 1, "exp": 0, "sqrt": 1}, "exp": {"square": 1, "cube": 1, "exp": 0, "sqrt": 1}, "sqrt": {"square": 1, "cube": 1, "exp": 1, "sqrt": 1}, }

GoldenGoldy · 2024-12-04T19:42:12Z

The dataset has around 2500 records, and only two features in this case, not exactly huge.

MilesCranmer · 2024-12-04T21:00:53Z

I think the garbage collector for Julia 1.11 is a bit buggy because they rewrote it to be parallel and likely haven't solved all the issues yet (I have seen some other issues with it, like JuliaLang/julia#56735).

So I am really curious to hear if simply switching to Julia 1.10 is enough to fix it. If that is true I think I might try to default to 1.10 in the future until the new Julia version is more stable.

We can likely make use of JuliaPy/pyjuliapkg#29 to have a "recommended" version.

GoldenGoldy · 2024-12-05T08:01:04Z

Thanks for the additional insights, and indeed changing the Julia version tot 1.10 makes all the difference!

First, I tried still with Julia 1.11, but with bumper=True. This seems to slow down the rate at which the memory fills up, however, it is still happening, and the task is destined to crash at some point. And this is with heap_size_hint_in_bytes=1500000000 (1.5GB):

Total memory usage

Memory usage per process

Then, I forced Julia 1.10. Initially, I did that in combination with a lower value for heap size, with heap_size_hint_in_bytes=500000000, and then the issues seemed to immediately go away:

Memory usage per process

Then, I tried it again with forcing Julia 1.10, but now without any heap_size_hint at all. And indeed, with Julia 1.10 it happily works, even though no heap size is specified it all:

Total memory usage

Memory usage per process

So, with Julia 1.10, the memory usage remains almost constant during a whole night.

MilesCranmer · 2024-12-05T09:40:23Z

Thanks, that is really interesting and useful. I raised an issue in Julia: JuliaLang/julia#56759 as it seems to be a core language issue rather than PySR specific.

In the meantime, if you have time, do you think you can try with parallelism="multithreading"? If this also has the memory leak then I think I should just force PySR to Julia 1.10 until they fix the bugs on 1.11.

GoldenGoldy · 2024-12-05T09:55:46Z

Thanks!

Sure, let me try running with parallelism="multithreading", I'll post some details on that later today hopefully.

GoldenGoldy · 2024-12-05T15:37:25Z

I wanted to do multiple additional tests but lacked the time. However, I did manage to do one test with parallelism="multithreading". Again no issues on Julia 1.10, while memory usage continues to grow on Julia 1.11.

In both cases I forced the Julia version, using:

import juliapkg
juliapkg.require_julia("~1.10")
from pysr import PySRRegressor

or

import juliapkg
juliapkg.require_julia("~1.11")
from pysr import PySRRegressor

Julia 1.10

Julia 1.11

GoldenGoldy · 2024-12-05T19:35:51Z

The last observation I thought I'd add, is that the issue sometimes seems to take a bit longer to surface in case of multiprocessing. There may be a period that the memory usage growth is non-existent or very limited, and the lines in the graph are more or less flat. See below an example, where this lasted about half an hour, after which the lines curl up and memory keeps growing out of control. I've also seen a case where the flat lines lasted for a bit over an hour, after which the issue still occurred, and memory usage again started growing fast suddenly. With multithreading the issue seems to occur faster and even more aggressive, at least in the tests that I've done, but with multiprocessing the issue also occurs each time it seems in the end. In all cases where I had the issue, it concerned Julia 1.11

MilesCranmer · 2024-12-05T20:44:39Z

Thanks! It sounds like the Julia team is looking into this: JuliaLang/julia#56759. It seems to be a real bug in Julia as far as I can tell. (I'm considering whether to push a change that defaults PySR to Julia 1.10 until this is solved.)

MilesCranmer · 2024-12-06T23:52:10Z

In the meantime, if you are up for running some more experiments, I think the most useful signal would be to know what hyperparameter makes the memory blow up the fastest

GoldenGoldy · 2024-12-07T13:06:10Z

Ok, I ran a few more experiments. Because it would be very time consuming to try and change each of the parameters and check the result each time, I thought I'd just run with the default parameters. So, the following is the result of running on the same dataset as before, but now with all the parameters on their default values in accordance with the PySR API documentation, except for setting niterations=10000000. And just to stress that this is using multithreading. Result:

The memory usage still increases very fast, perhaps slightly slower than before, although that might also be some random variation. It doesn't look that any of the parameters that were changed from their previous value to their default make a huge difference?

Next, I wanted to exclude any impact from the particular data that I'm using, or the script that I'm running and so on. So, I also tried running a script in Julia directly, completely removing Python or anything custom that I did. I took the script you gave here:
JuliaLang/julia#56759

using SymbolicRegression
X = randn(Float32, 5, 10_000)
y = randn(Float32, 10_000)
options = SymbolicRegression.Options(binary_operators=[+, *, /, -], unary_operators=[cos, exp])
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options)

and ran it for a short while, here is the result:

There is still an increase in memory usage but it is very modest now.

Then, I amended the Julia script, to bring the data dimensions more in line with the dataset I was using each time when I ran into the memory issue. Amended script:

using SymbolicRegression
X = randn(Float32, 2, 2500)
y = randn(Float32, 2500)
options = SymbolicRegression.Options(binary_operators=[+, *, /, -], unary_operators=[cos, exp])
hall_of_fame = equation_search(X, y; niterations=1000000000, options=options)

and ran that for a short while, now we see a completely different result:

That's a huge growth in memory usage.

I repeated the last two experiments just to be sure, in the same order:

Note that for all experiments running Julia directly I used "--threads auto" to try and make sure the available threads are being used. If I didn't add this, it seemed that only one thread was being used.

So, the data dimensions seem to play a huge role?
And maybe one key difference is then simply how fast each of the scripts goes through the iterations? The second (amended) script results in a reported ETA (in days) that is roughly 20 times shorter than the unchanged example script.

Hopefully this helps with reproducing and fixing the issue?

MilesCranmer · 2024-12-13T23:47:32Z

This was fixed in JuliaLang/julia#56801. The next Julia version should include this patch: JuliaLang/julia#56741. Thanks again for helping figure this out!

GoldenGoldy added the bug Something isn't working label Dec 4, 2024

GoldenGoldy assigned MilesCranmer Dec 4, 2024

MilesCranmer mentioned this issue Dec 5, 2024

Memory leak with Julia 1.11's GC (discovered in SymbolicRegression.jl) JuliaLang/julia#56759

Closed

MilesCranmer added the priority: high label Dec 5, 2024

MilesCranmer closed this as completed Dec 13, 2024

MilesCranmer pinned this issue Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Memory issue in version 1.0.0? #764

[BUG]: Memory issue in version 1.0.0? #764

GoldenGoldy commented Dec 4, 2024 •

edited by MilesCranmer

Loading

GoldenGoldy commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024 •

edited

Loading

MilesCranmer commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024

GoldenGoldy commented Dec 5, 2024

MilesCranmer commented Dec 5, 2024

GoldenGoldy commented Dec 5, 2024

GoldenGoldy commented Dec 5, 2024

GoldenGoldy commented Dec 5, 2024

MilesCranmer commented Dec 5, 2024

MilesCranmer commented Dec 6, 2024

GoldenGoldy commented Dec 7, 2024

MilesCranmer commented Dec 13, 2024 •

edited

Loading

[BUG]: Memory issue in version 1.0.0? #764

[BUG]: Memory issue in version 1.0.0? #764

Comments

GoldenGoldy commented Dec 4, 2024 • edited by MilesCranmer Loading

What happened?

Version

Operating System

Package Manager

Interface

Relevant log output

Extra Info

GoldenGoldy commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024 • edited Loading

MilesCranmer commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

GoldenGoldy commented Dec 4, 2024

MilesCranmer commented Dec 4, 2024

GoldenGoldy commented Dec 5, 2024

MilesCranmer commented Dec 5, 2024

GoldenGoldy commented Dec 5, 2024

GoldenGoldy commented Dec 5, 2024

GoldenGoldy commented Dec 5, 2024

MilesCranmer commented Dec 5, 2024

MilesCranmer commented Dec 6, 2024

GoldenGoldy commented Dec 7, 2024

MilesCranmer commented Dec 13, 2024 • edited Loading

GoldenGoldy commented Dec 4, 2024 •

edited by MilesCranmer

Loading

MilesCranmer commented Dec 4, 2024 •

edited

Loading

MilesCranmer commented Dec 13, 2024 •

edited

Loading