-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Memory issue in version 1.0.0? #764
Comments
Looking at it again, I suppose I should give the "heap_size_hint_in_bytes" parameter a try, which I will do now. Still strange though that before this was never necessary. I'll see if it fixes the issue though! |
I tried using |
Can you share your PySRRegressor settings and script? The more details the better. |
Oh sorry I just saw it and the fact you can't share more details. In this case can you explain your dataset size, and maybe any other details of the system? Are those all the PySR parameters, or do you have other ones like the logger? |
Can you also:
for a heap size hint of 1.5 GiB, it seems like Julia is correctly doing aggressive garbage collection when it gets close to that limit. (Edit: Or, wait, it looks like 8 GiB in this screenshot?) So, perhaps you could try other heap size hints? Maybe like 150 MiB? These are just to help get me more diagnostics on the issue. Thanks! |
Other ideas:
import juliapkg
juliapkg.require_julia("~1.10")
# THEN, import pysr:
import pysr This will modify the version constraint to one that is compatible with both PySR and your new requirement of [1.10.0, 1.11) |
Thanks for the feedback! The graph you referred to above though was not from the situation where the heap size hint was set to 1.5 GiB, in fact that was when I didn't set any heap size hint at all yet. See this comment: Also, I was thinking to set |
And the only other parameter I'm currently passing is: |
The dataset has around 2500 records, and only two features in this case, not exactly huge. |
I think the garbage collector for Julia 1.11 is a bit buggy because they rewrote it to be parallel and likely haven't solved all the issues yet (I have seen some other issues with it, like JuliaLang/julia#56735). So I am really curious to hear if simply switching to Julia 1.10 is enough to fix it. If that is true I think I might try to default to 1.10 in the future until the new Julia version is more stable. We can likely make use of JuliaPy/pyjuliapkg#29 to have a "recommended" version. |
Thanks for the additional insights, and indeed changing the Julia version tot 1.10 makes all the difference! First, I tried still with Julia 1.11, but with bumper=True. This seems to slow down the rate at which the memory fills up, however, it is still happening, and the task is destined to crash at some point. And this is with Then, I forced Julia 1.10. Initially, I did that in combination with a lower value for heap size, with Then, I tried it again with forcing Julia 1.10, but now without any heap_size_hint at all. And indeed, with Julia 1.10 it happily works, even though no heap size is specified it all: So, with Julia 1.10, the memory usage remains almost constant during a whole night. |
Thanks, that is really interesting and useful. I raised an issue in Julia: JuliaLang/julia#56759 as it seems to be a core language issue rather than PySR specific. In the meantime, if you have time, do you think you can try with |
Thanks! Sure, let me try running with |
I wanted to do multiple additional tests but lacked the time. However, I did manage to do one test with In both cases I forced the Julia version, using:
or
|
The last observation I thought I'd add, is that the issue sometimes seems to take a bit longer to surface in case of multiprocessing. There may be a period that the memory usage growth is non-existent or very limited, and the lines in the graph are more or less flat. See below an example, where this lasted about half an hour, after which the lines curl up and memory keeps growing out of control. I've also seen a case where the flat lines lasted for a bit over an hour, after which the issue still occurred, and memory usage again started growing fast suddenly. With multithreading the issue seems to occur faster and even more aggressive, at least in the tests that I've done, but with multiprocessing the issue also occurs each time it seems in the end. In all cases where I had the issue, it concerned Julia 1.11 |
Thanks! It sounds like the Julia team is looking into this: JuliaLang/julia#56759. It seems to be a real bug in Julia as far as I can tell. (I'm considering whether to push a change that defaults PySR to Julia 1.10 until this is solved.) |
In the meantime, if you are up for running some more experiments, I think the most useful signal would be to know what hyperparameter makes the memory blow up the fastest |
Ok, I ran a few more experiments. Because it would be very time consuming to try and change each of the parameters and check the result each time, I thought I'd just run with the default parameters. So, the following is the result of running on the same dataset as before, but now with all the parameters on their default values in accordance with the PySR API documentation, except for setting The memory usage still increases very fast, perhaps slightly slower than before, although that might also be some random variation. It doesn't look that any of the parameters that were changed from their previous value to their default make a huge difference? Next, I wanted to exclude any impact from the particular data that I'm using, or the script that I'm running and so on. So, I also tried running a script in Julia directly, completely removing Python or anything custom that I did. I took the script you gave here:
and ran it for a short while, here is the result: There is still an increase in memory usage but it is very modest now. Then, I amended the Julia script, to bring the data dimensions more in line with the dataset I was using each time when I ran into the memory issue. Amended script:
and ran that for a short while, now we see a completely different result: That's a huge growth in memory usage. I repeated the last two experiments just to be sure, in the same order: Note that for all experiments running Julia directly I used "--threads auto" to try and make sure the available threads are being used. If I didn't add this, it seemed that only one thread was being used. So, the data dimensions seem to play a huge role? Hopefully this helps with reproducing and fixing the issue? |
This was fixed in JuliaLang/julia#56801. The next Julia version should include this patch: JuliaLang/julia#56741. Thanks again for helping figure this out! |
What happened?
I first reported "juliacall.JuliaError: TaskFailedException" errors in #759. Having done further tests, I strongly suspect those were actually two separate issues. The domain error that was occurring with
sin
or other "unsafe" functions I believe was indeed solved as per the fix applied for that ticket.However, I keep getting crashes. I tested without using
sin
, and still it crashed. Furthermore, I applied the fix from #759 and made sure Julia re-compiled the relevant package etc., but still have crashes after that.Looking further into the log files, when scrolling up a bit from the stacktrace, I get:
, or similar, as the task number and worker number differ for each crash.
And the julia-xxx-xxxxxx-0000.out log says:
Can it be that v1.0.0 is more likely to have such memory issues? I never had issues like this before the upgrade.
The crashes occur always roughly in the same timeframe, which would be consistent with a memory issue, because one would expect it to "boil over" at roughly the same time. In my case, that is each time somewhere between 8 and 11 hours. And this is using a VM with 240GB of RAM. This used to be more than enough, and if anything, before the upgrade to v1.0.0 memory usage was usually very low and I was planning to switch to VMs with less RAM to avoid unnecessary costs.
Here is a memory usage graph of a run started this morning. It can be seen that memory usage climbs quite steep. And while there seems to be some garbage collection or other cleanup process, it doesn't make much of a dent and then memory usage continues to climb. Note that the line that is climbing steeply is the line for the usage space (applications) while the ones that remain flat are kernel and disk data.
And when looking at which processes consume the memory, the top users are all Julia workers. See screenshot below, where the heap-size is also visible in the screenshot.
Might this memory issue be due to changes in v1.0.0?
And/or is there an easy fix such as assigning a different memory amount to the processes or to encourage better garbage collection somehow?
I saw something similar in #490 but it's my understanding that was fixed.
I tried using the "heap_size_hint_in_bytes" parameter but this does not solve the issue it seems, see comment with screenshot added to this ticket.
Version
v1.0.0
Operating System
Linux
Package Manager
pip
Interface
Script (i.e.,
python my_script.py
)Relevant log output
Extra Info
I was running in distributed mode (cluster_manager='slurm'), with 30 CPU cores. The dataset has around 2500 records, but only two features. It's unfortunately not possible to share the full Python script I'm using, but here are the main parameters used when calling PySRRegressor:
The text was updated successfully, but these errors were encountered: