-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SparkOutOfMemoryError happens when running CometColumnarExchange #886
Comments
While for other native comet operators, a dedicated GreedyMemoryPool sized
|
@Kontinuation @viirya I am trying to reproduce this issue now, but I am not sure if it is already resolved by #988? |
I think so. |
I think it is not. This is related to the Java implementation of comet columnar shuffle ( |
Oh, this is a separate issue. |
I have not been able to reproduce this issue yet. I am using the same Comet commit and so far have tested on a single node cluster with these configs:
The query completes:
I am going to test on a two node k8s cluster next. |
I do wonder if the issue is related to specifying |
I could not reproduce this issue in k8s either. Here is the spark-submit command that I used.
@Kontinuation Do you have any suggestions for how I can reproduce this issue? |
I now see that I missed |
I can reproduce this now. |
Describe the bug
We easily run into this problem when running queries with
spark.comet.exec.shuffle.mode=jvm
:We've observed this problem not only on our own workloads but also on TPC-H benchmarks. The above-mentioned exception happens when running TPC-H query 5 on parquet files with scale factor = 1000.
We've tried to disable the comet shuffle manager and use Spark's own shuffle exchange, all TPC-H queries could finish successfully.
Steps to reproduce
Running TPC-H query 5 on a Spark cluster. The detailed environment and spark configurations are listed in Additional context.
Expected behavior
All TPC-H queries should finish successfully.
Additional context
The problem was produced on a self-deployed K8S Spark cluster on AWS.
Here are relevant spark configurations:
The text was updated successfully, but these errors were encountered: