Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] batch_size 不一样时,返回结果行数不同 #45247

Open
3 tasks done
DachuanXUAN opened this issue Dec 10, 2024 · 1 comment
Open
3 tasks done

[Bug] batch_size 不一样时,返回结果行数不同 #45247

DachuanXUAN opened this issue Dec 10, 2024 · 1 comment

Comments

@DachuanXUAN
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

Version

2.1.5

What's Wrong?

SQL 为简单的 select into s3

SELECT file_id, cast(data_start_time as String ) as data_start_time, cast(data_end_time as String ) as data_end_time, device_type... FROM t_table_name where data_start_time >= '2024-11-28 00:00:00.000' and data_start_time < '2024-11-28 01:00:00.000' order by file_id,data_start_time INTO OUTFILE "s3://xxx/xxx/2024_11_28_00/part_v2_" FORMAT AS PARQUET PROPERTIES( "s3.endpoint" = "http://xxx.com/", "s3.access_key" = "xxx", "s3.secret_key"="xxx", "s3.region" = "xxx", "max_file_size" = "120MB" );

batch_size 设置为 10 万时
set batch_size=100000;
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+
| 5 | 14602938 | 671738439 | s3://xxx/2024_11_28_00/part_v2_66d16409bc2a4b37-9905759434b51248_* |
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+

batch_size 设置为默认值
set batch_size=4096;
+------------+-----------+------------+------------------------------------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+------------+------------------------------------------------------------------------------------------------+
| 10 | 29803106 | 1316012670 | s3://xxx/2024_11_28_00/part_v2_2cef1e9dba2a4749-a9a688a90843fa53_* |
+------------+-----------+------------+------------------------------------------------------------------------------------------------+

batch_size 为默认值的行数应该是对的,batch_size 比较大的情况下,就会少数。batch_size 如果更大,SQL 会卡死,看不出原因。

之所以要设置 batch_size 是因为导出 parquet 时,希望能够设置 block 的行数,减少 parquet 中 block 的数量。因为除了设置 batch_size 没有别的方法能够控制这个数量。

What You Expected?

skip

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@DachuanXUAN
Copy link
Author

出问题的阈值在 65535,当小于等于这个数,查询结果是对的,大于这个数会出错

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant