[Bug] batch_size 不一样时，返回结果行数不同 #45247

DachuanXUAN · 2024-12-10T09:04:02Z

Search before asking

I had searched in the issues and found no similar issues.

Version

2.1.5

What's Wrong?

SQL 为简单的 select into s3

SELECT file_id, cast(data_start_time as String ) as data_start_time, cast(data_end_time as String ) as data_end_time, device_type... FROM t_table_name where data_start_time >= '2024-11-28 00:00:00.000' and data_start_time < '2024-11-28 01:00:00.000' order by file_id,data_start_time INTO OUTFILE "s3://xxx/xxx/2024_11_28_00/part_v2_" FORMAT AS PARQUET PROPERTIES( "s3.endpoint" = "http://xxx.com/", "s3.access_key" = "xxx", "s3.secret_key"="xxx", "s3.region" = "xxx", "max_file_size" = "120MB" );

batch_size 设置为 10 万时
set batch_size=100000;
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+
| 5 | 14602938 | 671738439 | s3://xxx/2024_11_28_00/part_v2_66d16409bc2a4b37-9905759434b51248_* |
+------------+-----------+-----------+------------------------------------------------------------------------------------------------+

batch_size 设置为默认值
set batch_size=4096;
+------------+-----------+------------+------------------------------------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+------------+------------------------------------------------------------------------------------------------+
| 10 | 29803106 | 1316012670 | s3://xxx/2024_11_28_00/part_v2_2cef1e9dba2a4749-a9a688a90843fa53_* |
+------------+-----------+------------+------------------------------------------------------------------------------------------------+

batch_size 为默认值的行数应该是对的，batch_size 比较大的情况下，就会少数。batch_size 如果更大，SQL 会卡死，看不出原因。

之所以要设置 batch_size 是因为导出 parquet 时，希望能够设置 block 的行数，减少 parquet 中 block 的数量。因为除了设置 batch_size 没有别的方法能够控制这个数量。

What You Expected?

skip

How to Reproduce?

No response

Anything Else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

DachuanXUAN · 2024-12-10T09:05:04Z

出问题的阈值在 65535，当小于等于这个数，查询结果是对的，大于这个数会出错

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] batch_size 不一样时，返回结果行数不同 #45247

[Bug] batch_size 不一样时，返回结果行数不同 #45247

DachuanXUAN commented Dec 10, 2024

DachuanXUAN commented Dec 10, 2024

[Bug] batch_size 不一样时，返回结果行数不同 #45247

[Bug] batch_size 不一样时，返回结果行数不同 #45247

Comments

DachuanXUAN commented Dec 10, 2024

Search before asking

Version

What's Wrong?

What You Expected?

How to Reproduce?

Anything Else?

Are you willing to submit PR?

Code of Conduct

DachuanXUAN commented Dec 10, 2024