You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we have both array aggregates and unnesting on the same array, we currently generate very inefficient SQL.
Example:
This generates SQL like:
WITH "__huey_data_foundation" AS (
SELECT unnest( case len( coalesce( "__huey_data_foundation"."records", []) ) when 0 then [ NULL ] else "__huey_data_foundation"."records" end ) AS "records.unnest()"
,list_aggregate( "__huey_data_foundation"."records", 'COUNT' ) AS "records count"
,list_aggregate( "__huey_data_foundation"."records", 'ENTROPY' ) AS "records entropy"
FROM read_json_auto( 'C:\roland\projects\QuaQuery\files\BWActivityMarch2024.json' ) AS "__huey_data_foundation"
)
SELECT "cte1"."records.unnest()" AS "records.unnest() elements"
,"cte1"."records count"
,"cte1"."records entropy"
,COUNT(*) OVER () AS "COUNT(*) OVER ()"
FROM __huey_data_foundation AS "cte1"
GROUP BY "cte1"."records.unnest()"
,"cte1"."records count"
,"cte1"."records entropy"
ORDER BY "cte1"."records.unnest()" ASC NULLS FIRST
,"cte1"."records count" ASC NULLS FIRST
,"cte1"."records entropy" ASC NULLS FIRST
LIMIT 100 OFFSET 0
The problem is that duckdb appears to execute the aggregate for each element unnested from the array.
It would be more clever to first calculate the array aggregates only once, and then, after that do the unnesting, only duplicating the outcome of the array aggregation.
Right now we have to write this to achieve that:
WITH "__huey_data_foundation" AS (
SELECT "__huey_data_foundation"."records" AS "records"
,list_aggregate( "__huey_data_foundation"."records", 'COUNT' ) AS "records count"
,list_aggregate( "__huey_data_foundation"."records", 'ENTROPY' ) AS "records entropy"
FROM read_json_auto( 'C:\roland\projects\QuaQuery\files\BWActivityMarch2024.json' ) AS "__huey_data_foundation"
), "unnesting_stage" AS (
SELECT unnest( case len( coalesce( "cte1"."records", []) ) when 0 then [ NULL ] else "cte1"."records" end ) AS "records.unnest() elements"
,"cte1"."records count"
,"cte1"."records entropy"
FROM __huey_data_foundation AS "cte1"
)
SELECT "unnesting_stage"."records.unnest() elements"
,"unnesting_stage"."records count"
,"unnesting_stage"."records entropy"
,COUNT(*) OVER () AS "COUNT(*) OVER ()"
FROM "unnesting_stage" AS "unnesting_stage"
GROUP BY "records.unnest() elements"
,"unnesting_stage"."records count"
,"unnesting_stage"."records entropy"
ORDER BY "records.unnest() elements" ASC NULLS FIRST
,"unnesting_stage"."records count" ASC NULLS FIRST
,"unnesting_stage"."records entropy" ASC NULLS FIRST
LIMIT 100 OFFSET 0
The difference is huge. The first query doesn't complete and runs out of memory, the second one takes about as long as only unnesting the array (io calculating the aggregates separately is neligible compared to unnesting)
The text was updated successfully, but these errors were encountered:
There's a duckdb issue for this. duckdb/duckdb#15341
Let's hold off for a bit and see if this gets picked up by duckdb, else we can apply our workaround.
If you're using huey and are affected by this, then please leave a comment below to let me know. We can always decide to fasttrack our workaround if necessary.
When we have both array aggregates and unnesting on the same array, we currently generate very inefficient SQL.
Example:
This generates SQL like:
The problem is that duckdb appears to execute the aggregate for each element unnested from the array.
It would be more clever to first calculate the array aggregates only once, and then, after that do the unnesting, only duplicating the outcome of the array aggregation.
Right now we have to write this to achieve that:
The difference is huge. The first query doesn't complete and runs out of memory, the second one takes about as long as only unnesting the array (io calculating the aggregates separately is neligible compared to unnesting)
The text was updated successfully, but these errors were encountered: