Array Aggregates: push down prior to unnesting stage #295

rpbouman · 2024-12-12T16:52:33Z

When we have both array aggregates and unnesting on the same array, we currently generate very inefficient SQL.
Example:

This generates SQL like:

WITH "__huey_data_foundation" AS (
SELECT unnest( case len( coalesce( "__huey_data_foundation"."records", []) ) when 0 then [ NULL ] else "__huey_data_foundation"."records" end ) AS "records.unnest()"
,list_aggregate( "__huey_data_foundation"."records", 'COUNT' ) AS "records count"
,list_aggregate( "__huey_data_foundation"."records", 'ENTROPY' ) AS "records entropy"
FROM read_json_auto( 'C:\roland\projects\QuaQuery\files\BWActivityMarch2024.json' ) AS "__huey_data_foundation"
)
SELECT "cte1"."records.unnest()" AS "records.unnest() elements"
,"cte1"."records count"
,"cte1"."records entropy"
,COUNT(*) OVER () AS "COUNT(*) OVER ()"
FROM __huey_data_foundation AS "cte1"
GROUP BY "cte1"."records.unnest()"
,"cte1"."records count"
,"cte1"."records entropy"
ORDER BY "cte1"."records.unnest()" ASC NULLS FIRST
,"cte1"."records count" ASC NULLS FIRST
,"cte1"."records entropy" ASC NULLS FIRST
LIMIT 100 OFFSET 0

The problem is that duckdb appears to execute the aggregate for each element unnested from the array.

It would be more clever to first calculate the array aggregates only once, and then, after that do the unnesting, only duplicating the outcome of the array aggregation.

Right now we have to write this to achieve that:

WITH "__huey_data_foundation" AS (
  SELECT "__huey_data_foundation"."records" AS "records"
  ,list_aggregate( "__huey_data_foundation"."records", 'COUNT' ) AS "records count"
  ,list_aggregate( "__huey_data_foundation"."records", 'ENTROPY' ) AS "records entropy"
  FROM read_json_auto( 'C:\roland\projects\QuaQuery\files\BWActivityMarch2024.json' ) AS "__huey_data_foundation"
), "unnesting_stage" AS (
  SELECT unnest( case len( coalesce( "cte1"."records", []) ) when 0 then [ NULL ] else "cte1"."records" end ) AS "records.unnest() elements"
  ,"cte1"."records count"
  ,"cte1"."records entropy"
  FROM __huey_data_foundation AS "cte1"
)
SELECT "unnesting_stage"."records.unnest() elements"
,"unnesting_stage"."records count"
,"unnesting_stage"."records entropy"
,COUNT(*) OVER () AS "COUNT(*) OVER ()"
FROM "unnesting_stage" AS "unnesting_stage"
GROUP BY "records.unnest() elements"
,"unnesting_stage"."records count"
,"unnesting_stage"."records entropy"
ORDER BY "records.unnest() elements" ASC NULLS FIRST
,"unnesting_stage"."records count" ASC NULLS FIRST
,"unnesting_stage"."records entropy" ASC NULLS FIRST
LIMIT 100 OFFSET 0

The difference is huge. The first query doesn't complete and runs out of memory, the second one takes about as long as only unnesting the array (io calculating the aggregates separately is neligible compared to unnesting)

The text was updated successfully, but these errors were encountered:

rpbouman · 2024-12-14T18:13:01Z

There's a duckdb issue for this. duckdb/duckdb#15341
Let's hold off for a bit and see if this gets picked up by duckdb, else we can apply our workaround.

If you're using huey and are affected by this, then please leave a comment below to let me know. We can always decide to fasttrack our workaround if necessary.

rpbouman added the enhancement New feature or request label Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array Aggregates: push down prior to unnesting stage #295

Array Aggregates: push down prior to unnesting stage #295

rpbouman commented Dec 12, 2024

rpbouman commented Dec 14, 2024 •

edited

Loading

Array Aggregates: push down prior to unnesting stage #295

Array Aggregates: push down prior to unnesting stage #295

Comments

rpbouman commented Dec 12, 2024

rpbouman commented Dec 14, 2024 • edited Loading

rpbouman commented Dec 14, 2024 •

edited

Loading