Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array Aggregates: push down prior to unnesting stage #295

Open
rpbouman opened this issue Dec 12, 2024 · 1 comment
Open

Array Aggregates: push down prior to unnesting stage #295

rpbouman opened this issue Dec 12, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@rpbouman
Copy link
Owner

When we have both array aggregates and unnesting on the same array, we currently generate very inefficient SQL.
Example:

image

This generates SQL like:

WITH "__huey_data_foundation" AS (
SELECT unnest( case len( coalesce( "__huey_data_foundation"."records", []) ) when 0 then [ NULL ] else "__huey_data_foundation"."records" end ) AS "records.unnest()"
,list_aggregate( "__huey_data_foundation"."records", 'COUNT' ) AS "records count"
,list_aggregate( "__huey_data_foundation"."records", 'ENTROPY' ) AS "records entropy"
FROM read_json_auto( 'C:\roland\projects\QuaQuery\files\BWActivityMarch2024.json' ) AS "__huey_data_foundation"
)
SELECT "cte1"."records.unnest()" AS "records.unnest() elements"
,"cte1"."records count"
,"cte1"."records entropy"
,COUNT(*) OVER () AS "COUNT(*) OVER ()"
FROM __huey_data_foundation AS "cte1"
GROUP BY "cte1"."records.unnest()"
,"cte1"."records count"
,"cte1"."records entropy"
ORDER BY "cte1"."records.unnest()" ASC NULLS FIRST
,"cte1"."records count" ASC NULLS FIRST
,"cte1"."records entropy" ASC NULLS FIRST
LIMIT 100 OFFSET 0

The problem is that duckdb appears to execute the aggregate for each element unnested from the array.

It would be more clever to first calculate the array aggregates only once, and then, after that do the unnesting, only duplicating the outcome of the array aggregation.

Right now we have to write this to achieve that:

WITH "__huey_data_foundation" AS (
  SELECT "__huey_data_foundation"."records" AS "records"
  ,list_aggregate( "__huey_data_foundation"."records", 'COUNT' ) AS "records count"
  ,list_aggregate( "__huey_data_foundation"."records", 'ENTROPY' ) AS "records entropy"
  FROM read_json_auto( 'C:\roland\projects\QuaQuery\files\BWActivityMarch2024.json' ) AS "__huey_data_foundation"
), "unnesting_stage" AS (
  SELECT unnest( case len( coalesce( "cte1"."records", []) ) when 0 then [ NULL ] else "cte1"."records" end ) AS "records.unnest() elements"
  ,"cte1"."records count"
  ,"cte1"."records entropy"
  FROM __huey_data_foundation AS "cte1"
)
SELECT "unnesting_stage"."records.unnest() elements"
,"unnesting_stage"."records count"
,"unnesting_stage"."records entropy"
,COUNT(*) OVER () AS "COUNT(*) OVER ()"
FROM "unnesting_stage" AS "unnesting_stage"
GROUP BY "records.unnest() elements"
,"unnesting_stage"."records count"
,"unnesting_stage"."records entropy"
ORDER BY "records.unnest() elements" ASC NULLS FIRST
,"unnesting_stage"."records count" ASC NULLS FIRST
,"unnesting_stage"."records entropy" ASC NULLS FIRST
LIMIT 100 OFFSET 0

The difference is huge. The first query doesn't complete and runs out of memory, the second one takes about as long as only unnesting the array (io calculating the aggregates separately is neligible compared to unnesting)

@rpbouman rpbouman added the enhancement New feature or request label Dec 12, 2024
@rpbouman
Copy link
Owner Author

rpbouman commented Dec 14, 2024

There's a duckdb issue for this. duckdb/duckdb#15341
Let's hold off for a bit and see if this gets picked up by duckdb, else we can apply our workaround.

If you're using huey and are affected by this, then please leave a comment below to let me know. We can always decide to fasttrack our workaround if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant