-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to handle staging models that depend on multiple sources in fct_staging_directories #111
Comments
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Just to give more context on this... In
Curious what should / could be a solution here? |
A possible solution I came up with would be to create another CTE that identifies staging models that reference more than one source, and then do a
Would your expectation be that the current suggested path still appears for the staging model (in SQL --Add this to get all the stage models with count of parent source. Next step will be to count which ones have more than one source parent
staging_by_parent_source_count as (
select
count(distinct parent_source_name) as resource_count
,child as resource_name
,child_resource_type as resource_type
,child_model_type as model_type
,child_file_path as current_file_path
from staging_models
group by child, child_resource_type, child_model_type, child_file_path
),
--Added this CTE to listagg() the multuple suggested paths and advise the user to split the source file into those two places.
multiple_sources_staging_to_split as (
select
staging_models.child as resource_name,
staging_models.child_resource_type as resource_type,
staging_models.child_model_type as model_type,
staging_models.child_file_path as current_file_path,
'More than one source. Split into separate staging models in: ' ||
listagg('models/' || 'staging' || '/' || staging_models.parent_source_name || '/', ' AND ')
within group(order by 'models/' || 'staging' || '/' || staging_models.parent_source_name || '/') as change_file_path_to
from staging_models
join staging_by_parent_source_count on staging_models.child = staging_by_parent_source_count.resource_name and
staging_models.child_resource_type = staging_by_parent_source_count.resource_type and
staging_models.child_model_type = staging_by_parent_source_count.model_type and
staging_models.child_file_path = staging_by_parent_source_count.current_file_path
where staging_by_parent_source_count.resource_count > 1
group by staging_models.child, staging_models.child_resource_type, staging_models.child_model_type, staging_models.child_file_path
),
unioned as (
select * from inappropriate_subdirectories_staging
union all
select * from innappropriate_subdirectories_non_staging_models
union all --Added union all to append these results
select * from multiple_sources_staging_to_split
)
select * from unioned; I also found a small typo in the comments at the top of the model's query: Where the second comment line says, "should be in nested in" there's an extra "in". -- For staging models: The files should be in nested in the staging folder in a subfolder that matches their source parent's name. |
@graciegoheen I found this issue last year and had put down my thoughts, but since I focused on the too_many_joins change first I hadn't come back to this until now. This is another issue I would like to take on. I've edited my original notes for clarity, and I want to get your input on how you would like to see the error messages. |
Hey! Thanks for being down to tackle this one.
I like this suggestion. Here's what I'm thinking... Scenario 1
we should only flag that Scenario 2
we should flag that What do you think? Any scenarios I missed? |
Happy to help! Yeah, that sounds good. The only other scenario I can think of would be a combination of the two, but I think individually handling each scenario type will give the correct suggestions in the end. Let me put those two checks together and make sure they're good. Then I'll create something that breaks both best practices with a single model, to see what the output looks like. |
There's an issue when I try to build the model with my new CTEs. It looks like duckdb doesn't support
My environment has duckdb version 0.9.2, dbt-duckdb version 1.7.1 and dbt-core version 1.7.5 . Do you know if there's a similar function to listagg on duckdb? I can look into other commands that would work to bring multiple warnings into a single string. I also just posted on the dbt Slack #db-duckdb channel, to ask if anyone has suggestions. |
Thanks, @graciegoheen! I also played around with the list_agg and you're right, the "within group" ordering bit is what was throwing off duckdb. I commented that out and it built fine. I'll work out how to use the macro. I'll take a look at the docs for the macro and materializing the test model as a table. |
I'm running into an issue trying to get the In the CTE before the one I'm adding, I create the paths into a column called
This compiles to:
There is a comma missing after the delimiter The second issue is, if I add the comma manually in the compiled SQL and run that in Duckdb, it tells me Duckdb doesn't support three variables in the
If I remove the
|
I figured out the problem from the comment above. I was putting this into the model: ,'More than one source. Split into separate staging models in: ' ||
{{ dbt.listagg(measure="list_agg_string", delimiter_text="' AND '", order_by_clause="list_agg_string") }} as change_file_path_to but the I got the model built and next I need to work on the integration tests. |
@graciegoheen Hi Grace. I've figured out the logic and now I've moved onto the Integration Tests (wish me luck 🤣). It looks like there were changes to the test I could figure out how to add changes that would violate the new updated check by adding a new staging model with multiple sources from different directories. I'd either have to update an existing model, which would cause other rules' seed files to also require updates, or create a new model, which would require changes in the seed files for tests that count percentage of a condition over total models (i.e., Any thoughts on how best to proceed? Should I create a PR to see which parts of the integration tests fail, set it to draft, and then work on getting the values to match? |
No description provided.
The text was updated successfully, but these errors were encountered: