Skip to content

Customer

Tim Kirschke edited this page Apr 25, 2023 · 1 revision

Source Attributes

   

Name Data type Classification
C_CUSTKEY Number Business Key
C_NATIONKEY Number Foreign Key, reference to Nation.N_NATIONKEY
C_ACCTBAL Number Descriptive Attribute, non-privacy related
C_MKTSEGMENT Varchar Descriptive Attribute, non-privacy related
C_COMMENT Varchar Descriptive Attribute, non-privacy related
C_NAME Varchar Descriptive Attribute, privacy related
C_ADDRESS Varchar Descriptive Attribute, privacy related
C_PHONE Varchar Descriptive Attribute, privacy related

DV Objects

  • customer_h (Hub) with the attribute C_CUSTKEY as its Business Key
  • customer_n_s (Satellite) with the non-privacy payload attributes C_ACCTBAL, C_MKTSEGMENT and C_COMMENT
  • customer_p_s (Satellite) with the privacy payload attributes C_NAME, C_ADDRESS and C_PHONE

Stage Model

{{ config(materialized='view', 
            schema='Stages') }}

{%- set yaml_metadata -%}
source_model:
    'TPC-H_SF1': 'Customer'
hashed_columns: 
    hk_customer_h:
        - c_custkey
    hk_nation_h:
        - c_nationkey
    hk_customer_nation_l:
        - c_custkey
        - c_nationkey
    hd_customer_p_s:
        is_hashdiff: true
        columns:
            - c_name
            - c_address
            - c_phone
    hd_customer_n_s:
        is_hashdiff: true
        columns:
            - c_acctbal
            - c_mktsegment
            - c_comment
ldts: "SYSDATE()"
rsrc: '!TPC_H_SF1.Customer'
{%- endset -%}

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.stage(source_model=metadata_dict['source_model'],
                    ldts=metadata_dict['ldts'],
                    rsrc=metadata_dict['rsrc'],
                    hashed_columns=metadata_dict['hashed_columns'],
                    derived_columns=none,
                    missing_columns=none,
                    prejoined_columns=none,
                    include_source_columns=true) }}
                 

The staging model for Customer generates the Hub Hashkey, and both Hashdiffs for the Customer Satellites. Additionally, the reference to Nation requires to also generate a Hub Hashkey for Nation out of the referred Nation Business Key C_NATIONKEY, and a Link Hashkey out of both the Business Key of Customer, and the Business Key of Nation.

Data Vault Models

Hub Customer_h

{{ config(materialized='incremental',
          schema='Core') }}

{%- set yaml_metadata -%}
source_models: 
    stg_customer:
        rsrc_static: 'TPC_H_SF1.Customer'
    stg_orders:
        hk_column: hk_customer_h
        bk_columns: o_custkey
        rsrc_static: 'TPC_H_SF1.Orders'
hashkey: hk_customer_h
business_keys: c_custkey
{%- endset -%}      

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.hub(source_models=metadata_dict['source_models'],
                     hashkey=metadata_dict['hashkey'],
                     business_keys=metadata_dict['business_keys']) }}

The Hub for Customer is a multi source entity, since Nation gets referenced by Orders. Therefore you can find both stages (Customer and Orders) specified as the source model for customer_h. For each source models the rsrc_static attribute is set to the static value that is defined in each stage models.

For the second source model, stg_orders, the parameter bk_columns is used to specify the name of the input column, that equals the Business Key of Customer, within that stage, which would be O_CUSTKEY.

Note that bk_columns does not need to be defined for stg_customer, because it would equal the term defined under the top-level parameter business_keys.

Satellite Customer_n0_s (Version 0) - non-privacy

{{ config(materialized='incremental',
          schema='Core') }}

{%- set yaml_metadata -%}
source_model: stg_customer
parent_hashkey: hk_customer_h
src_hashdiff: hd_customer_n_s
src_payload:
    - c_acctbal
    - c_mktsegment
    - c_comment
{%- endset -%}      

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.sat_v0(source_model=metadata_dict['source_model'],
                     parent_hashkey=metadata_dict['parent_hashkey'],
                     src_hashdiff=metadata_dict['src_hashdiff'],
                     src_payload=metadata_dict['src_payload']) }}

Since the satellite of Customer is attached to the Customer Hub, it must include the Customer Hub Hashkey. Additionally the name of the previously calculated Hashdiff column (of the non-privacy attributes) is specified. The definition of the source model points the macro to the previously created stage model.

Note that the defined payload matches the input column configuration for the non-privacy hashdiff column.

Satellite Customer_n_s (Version 1) - non-privacy

{{ config(materialized='view',
          schema='Core') }}

{%- set yaml_metadata -%}
sat_v0: customer_n0_s
hashkey: hk_customer_h
hashdiff: hd_customer_n_s
add_is_current_flag: true
{%- endset -%}      

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.sat_v1(sat_v0=metadata_dict['sat_v0'],
                     hashkey=metadata_dict['hashkey'],
                     hashdiff=metadata_dict['hashdiff'],
                     add_is_current_flag=metadata_dict['add_is_current_flag']) }}

The Satellite v1 is materialized as a View which is required to enable an Insert-Only approach. It is configured to use the previously created Version 0 Satellite as a base. The hashkey and hashdiff configuration matches what we defined within the Satellite v0 model.

Note that the Satellite is configured to additionally add a boolean column "is_current". The name of this column is controlled by Global variables and can be configured to match users naming conventions.

Satellite Customer_p0_s (Version 0) - privacy

{{ config(materialized='incremental',
          schema='Core') }}

{%- set yaml_metadata -%}
source_model: stg_customer
parent_hashkey: hk_customer_h
src_hashdiff: hd_customer_p_s
src_payload:
    - c_name
    - c_address
    - c_phone
{%- endset -%}      

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.sat_v0(source_model=metadata_dict['source_model'],
                     parent_hashkey=metadata_dict['parent_hashkey'],
                     src_hashdiff=metadata_dict['src_hashdiff'],
                     src_payload=metadata_dict['src_payload']) }}

Since the satellite of Customer is attached to the Customer Hub, it must include the Customer Hub Hashkey. Additionally the name of the previously calculated Hashdiff column (of the privacy attributes) is specified. The definition of the source model points the macro to the previously created stage model.

Note that the defined payload matches the input column configuration for the privacy hashdiff column.

Satellite Customer_p_s (Version 1) - privacy

{{ config(materialized='view',
          schema='Core') }}

{%- set yaml_metadata -%}
sat_v0: customer_p0_s
hashkey: hk_customer_h
hashdiff: hd_customer_p_s
add_is_current_flag: true
{%- endset -%}      

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.sat_v1(sat_v0=metadata_dict['sat_v0'],
                     hashkey=metadata_dict['hashkey'],
                     hashdiff=metadata_dict['hashdiff'],
                     add_is_current_flag=metadata_dict['add_is_current_flag']) }}

The Satellite v1 is materialized as a View which is required to enable an Insert-Only approach. It is configured to use the previously created Version 0 Satellite as a base. The hashkey and hashdiff configuration matches what we defined within the Satellite v0 model.

Note that the Satellite is configured to additionally add a boolean column "is_current". The name of this column is controlled by Global variables and can be configured to match users naming conventions.

Record Tracking Satellite Customer_rts

{{ config(materialized='incremental',
          schema='Core') }}

{%- set yaml_metadata -%}
tracked_hashkey: hk_customer_h
source_models:
    stg_customer:
        rsrc_static: 'TPC_H_SF1.Customer'
    stg_orders:
        hk_column: hk_customer_h
        rsrc_static: 'TPC_H_SF1.Orders'
{%- endset -%}    

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.rec_track_sat(tracked_hashkey=metadata_dict['tracked_hashkey'],
                                source_models=metadata_dict['source_models']) }}

The Record Tracking Satellite is used to store information about appearances of Business Keys. It is attached to the Customer Hub, and should therefore match the source configuration of the Hub. That's why both sources of the Hub, Customer and Orders, are also considered for Record Tracking. The Record Tracking Satellite is configured to keep track of the Hub Hashkey hk_customer_h.

Link Customer_Nation_l

{{ config(materialized='incremental',
          schema='Core') }}

{%- set yaml_metadata -%}
source_models: stg_customer
link_hashkey: hk_customer_nation_l
foreign_hashkeys: 
    - hk_customer_h
    - hk_nation_h
{%- endset -%}      

{% set metadata_dict = fromyaml(yaml_metadata) %}

{{ datavault4dbt.link(source_models=metadata_dict['source_models'],
                     link_hashkey=metadata_dict['link_hashkey'],
                     foreign_hashkeys=metadata_dict['foreign_hashkeys']) }}

The source data for Customer includes a reference to Nation via C_NATIONKEY. In the stage model for Customer, we defined both Hub Hashkeys (Customer and Nation) and the Link Hashkey out of the Business Key of both Objects. We forward those column names into the respective parameters of the link macro, link_hashkey and foreign_hashkeys.

PIT Customer_bp

{{ config(materialized='incremental',
          schema='Core',
          post_hook="{{ datavault4dbt.clean_up_pit('snap_v1') }}") }}

{%- set yaml_metadata -%}
tracked_entity: customer_h
hashkey: 'hk_customer_h'
sat_names:
    - customer_n_s
    - customer_p_s
snapshot_relation: 'snap_v1'
snapshot_trigger_column: 'is_active'
dimension_key: 'hk_customer_d'
{%- endset -%}    

{%- set metadata_dict = fromyaml(yaml_metadata) -%}

{{ datavault4dbt.pit(pit_type=metadata_dict['pit_type'],
                                tracked_entity=metadata_dict['tracked_entity'],
                                hashkey=metadata_dict['hashkey'],
                                sat_names=metadata_dict['sat_names'],
                                snapshot_relation=metadata_dict['snapshot_relation'],
                                snapshot_trigger_column=metadata_dict['snapshot_trigger_column'],
                                dimension_key=metadata_dict['dimension_key'],
                                custom_rsrc=metadata_dict['custom_rsrc']) }}

A PIT is built on top of the Customer Raw Vault model, and pre-calculates the corresponding Satellite entries per snapshot date and Hub Hashkey.

The post hook of this model is set to call the PIT clean up macro, which deletes snapshot entries, that are no longer active. It is set to use the Snapshot Control v1 View, which holds the dynamic logarithmic snapshot logic.

The PIT is configured to track the Customer Hub, and more specifically, the Hub Hashkey hk_customer_h inside it. As satellites to track, both the non-privacy and the privacy Satellite for Customer are specified under sat_names.

At last, we tell the PIT how the snapshot relation (which needs to be created first!) is called, and which column in it should trigger the snapshot activation. Additionally, we tell the macro how to call the dimension key column, a surrogate hash value calculated out of the tracked hashkey and the snapshot date.