Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hash_standardization.sql - handle case with an empty string #110

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dpolishchuk68
Copy link
Collaborator

@dpolishchuk68 dpolishchuk68 commented Aug 16, 2023

Wenn Business Key is an empty string ( '' ) or a concatenation of empty strings, a ghost record hash value should be returned ( '00000000000000000000000000000000' ).

Below is a code snippet for testing hash keys in Snowflake with one and two business keys. Please use SQL formatter to make it readable. This text editor doesn't save new lines and indentations.
with countries as ( select * from ( VALUES ('DE', '') ,('GB', '') ,('UA', 'US') ,('US', '') ,('RO', NULL) ,(NULL,'DE') ,('', '') ,(NULL, NULL) ,(NULL, '') ,('', NULL) ) AS t(country_id, leadcountry_id) ) SELECT COUNTRY_ID AS COUNTRY_ID_1, LEADCOUNTRY_ID AS LEADCOUNTRY_ID_2, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)),''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^') AS Correct_value_2_before_hashing, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^') AS wrong_value_2_before_hashing, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^'))), '00000000000000000000000000000000') AS wrong_hash, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)),''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^'))), '00000000000000000000000000000000') AS correct_hash, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("COUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^') AS wrong_value_before_hashing_concat, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("COUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^') AS correct_value_before_hashing_concat, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("COUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^'))), '00000000000000000000000000000000') AS correct_hash_concat, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("COUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^'))), '00000000000000000000000000000000') AS wrong_hash_concat FROM countries;

Wenn Business Key is an empty string ( '' ), a ghost records hash value should be returned ( 0000... ).
@tkirschke
Copy link
Member

Hi @dpolishchuk68 and thanks for creating this PR!

When I understand your proposed solution correctly, NULL inputs and empty strings should be treated the same by the Hash Algorithm.
But this would cover the fact that these are two different data assumptions:

  • NULL assuming that there was no value delivered for a specific column
  • An empty string representing an actual input of this empty string

I agree, that especially when talking about Business Keys, also empty string should be handled somehow, to mark that there is something wrong with this data. But imho the better solution here would be to insert the error key instead.

But when doing that, I would like to have that configurable. As in configuring for each Hashkey (or even for each Hashkey component), which columns are mandatory and which are not. Based on that configuration, when a mandatory column is an empty string, or NULL, the error key should be inserted. But when a column is not set as mandatory, only NULL values should be replaced be the unknown key.

This will definetely require many bigger changes in the way of defining Hashkeys, and of course also in the background macros.
I will talk to our experts (happy to include you here!) to come up with a solution!

Best regards
Tim

@bschlottfeldt
Copy link
Contributor

bschlottfeldt commented Sep 12, 2023

I was researching a bit on the topic and I found this comment in Salesforce OBJ-1359 from @tta-scalefree : "According to what we teach in bootcamp/training, empty strings and null values should be treated equally during hashing operations."
So maybe we should fix it...
Trung please let me know your thoughts :)

@tkirschke
Copy link
Member

tkirschke commented Sep 12, 2023

Hi @bschlottfeldt!

@tta-scalefree @dpolishchuk68 @molschimke and I already had a deep discussion about this topic last week. The two most important decisions we made were:

  • New global parameter "Hashkey_treat_empty_string_as_null" with default value 'true'
  • New global parameter "Hashdiff_treat_empty_string_as_null" with default value 'false'

Besides that, we also made a few decisions regarding the Hashing in general. We want to parameterize all three control characters with global variables:

  • Column Separator (right now its '||') with new default value of '|'
  • Quoting Character with default '"' (as is now)
  • NULL Replacement String with default '^^' (as is now)

Additionally we have to ensure that all three are properly replaced by the escaped version. Here the order is important: The Quoting character needs to be replaced first. Followed by the Column Separator and the NULL Replacement String.

These changes will have a huge impact on the hashing behaviour, because Fake Deltas will be generated. Therefore we add another global variable (only until the next major release):

  • 'use_legacy_hash' with the default value 'false'. Users can set this to true to avoid the changes mentioned above. BUT we need a disclaimer that this needs to be activated at some point!

There isn't a clear timeline yet, but if you have sparetime, feel free to start working on these things!

@bschlottfeldt
Copy link
Contributor

If they are optional parameters why do we need to disclaim they need to be activated ?

@bschlottfeldt
Copy link
Contributor

The amount of hashkeys that I would have to manually change in the project because of this is absurd (specially the double pipe operator replacement with single pipe, seems very unnecessary) . We should always keep the option to use the legacy hash if users dont want to make major changes manually.

@tkirschke
Copy link
Member

tkirschke commented Sep 12, 2023

Hi @bschlottfeldt,
Let me try to address your concerns:

If they are optional parameters why do we need to disclaim they need to be activated ?

We need to disclaim that when they install this right away without modifying these global parameters, fake deltas will be generated. Maybe we switch to opt-in for the new hashing for the first release, thats up for discussion.

The amount of hashkeys that I would have to manually change in the project because of this is absurd .

You would just need to set the two global parameters "treat_empty_string_as_null" to false, and change the three new global parameters for the control characters of the hashing to the current values to achieve exactly the same hash output.

(specially the double pipe operator replacement with single pipe, seems very unnecessary)

@molschimke suggested switching to single character pipes because a backslash for escaping something only affects the one character after it. To be configurable, this must be a one-character thing, to later in the code only escape this one-character string with \<col_separator>

Since datavault4dbt is still in a pre-release phase, we want to conduct these changes and set them as the default behavior before switching to a production state at one point.

@bschlottfeldt
Copy link
Contributor

Ahh okay, thanks for explaining @tkirschke .
Sounds good :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants