-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update hash_standardization.sql - handle case with an empty string #110
base: main
Are you sure you want to change the base?
Update hash_standardization.sql - handle case with an empty string #110
Conversation
Wenn Business Key is an empty string ( '' ), a ghost records hash value should be returned ( 0000... ).
Hi @dpolishchuk68 and thanks for creating this PR! When I understand your proposed solution correctly, NULL inputs and empty strings should be treated the same by the Hash Algorithm.
I agree, that especially when talking about Business Keys, also empty string should be handled somehow, to mark that there is something wrong with this data. But imho the better solution here would be to insert the error key instead. But when doing that, I would like to have that configurable. As in configuring for each Hashkey (or even for each Hashkey component), which columns are mandatory and which are not. Based on that configuration, when a mandatory column is an empty string, or NULL, the error key should be inserted. But when a column is not set as mandatory, only NULL values should be replaced be the unknown key. This will definetely require many bigger changes in the way of defining Hashkeys, and of course also in the background macros. Best regards |
I was researching a bit on the topic and I found this comment in Salesforce OBJ-1359 from @tta-scalefree : "According to what we teach in bootcamp/training, empty strings and null values should be treated equally during hashing operations." |
Hi @bschlottfeldt! @tta-scalefree @dpolishchuk68 @molschimke and I already had a deep discussion about this topic last week. The two most important decisions we made were:
Besides that, we also made a few decisions regarding the Hashing in general. We want to parameterize all three control characters with global variables:
Additionally we have to ensure that all three are properly replaced by the escaped version. Here the order is important: The Quoting character needs to be replaced first. Followed by the Column Separator and the NULL Replacement String. These changes will have a huge impact on the hashing behaviour, because Fake Deltas will be generated. Therefore we add another global variable (only until the next major release):
There isn't a clear timeline yet, but if you have sparetime, feel free to start working on these things! |
If they are optional parameters why do we need to disclaim they need to be activated ? |
The amount of hashkeys that I would have to manually change in the project because of this is absurd (specially the double pipe operator replacement with single pipe, seems very unnecessary) . We should always keep the option to use the legacy hash if users dont want to make major changes manually. |
Hi @bschlottfeldt,
We need to disclaim that when they install this right away without modifying these global parameters, fake deltas will be generated. Maybe we switch to opt-in for the new hashing for the first release, thats up for discussion.
You would just need to set the two global parameters "treat_empty_string_as_null" to false, and change the three new global parameters for the control characters of the hashing to the current values to achieve exactly the same hash output.
@molschimke suggested switching to single character pipes because a backslash for escaping something only affects the one character after it. To be configurable, this must be a one-character thing, to later in the code only escape this one-character string with Since datavault4dbt is still in a pre-release phase, we want to conduct these changes and set them as the default behavior before switching to a production state at one point. |
Ahh okay, thanks for explaining @tkirschke . |
Wenn Business Key is an empty string ( '' ) or a concatenation of empty strings, a ghost record hash value should be returned ( '00000000000000000000000000000000' ).
Below is a code snippet for testing hash keys in Snowflake with one and two business keys. Please use SQL formatter to make it readable. This text editor doesn't save new lines and indentations.
with countries as ( select * from ( VALUES ('DE', '') ,('GB', '') ,('UA', 'US') ,('US', '') ,('RO', NULL) ,(NULL,'DE') ,('', '') ,(NULL, NULL) ,(NULL, '') ,('', NULL) ) AS t(country_id, leadcountry_id) ) SELECT COUNTRY_ID AS COUNTRY_ID_1, LEADCOUNTRY_ID AS LEADCOUNTRY_ID_2, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)),''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^') AS Correct_value_2_before_hashing, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^') AS wrong_value_2_before_hashing, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^'))), '00000000000000000000000000000000') AS wrong_hash, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)),''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^'))), '00000000000000000000000000000000') AS correct_hash, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("COUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^') AS wrong_value_before_hashing_concat, NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("COUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^') AS correct_value_before_hashing_concat, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("COUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(NULLIF(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), ''), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^'))), '00000000000000000000000000000000') AS correct_hash_concat, IFNULL(LOWER(MD5(NULLIF(CAST(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(CONCAT( IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("COUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^'),'||', IFNULL((CONCAT('\"', REPLACE(REPLACE(REPLACE(TRIM(CAST("LEADCOUNTRY_ID" AS STRING)), '\\', '\\\\'), '"', '\"'), '^^', '--'), '\"')), '^^') ), '\n', '') , '\t', '') , '\v', '') , '\r', '') AS STRING), '^^||^^'))), '00000000000000000000000000000000') AS wrong_hash_concat FROM countries;