Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1367643: When/Otherwise throws a KeyError depending on the order of the select vs where clause #1519

Closed
samuelsongsr opened this issue May 6, 2024 · 1 comment
Labels
bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required

Comments

@samuelsongsr
Copy link

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.10.4 (main, May 26 2022, 13:33:07) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]

  2. What are the Snowpark Python and pandas versions in the environment?

    pandas==2.2.1
    snowflake-snowpark-python==1.15.0

  3. What did you do?
    I'm trying to test a when().otherwise() operation and it throws a KeyError exception if the where() clause is placed before the select() clause.

import snowflake.snowpark as sp
import snowflake.snowpark.functions as sf
import snowflake.snowpark.types as st


mock_session = sp.Session.builder.config("local_testing", True).create()

inputs = mock_session.create_dataframe([
    sp.Row(ID=1, FIRST_NAME="Homer", LAST_NAME="Simpson"),
    sp.Row(ID=2, FIRST_NAME="Marge", LAST_NAME="Simpson"),
    sp.Row(ID=3, FIRST_NAME="Clark", LAST_NAME="Kent"),
    sp.Row(ID=4, FIRST_NAME="Bruce", LAST_NAME="Wayne")
])

# Note: if you place the select(...) before the where(...) call, this works.

df = inputs.where(
    (sf.col("ID") < 10) &
    (sf.col("FIRST_NAME").in_("Marge", "Clark"))
).select(
    sf.col("ID"),
    sf.col("FIRST_NAME"),
    sf.when(
        sf.col("LAST_NAME").in_("Kent", "Bruce"),
        sf.lit("---REDACTED---").astype(st.StringType())
    ).otherwise(
        sf.col("LAST_NAME")
    ).alias("LAST_NAME")
)

results = df.collect()

KeyError: [2] not in index

As noted in the comment, changing the order of operations by calling select(...) before where(...) works. However, this won't work for all use cases since there are cases where a select doesn't include a column that's filtered on. For example, given the inputs above:

# works in Snowpark connected to our Snowflake instance, does not work under test.
df = inputs.select(
    sf.col("FIRST_NAME"),
    sf.col("LAST_NAME")
).where(
    (sf.col("ID") == 1)
)

results = df.collect()

KeyError: '"ID"'
  1. What did you expect to see?

I expected two results:

[Row(ID=2, FIRST_NAME='Marge', LAST_NAME='Simpson'),
 Row(ID=3, FIRST_NAME='Clark', LAST_NAME='---REDACTED---')]
@samuelsongsr samuelsongsr added bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required labels May 6, 2024
@github-actions github-actions bot changed the title When/Otherwise throws a KeyError depending on the order of the select vs where clause SNOW-1367643: When/Otherwise throws a KeyError depending on the order of the select vs where clause May 6, 2024
@samuelsongsr
Copy link
Author

updated to 1.20.0 and I can't reproduce the bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working local testing Local Testing issues/PRs needs triage Initial RCA is required
Projects
None yet
Development

No branches or pull requests

1 participant