Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-886649: write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date #991

Open
dwelden opened this issue Aug 4, 2023 · 8 comments
Labels
bug Something isn't working needs triage Initial RCA is required

Comments

@dwelden
Copy link

dwelden commented Aug 4, 2023

Please answer these questions before submitting your issue. Thanks!

  1. What version of Python are you using?

    Python 3.10.5

  2. What operating system and processor architecture are you using?

    Windows-10-10.0.19044-SP0

  3. What are the component versions in the environment (pip freeze)?
    pipfreeze.txt

  4. What did you do?

import pandas as pd
from snowflake.snowpark import Session

df = pd.read_sas(source_file, format='sas7bdat', encoding='cp1252')
session.write_pandas(df, target_table, overwrite=overwrite)

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152130 entries, 0 to 152129
Data columns (total 5 columns):
 #   Column              Non-Null Count   Dtype
---  ------              --------------   -----
 0   ProductKey          152130 non-null  object
 1   DistributionCenter  152130 non-null  object
 2   DATE                152130 non-null  datetime64[ns]
 3   Discount            152130 non-null  float64
 4   Revenue             151830 non-null  float64
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 5.8+ MB

>>> df
                                           ProductKey DistributionCenter       DATE  Discount  Revenue
0       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-02-10       0.0      0.0
1       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-02-17       0.0      0.0
2       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-02-24       0.0      0.0
3       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-03-03       0.0      0.0
4       Javier's Deep Cleansing Hair Shampoo 2.718 dl               Cary 2013-03-10       0.0      0.0
...                                               ...                ...        ...       ...      ...
152125            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-02-15       0.0  27027.0
152126            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-02-22       0.0  28124.0
152127            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-03-01       0.0  25574.0
152128            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-03-08       0.0  25812.0
152129            Sujatha's Sensitve skin Lotion 8 dl            Toronto 2015-03-15       0.0  25532.0

[152130 rows x 5 columns]
  1. What did you expect to see?

Expected Pandas datetime column to be written to Snowflake as a TIMESTAMP_NTZ. Dates are not loading properly.

PRODUCTKEY DISTRIBUTIONCENTER DATE DISCOUNT REVENUE
Cleanser Cary Invalid Date 0 0
Shampoo Cary Invalid Date 0 0
... ... ... ... ...
  1. Can you set logging to DEBUG and collect the logs?

sas2snow2.log

@dwelden dwelden added bug Something isn't working needs triage Initial RCA is required labels Aug 4, 2023
@github-actions github-actions bot changed the title write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date SNOW-886649: write_pandas inserts datetime64[ns] to Snowflake as an Invalid Date Aug 4, 2023
@dwelden
Copy link
Author

dwelden commented Aug 4, 2023

The data set used is DCSKINPRODUCT.sas7bdat from SAS-Visual-Forecasting---sample-data-sets

DCSKINPRODUCT.zip

-- Query Snowflake to examine the invalid dates
select to_varchar("DATE") from dcskinproduct limit 20;

43113053-09-24 17:00:00.000
43132219-01-30 16:00:00.000
43151384-06-05 17:00:00.000
43170549-10-11 17:00:00.000
43189715-02-16 16:00:00.000
43208880-06-22 17:00:00.000
43228045-10-28 17:00:00.000
43247211-03-05 16:00:00.000
43266376-07-10 17:00:00.000
43285541-11-15 16:00:00.000
43304707-03-23 17:00:00.000
43323872-07-27 17:00:00.000
43343037-12-02 16:00:00.000
43362203-04-09 17:00:00.000
43381368-08-13 17:00:00.000
43400533-12-19 16:00:00.000
43419699-04-25 17:00:00.000
43438864-08-30 17:00:00.000
43458030-01-05 16:00:00.000
43477195-05-13 17:00:00.000

The data type datetime64[ns] is stored internally as an int64 nanoseconds timedelta as of 1970-01-01. Given the Year values Snowflake has here, my suspicion is that the write_pandas is evaluating as if the timedelta were in milliseconds.

>>> df['DATE'][0], df['DATE'].astype('int64')[0]
(Timestamp('2013-02-10 00:00:00'), 1360454400000000000)
select to_varchar(timestampadd('milliseconds', 1360454400000000000, '1970-01-01'::timestamp));
43113053-09-25 00:00:00.000

select to_varchar(timestampadd('nanoseconds', 1360454400000000000, '1970-01-01'::timestamp));
2013-02-10 00:00:00.000

@dwelden
Copy link
Author

dwelden commented Aug 4, 2023

For now, I have the following workaround in place, but this should not be required.

    # Map datetime columns to string (needed because of bug in write_pandas)
    for column in df.columns.to_list():
        if is_datetime64_dtype(df[column]):
            df[column] = df[column].dt.strftime('%Y-%m-%d %H:%M:%S.%f')

@ericpettengill
Copy link

linking snowflake-connector-python#600 and comment. New release of snowflake-connector-python adds use_logical_type param

@sfc-gh-aalam
Copy link
Contributor

closing this issue as use_logical_type is working as expected. Please open a new issue if you see more issues.

@dwelden
Copy link
Author

dwelden commented Nov 14, 2023

When will this paramter be available in Snowpark? The new parameter is not available in Snowpark for Python version 1.9.0 or 1.10.0 snowflake.snowpark.Session.write_pandas.

>>> session.write_pandas(df, target_table, use_logical_type=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Session.write_pandas() got an unexpected keyword argument 'use_logical_type'
>>> from snowflake.snowpark.version import VERSION as snowpark_version
>>> print(snowpark_version)
(1, 10, 0)

@sfc-gh-aalam
Copy link
Contributor

It is coming in the next release.

@sfc-gh-aalam sfc-gh-aalam reopened this Nov 14, 2023
@sfc-gh-aalam
Copy link
Contributor

My bad. I thought this was snowflake-connector-python issue

@ameryisafreeelf
Copy link

ameryisafreeelf commented Jul 31, 2024

BUMP- I'm currently running into related issues, happy to open a new issue but I assume this one is still open for a good reason. Thanks in advance for any eyes here.

Edit:
Python deps

  • Python 3.10.13
  • pandas 2.2.2
  • snowflake 0.6.0
  • snowflake-connector-python 3.7.1
  • snowflake-snowpark-python 1.13.0

I'm successfully passing use_logical_type to write_pandas(), but at the Snowflake level, this seems to be allowing the COPY from staged Parquet -> Snowflake to write junk into tables. Most notably, I have the following case:

  • I'm taking a datetime.datetime that gets written to parquet with use_logical_type. I believe this allows for parquet-specific encoding (under the hood, storing the timestamp as an integer offset from a unix epoch).
  • The copy command from Parquet to Snowflake results in Invalid date being written to a column of type TIMESTAMP_NTZ(9).

As a user, this is hard for me to triage because the staged parquets are difficult for me to peek into. In my opinion, the write_pandas command should not be able to write something like Invalid Date to Snowflake in a TIMESTAMP column. It'd be nice if Snowflake couldn't store that to begin with, but I assume there's a good reason for that and it's a layer deeper than the issue at hand.

Is there a good workaround for this? My current fix is just to avoid using use_logical_type altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Initial RCA is required
Projects
None yet
Development

No branches or pull requests

4 participants