[BUG] using `python -m cudf.pandas` and calling `hasattr` converts NA to NaN #17666

MarcoGorelli · 2025-01-01T12:05:28Z

Describe the bug

Here's a complete reproduction: https://colab.research.google.com/drive/1E2bWuCZhuMK_t_aevsWQhbUysSF8hsHt?usp=sharing

src = """
import pandas as pd
df = pd.DataFrame({
    "a": ["a", "a", "b", "b", "b"],
    "b": [1, 2, None, 5, 3],
    "c": [5, 4, 3, 2, 1],
})
print(df)
print(hasattr(df, 'foobar'))
print(df)
"""
with open('f.py', 'w', encoding='utf-8') as fd:
  fd.write(src)

If I then run

%%bash
python -m cudf.pandas f.py

then I get

   a     b  c
0  a   1.0  5
1  a   2.0  4
2  b  <NA>  3
3  b   5.0  2
4  b   3.0  1
False
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1

Spotted in Narwhals

Expected behavior
using hasattr should not change the contents of the dataframe

The text was updated successfully, but these errors were encountered:

mroeschke · 2025-01-02T19:46:23Z

Thanks for the report.

Possibly more simply, once the repr of the pandas DataFrame is accessed it "overrides" the repr of the cudf DataFrame object

In [1]: %load_ext cudf.pandas

In [2]: import pandas as pd
   ...: df = pd.DataFrame({
   ...:     "a": ["a", "a", "b", "b", "b"],
   ...:     "b": [1, 2, None, 5, 3],
   ...:     "c": [5, 4, 3, 2, 1],
   ...: })

In [3]: df
Out[3]: 
   a     b  c
0  a   1.0  5
1  a   2.0  4
2  b  <NA>  3
3  b   5.0  2
4  b   3.0  1

In [4]: df._fsproxy_slow
Out[4]: 
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1

In [5]: df
Out[5]: 
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1

MarcoGorelli · 2025-01-02T22:36:12Z

I'll check tomorrow, but I think it was actually affecting results (e.g. df['b'].cumsum())

MarcoGorelli · 2025-01-02T23:40:39Z

Yup, here's a repro which better demonstrates the issue:

src = """
import pandas as pd
df = pd.DataFrame({
    "a": ["a", "a", "b", "b", "b"],
    "b": [1, 2, None, 5, 3],
    "c": [5, 4, 3, 2, 1],
})
print(df)
print(df.groupby('a')['b'].cumsum())
print(hasattr(df, 'foobar'))
print(df)
print(df.groupby('a')['b'].cumsum())
"""
with open('f.py', 'w', encoding='utf-8') as fd:
  fd.write(src)

The output is

   a     b  c
0  a   1.0  5
1  a   2.0  4
2  b  <NA>  3
3  b   5.0  2
4  b   3.0  1
0     1.0
1     3.0
2    <NA>
3     5.0
4     8.0
Name: b, dtype: float64
False
   a    b  c
0  a  1.0  5
1  a  2.0  4
2  b  NaN  3
3  b  5.0  2
4  b  3.0  1
0    1.0
1    3.0
2    NaN
3    NaN
4    NaN
Name: b, dtype: float64

So, we go from

0     1.0
1     3.0
2    <NA>
3     5.0
4     8.0
Name: b, dtype: float64

to

0    1.0
1    3.0
2    NaN
3    NaN
4    NaN
Name: b, dtype: float64

mroeschke · 2025-01-02T23:51:52Z

Ah OK thanks for the additional repo.

I think when repr-ing with print, the cudf.pandas df is undergoing a cudf to pandas to cudf roundtrip, and for column B, we're experiencing a dtype roundtrip mismatch e.g.

In [2]: import cudf

In [3]: cudf.DataFrame([1, None]).dtypes
Out[3]: 
0    int64
dtype: object

In [4]: cudf.DataFrame.from_pandas(cudf.DataFrame([1, None]).to_pandas()).dtypes
Out[4]: 
0    float64
dtype: object

galipremsagar · 2025-01-02T23:55:40Z

This is because of the nan_as_null parameter, that is present during the round-trip. I'm working on a fix.

MarcoGorelli · 2025-01-03T09:21:38Z

sure, thanks

No objections fixing it like this, but I think falling back to pandas after a simple hasattr check is going to disappoint your users, hasattr checks are extremely common in all kinds of libraries (pymc, scikit-learn, ...)

Falling back to pandas just for the sake of raising an error message (which gets discarded by hasattr anyway) seems worse than raising an AttributeError with a marginally different message to pandas'

EDIT: i've made a separate issue about this: #17678

MarcoGorelli added the bug Something isn't working label Jan 1, 2025

MarcoGorelli changed the title ~~[BUG] using python -m cudf.pandas and using hasattr converts NA to NaN~~ [BUG] using python -m cudf.pandas and calling hasattr converts NA to NaN Jan 1, 2025

galipremsagar self-assigned this Jan 2, 2025

mroeschke added the cudf.pandas Issues specific to cudf.pandas label Jan 2, 2025

github-project-automation bot added this to cuDF Python Jan 2, 2025

github-project-automation bot moved this to Todo in cuDF Python Jan 2, 2025

galipremsagar mentioned this issue Jan 3, 2025

typecast all nulls to nans for float #17676

Closed

3 tasks

GPUtester moved this from Todo to In Progress in cuDF Python Jan 3, 2025

galipremsagar linked a pull request Jan 3, 2025 that will close this issue

convert all nulls to nans in a specific scenario #17677

Draft

3 tasks

This was referenced Jan 3, 2025

[FEA] Don't fallback to pandas after simpe hasattr check #17678

Open

[FEA] Add narwhals tests to our test suite #17662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] using `python -m cudf.pandas` and calling `hasattr` converts NA to NaN #17666

[BUG] using `python -m cudf.pandas` and calling `hasattr` converts NA to NaN #17666

MarcoGorelli commented Jan 1, 2025 •

edited

Loading

mroeschke commented Jan 2, 2025

MarcoGorelli commented Jan 2, 2025

MarcoGorelli commented Jan 2, 2025

mroeschke commented Jan 2, 2025

galipremsagar commented Jan 2, 2025

MarcoGorelli commented Jan 3, 2025 •

edited

Loading

[BUG] using python -m cudf.pandas and calling hasattr converts NA to NaN #17666

[BUG] using python -m cudf.pandas and calling hasattr converts NA to NaN #17666

Comments

MarcoGorelli commented Jan 1, 2025 • edited Loading

mroeschke commented Jan 2, 2025

MarcoGorelli commented Jan 2, 2025

MarcoGorelli commented Jan 2, 2025

mroeschke commented Jan 2, 2025

galipremsagar commented Jan 2, 2025

MarcoGorelli commented Jan 3, 2025 • edited Loading

[BUG] using `python -m cudf.pandas` and calling `hasattr` converts NA to NaN #17666

[BUG] using `python -m cudf.pandas` and calling `hasattr` converts NA to NaN #17666

MarcoGorelli commented Jan 1, 2025 •

edited

Loading

MarcoGorelli commented Jan 3, 2025 •

edited

Loading