Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix multi-index on columns with bool level values does not roundtrip through parquet #60519

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -709,6 +709,7 @@ I/O
- Bug in :meth:`read_stata` where the missing code for double was not recognised for format versions 105 and prior (:issue:`58149`)
- Bug in :meth:`set_option` where setting the pandas option ``display.html.use_mathjax`` to ``False`` has no effect (:issue:`59884`)
- Bug in :meth:`to_excel` where :class:`MultiIndex` columns would be merged to a single row when ``merge_cells=False`` is passed (:issue:`60274`)
- Bug in :meth:`read_parquet` raising ``ValueError`` if the multi-index contains a level with bools and if that multi-index is on the columns, then while the parquet can be written with the ``pyarrow`` engine, it cannot be read back in using ``pyarrow``. (:issue:`60508`)

Period
^^^^^^
Expand Down
8 changes: 8 additions & 0 deletions pandas/core/dtypes/astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,14 @@ def _astype_nansafe(
)
raise ValueError(msg)

if arr.dtype == object and dtype == bool:
# If the dtype is bool and the array is object, we need to replace
# the False and True of the object type in the ndarray with the
# bool type to ensure that the type conversion is correct
arr[arr == "False"] = np.False_
arr[arr == "True"] = np.True_
return arr.astype(dtype, copy=copy)
Comment on lines +128 to +134
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be changing the behavior of astype(bool) to special case certain values.

Copy link
Contributor Author

@sunlight798 sunlight798 Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. I think the method you mentioned in 60508 is quite reasonable. Thank you for your review.


if copy or arr.dtype == object or dtype == object:
# Explicit copy, or required since NumPy can't view from / to object.
return arr.astype(dtype, copy=True)
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/io/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -1468,3 +1468,15 @@ def test_invalid_dtype_backend(self, engine):
df.to_parquet(path)
with pytest.raises(ValueError, match=msg):
read_parquet(path, dtype_backend="numpy")

def test_bool_multiIndex_roundtrip_through_parquet(self, pa):
# GH 60508
df = pd.DataFrame(
[[1, 2], [4, 5]],
columns=pd.MultiIndex.from_tuples([(True, 'B'), (False, 'C')]),
)
with tm.ensure_clean("test.parquet") as path:
df.to_parquet(path, engine=pa)

result = read_parquet(path, engine=pa)
tm.assert_frame_equal(result, df)
Loading