Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None #750

selitvin · 2022-04-08T06:49:21Z

Resolves #744

We implemented a Unischema->Pyarrow-schema conversion and explicitly
set the pyarrow schema when converting a pandas dataframe returned by
transform spec function into a pyarrow table. This way, pyarrow does not
have to guess the type of data from the data itself (which it obviously
could not do before, since all values were None).

oby1 · 2022-04-12T19:18:23Z

petastorm/tests/test_parquet_reader.py

@@ -209,6 +209,33 @@ def preproc_fn1(x):
               (3, 4, 5) == sample['tensor_col_2'].shape[1:]


+@pytest.mark.parametrize('null_column_dtype', [np.float64, np.unicode_])
+@pytest.mark.parametrize('reader_factory', _D)
+def test_transform_spec_returns_all_none_values(scalar_dataset, null_column_dtype, reader_factory):


These test cases assume the transform_spec func is creating the null values. In the more common case, there are missing values in fields unedited by the transform_spec. I believe this solution already addresses both cases but it would be good to demonstrate this in the tests either with an additional test case or additional non-edited fields in this test case.

arhan-gunel · 2022-07-22T18:31:10Z

petastorm/tests/test_parquet_reader.py

+@pytest.mark.parametrize('reader_factory', _D)
+def test_transform_spec_returns_all_none_values_in_a_list_field(scalar_dataset, reader_factory):
+    def fill_id_with_nones(x):
+        return pd.DataFrame({'int_fixed_size_list': [[None for _ in range(3)]] * len(x)})


We also ran into the NoneType issue with lists of strings. Consider adding string types to the test as well.

The NoneType problem also occurs when only some of the values in the list are None, e.g. ['a', 'b', None]. What about a test_transform_spec_returns_some_none_values_in_a_list_field?

…rSpec's function sets an entire column to None. Resolves uber#744 We implemented a Unischema->Pyarrow-schema conversion and explicitly set the pyarrow schema when converting a pandas dataframe returned by transform spec function into a pyarrow table. This way, pyarrow does not have to guess the type of data from the data itself (which it obviously could not do before, since all values were None).

The test tests properly columns with scalars only. Will need to verify correct behavior with columns that are lists separately. Will extend the tests in the following commits.

…all elements being None. Change the type of numpy dtype to np.object instead of np.unicode_

CLAassistant · 2023-02-16T07:46:23Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Yevgeni Litvin seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

selitvin changed the title ~~Fix type of the a batch returned by make_batch_reader when TransformSpec's function sets an entire column to None.~~ Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None Apr 8, 2022

selitvin closed this Apr 9, 2022

oby1 mentioned this pull request Apr 10, 2022

Use of transform_spec in make_batch_reader leads to tensorflow error when column is missing values #744

Open

selitvin reopened this Apr 12, 2022

oby1 reviewed Apr 12, 2022

View reviewed changes

arhan-gunel reviewed Jul 22, 2022

View reviewed changes

Yevgeni Litvin added 5 commits September 13, 2022 22:13

Handle timestamp type correctly.

02bc27c

Test fix

bb3dbec

Fix failing "test_transform_spec_returns_all_none_values" test.

9b2bb69

The test tests properly columns with scalars only. Will need to verify correct behavior with columns that are lists separately. Will extend the tests in the following commits.

Add tests: without tranform_spec and with list of strings with some, …

1fcf22f

…all elements being None. Change the type of numpy dtype to np.object instead of np.unicode_

selitvin force-pushed the selitvin/fix_all_none_return_by_transformer branch from 5d28818 to 1fcf22f Compare September 14, 2022 05:58

Yevgeni Litvin added 2 commits September 14, 2022 12:54

Type fix

6ef18b3

Mypy error ignore

ad9defb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None #750

Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None #750

selitvin commented Apr 8, 2022

oby1 Apr 12, 2022

arhan-gunel Jul 22, 2022

CLAassistant commented Feb 16, 2023

Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None #750

Are you sure you want to change the base?

Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None #750

Conversation

selitvin commented Apr 8, 2022

oby1 Apr 12, 2022

Choose a reason for hiding this comment

arhan-gunel Jul 22, 2022

Choose a reason for hiding this comment

CLAassistant commented Feb 16, 2023