to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

SamJia · 2022-08-02T06:00:04Z

Current behavior

when hb read some nested lists with ragged_rank > 1，the read Value cannot be transformed to SparseTensor by function hb.data.to_sparse.

For example:
dense_feature is one of the features read by hb.data.ParquetDataset, and to_sparse does not work for it.

Moreover, if I swap the order of the two nested_row_splits, then it can be to_sparse.

So maybe the order of the nested_row_splits when reading parquet file is incorrect?

Expected behavior

the Value read from parquet file can be transformed to SparseTensor.

System information

GPU model and memory: No
OS Platform: Ubuntu
Docker version: No
GCC/CUDA/cuDNN version: 7.4/No/No
Python/conda version:3.6.13/4.13.0
TensorFlow/PyTorch version:1.14.0

Code to reproduce

import tensorflow as tf
import hybridbackend.tensorflow as hb
dataset = hb.data.ParquetDataset("test2.zstd.parquet", batch_size=1)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
sess = tf.Session()
vals = sess.run(next_element)

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

Willing to contribute

Yes

2sin18 · 2022-08-02T06:04:10Z

Thanks for your report, I will look into it.

SamJia · 2022-08-03T06:30:43Z

An example to create a parquet dataset file and reproduce the error:

# Create parquet file
import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[[1], [2, 3]], [[4], [5]]], pa.list_(pa.list_(pa.int64())))
table = pa.Table.from_arrays([arr], ['test'])
pq.write_table(table, 'test.zstd.parquet', compression='ZSTD')

# Reading the parquet file
import tensorflow as tf
import hybridbackend.tensorflow as hb

dataset = hb.data.ParquetDataset("test.zstd.parquet", batch_size=2)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()  
sess = tf.Session()
vals = sess.run(next_element)

DelightRun · 2023-05-29T11:18:07Z

It seems this error still exists in 0.8.0

2sin18 · 2023-05-29T11:53:33Z

@DelightRun Could you try the latest commit ?

DelightRun · 2023-05-30T02:03:44Z

@DelightRun Could you try the latest commit ?

I use your pre-built v0.8.0 wheel package with TensorFlow 1.15.0.
It's not very convenient for me to compile from source (I use this in our prod env, which has several limits).

However, I found it seems the problem is nested_row_splits need to be reversed:

WRONG CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

RIGHT CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4]))[::-1])
sess = tf.Session()
sess.run(val.to_sparse())

2sin18 · 2023-05-30T02:39:49Z

@DelightRun Could you try the latest commit ?

I use your pre-built v0.8.0 wheel package with TensorFlow 1.15.0. It's not very convenient for me to compile from source (I use this in our prod env, which has several limits).

However, I found it seems the problem is nested_row_splits need to be reversed:

WRONG CODE
# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())
RIGHT CODE
# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4]))[::-1])
sess = tf.Session()
sess.run(val.to_sparse())

You are right, and the issue has been fixed, but might not released for your platform. Which Python version, CUDA version (or CPU-only), TensorFlow version do you use? I would release v1.0 in these days.

DelightRun · 2023-05-30T04:08:28Z

Python == 3.6
TensorFlow == 1.15.5
CPU-Only

Hybridbackend is installed via pip: wheel == https://files.pythonhosted.org/packages/11/de/1408b520b9e4eed382ee068aba170d6089d3731f0d51ce3e898b0cb2aef6/hybridbackend_tf115_cpu-0.8.0-cp36-cp36m-manylinux_2_24_x86_64.whl

DelightRun · 2023-05-30T10:05:27Z

@DelightRun Could you try the latest commit ?

Tried the latest commit (compiled via docker), still has this error.
RaggedTensor with rank >= 2 seems pretty buggy.

francktcheng · 2023-05-30T10:59:20Z

Hi @DelightRun, I tried your previous demo (with an adjustment of API accordingly) with the latest commit (4486ba1)

# Create parquet file
import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[[1], [2, 3]], [[4], [5]]], pa.list_(pa.list_(pa.int64())))
table = pa.Table.from_arrays([arr], ['test'])
pq.write_table(table, './test.zstd.parquet', compression='ZSTD')

# Reading the parquet file
import tensorflow as tf
import hybridbackend.tensorflow as hb

dataset = hb.data.ParquetDataset("./test.zstd.parquet", batch_size=2)
dataset = dataset.apply(hb.data.parse())
next_element = tf.data.make_one_shot_iterator(dataset).get_next()
sess = tf.Session()
vals = sess.run(next_element)
print(vals)

The output is

{'test': SparseTensorValue(indices=array([[0, 0, 0],
       [0, 1, 0],
       [0, 1, 1],
       [1, 0, 0],
       [1, 1, 0]]), values=array([1, 2, 3, 4, 5]), dense_shape=array([2, 2, 2]))}

It seems OK and could you reproduce this result?
my env is
python == 3.6
tensorflow == 1.15.5
hybridbackend == 1.0.0 (cpu-only)

2sin18 self-assigned this Aug 2, 2022

2sin18 added the bug Something isn't working label Aug 2, 2022

This was referenced Aug 3, 2022

[Data] Fix support for nested list #70

Merged

[Data] Fix nested list support #71

Merged

2sin18 closed this as completed in #71 Aug 3, 2022

2sin18 reopened this May 30, 2023

2sin18 assigned francktcheng May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

SamJia commented Aug 2, 2022

2sin18 commented Aug 2, 2022

SamJia commented Aug 3, 2022

DelightRun commented May 29, 2023

2sin18 commented May 29, 2023

DelightRun commented May 30, 2023

2sin18 commented May 30, 2023

DelightRun commented May 30, 2023

DelightRun commented May 30, 2023

francktcheng commented May 30, 2023

to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

Comments

SamJia commented Aug 2, 2022

Current behavior

Expected behavior

System information

Code to reproduce

Willing to contribute

2sin18 commented Aug 2, 2022

SamJia commented Aug 3, 2022

DelightRun commented May 29, 2023

2sin18 commented May 29, 2023

DelightRun commented May 30, 2023

2sin18 commented May 30, 2023

DelightRun commented May 30, 2023

DelightRun commented May 30, 2023

francktcheng commented May 30, 2023