Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

Open
SamJia opened this issue Aug 2, 2022 · 9 comments · Fixed by #71
Open

to_sparse failed for Value with ragged_rank > 1 read from parquet file #69

SamJia opened this issue Aug 2, 2022 · 9 comments · Fixed by #71
Assignees
Labels
bug Something isn't working

Comments

@SamJia
Copy link
Collaborator

SamJia commented Aug 2, 2022

Current behavior

when hb read some nested lists with ragged_rank > 1,the read Value cannot be transformed to SparseTensor by function hb.data.to_sparse.

For example:
dense_feature is one of the features read by hb.data.ParquetDataset, and to_sparse does not work for it.
image

Moreover, if I swap the order of the two nested_row_splits, then it can be to_sparse.

image

So maybe the order of the nested_row_splits when reading parquet file is incorrect?

Expected behavior

the Value read from parquet file can be transformed to SparseTensor.

System information

  • GPU model and memory: No
  • OS Platform: Ubuntu
  • Docker version: No
  • GCC/CUDA/cuDNN version: 7.4/No/No
  • Python/conda version:3.6.13/4.13.0
  • TensorFlow/PyTorch version:1.14.0

Code to reproduce

import tensorflow as tf
import hybridbackend.tensorflow as hb
dataset = hb.data.ParquetDataset("test2.zstd.parquet", batch_size=1)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
sess = tf.Session()
vals = sess.run(next_element)

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

Willing to contribute

Yes

@2sin18
Copy link
Collaborator

2sin18 commented Aug 2, 2022

Thanks for your report, I will look into it.

@2sin18 2sin18 self-assigned this Aug 2, 2022
@2sin18 2sin18 added the bug Something isn't working label Aug 2, 2022
@SamJia
Copy link
Collaborator Author

SamJia commented Aug 3, 2022

An example to create a parquet dataset file and reproduce the error:

# Create parquet file
import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[[1], [2, 3]], [[4], [5]]], pa.list_(pa.list_(pa.int64())))
table = pa.Table.from_arrays([arr], ['test'])
pq.write_table(table, 'test.zstd.parquet', compression='ZSTD')

# Reading the parquet file
import tensorflow as tf
import hybridbackend.tensorflow as hb

dataset = hb.data.ParquetDataset("test.zstd.parquet", batch_size=2)
dataset = dataset.apply(hb.data.to_sparse())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()  
sess = tf.Session()
vals = sess.run(next_element)

@DelightRun
Copy link

It seems this error still exists in 0.8.0

@2sin18
Copy link
Collaborator

2sin18 commented May 29, 2023

@DelightRun Could you try the latest commit ?

@DelightRun
Copy link

@DelightRun Could you try the latest commit ?

I use your pre-built v0.8.0 wheel package with TensorFlow 1.15.0.
It's not very convenient for me to compile from source (I use this in our prod env, which has several limits).

However, I found it seems the problem is nested_row_splits need to be reversed:

WRONG CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

RIGHT CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4]))[::-1])
sess = tf.Session()
sess.run(val.to_sparse())

@2sin18
Copy link
Collaborator

2sin18 commented May 30, 2023

@DelightRun Could you try the latest commit ?

I use your pre-built v0.8.0 wheel package with TensorFlow 1.15.0. It's not very convenient for me to compile from source (I use this in our prod env, which has several limits).

However, I found it seems the problem is nested_row_splits need to be reversed:

WRONG CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4])))
sess = tf.Session()
sess.run(val.to_sparse())

RIGHT CODE

# One more simple demo:
import tensorflow as tf
import hybridbackend.tensorflow as hb
val = hb.data.dataframe.DataFrame.Value(values = np.array([1,2,3,4,5]), nested_row_splits=(np.array([0,1,3,4,5]), np.array([0,2,4]))[::-1])
sess = tf.Session()
sess.run(val.to_sparse())

You are right, and the issue has been fixed, but might not released for your platform. Which Python version, CUDA version (or CPU-only), TensorFlow version do you use? I would release v1.0 in these days.

@DelightRun
Copy link

@DelightRun
Copy link

@DelightRun Could you try the latest commit ?

Tried the latest commit (compiled via docker), still has this error.
RaggedTensor with rank >= 2 seems pretty buggy.

@francktcheng
Copy link
Collaborator

Hi @DelightRun, I tried your previous demo (with an adjustment of API accordingly) with the latest commit (4486ba1)

# Create parquet file
import pyarrow as pa
import pyarrow.parquet as pq

arr = pa.array([[[1], [2, 3]], [[4], [5]]], pa.list_(pa.list_(pa.int64())))
table = pa.Table.from_arrays([arr], ['test'])
pq.write_table(table, './test.zstd.parquet', compression='ZSTD')

# Reading the parquet file
import tensorflow as tf
import hybridbackend.tensorflow as hb

dataset = hb.data.ParquetDataset("./test.zstd.parquet", batch_size=2)
dataset = dataset.apply(hb.data.parse())
next_element = tf.data.make_one_shot_iterator(dataset).get_next()
sess = tf.Session()
vals = sess.run(next_element)
print(vals)

The output is

{'test': SparseTensorValue(indices=array([[0, 0, 0],
       [0, 1, 0],
       [0, 1, 1],
       [1, 0, 0],
       [1, 1, 0]]), values=array([1, 2, 3, 4, 5]), dense_shape=array([2, 2, 2]))}

It seems OK and could you reproduce this result?
my env is
python == 3.6
tensorflow == 1.15.5
hybridbackend == 1.0.0 (cpu-only)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants