Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Auto infer schema (including fields shape) from the first row #512

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

WeichenXu123
Copy link
Collaborator

@WeichenXu123 WeichenXu123 commented Mar 23, 2020

What issues does the PR addresses ?

There're 2 issues in make_batch_reader, one is critical and another is less critical but a pain point.

(Critical) Inferring schema in make_batch_reader cannot infer fields' shape information

Because there's no shape information, when make tensorflow dataset from the reader, if we make some tensorflow dataset operations, such as unroll, batch, and reshape field, error may occur. Tensorflow graph operator depends on field shape information heavily.

(Pain point) The TransformSpec need to specify edit/removed fields manually

We hope user can only provide a transform function, and petastorm can automatically infer the result schema from the output pandas dataframe of the transform function.

The approach in the PR

Add a method ArrowReaderWorker. infer_schema_from_first_row which can read a row first and infer the schema from the row. So that we can infer the accurate shape information.
Add a param infer_schema_from_first_row into make_batch_reader (default off, so won't break API behavior)

Limitations:

  • for all rows (before applying predicates), require all values in each field non-nullable and having the same shape.

Test

Unit test to be added. But it is ready for first review.

Example code

import os
import pandas as pd
import sys
import numpy as np
from pyspark.sql.functions import pandas_udf
import tensorflow as tf

from petastorm import make_batch_reader
from petastorm.transform import TransformSpec
from petastorm.spark import make_spark_converter
spark.conf.set('petastorm.spark.converter.parentCacheDirUrl', 'file:/tmp/converter')

data_url = 'file:/tmp/0001'
data_path = '/tmp/t0001'

@pandas_udf('array<float>')
def gen_array(v):
  return v.map(lambda x: np.random.rand(10))

df1 = spark.range(10).withColumn('v', gen_array('id')).repartition(2)
cv1 = make_spark_converter(df1)

# we can auto infer one-dim array shape
with cv1.make_tf_dataset(batch_size=4, num_epochs=1) as dataset:
	iter = dataset.make_one_shot_iterator()
	next_op = iter.get_next()
	with tf.Session() as sess:
		for i in range(3):
			batch = sess.run(next_op)
			print(batch)


def preproc_fn(x):
  # reshape column 'v' to (2, 5) shape.
  x2 = pd.DataFrame({'v': x['v'].map(lambda x: x.reshape((2, 5))), 'id': x['id'] + 10000})
  return x2

# now we can auto infer multi-dim array shape.
with cv1.make_tf_dataset(batch_size=4, preprocess_fn=preproc_fn, num_epochs=1) as dataset:
	iter = dataset.make_one_shot_iterator()
	next_op = iter.get_next()
	with tf.Session() as sess:
		for i in range(3):
			batch = sess.run(next_op)
			print(batch)

@WeichenXu123 WeichenXu123 changed the title [WIP] Auto infer schema from first row [WIP] Auto infer schema (including fields shape) from the first row Mar 23, 2020
@codecov
Copy link

codecov bot commented Mar 23, 2020

Codecov Report

Merging #512 into master will decrease coverage by 0.16%.
The diff coverage is 72.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #512      +/-   ##
==========================================
- Coverage   86.02%   85.86%   -0.17%     
==========================================
  Files          81       81              
  Lines        4402     4442      +40     
  Branches      704      713       +9     
==========================================
+ Hits         3787     3814      +27     
- Misses        504      511       +7     
- Partials      111      117       +6
Impacted Files Coverage Δ
petastorm/tf_utils.py 80.91% <ø> (ø) ⬆️
petastorm/spark/spark_dataset_converter.py 87.5% <25%> (-3.13%) ⬇️
petastorm/reader.py 90.32% <77.77%> (-0.68%) ⬇️
petastorm/arrow_reader_worker.py 90.34% <83.87%> (-1.66%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b70510...529cb83. Read the comment docs.

Copy link
Collaborator

@liangz1 liangz1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Convenient feature that would simplify the schema issue! I left a couple of questions.

@@ -168,6 +170,38 @@ def process(self, piece_index, worker_predicate, shuffle_row_drop_partition):
if all_cols:
self.publish_func(all_cols)

def infer_schema_from_first_row(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure whether the partition[0] necessarily contains the "first" row? Could the partitions be out of order? If so, we may call it infer_schema_from_a_row.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I read the first row in the index-0 row-groups. But index-0 row-groups may be non-deterministic ? Not sure. infer_schema_from_a_row sounds good.


if 'transform_spec' in petastorm_reader_kwargs or \
'infer_schema_from_first_row' in petastorm_reader_kwargs:
raise ValueError('User cannot set transform_spec and infer_schema_from_first_row '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also allow users to use transform_spec&infer_schema_from_first_row? Keeping transform_spec would make it consistent with the rest of the petastorm library.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the param preprocess_fn should cover the functionality of transform_spec, and it is easier to use (can auto inferring result schema), so I forbid the two params.

liangz1 added a commit to liangz1/petastorm that referenced this pull request Mar 24, 2020
@WeichenXu123
Copy link
Collaborator Author

I create a simple PR to address issue 1, #517
We can merge that one first.
This PR could be a long-term work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants