-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Auto infer schema (including fields shape) from the first row #512
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #512 +/- ##
==========================================
- Coverage 86.02% 85.86% -0.17%
==========================================
Files 81 81
Lines 4402 4442 +40
Branches 704 713 +9
==========================================
+ Hits 3787 3814 +27
- Misses 504 511 +7
- Partials 111 117 +6
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Convenient feature that would simplify the schema issue! I left a couple of questions.
petastorm/arrow_reader_worker.py
Outdated
@@ -168,6 +170,38 @@ def process(self, piece_index, worker_predicate, shuffle_row_drop_partition): | |||
if all_cols: | |||
self.publish_func(all_cols) | |||
|
|||
def infer_schema_from_first_row(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm not sure whether the partition[0] necessarily contains the "first" row? Could the partitions be out of order? If so, we may call it infer_schema_from_a_row
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I read the first row in the index-0 row-groups. But index-0 row-groups may be non-deterministic ? Not sure. infer_schema_from_a_row
sounds good.
|
||
if 'transform_spec' in petastorm_reader_kwargs or \ | ||
'infer_schema_from_first_row' in petastorm_reader_kwargs: | ||
raise ValueError('User cannot set transform_spec and infer_schema_from_first_row ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also allow users to use transform_spec
&infer_schema_from_first_row
? Keeping transform_spec
would make it consistent with the rest of the petastorm library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the param preprocess_fn
should cover the functionality of transform_spec
, and it is easier to use (can auto inferring result schema), so I forbid the two params.
I create a simple PR to address issue 1, #517 |
What issues does the PR addresses ?
There're 2 issues in
make_batch_reader
, one is critical and another is less critical but a pain point.(Critical) Inferring schema in
make_batch_reader
cannot infer fields' shape informationBecause there's no shape information, when make tensorflow dataset from the reader, if we make some tensorflow dataset operations, such as unroll, batch, and reshape field, error may occur. Tensorflow graph operator depends on field shape information heavily.
(Pain point) The
TransformSpec
need to specify edit/removed fields manuallyWe hope user can only provide a transform function, and petastorm can automatically infer the result schema from the output pandas dataframe of the transform function.
The approach in the PR
Add a method
ArrowReaderWorker. infer_schema_from_first_row
which can read a row first and infer the schema from the row. So that we can infer the accurate shape information.Add a param
infer_schema_from_first_row
intomake_batch_reader
(default off, so won't break API behavior)Limitations:
Test
Unit test to be added. But it is ready for first review.
Example code