-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new builtin blocktype for CSV file to table parsing #631
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure whether this should be a builtin blocktype
or a composite blocktype
. This is more a general discussion that we should have before merging this @joluj @rhazn
My hunch is that we went too fine-granular until now, leading to a lot of blocks that only do one very small thing, and that is not usable for users. We have composite blocks, but they haven't appeared in the docs yet.
If we don't look at performance, having very fine-granular blocks and composing them into composite ones is a good way to go (if users can actually find them in the docs).
If we look at performance, I see a point in introducing built-in block types that do multiple things in one go. Leaving out small steps and letting a library do it from start to finish can be way more efficient.
I'd feel quite strongly about a composite block, I feel it aligns much more with what we are aiming for (not caring about performance but building a cohesive language). As soon as we can move into JV code, I'd do it and try to limit custom implementations in the interpreter as much as possible. |
9e25175
to
78f650f
Compare
I generally agree with having fine-grained blocks. However, I'd go with a builtin block here. When you think about CSV, you usually don't think about "Read it line by line, then extract the columns, then parse it into a table". This is not only bad for the performance (due to it's current implementation), but also just wrong. Consider this example:
This is a valid CSV file which contains line breaks in the |
How come it will never be parseable by our current approach? Isn't it just be a straightforward I think that is an excellent reason for a composite block as well, if we do it this way, your example will parse without error with a |
I'll summarize the discussion at this point. The things everyone here seems to agree on:
Problem 1 can be solved in two ways:
I believe this is majorly a language design question, not necessarily an execution question (although both topics are somewhat connected). Problem 2 is a problem, indeed, one that we should fix in the current implementation as well. There is no value in having two implementations leading to different results, anyway, causing even more confusion. Now, coming to the execution side of things, I believe this is where you were coming from because you saw that it makes sense in the Polaris implementation. Pulling execution steps together for performance is a nice optimization to explore (esp. for Johannes' dissertation). TL;DR: this is my recommendation on how to go forward:
|
I agree with that. How do we progress then? Should we merge everything but postpone the changes related to the new I think the tests can be reused to validate the existing implementation of the CSVExtractor / CSVFileInterpreter. |
I wouldn't wait for some future optimization, I don't think performance should be something we're worried about at this stage. I'd continue exactly as Georg suggested (in point 1 and 2). I took a crack at number 3 in #632. |
This PR adds a new builtin block type
CSVToTableInterpreter
, combiningTextFileInterpreter
,CSVInterpreter
andTableInterpreter
for the common use case of parsing csv data into a table without modifying it.This includes making some methods of
TableInterpreterExecutor
public
in order to share code withCSVToTableInterpreterExecutor
.Pipelines can now look like this: