Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new builtin blocktype for CSV file to table parsing #631

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

TungstnBallon
Copy link
Contributor

This PR adds a new builtin block type CSVToTableInterpreter, combining TextFileInterpreter, CSVInterpreter and TableInterpreter for the common use case of parsing csv data into a table without modifying it.

This includes making some methods of TableInterpreterExecutor public in order to share code with CSVToTableInterpreterExecutor.

Pipelines can now look like this:

pipeline CarsPipeline {
  CarsExtractor
    -> CarsInterpreter
    // ... TableTransformer blocks
    -> CarsLoader;

  block CarsExtractor oftype HttpExtractor {
    url: "https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be10b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv";
  }

  block CarsInterpreter oftype CSVToTableInterpreter {
    enclosing: '"';
    header: true;
    columns: [
      "" oftype text,
      "mpg" oftype decimal,
      "cyl" oftype integer,
      "disp" oftype decimal,
      "hp" oftype integer,
      "drat" oftype decimal,
      "wt" oftype decimal,
      "qsec" oftype decimal,
      "vs" oftype integer,
      "am" oftype integer,
      "gear" oftype integer,
      "carb" oftype integer
    ];
  }

  // ...

  block CarsLoader oftype SQLiteLoader {
    table: "Cars";
    file: "./cars.sqlite";
  }
}

@TungstnBallon TungstnBallon requested a review from joluj December 17, 2024 09:53
@TungstnBallon TungstnBallon self-assigned this Dec 17, 2024
Copy link
Member

@georg-schwarz georg-schwarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure whether this should be a builtin blocktype or a composite blocktype. This is more a general discussion that we should have before merging this @joluj @rhazn

My hunch is that we went too fine-granular until now, leading to a lot of blocks that only do one very small thing, and that is not usable for users. We have composite blocks, but they haven't appeared in the docs yet.

If we don't look at performance, having very fine-granular blocks and composing them into composite ones is a good way to go (if users can actually find them in the docs).

If we look at performance, I see a point in introducing built-in block types that do multiple things in one go. Leaving out small steps and letting a library do it from start to finish can be way more efficient.

@rhazn
Copy link
Contributor

rhazn commented Dec 19, 2024

I'd feel quite strongly about a composite block, I feel it aligns much more with what we are aiming for (not caring about performance but building a cohesive language). As soon as we can move into JV code, I'd do it and try to limit custom implementations in the interpreter as much as possible.

@joluj
Copy link
Contributor

joluj commented Dec 20, 2024

I generally agree with having fine-grained blocks. However, I'd go with a builtin block here.

When you think about CSV, you usually don't think about "Read it line by line, then extract the columns, then parse it into a table". This is not only bad for the performance (due to it's current implementation), but also just wrong. Consider this example:

Name,Address,Notes
"John Doe","123 Elm St, Springfield, IL","Met at conference
Likes coffee"
"Jane Smith","456 Oak Ave, Lincoln, NE","Follow up next week
Prefers email"

This is a valid CSV file which contains line breaks in the Notes column. This file will never be parseable with our current approach, therefore I'd strongly vote for a builtin block for CSVs to enable the proper use of CSVs.

@rhazn
Copy link
Contributor

rhazn commented Dec 20, 2024

How come it will never be parseable by our current approach? Isn't it just be a straightforward TextFileInterpreter into a CSVInterpreter? That we split text files into lines does not need to happen in the CSVInterpreter, we can just restore the file into one string again and parse it with a library, the same way this does and have it work correctly in the CSVInterpreter as well.

I think that is an excellent reason for a composite block as well, if we do it this way, your example will parse without error with a CSVToTableInterpreter and will error with a TextFileInterpreter->CSVInterpreter->TableInterpreter pipeline. Isn't that very unintuitive?

@georg-schwarz
Copy link
Member

georg-schwarz commented Dec 20, 2024

I'll summarize the discussion at this point. The things everyone here seems to agree on:

  1. Chaining too many blocks leads to confusion and bloated Jayvee code
  2. Our current implementation seems to have an issue parsing CSVs with line breaks in the middle of cell values

Problem 1 can be solved in two ways:

  1. Create more coarse-grained block types.
  2. Use composite block types.
    Obviously, that means composite block types should be properly documented, and their documentation needs to be generated to the hosted documentation page (which would be a follow-up issue).

I believe this is majorly a language design question, not necessarily an execution question (although both topics are somewhat connected).
From a language design perspective, there is beauty in defining the smallest units and then building things on them, embodied by approach 2. So, purely from this perspective, I'd go with approach 2.
I would fear competing implementations with multiple built-in block types becoming out of sync over time (e.g., adding a new property only to one implementation by mistake).

Problem 2 is a problem, indeed, one that we should fix in the current implementation as well. There is no value in having two implementations leading to different results, anyway, causing even more confusion.

Now, coming to the execution side of things, I believe this is where you were coming from because you saw that it makes sense in the Polaris implementation. Pulling execution steps together for performance is a nice optimization to explore (esp. for Johannes' dissertation).
However, I'd view this as a detached decision from language design (if we can).
Off the top of my head, I'd rather try to optimize execution based on the created execution graph and combine steps there rather than in the language itself.

TL;DR: this is my recommendation on how to go forward:

  1. Use the insights gained to refactor the built-in TextFileInterpreter or CsvFileInterpreter to deal with line breaks in cell values
  2. Implement the functionality of the CSVToTableInterpreter block type as a composite block type
  3. Future PR: List composite block types next to the builtin ones in the documentation
  4. Future: Explore how to optimize chains of block types by more efficient implementations based on the execution graph

@joluj
Copy link
Contributor

joluj commented Dec 20, 2024

I agree with that. How do we progress then?

Should we merge everything but postpone the changes related to the new CSVToTableInterpreterExecutor? (Postpone until we do this: Explore how to optimize chains of block types by more efficient implementations based on the execution graph)

I think the tests can be reused to validate the existing implementation of the CSVExtractor / CSVFileInterpreter.

@rhazn
Copy link
Contributor

rhazn commented Dec 20, 2024

I wouldn't wait for some future optimization, I don't think performance should be something we're worried about at this stage. I'd continue exactly as Georg suggested (in point 1 and 2).

I took a crack at number 3 in #632.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants