Add new builtin blocktype for CSV file to table parsing #631

TungstnBallon · 2024-12-17T09:53:25Z

This PR adds a new builtin block type CSVToTableInterpreter, combining TextFileInterpreter, CSVInterpreter and TableInterpreter for the common use case of parsing csv data into a table without modifying it.

This includes making some methods of TableInterpreterExecutor public in order to share code with CSVToTableInterpreterExecutor.

Pipelines can now look like this:

pipeline CarsPipeline {
  CarsExtractor
    -> CarsInterpreter
    // ... TableTransformer blocks
    -> CarsLoader;

  block CarsExtractor oftype HttpExtractor {
    url: "https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be10b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv";
  }

  block CarsInterpreter oftype CSVToTableInterpreter {
    enclosing: '"';
    header: true;
    columns: [
      "" oftype text,
      "mpg" oftype decimal,
      "cyl" oftype integer,
      "disp" oftype decimal,
      "hp" oftype integer,
      "drat" oftype decimal,
      "wt" oftype decimal,
      "qsec" oftype decimal,
      "vs" oftype integer,
      "am" oftype integer,
      "gear" oftype integer,
      "carb" oftype integer
    ];
  }

  // ...

  block CarsLoader oftype SQLiteLoader {
    table: "Cars";
    file: "./cars.sqlite";
  }
}

georg-schwarz

I'm unsure whether this should be a builtin blocktype or a composite blocktype. This is more a general discussion that we should have before merging this @joluj @rhazn

My hunch is that we went too fine-granular until now, leading to a lot of blocks that only do one very small thing, and that is not usable for users. We have composite blocks, but they haven't appeared in the docs yet.

If we don't look at performance, having very fine-granular blocks and composing them into composite ones is a good way to go (if users can actually find them in the docs).

If we look at performance, I see a point in introducing built-in block types that do multiple things in one go. Leaving out small steps and letting a library do it from start to finish can be way more efficient.

rhazn · 2024-12-19T08:58:46Z

I'd feel quite strongly about a composite block, I feel it aligns much more with what we are aiming for (not caring about performance but building a cohesive language). As soon as we can move into JV code, I'd do it and try to limit custom implementations in the interpreter as much as possible.

joluj · 2024-12-20T10:17:07Z

I generally agree with having fine-grained blocks. However, I'd go with a builtin block here.

When you think about CSV, you usually don't think about "Read it line by line, then extract the columns, then parse it into a table". This is not only bad for the performance (due to it's current implementation), but also just wrong. Consider this example:

Name,Address,Notes
"John Doe","123 Elm St, Springfield, IL","Met at conference
Likes coffee"
"Jane Smith","456 Oak Ave, Lincoln, NE","Follow up next week
Prefers email"

This is a valid CSV file which contains line breaks in the Notes column. This file will never be parseable with our current approach, therefore I'd strongly vote for a builtin block for CSVs to enable the proper use of CSVs.

rhazn · 2024-12-20T10:33:16Z

How come it will never be parseable by our current approach? Isn't it just be a straightforward TextFileInterpreter into a CSVInterpreter? That we split text files into lines does not need to happen in the CSVInterpreter, we can just restore the file into one string again and parse it with a library, the same way this does and have it work correctly in the CSVInterpreter as well.

I think that is an excellent reason for a composite block as well, if we do it this way, your example will parse without error with a CSVToTableInterpreter and will error with a TextFileInterpreter->CSVInterpreter->TableInterpreter pipeline. Isn't that very unintuitive?

georg-schwarz · 2024-12-20T11:34:24Z

I'll summarize the discussion at this point. The things everyone here seems to agree on:

Chaining too many blocks leads to confusion and bloated Jayvee code
Our current implementation seems to have an issue parsing CSVs with line breaks in the middle of cell values

Problem 1 can be solved in two ways:

Create more coarse-grained block types.
Use composite block types.
Obviously, that means composite block types should be properly documented, and their documentation needs to be generated to the hosted documentation page (which would be a follow-up issue).

I believe this is majorly a language design question, not necessarily an execution question (although both topics are somewhat connected).
From a language design perspective, there is beauty in defining the smallest units and then building things on them, embodied by approach 2. So, purely from this perspective, I'd go with approach 2.
I would fear competing implementations with multiple built-in block types becoming out of sync over time (e.g., adding a new property only to one implementation by mistake).

Problem 2 is a problem, indeed, one that we should fix in the current implementation as well. There is no value in having two implementations leading to different results, anyway, causing even more confusion.

Now, coming to the execution side of things, I believe this is where you were coming from because you saw that it makes sense in the Polaris implementation. Pulling execution steps together for performance is a nice optimization to explore (esp. for Johannes' dissertation).
However, I'd view this as a detached decision from language design (if we can).
Off the top of my head, I'd rather try to optimize execution based on the created execution graph and combine steps there rather than in the language itself.

TL;DR: this is my recommendation on how to go forward:

Use the insights gained to refactor the built-in TextFileInterpreter or CsvFileInterpreter to deal with line breaks in cell values
Implement the functionality of the CSVToTableInterpreter block type as a composite block type
Future PR: List composite block types next to the builtin ones in the documentation
Future: Explore how to optimize chains of block types by more efficient implementations based on the execution graph

joluj · 2024-12-20T14:28:02Z

I agree with that. How do we progress then?

Should we merge everything but postpone the changes related to the new CSVToTableInterpreterExecutor? (Postpone until we do this: Explore how to optimize chains of block types by more efficient implementations based on the execution graph)

I think the tests can be reused to validate the existing implementation of the CSVExtractor / CSVFileInterpreter.

rhazn · 2024-12-20T14:41:23Z

I wouldn't wait for some future optimization, I don't think performance should be something we're worried about at this stage. I'd continue exactly as Georg suggested (in point 1 and 2).

I took a crack at number 3 in #632.

TungstnBallon requested a review from joluj December 17, 2024 09:53

TungstnBallon self-assigned this Dec 17, 2024

georg-schwarz reviewed Dec 19, 2024

View reviewed changes

TungstnBallon added 7 commits December 20, 2024 11:07

Add CSVToTableInterpreter to the standard library

cdd2e7f

Remove lineBreak property

4f8b953

Make some methods static and public to enable code sharing

e033557

Add the CSVToTableInterpreter executor

8c2ff27

add isEmpty() method

de639c4

Error if the block expects headers, but the csv file is empty

dd03ff0

Add tests

78f650f

TungstnBallon force-pushed the csv-to-table-block branch from 9e25175 to 78f650f Compare December 20, 2024 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new builtin blocktype for CSV file to table parsing #631

Add new builtin blocktype for CSV file to table parsing #631

TungstnBallon commented Dec 17, 2024

georg-schwarz left a comment •

edited

Loading

rhazn commented Dec 19, 2024

joluj commented Dec 20, 2024

rhazn commented Dec 20, 2024

georg-schwarz commented Dec 20, 2024 •

edited

Loading

joluj commented Dec 20, 2024

rhazn commented Dec 20, 2024

Add new builtin blocktype for CSV file to table parsing #631

Are you sure you want to change the base?

Add new builtin blocktype for CSV file to table parsing #631

Conversation

TungstnBallon commented Dec 17, 2024

georg-schwarz left a comment • edited Loading

Choose a reason for hiding this comment

rhazn commented Dec 19, 2024

joluj commented Dec 20, 2024

rhazn commented Dec 20, 2024

georg-schwarz commented Dec 20, 2024 • edited Loading

joluj commented Dec 20, 2024

rhazn commented Dec 20, 2024

georg-schwarz left a comment •

edited

Loading

georg-schwarz commented Dec 20, 2024 •

edited

Loading