-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: materialize the result to handle non-deterministic functions #24
base: main
Are you sure you want to change the base?
Conversation
Any interaction of UDF with watermark? |
I think we should give user way to config on the UDF such as
|
This proposal inspires me that, if we're going to fix a bug for our built-in expression in a version update, it also breaks the determinism, even if the expression itself is mathematically pure. 😄 If we're for the sake of serious consistency and determinism, we should even version-tag the built-in expressions and keep the buggy one when fixing, in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just came up with a problem: IIRC currently UDF is just an expression, which can be used in other places beside Project, e.g., Filter. Should we make them occur only in Project? Besides the determinism problem, there are also optimizations like risingwavelabs/risingwave#8703
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, https://github.com/risingwavelabs/rfcs/pull/12/files#diff-135315ddd7b7929c8daf439f35043ba3c168ce620ab72c56740a6055d15e8f9fR141 refused to add a new operator, but it seems we eventually need to add some 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, I need to clarify that the "UDF" and "non-deterministic expression" are independent concepts. PROC_TIME()
and RAND()
are built-in expressions but they are non-deterministic. A user-defined function can also be deterministic if a user makes a contract with us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And materialized Project is just used for non-deterministic expression on an "updatable stream". We can use a normal project on the append-only stream for non-deterministic expression too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally agree to refuse to add a new operator for UDF or async expression but not for non-deterministic expression. In another word, if we did add a "UDFProject", now we maybe have to accept StreamProject
, StreamAsyncProject
, StreamMaterializeProject
, StreamAsyncMaterializeProject
😅 🥵
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, after thinking a while I feel that although conceptually different, practically they cannot be so cleanly separated 😅:
Since we want to "always materializing UDF" (except for rare cases and used with caution), MaterializeProject
would become the main executor for UDFs. Optimizations for UDF in Project
like risingwavelabs/risingwave#8703 would be in vain. On the other hand, do you think we should optimize MaterializeProject
for async execution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'm wrong again😇😇😇. I forgot append-only stream. But the points might still apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the optimization for async execution should be done behind the expression framework 🤔 Well there also are some executor-level optimization. For example, some users might not care about the order of the records on the append-only stream and we might reorder them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the optimization for async execution should be done behind the expression framework
Is that enough? If so we can only tweak the performance for expr on one chunk, but not amoung chunks, like buffered
in https://github.com/risingwavelabs/risingwave/pull/8703/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually I don’t mind adding many different executor variants in the future, just like the TopN variants 😅
Let’s conclude the discussion.
|
||
## Design | ||
|
||
We will introduce a `MaterializeProject`, it will materialize some **partial** columns of result with the stream key of the input as the primiary key. When an `Insert` operation comes, it will compute the result and materialize some columns. when `Update` or `Delete` comes, it will lookup its state to replace the old value of the operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find a problem: not all expressions can be extracted to a Project
easily, e.g., in FILTER
clause. 🤡
This is mentioned in #12 as a reason against a dedicated operator for UDF.
No description provided.