refactor: refactor source executor (part 1) #15103

xxchan · 2024-02-18T03:36:46Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Rename stream_source_splits and state_cache according to how they are updated.

They are very similar and can be wrongly used. Actually only one of them may be enough. This PR only does the renaming first to make it easy to review.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

xxchan · 2024-02-18T03:36:58Z

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

xxchan · 2024-02-18T03:41:53Z

src/stream/src/executor/source/fs_source_executor.rs

+                                self.stream_source_core.latest_split_info.get_mut(id).map(
+                                    |origin_split| {
+                                        origin_split.update_in_place(offset.clone())?;
+                                        Ok::<_, anyhow::Error>((id.clone(), origin_split.clone()))
+                                    },
+                                )
                            })
                            .try_collect()?;

-                        self.stream_source_core.state_cache.extend(state);
+                        self.stream_source_core
+                            .updated_splits_in_epoch
+                            .extend(state);


We can see they are updated in the same way on data chunk. Plan to unify them later.

xxchan · 2024-02-18T03:43:33Z

src/stream/src/executor/source/source_executor.rs

-        // fetch the newest offset, either it's in cache (before barrier)
-        // or in state table (just after barrier)
-        let target_state = if core.state_cache.is_empty() {
-            for ele in &mut *split_info {
-                if let Some(recover_state) = core
-                    .split_state_store
-                    .try_recover_from_state_store(ele)
-                    .await?
-                {
-                    *ele = recover_state;
-                }
-            }
-            split_info.to_owned()
-        } else {
-            core.state_cache
-                .values()
-                .map(|split_impl| split_impl.to_owned())
-                .collect_vec()
-        };



Actually this is a bug: When state_cache is some, we shouldn't use it to rebuild stream reader, since it only contains splits updated in this epoch.

However, this is hard to trigger. In my local testing I found that rebuild_stream_reader_from_error isn't triggered even when external Kafka/PG is killed...

Actually this is a bug: When state_cache is some, we shouldn't use it to rebuild stream reader, since it only contains splits updated in this epoch.

Source attempts to recover from the latest successful offset so we always recover from cache.
The case you mentioned above occurs when some partitions have no new data in one epoch, which means partitioning imbalance and this can hardly happen in MQs.

In my local testing I found that rebuild_stream_reader_from_error isn't triggered even when external Kafka/PG is killed...

Kafka SDK handle Kafka broker timeout internally and will keep trying to reconnect. For kinesis, the logic will be triggered when network issue happens. 😇

when some partitions have no new data in one epoch, which means partitioning imbalance and this can hardly happen in MQs.

I think It's not that impossible. Imagine the source is idle for a while, and then only 1 message is produced. Can also happen if user's key is imbalanced according to their bussiness logic.

xxchan · 2024-02-18T03:46:47Z

src/stream/src/executor/source/source_executor.rs

-                    if let Some(target_state) = &target_state {
-                        latest_split_info = target_state.clone();
-                    }
-


I think the local variable latest_split_info can be replaced by the field.

StrikeW

LGTM, PTAL @shanicky

tabVersion · 2024-02-18T15:58:56Z

src/stream/src/executor/source/source_executor.rs


+        let target_state = core.latest_split_info.values().cloned().collect();


I have some correctness concerns here. Seems you are resetting the process to the offset where the last successful barrier came, taking the time as T0.
But later an error occurs at time T1 and requires rebuilding source internally. The logic here may lead to read data from T0 to T1 twice.

I think the expected logic here is taking a union of both state_cache and split_info to make sure resetting every assigned split to its latest offset.

Seems you are resetting the process to the offset where the last successful barrier came

No. latest_split_info is also updated on every message chunk, so its offset is always up to date.

This is indeed very confusing and also why I wanted to use just one.

tabVersion

rest lgtm
thanks for finding bugs.

tabVersion · 2024-02-19T05:09:38Z

src/stream/src/executor/source/source_executor.rs

-        // state cache may be stale
-        for existing_split_id in core.stream_source_splits.keys() {
+        // Checks dropped splits
+        for existing_split_id in core.latest_split_info.keys() {


also delete removed items in latest_split_info here because it is seen as the ground truth of split assignment when doing recovery internally.

Actually it is updated later in persist_state_and_clear_cache using target_state.. This is how it works previously. This PR doesn't want to change it. #15104 just changes this.

This reverts commit 3ce0996.

refactor: refactor SourceExecutor

1dbdfd2

xxchan mentioned this pull request Feb 18, 2024

refactor: refactor source executor (part 2) #15104

Merged

9 tasks

github-actions bot added the type/refactor label Feb 18, 2024

xxchan changed the title ~~refactor: refactor SourceExecutor~~ refactor: refactor source executor (part 1) Feb 18, 2024

xxchan marked this pull request as ready for review February 18, 2024 03:37

xxchan commented Feb 18, 2024

View reviewed changes

xxchan requested review from shanicky, StrikeW and BugenZhao February 18, 2024 03:47

BugenZhao approved these changes Feb 18, 2024

View reviewed changes

StrikeW reviewed Feb 18, 2024

View reviewed changes

xxchan requested a review from tabVersion February 18, 2024 05:54

tabVersion reviewed Feb 18, 2024

View reviewed changes

This was referenced Feb 18, 2024

feat(stream): add kafka backfill executor #14172

Merged

feat: add kafka backfill frontend #14465

Closed

tabVersion reviewed Feb 19, 2024

View reviewed changes

xxchan added this pull request to the merge queue Feb 19, 2024

tabVersion approved these changes Feb 19, 2024

View reviewed changes

Merged via the queue into main with commit 3ce0996 Feb 19, 2024
35 of 36 checks passed

xxchan deleted the 02-16-refactor_refactor_SourceExecutor branch February 19, 2024 05:46

TennyZhuang added a commit that referenced this pull request Feb 21, 2024

Revert "refactor: refactor source executor (part 1) (#15103)"

85cee86

This reverts commit 3ce0996.

TennyZhuang added a commit that referenced this pull request Feb 21, 2024

Revert "refactor: refactor source executor (part 1) (#15103)"

13ca880

This reverts commit 3ce0996.

xxchan mentioned this pull request Feb 29, 2024

feat(risedev): add some configs for monitoring #15354

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: refactor source executor (part 1) #15103

refactor: refactor source executor (part 1) #15103

xxchan commented Feb 18, 2024 •

edited

Loading

xxchan commented Feb 18, 2024 •

edited

Loading

xxchan Feb 18, 2024

xxchan Feb 18, 2024

xxchan Feb 18, 2024

tabVersion Feb 18, 2024

tabVersion Feb 18, 2024

xxchan Feb 18, 2024

xxchan Feb 18, 2024

StrikeW left a comment

tabVersion Feb 18, 2024

tabVersion Feb 18, 2024

xxchan Feb 18, 2024

xxchan Feb 18, 2024

tabVersion left a comment

tabVersion Feb 19, 2024

xxchan Feb 19, 2024


		let target_state = core.latest_split_info.values().cloned().collect();

refactor: refactor source executor (part 1) #15103

refactor: refactor source executor (part 1) #15103

Conversation

xxchan commented Feb 18, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

xxchan commented Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikeW left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tabVersion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xxchan commented Feb 18, 2024 •

edited

Loading

xxchan commented Feb 18, 2024 •

edited

Loading