feat: batching telemetry event request avoid too many requests #20000

tabVersion · 2025-01-02T09:38:42Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

I have written necessary rustdoc comments.
I have added necessary unit tests and integration tests.
I have added test labels as necessary.
I have added fuzzing tests or opened an issue to track them.
My PR contains breaking changes.
My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

My PR needs documentation updates.

Release note

tabVersion · 2025-01-02T09:47:14Z

https://github.com/risingwavelabs/telemetry-backend/pull/45 related changes on the server side

tabVersion · 2025-01-02T09:50:31Z

proto/telemetry.proto

-  reserved 2;
-  reserved "META_BACKEND_ETCD";


telemetry service still receive requests from legacy clusters and it is still possible getting reports from a ETCD backend cluster.
I believe the field is changed by mistake in #18621
There is no breaking change in the proto file because there is no updates on the server side since then.

hzxa21 · 2025-01-03T03:32:25Z

src/common/telemetry_event/src/lib.rs

-            .unwrap_or_else(|e| tracing::info!("{}", e))
-    });
+
+    TELEMETRY_EVENT_REPORT_STASH.blocking_lock().push(event);


Given that we call blocking_lock here, why not just using a non-async mutex (parking_lot::Mutex) instead of the tokio async mutex?

after switching to unbounded channel, there seems no need for the sync Mutex.

hzxa21 · 2025-01-03T03:40:02Z

src/common/telemetry_event/src/lib.rs

+        .unwrap_or_else(|e| tracing::debug!("{}", e));
+}
+
+pub const TELEMETRY_EVENT_REPORT_INTERVAL: u64 = 10; // 10 seconds


Is the size of the events accumulated in 10s controllable? If not, I am concerned that doing only time-based batching can cause memory pressure / OOM on the node. We might need to consider adding size-based batching as well.

It makes sense. Let me fix.

src/common/telemetry_event/src/lib.rs

hzxa21 · 2025-01-03T07:10:29Z

src/common/src/telemetry/report.rs

+                    if let Some(event) = event {
+                        TELEMETRY_EVENT_REPORT_STASH.lock().await.push(event);
+                    }
+                    if TELEMETRY_EVENT_REPORT_STASH.lock().await.len() >= TELEMETRY_EVENT_REPORT_STASH_SIZE {


TELEMETRY_EVENT_REPORT_STASH with locking is unnecessary because do_telemetry_event_report is called in this tokio task only and there is no real contention on it. We can define a local var for the message batch and pass it to do_telemetry_event_report directly.

let mut batch = Vec::new(); loop { .... tokio::select! { ... ... => { ... do_telemetry_event_report(std::mem::take(&mut batch)).await; } ... } }

My intention is to move all event report stuff inside the telemetry_event crate, and defining a local var here may make the two telemetry object mixed.
But yes, a local var does better here.

hzxa21

Rest LGTM

src/common/telemetry_event/src/lib.rs

src/common/src/telemetry/report.rs

batch message

14ab498

github-actions bot added the type/feature label Jan 2, 2025

restore telemetry etcd backend

9731e9c

tabVersion commented Jan 2, 2025

View reviewed changes

batching

5c6578f

hzxa21 reviewed Jan 3, 2025

View reviewed changes

tabversion and others added 2 commits January 3, 2025 14:55

batch size trigger report

015af02

Merge branch 'main' into tab/batching-telemetry-request

15de8b9

tabVersion marked this pull request as ready for review January 3, 2025 06:58

tabVersion requested a review from hzxa21 January 3, 2025 06:59

hzxa21 reviewed Jan 3, 2025

View reviewed changes

graphite-app bot requested a review from a team January 3, 2025 07:18

tabversion added 2 commits January 3, 2025 15:35

fix

3ed7e87

fmt

9ffe508

hzxa21 approved these changes Jan 3, 2025

View reviewed changes

src/common/telemetry_event/src/lib.rs Outdated Show resolved Hide resolved

src/common/telemetry_event/src/lib.rs Outdated Show resolved Hide resolved

src/common/src/telemetry/report.rs Outdated Show resolved Hide resolved

tabVersion added 2 commits January 3, 2025 20:15

fix

a50d227

Merge branch 'main' into tab/batching-telemetry-request

e793e2c

graphite-app bot requested a review from a team January 3, 2025 12:35

fix dylint

b0274f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: batching telemetry event request avoid too many requests #20000

feat: batching telemetry event request avoid too many requests #20000

tabVersion commented Jan 2, 2025

tabVersion commented Jan 2, 2025

tabVersion Jan 2, 2025

hzxa21 Jan 3, 2025

tabVersion Jan 3, 2025

hzxa21 Jan 3, 2025

tabVersion Jan 3, 2025

hzxa21 Jan 3, 2025

tabVersion Jan 3, 2025

hzxa21 left a comment

feat: batching telemetry event request avoid too many requests #20000

Are you sure you want to change the base?

feat: batching telemetry event request avoid too many requests #20000

Conversation

tabVersion commented Jan 2, 2025

What's changed and what's your intention?

Checklist

Documentation

tabVersion commented Jan 2, 2025

tabVersion Jan 2, 2025

Choose a reason for hiding this comment

hzxa21 Jan 3, 2025

Choose a reason for hiding this comment

tabVersion Jan 3, 2025

Choose a reason for hiding this comment

hzxa21 Jan 3, 2025

Choose a reason for hiding this comment

tabVersion Jan 3, 2025

Choose a reason for hiding this comment

hzxa21 Jan 3, 2025

Choose a reason for hiding this comment

tabVersion Jan 3, 2025

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment