-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: batching telemetry event request avoid too many requests #20000
base: main
Are you sure you want to change the base?
Conversation
https://github.com/risingwavelabs/telemetry-backend/pull/45 related changes on the server side |
reserved 2; | ||
reserved "META_BACKEND_ETCD"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
telemetry service still receive requests from legacy clusters and it is still possible getting reports from a ETCD backend cluster.
I believe the field is changed by mistake in #18621
There is no breaking change in the proto file because there is no updates on the server side since then.
.unwrap_or_else(|e| tracing::info!("{}", e)) | ||
}); | ||
|
||
TELEMETRY_EVENT_REPORT_STASH.blocking_lock().push(event); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we call blocking_lock
here, why not just using a non-async mutex (parking_lot::Mutex
) instead of the tokio async mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after switching to unbounded channel, there seems no need for the sync Mutex.
.unwrap_or_else(|e| tracing::debug!("{}", e)); | ||
} | ||
|
||
pub const TELEMETRY_EVENT_REPORT_INTERVAL: u64 = 10; // 10 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the size of the events accumulated in 10s controllable? If not, I am concerned that doing only time-based batching can cause memory pressure / OOM on the node. We might need to consider adding size-based batching as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. Let me fix.
src/common/src/telemetry/report.rs
Outdated
if let Some(event) = event { | ||
TELEMETRY_EVENT_REPORT_STASH.lock().await.push(event); | ||
} | ||
if TELEMETRY_EVENT_REPORT_STASH.lock().await.len() >= TELEMETRY_EVENT_REPORT_STASH_SIZE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TELEMETRY_EVENT_REPORT_STASH
with locking is unnecessary because do_telemetry_event_report
is called in this tokio task only and there is no real contention on it. We can define a local var for the message batch and pass it to do_telemetry_event_report
directly.
let mut batch = Vec::new();
loop {
....
tokio::select! {
...
... => {
...
do_telemetry_event_report(std::mem::take(&mut batch)).await;
}
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is to move all event report stuff inside the telemetry_event
crate, and defining a local var here may make the two telemetry object mixed.
But yes, a local var does better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Checklist
Documentation
Release note