Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: batching telemetry event request avoid too many requests #20000

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

tabVersion
Copy link
Contributor

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

@tabVersion
Copy link
Contributor Author

https://github.com/risingwavelabs/telemetry-backend/pull/45 related changes on the server side

Comment on lines -10 to -11
reserved 2;
reserved "META_BACKEND_ETCD";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetry service still receive requests from legacy clusters and it is still possible getting reports from a ETCD backend cluster.
I believe the field is changed by mistake in #18621
There is no breaking change in the proto file because there is no updates on the server side since then.

.unwrap_or_else(|e| tracing::info!("{}", e))
});

TELEMETRY_EVENT_REPORT_STASH.blocking_lock().push(event);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we call blocking_lock here, why not just using a non-async mutex (parking_lot::Mutex) instead of the tokio async mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after switching to unbounded channel, there seems no need for the sync Mutex.

.unwrap_or_else(|e| tracing::debug!("{}", e));
}

pub const TELEMETRY_EVENT_REPORT_INTERVAL: u64 = 10; // 10 seconds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the size of the events accumulated in 10s controllable? If not, I am concerned that doing only time-based batching can cause memory pressure / OOM on the node. We might need to consider adding size-based batching as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. Let me fix.

src/common/telemetry_event/src/lib.rs Outdated Show resolved Hide resolved
@tabVersion tabVersion marked this pull request as ready for review January 3, 2025 06:58
@tabVersion tabVersion requested a review from hzxa21 January 3, 2025 06:59
if let Some(event) = event {
TELEMETRY_EVENT_REPORT_STASH.lock().await.push(event);
}
if TELEMETRY_EVENT_REPORT_STASH.lock().await.len() >= TELEMETRY_EVENT_REPORT_STASH_SIZE {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TELEMETRY_EVENT_REPORT_STASH with locking is unnecessary because do_telemetry_event_report is called in this tokio task only and there is no real contention on it. We can define a local var for the message batch and pass it to do_telemetry_event_report directly.

let mut batch = Vec::new();

loop {
  ....
  tokio::select! {
    ...
    ... => {
     ...
     do_telemetry_event_report(std::mem::take(&mut batch)).await;
    }
    ...
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to move all event report stuff inside the telemetry_event crate, and defining a local var here may make the two telemetry object mixed.
But yes, a local var does better here.

@graphite-app graphite-app bot requested a review from a team January 3, 2025 07:18
tabversion added 2 commits January 3, 2025 15:35
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

src/common/telemetry_event/src/lib.rs Outdated Show resolved Hide resolved
src/common/telemetry_event/src/lib.rs Outdated Show resolved Hide resolved
src/common/src/telemetry/report.rs Outdated Show resolved Hide resolved
@graphite-app graphite-app bot requested a review from a team January 3, 2025 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants