Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark TTFB PoC #208

Open
bajtos opened this issue Dec 4, 2024 · 12 comments
Open

Spark TTFB PoC #208

bajtos opened this issue Dec 4, 2024 · 12 comments
Assignees

Comments

@bajtos
Copy link
Contributor

bajtos commented Dec 4, 2024

We want Spark to measure another service level indicator: time to first byte (TTFB).

The goal of this task is to create a proof-of-concept solution in a few days's time. The focus should be covering the entire vertical, from recording TTFB to showing TTFB in the dashboards and documenting the possible next step based on what's missing in the PoC.

Notes:

  • Let's record TTFB only when the retrieval was successful.
  • Checker nodes are already collecting the timing information and submitting it in the measurement fields startAt and endAt.
  • We cannot trust any of the individual TTFB values reported by the checker nodes. We need to find a way how to derive somewhat trustworthy value from a bunch of untrustworthy measurements.
  • Let's create a lightweight documentation describing how to interpret the TTFB values shown by our dashboards. This will be among the first questions people will ask us.
@bajtos
Copy link
Contributor Author

bajtos commented Dec 4, 2024

Re: We cannot trust any of the individual TTFB values reported by the checker nodes.

To make it more difficult for checker nodes to cheat and influence the TTFB indicator we produce, I propose the following implementation for the PoC scope:

  1. For each Retrieval Task, if there is a majority that agrees the result was successful, take TTFB value from all majority retrieval results a calculate p50 (median) value.

    This gives us pretty high confidence in the correctness of the reported value. We are already using the honest majority assumption. If somebody wanted to influence the p50 value, they would have to control more than 50% of the network, which is very unlikely given the assumption.

  2. The next step is to figure out how to aggregate these medians.

    • If the Spark network performs two retrieval tasks for the same miner in one round, two TTFB values are produced by this round.
    • We also want to aggregate per-round per-miner values to per-day granularity for presentation via spark-stats in the dashboards.

    To keep things simple, I propose the following:

    • In the database, keep a list of p50 TTFB values (one list item per Retrieval Task)
    • In the presentation layer (spark-stats or the Observable dashboard data fetcher), let's use the same p50 algorithm for aggregation.

    This should work reasonably well with our current percentile implementation:

    • p50 of [10, 30] is 20
    • p50 of [10, 20, 30] is 20
    • p50 of [10, 19, 30] is 19
    • p50 of [19, 21, 22, 23] is 21.5

    I know this is not correct statistics, but I argue it's a good start and we can easily improve this part later.

@patrickwoodhead You are our math wizard, what do you think? Can you propose a better aggregation function?

Inputs: a list of p50 TTFB values, one value per retrieval task (committee)
Desired output: one value representing "the overall p50 TTFB value".

@bajtos bajtos mentioned this issue Dec 4, 2024
15 tasks
@patrickwoodhead
Copy link

patrickwoodhead commented Dec 4, 2024

Someone can influence the p50 value without controlling more than 50% of the committee. Let's assume the truthful responses are [19, 21, 22, 23, 25]. Then if I control the checkers that reported the fastest two measurements, I can instead report slow measurements to make the responses [22, 23, 25, 100, 100].

This would make the median TTFB 25 instead of 22. So it has been influenced but it still remains one of the truthful values based on the honest majority assumption. I think this is a good enough starting point.

I think the aggregation logic makes sense too. Keeping a value per task seems the right approach and then we can aggregate over tasks to get the values for miners, client, allocators etc.

The one thing this doesn't account for is the geolocation of checkers and the time it takes to move the bytes over the internet. We discussed subtracting the round trip time for the TCP handshake from the round trip time for the first byte to understand the server side TTFB?

@bajtos
Copy link
Contributor Author

bajtos commented Dec 5, 2024

Someone can influence the p50 value without controlling more than 50% of the committee. Let's assume the truthful responses are [19, 21, 22, 23, 25]. Then if I control the checkers that reported the fastest two measurements, I can instead report slow measurements to make the responses [22, 23, 25, 100, 100].

This would make the median TTFB 25 instead of 22. So it has been influenced but it still remains one of the truthful values based on the honest majority assumption. I think this is a good enough starting point.

I think the aggregation logic makes sense too. Keeping a value per task seems the right approach and then we can aggregate over tasks to get the values for miners, client, allocators etc.

SGTM 👍🏻

The one thing this doesn't account for is the geolocation of checkers and the time it takes to move the bytes over the internet. We discussed subtracting the round trip time for the TCP handshake from the round trip time for the first byte to understand the server side TTFB?

Yes, we don't account for geolocation and the related latency.

IMO, fixing this is beyond the PoC/MVP scope, but geolocation-based latency is definitely something to document and plan to improve soon.

For context:

  • Right now, Spark observes the p50 TTFB value around 900ms.
  • A realistic roundtrip between Europe and Australia is around ~350-400ms.

Since we are measuring p50, if our network is distributed uniformly enough that we have a similar number of nodes far away and close to the SP being measured, then I think the p50 value should not be affected too much.

I guess this depends on the interpretation. We can say that the PoC version is measuring the worldwide TTFB median.

Later, we can improve this to measure "TTFB on the provider's server". But then this creates an attack vector, where the SP sets up a very slow internet connection (e.g. add 500ms latency at the TCP/IP or HTTPS/TLS level). Such retrieval service won't be very useful to users as they have to always wait more than 500ms for the first byte, but Spark will report much lower server-side TTFB, because it subtracts those 500ms.

Clearly, more research is needed here.

@pyropy
Copy link

pyropy commented Dec 18, 2024

Should this be consider a different task that checkers should be rewarded for (e.g. evaluating separately from retrievability but using same reported data) or it should just be part of current evaluation process as additional metric?

@bajtos
Copy link
Contributor Author

bajtos commented Dec 18, 2024

Should this be consider a different task that checkers should be rewarded for (e.g. evaluating separately from retrievability but using same reported data) or it should just be part of current evaluation process as additional metric?

As a guiding principle for this PoC, we want to do as little effort as possible to ship TTFB metric (network-wide and per-miner).

My idea was to extract TTFB as an additional step in the current evaluation process.

Also, note that Spark checker nodes will not perform any additional task for this; they are already reporting enough data in their current measurements.

@pyropy
Copy link

pyropy commented Dec 18, 2024

My idea was to extract TTFB as an additional step in the current evaluation process.

I just want to confirm: Are we discussing the Committee evaluation process? Should they be responsible for calculating the P50 for TTFB or we should calculate P50 from all reported values?

@juliangruber
Copy link
Member

Let's start with TTFB calculation based on current committee results, even though committees don't necessarily agree on the TTFB value - it's a PoC and we can iterate from there

@bajtos
Copy link
Contributor Author

bajtos commented Dec 19, 2024

See #208 (comment)

For each Retrieval Task, if there is a majority that agrees the result was successful, take TTFB value from all majority retrieval results and calculate the p50 (median) value.

A committee is defined as "all measurements submitted for one Retrieval Task that pass fraud detection step", so yes, the TTFB can be extracted from individual committees.

See this loop for an example showing how to extract per-committee data:

https://github.com/filecoin-station/spark-evaluate/blob/607d836f6a60b9b22c57254d6df041dcd4c1cf5f/lib/public-stats.js#L69-L75

Take also look at updateDailyDealsStats() later in that same file, it aggregates statistics for (day, miner_id, client_id) tuple.

#208 (comment)

I think the aggregation logic makes sense too. Keeping a value per task seems the right approach and then we can aggregate over tasks to get the values for miners, client, allocators etc.

To get per task granularity, we cannot use (day, miner_id, client_id) structure. Instead, we need something like (day, round_number, miner_id, payload_cid, client_ids). (Note that a single retrieval task (miner_id, payload_cid) can be linked to deals from multiple clients - that's why we need client_ids array.)

@juliangruber
Copy link
Member

@bajtos the question made sense because committees are built on retrievability, not ttfb. It could make sense to also build them on ttfb values/buckets, for the sake of this metric. Since this is a PoC, that's not necessary.

@bajtos
Copy link
Contributor Author

bajtos commented Dec 19, 2024

@bajtos the question made sense because committees are built on retrievability, not ttfb. It could make sense to also build them on ttfb values/buckets, for the sake of this metric.

Makes sense 👍🏻

Since this is a PoC, that's not necessary.

Agreed.

I also realised that in my comment #208 (comment), I may be adding complexity that's not necessary for this PoC.

I would like the result of this PoC to provide the following two metrics:

  • Network-wide TTFB and how it evolves on daily basis. (E.g., on 2024-12-19, the TTFB was 450ms; it improved to 300ms on 2025-05-19.)

    • What retrieval latency can be expected from the Filecoin network?
  • Per-miner TTFB and how it evolves on daily basis.

    • Client: If TTFB is an important metric to me, which Storage Provider should I choose to store my data?
    • SP: How good is my TTFB compared to others? Is it improving over time?

Per-client and per-allocator TTFB can be added later after we validate there is demand for that.

@pyropy
Copy link

pyropy commented Dec 19, 2024

@bajtos I guess if that's the case then we should just extend existing metrics (updating daily deals stats)?

@bajtos
Copy link
Contributor Author

bajtos commented Dec 22, 2024

@bajtos I guess if that's the case then we should just extend existing metrics (updating daily deals stats)?

Yes, essentially. Let's add a new table to track TTFB, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants