Question About Reported Results #67

rafaelpadilla · 2024-12-19T05:31:11Z

Hi there! 👋

First off, amazing job putting together a leaderboard with so many models! 🙌 It’s such a valuable resource for the community to compare performance easily—thank you for making this effort!

I did notice a couple of things that seemed a bit off, and I was hoping to get some clarification:

1️⃣ Some of the results on the leaderboard seem quite different from those published in other sources, like this and the RT-DETR paper.

2️⃣ Additionally, I’m curious about how the validation set is being used in each evaluated model. If it’s influencing training (e.g., for early stopping), it might make the validation set less ideal as a benchmark for the leaderboard.

Would you mind shedding some light on these points? I’m asking to better understand and align expectations. The race to push higher mAP is incredibly competitive among the models, and even the smallest decimal point can make a big difference when comparing models! 💡😊

Keep up the awesome work! 🚀

onuralpszr · 2024-12-19T22:48:49Z

Hi @rafaelpadilla 👋,

First, I’d like to mention that we are transparent about how we run all metric tests in our repository. This allows everyone to clearly see how we evaluate models. Our focus has always been on using the original models and repositories to achieve results that are as close as possible to the reported benchmarks.

Here are a few key details about our evaluation process:

Data: We use the COCO 2017 val dataset for evaluation.
Metrics: We utilize the Supervision library for metric calculations.
- For mAP, we follow this guideline: Supervision mAP Documentation
- For F1 Score, we refer to: Supervision F1 Score Documentation

Additionally, we ensure that the models we test are "pre-trained." We do not re-train or fine-tune them, staying consistent with the original implementations.

When validating our results, we cross-check with the repository benchmarks. For instance, with RT-DETR, we referenced their official implementation. However, differences in scores may arise due to variations in metric calculation methods. For example, RT-DETR likely uses "COCO metrics," which could account for discrepancies in the results.

Let me know if you have any questions or need further clarification!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question About Reported Results #67

Question About Reported Results #67

rafaelpadilla commented Dec 19, 2024

onuralpszr commented Dec 19, 2024

Question About Reported Results #67

Question About Reported Results #67

Comments

rafaelpadilla commented Dec 19, 2024

onuralpszr commented Dec 19, 2024