Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question About Reported Results #67

Open
rafaelpadilla opened this issue Dec 19, 2024 · 1 comment
Open

Question About Reported Results #67

rafaelpadilla opened this issue Dec 19, 2024 · 1 comment

Comments

@rafaelpadilla
Copy link

Hi there! 👋

First off, amazing job putting together a leaderboard with so many models! 🙌 It’s such a valuable resource for the community to compare performance easily—thank you for making this effort!

I did notice a couple of things that seemed a bit off, and I was hoping to get some clarification:

1️⃣ Some of the results on the leaderboard seem quite different from those published in other sources, like this and the RT-DETR paper.

2️⃣ Additionally, I’m curious about how the validation set is being used in each evaluated model. If it’s influencing training (e.g., for early stopping), it might make the validation set less ideal as a benchmark for the leaderboard.

Would you mind shedding some light on these points? I’m asking to better understand and align expectations. The race to push higher mAP is incredibly competitive among the models, and even the smallest decimal point can make a big difference when comparing models! 💡😊

Keep up the awesome work! 🚀

@onuralpszr
Copy link
Collaborator

Hi @rafaelpadilla 👋,

First, I’d like to mention that we are transparent about how we run all metric tests in our repository. This allows everyone to clearly see how we evaluate models. Our focus has always been on using the original models and repositories to achieve results that are as close as possible to the reported benchmarks.

Here are a few key details about our evaluation process:

Additionally, we ensure that the models we test are "pre-trained." We do not re-train or fine-tune them, staying consistent with the original implementations.

When validating our results, we cross-check with the repository benchmarks. For instance, with RT-DETR, we referenced their official implementation. However, differences in scores may arise due to variations in metric calculation methods. For example, RT-DETR likely uses "COCO metrics," which could account for discrepancies in the results.

Let me know if you have any questions or need further clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants