-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
counterexamples out of range in nn4sys2023 #4
Comments
Thanks @huanzhang12 and @KaidiXu this helps. Based on this, I guess the situation looks like this: In this case, NNV reported a counter-example that was within a small tolerance of the possible inputs, although other tools said it was unsat. What is the suggestion of what to do in this case? It seems like both SAT and UNSAT are possible answers. Something else that could happen with tolerances would be if the outputs looked like this:
Also: Any idea if something similar happened with the traffic signs recognition 2023 benchmark? The input/output errors there look much larger than the tolerance. The rules were a bit ambiguous on this point... I had to modify things to state when a relative and when an absolute tolerance was used and updated the tolerances... trying to make it so that no tool was penalized for the outputs not matching the inputs when executing using onnxruntime, but I didn't anticipate it could affect the sat/unsat verification result. The latest wording is "However, there will be no penalty as long as the execution of the network using onnxruntime matches the witness up to a small numerical precision (mismatch is less than 1e-3 relative error), and the constraints are met up to 1e-4 absolute error. " The original wording from back in April was: "However, there will be no penalty as long as the execution of the network using onnxruntime matches the witness up to a small numerical precision (mismatch is less than 1e-4), and the constraints are met up to this same precision." |
|
@stanleybak Thank you for getting back to me and looking into the issue! I read the latest rules, and it does say 1e-4 absolute error on input constraints, so those adversarial examples are valid. But tools that give "unsat" results should not be considered wrong, because we may say both "unsat" and "sat" are correct due to the large absolute error tolerance. If the tolerance is smaller, the results would be "unsat". I feel relative error makes more sense in this setting (since the X is not normalized), and we can discuss this next year. For The second issue was a bug in the scoring script. When multiple tools find counter-examples, as long as one tool finds a valid counter-example, the script will assume counter-examples found by other tools are all correct, even if they are actually "incorrect" due to the numerical issue above. This is indeed a bug that should be fixed, otherwise the penalty will be incorrect (lower) for tools that indeed report wrong counter-examples. for ce_valid_res in ce_results.values():
if ce_valid_res == CounterexampleResult.CORRECT:
valid_ce = True
break Here elif penalize_no_ce and res == "violated" and not ce_results[tool_name]: # in 2022 also had: num_holds > 0
# Rule: If a witness is not provided when “sat” is produced, the tool will be assessed a penalty.
score = Settings.PENALTY_INCORRECT
# some code here ommitted
elif res == "violated" and not valid_ce:
# incorrect witness
score = Settings.PENALTY_INCORRECT If I understand the script correctly, in the first The proposed fix is to change the second elif res == "violated" and (not valid_ce or ce_results[tool_name] != CounterexampleResult.CORRECT): For this benchmark, the results will change dramatically after fixing this bug. |
Hello @huanzhang12 @stanleybak @KaidiXu, I looked into the results for the 2023 nn4sys benchmark after the competition, and I found that all our (NNV) counterexamples generated were wrong (only for this benchmark), as there was a parsing error for this type of vnnlib specifications. I reported our errors to Christopher in case we need to recompute the scores, but I had missed this issue. |
In the VNN-COMP 2023 nn4sys benchmark, many tools got “incorrect” results. Since the majority part of this benchmark was the same as 2021 and 2022, these “incorrect” results look quite suspicious, so @KaidiXu and I looked into why they happened. We found the issue was that the counterexamples provided by one tool violate the constraints on x, but it is not detected by the evaluation script due to an absolute error threshold being used. Other tools verified these examples to be safe and were mistakenly counted as “incorrect”.
Below are the two instances where our tool (alpha-beta-CROWN) was counted as “incorrect”:
In lindex_deep_lindex_1000.counterexample, the counter example has x = 0.008319346369826038, however the actual range of x given by the vnnlib is 0.0082591306418180 <= x <= 0.0082597080618143 (see line 200 after unzip in the vnnlib file)
In lindex_lindex_1000.counterexample, the counter example has x = 0.008319346369826038, however the actual range of x defined by the vnnlib is 0.0083549842238426 <= x <= 0.0083560608327389 (see line 358 after unzip in the vnnlib file)
In these two counterexamples, the input x in the counterexample is clearly outside of the valid ranges. However, the evaluation script uses an absolute threshold 1e-4 to check the boundary of x, which is insufficient in this case because x is very close to 0, so large errors on x were not detected. I think the counterexample checking script should use relative error to determine whether x is out of range, and it matters when x is very close to 0 in this case. We hope in the final report that the competition organizers can rerun the evaluation script for this benchmark and double-check these counterexamples with corrected tolerance.
The text was updated successfully, but these errors were encountered: