Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit an errors metric and have a default error rate script abort threshold #877

Open
na-- opened this issue Dec 18, 2018 · 10 comments
Open

Comments

@na--
Copy link
Member

na-- commented Dec 18, 2018

It would be very useful if we resurrect the currently unused errors metric and also add a default threshold that would abort the script if the error rate exceeds a certain configurable value.

This issue is a summary of an idea that has been discussed in various other issues and PRs before:

@na--
Copy link
Member Author

na-- commented Apr 18, 2019

All exceptions during an iteration will add to the errors metric, but we probably should offer users more flexibility by extending the fail() function to support custom error tags. Something like this: fail("error message", { tags: {action: "customer-add", cause: "cannot-login" } }.

And ideally, we should improve the UX of fail() errors. Currently, running this script:

import { fail } from "k6";

export default function () {
   fail("test");
}

will produce and error like this:

ERRO[0000] GoError: test
    at native
    at script.js:4:18(4) 

whereas something like Fail "test" at script.js:4:18(4) or script error ... or user error would be much better - normal users don't need to know k6 is written in Go... If there's an use case, we might even add a way to specify extra data in fail() besides the metric tags and error message, though I'm not sure what a good use case would be for that.

@na-- na-- mentioned this issue May 30, 2019
39 tasks
@na-- na-- added this to the v1.0.0 milestone Aug 27, 2019
@fknshwow
Copy link

@na-- the referenced issue #735 was closed in favor of this one. Will this issue cover the following?

When a check fails or fail() is called, a failure exit code is returned.
This would enable us to fail a build pipeline in the event the check criteria is not met without setting a threshold.

@na-- na-- modified the milestones: v1.0.0, v0.27.0 Nov 14, 2019
@na--
Copy link
Member Author

na-- commented Nov 14, 2019

@lukecusolito, sorry for the late response, I somehow missed your question until I just now looked at this issue for another reason 😞

To answer your question, a single check or a fail() likely won't abort the whole script. Rather, if a certain percentage of iterations are aborted with an error (either fail() or a thrown JS error), k6 would abort the whole script.

As shown in the script example from #735 (comment), you can sort of implement this even now, with a few lines of code, using custom metrics and thresholds. And using the same approach of custom metrics and thresholds, you can also implement a script abort if only a single check or iteration fails, if that's what your use case is.

This issue is focused on actually emitting these iteration errors as a metric, and having some sensible default threshold that automatically abort the test if too many iterations result in an error. After all, if 50% of your iterations are interrupted by an error, your load test probably isn't very meaningful...

@na--
Copy link
Member Author

na-- commented Aug 13, 2020

Thinking a bit more about this, and considering that #1007 added a dropped_iterations metric (#1529, details), maybe this metric shouldn't be called errors, but something like interrupted_iterations metric instead? And have a cause tag? And be used for 4 things:

  • normal JS exceptions, having a cause: error tag
  • calls from fail(), with cause: fail or cause: error (not sure, depends on how we want to do the threshold), and potentially other extra tags, as described above
  • and iterations for which goja's execution was interrupted by the executor because the gracefulStop or gracefulRampDown were insufficient for the iteration to finish completely, with cause: interrupted or something like that.
  • when someone interrupts the whole test run with Ctrl+C or something like that, any currently running iterations should also probably emit interrupted_iterations{cause: interrupted}: 1? Though that's the least important case probably...

Another reason for that is that it seems good UX for us to update the UI counter for interrupted iterations (i.e. the ZZ in the running (MMmSS.Ss), 0/XX VUs, YY complete and ZZ interrupted iterations text above the progress bars) to also include iterations that were prematurely interrupted because of an exception or fail()?

Finally, regarding thresholds... I'm not sure if we should add a default one, considering there might not be a way to users to overwrite it or remove it currently? This needs to be checked, but regardless, if we even want the possibility of adding a sensible threshold based on interrupted_iterations{cause: error} (that's why I want to classify both exceptions and fail() calls as cause: error), we need to make the new metric a Rate instead of a Counter.

Unfortunately, that means k6 will emit more noise for each iteration (both iterations: 1 and interrupted_iterations: 0)... Which is far from ideal, but unfortunately we currently can't have a threshold referencing 2 metrics (i.e. something like interrupted_iterations / iterations < 0.05 is impossible 😞 ).

So, I'm not sure if we should have a Rate or a Counter. Rate will allow us better thresholds now (e.g. interrupted_iterations{cause:error}: ["rate < 0.05"]), and its downsides can be somewhat ameliorated by #1321 when performance matters. Counter will be lighter for metrics processing, but won't allow us to set a sensible threshold until the thresholds are overhauled (#1441, #1443 and connected issues)... @sniku @mstoykov @imiric, what do you think?

@mstoykov
Copy link
Contributor

I like the idea of being able to differentiate between the different reasons k6 has interrupted an iteration.

I did want to see if we can just add it on top of the current Iterations metric ... but

  1. We don't emit it if the context is Done which is basically all the graceful* variants
  2. It will become harder to use
    Which probably means that this is not a good idea :(

I also think that Rate is a better fit at least given the current shortcomings of the threshold api ...

@na--
Copy link
Member Author

na-- commented Aug 13, 2020

We don't emit it if the context is Done which is basically all the graceful* variants

Even though it will be a minor breaking change, since we don't emit iterations: 1 for interrupted-by-executor iterations currently, there's something to be said for also not emitting it for interrupted-by-error iterations as well, especially if we have a new interrupted_iterations metric that will take it into account.

@na--
Copy link
Member Author

na-- commented Aug 13, 2020

Somewhat connected issue that can probably be implemented at the same time as this one: #1250

@imiric
Copy link
Contributor

imiric commented Dec 7, 2020

Hey, I started exploring this issue, but it's a bit difficult to gather precise requirements because of the amount of related issues and discussions. So far I'm mostly working based on this comment.

One question so far: #1007 also introduced ExecutionState.interruptedIterationsCount:

https://github.com/loadimpact/k6/blob/420dd4161c280bdf0398af5e58f73c939057695a/lib/execution.go#L218-L223

This is not a metric, but an internal counter incremented by specific executors, and is shown in the test run UI (X complete and Y interrupted iterations).

At first glance it seems that the proposed interrupted_iterations metric should serve the same purpose, so questions about reconciling this:

  1. Right now it's simple enough for fail() to emit an interrupted_iterations metric, but it seems impossible for the js/modules/k6 package to access the ExecutionState, we can only get the lib.State from the context.
    lib.State and lib.ExecutionState have some conceptual overlap (they both have lib.Options for example), so could we consider a merger of these into a single struct?

  2. If there's an interrupted_iterations metric, can we remove the internal interruptedIterationsCount and have executors emit this metric instead? Also, removing it from the UI as well, so I guess only showing X complete iterations?
    This should be simpler than a refactor to merge both states or expose ExecutionState elsewhere, but it won't be shown in real-time anymore and will only appear in the end-of-test summary.

@na--
Copy link
Member Author

na-- commented Dec 7, 2020

At first glance it seems that the proposed interrupted_iterations metric should serve the same purpose

This is somewhat true, but the devil, as always, is in the details... The ExecutionState.interruptedIterationsCount atomic is only meant for the live progressbar UI, to give a better idea of what the test run is doing in real time. The proposed interrupted_iterations aims to expose that same information to the metrics subsystem, as discrete events every time an iteration is interrupted, so the data is available in the thresholds, end-of-test summary, external outputs, etc.

It'd be difficult to drop ExecutionState.interruptedIterationsCount even after the interrupted_iterations metric is implemented, since we'd have to add something listens for interrupted_iterations data points in real time and displays them in the UI... It's certainly possible to do, but it will be much more inefficient than the current solution and a bit of a spaghettification of the code. It's similar to the progressbar counters for a shared-iterations executor and the iterations metric - we both emit an iterations data point, and we increment a counter, since those are two conceptually different systems.

And while currently throw Error() and fail() don't count towards interruptedIterationsCount, they should. And connected to that,

Right now it's simple enough for fail() to emit an interrupted_iterations metric, but it seems impossible for the js/modules/k6 package to access the ExecutionState, we can only get the lib.State from the context

The easiest way to detect failed iterations would be in getIterationRunner(). And you should be able to "throw" a custom struct error type from fail(), in order to distinguish it from normal JS exceptions, and to also attach extra metric tags in fail(). Then you should be able to type assert for it in getIterationRunner().

lib.State and lib.ExecutionState have some conceptual overlap (they both have lib.Options for example), so could we consider a merger of these into a single struct?

They don't really overlap all that much. A better name for lib.State would probably have been lib.VUState, since each VU has its own instance of it, whereas there is only one lib.ExecutionState and it's used by the local execution scheduler. So I don't see a way we can merge them. lib.State is definitely overdue for a big refactoring, so it's more usable by extensions and even k6 code modules, but that's another topic for the future...

2. If there's an interrupted_iterations metric, can we remove the internal interruptedIterationsCount and have executors emit this metric instead? Also, removing it from the UI as well, so I guess only showing X complete iterations?
This should be simpler than a refactor to merge both states or expose ExecutionState elsewhere, but it won't be shown in real-time anymore and will only appear in the end-of-test summary.

As I mentioned above, unless I am very much mistaken, you should be able to update both the atomic and emit the metric from a single place across all executors, the getIterationRunner() helper. So, given how small the upkeep for having the extra real-time information is, I don't see a need to remove it...

imiric pushed a commit that referenced this issue Dec 14, 2020
This avoids outputting "GoError" as mentioned in
#877 (comment)
imiric pushed a commit that referenced this issue Dec 15, 2020
This avoids outputting "GoError" as mentioned in
#877 (comment)
imiric pushed a commit that referenced this issue Dec 17, 2020
This avoids outputting "GoError" as mentioned in
#877 (comment)
@imiric imiric modified the milestones: v0.30.0, v0.31.0 Jan 13, 2021
@na-- na-- modified the milestones: v0.31.0, v0.32.0 Feb 24, 2021
@na-- na-- modified the milestones: v0.32.0, v1.0.0 Apr 14, 2021
imiric pushed a commit that referenced this issue Jun 7, 2021
This avoids outputting "GoError" as mentioned in
#877 (comment)
imiric pushed a commit that referenced this issue Jun 8, 2021
This avoids outputting "GoError" as mentioned in
#877 (comment)
@na--
Copy link
Member Author

na-- commented Jun 15, 2022

For future reference, #1769 was an old attempt to implement this that was blocked by #1321 and became stale.

@na-- na-- modified the milestones: v1.0.0, TBD Nov 9, 2022
harrytwigg pushed a commit to APITeamLimited/globe-test that referenced this issue Jan 11, 2023
This avoids outputting "GoError" as mentioned in
grafana/k6#877 (comment)
@codebien codebien removed this from the TBD milestone Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants