Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error description when NMEA message truncated #53

Merged
merged 2 commits into from
Mar 12, 2020

Conversation

idg10
Copy link
Contributor

@idg10 idg10 commented Mar 10, 2020

Resolves #45

@idg10 idg10 self-assigned this Mar 10, 2020
@HowardvanRooijen
Copy link
Contributor

Can we have an ADR for this change please?

@idg10 idg10 mentioned this pull request Mar 11, 2020
@idg10
Copy link
Contributor Author

idg10 commented Mar 11, 2020

Can we have an ADR for this change please?

@HowardvanRooijen it's not an architectural change. Is "fixed a bug" an appropriate use for an ADR?

The design was and still is that we detect when messages do not conform to the standard. There was one particular case we failed to detect, and as a result we threw an error later on as a result of attempting an out of bounds read from a ReadOnlySpan<byte>. This error provided no information to the caller, and that is not how this library is supposed to work.

That's why the issue associated with this PR was labelled as a bug. There is no architecture or design change here; this change properly aligns the implementation with the design.

I've added #54 to track the fact that we have no ADRs, because adding one ADR that provides just the background to this bugfix is a separate bit of work.

@HowardvanRooijen
Copy link
Contributor

We need to capture the new knowledge we have gained, or that our existing assumptions were flawed by omission of more real world test data.

But we need to capture this knowledge somewhere.

Doesn't have to be part of this PR, but it might help future consumers of this library, who may take the code, run it and say "well that's rubbish, it doesn't work for my dataset"...

@idg10
Copy link
Contributor Author

idg10 commented Mar 11, 2020

I don't agree that this demonstrates that existing assumptions were flawed. We have always presumed that problems of exactly this kind would occur, and we already had several test cases based on that assumption. This PR adds one more test based on that same assumption as the others that were already there.

All code contains bugs, and the more widely you use a component, the more those bugs will come out. I fully expect more issues of this kind to come up in the future because it seems unlikely that we've found them all yet. And when that happens, it won't necessitate a change in assumptions. We knew this would probably happen despite our best efforts to avoid it. It has happened. In all liklihood it will happen again.

(Conversely #47 genuinely is driven by a change in assumptions, and will warrant an ADR.)

@mwadams
Copy link

mwadams commented Mar 11, 2020

I don't agree that this demonstrates that existing assumptions were flawed. We have always presumed that problems of exactly this kind would occur, and we already had several test cases based on that assumption. This PR adds one more test based on that same assumption as the others that were already there.

All code contains bugs, and the more widely you use a component, the more those bugs will come out. I fully expect more issues of this kind to come up in the future because it seems unlikely that we've found them all yet. And when that happens, it won't necessitate a change in assumptions. We knew this would probably happen despite our best efforts to avoid it. It has happened. In all liklihood it will happen again.

(Conversely #47 genuinely is driven by a change in assumptions, and will warrant an ADR.)

I think Howard is saying "we should consider how we document the existence of this class of problem so that people know it when they see it" - and a note on the "write the ADRs" issue might be the right place to do it rather than on this PR.

@idg10
Copy link
Contributor Author

idg10 commented Mar 11, 2020

I've created #55 to add a retro-ADR on error reporting strategy.

In my earlier comments I was trying to work out whether I'd missed something, because I'm being asked to capture some information but I don't know what that information is. I don't believe I have any new insights as a result of discovering this bug, so I'm not sure what there is here to capture. When it comes to root cause analysis, I would say the problem is that we have access to a limited amount of data for testing purposes. The number of ways a message could be malformed is limitless, so we will never have enough test data to provide every possible example, so from time to time we will encounter failure modes we did not anticipate. (We could ask one more "why?": why are we reliant on example data to discover the ways that messages can be malformed? We don't strictly have to be: formal analysis could enable us to do a better job of anticipating possibilities; if we used a formal-grammar-driven implementation then this kind of error might never have emerged. But the cost-benefit ratio does not look good, particularly since the basic failure mode is "bad messages of this form will not be processed"; the impact of this bug is limited purely to the diagnostic power of the error message.)

we should consider how we document the existence of this class of problem so that people know it when they see it

I've been trying to think how I could do that. A bug in one place in the code resulted in a confusing exception being thrown in a later part of the code. That's a very broad class of problem, so we might think we could narrow it down to the cases where the bug was in the parser code that checks that the basic structure is OK and the crash occurred in the code that interprets what's there. But we're still also in that broad category of errors (undetected mistake, time passes, eventual failure) and these by their nature produce failure modes that are hard to predict. This can make it impossible to provide a guide for inferring the nature of the problem from the symptom: it's like trying to work out what specific mistake Wile E. Coyote made when the only available evidence is the cloud of dust rising out of the ravine.

This is exactly why the code goes to some lengths to check that the basic structure is OK. (And our users have reported to us that in most cases the error messages have in fact provided sufficient information, and that this was an exception.) But with bugs like this where there's a type of problem we did not anticipate, I don't know how we could prepare people to "know it when they see it" because I'm unable to predict exactly what the failure will look like. (It'll look exactly the same as any other bug in that broad category where an undetected mistake results in an exception later on.) I didn't know what the problem was until I stepped through the code.

@idg10 idg10 assigned carmeleve and unassigned carmeleve Mar 12, 2020
@idg10 idg10 requested a review from carmeleve March 12, 2020 10:12
@idg10 idg10 merged commit 7c8f818 into master Mar 12, 2020
@idg10 idg10 deleted the feature/45-check-truncation branch March 12, 2020 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NmeaLineParser does not check for message truncation if payload partially present
4 participants