False positive mixed_case_recommended_field for numeric values #1402

MKuranowski · 2023-04-28T00:53:02Z

Describe the bug

The mixed_case_recommended_field warning is always triggered when a field contains digits, like route_short_name=A1 or trip_short_name=101.

Characters which are not letters should not be considered at all when comparing case.

Steps/Code to Reproduce

Validate a file which uses numbers in fields like route_short_name or trip_short_name.

Expected Results

No mixed_case_recommended_field warning for those fileds.

Actual Results

{
      "code": "mixed_case_recommended_field",
      "severity": "WARNING",
      "totalNotices": 360,
      "sampleNotices": [
        {
          "filename": "routes.txt",
          "fieldName": "route_short_name",
          "fieldValue": "A1",
          "csvRowNumber": 2
        },
        {
          "filename": "routes.txt",
          "fieldName": "route_short_name",
          "fieldValue": "ZA12",
          "csvRowNumber": 3
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "301",
          "csvRowNumber": 2
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "5301",
          "csvRowNumber": 3
        },
        {
          "filename": "trips.txt",
          "fieldName": "trip_short_name",
          "fieldValue": "103",
          "csvRowNumber": 4
        },
        ...
}

Screenshots

No response

Files used

https://mkuran.pl/gtfs/wkd.zip

Validator version

4.1.0

Operating system

Debian 12

Java version

openjdk version "17.0.6" 2023-01-17

Additional notes

No response

The text was updated successfully, but these errors were encountered:

welcome · 2023-04-28T00:53:04Z

Thanks for opening your first issue in this project! If you haven't already, you can join our slack and join the #gtfs-validators channel to meet our awesome community. Come say hi 👋!

Welcome to the community and thank you for your engagement in open source! 🎉

briandonahue · 2023-04-28T14:55:35Z

@isabelle-dr How do we want to approach this? It looks like in some cases when numbers are present, we don't want to throw the warning:
A1
103B
1205

In those cases we could skip the mixed case validation altogether, but how about longer strings? Should we have a length minimum or number of letters in a string before mixed case is considered? Thinking about strings like:

MY ROUTE 100
route 1 boulevard

etc

isabelle-dr · 2023-05-01T17:56:33Z

@briandonahue thanks for having a look at this.
After giving this some thought, here is a proposal:
We could modify the logic so we only run this check if the field value contains at least one " " character and if it doesn't contain only numbers.
This means that the following values would be valid: ZA12, 1205, 1205 1.

cc @tzujenchanmbd

briandonahue · 2023-05-02T13:59:08Z

@isabelle-dr I had a couple thoughts. 1) I think the current implementation has another flaw - I believe it only checks for the 26 letters of the basic alphabet /[a-zA-Z]/ which will not match characters such as àâçéèêëîïôûùüÿñæœ (and probably others).

Given the above, I think maybe a better check would be:
- the value contains more than one "letter" (defined as non-numeric, and non-white-space characters)
- AND the original value is the same as the value after being converted to all upper OR all lower case (this will of course not affect numbers and white space, but should handle other characters more efficiently than a regex)

We may still get some false positives if there are values such as AA-103 or B-34-C or D-E etc. But it is a recommendation/warning not an error, and it shoudl cut down significantly on false positives for numeric values or single-letter + numbers.

// if more than one non-numeric, non-whitespace character
if((value.length() - countNumericChars(value) - countWhiteSpaceChars(value)) > 1) &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value){
    notices.add(new MixedFieldNotice());
}

isabelle-dr · 2023-05-02T15:34:00Z

I like this idea!
We would still get the notice for something like ZA12, right?
Should we also add the following condition so that we differentiate acronyms (such as ZA12) from the natural text (such as Route 1B)?

the field value contains at least one " " character

briandonahue · 2023-05-02T15:47:40Z

I like this idea! We would still get the notice for something like ZA12, right? Should we also add the following condition so that we differentiate acronyms (such as ZA12) from the natural text (such as Route 1B)?

the field value contains at least one " " character

@isabelle-dr I thought about this, but I think we want to keep it for long words with no spaces such as Lancaster?

Another approach would be to validate any values with strings of more than 1 letter, regardless of spaces. In this case A 101 and 342B would be ignored but RTE 101, RTE23 and route 1B would be flagged. If we don't want something like RTE23 to be flagged, we could be more strict and ignore strings of mixed letters and numbers (but no spaces). So A12387abi would be ignored but ROUTE A2348B would be flagged.

briandonahue · 2023-05-02T16:33:10Z

After further thought, I think the simplest logic that will cover most cases is to validate any value that has more than one consecutive letter, regardless of numbers or spaces. It will flag things like 123ABC but that seems like it would be uncommon, and it's only a warning... we could also re-visit if users complain, and add some more complex logic, to ignore mixed number/letter strings or something.

Pseudo-code for simpler logic:

// any consecutive non-numeric, non-space characters (2 or more sequential letters)
if (/[0-9 ]{2,}/.test(value)  &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value)){
    notices.add(new MixedFieldNotice());
}

isabelle-dr · 2023-05-02T21:35:28Z

These are very good points, thank you for giving it some thought @briandonahue!
What you shared in the last message sounds great 👍

briandonahue · 2023-05-03T15:52:00Z

@isabelle-dr I've got this working (modified to allow non-English characters, etc) but I noticed the original issue lists ZA12 as a potential false positive. Thoughts? Should we leave it as a warning that can be ignored, or should we consider some larger allowable number of characters mixed with numbers in a single string 😅

MKuranowski · 2023-05-04T01:17:44Z

more than one consecutive letter

Why not more than two?

Two-letter codes for routes are almost too common. Apart from the example from Warsaw Commuter Rail, rapid suburban railway services in Berlin use them (RE and RB), Singapore MTR (EW, NS, NE, CC, DT and TE), Taipei Metro (BL and BR), rail services in Tokyo (too many to list), and many more.

I'd even go as far as too say that 3-letter codes are also relatively common, see Paris RER or Gemarny's ICE. Asian networks also use acronyms like MRT, LRT pretty often.

The posted pseudocode earlier is arguably even worse. It flags things without letters at all, or flags a very common numbering scheme where a number is prefixed or suffixed with a single letter.

The idea of value.toLowerCase() == value || value.toUpperCase() == value makes no sense if value contains characters which don't have the notion of case - which is most of characters. Digits, and pretty much every other alphabet other than Latin, Greek or Cyrylic will always return true on such a test.

MKuranowski · 2023-05-04T01:44:47Z

A much more reliable strategy would be to rely on Unicode character properties.

Catching 3 consecutive upper case letters can be done with something like /\p{Lu}{3,}/u.test(value).
3 consecutive lower case letters are more tricky, as something like /\p{Ll}{3,}/u.test(value) also catches things like Orange. I think something like (?<!\p{LC})\p{Ll}{3,}/u.test(value) (3 consecutive lower case letters not preceded by any other cased letter) works: https://regex101.com/r/GVPMmO/1.

briandonahue · 2023-05-04T14:58:54Z

@MKuranowski Thank you for the feedback! I realized some of this during my testing, and changed the approach in the draft PR. Here is a link to the current implementation, it is generated so a little hard to read, but it ends up being this:

      if (value.matches(".*\\p{L}{2}.*") && 
          (value.toLowerCase() == value || value.toUpperCase() == value)) {
      // add notice
      }

We can adjust the requirement such that we only check sequential letters of 4 or more, or whatever we choose. I can also add better test cases like the ones you included above and any other suggestions.

I simplified the check to compare upper cased and lower cased values to the original, let me know if you see any flaw with that approach.

MKuranowski · 2023-05-05T02:18:07Z

The main problem with ZA12 is still not solved though; for that 2 consecutive letters with the same case should be allowed. As I mentioned earlier, there are a lot of systems which use 2-letter codes for routes; so I don't think this should be an issue.

The v.toLowerCase() == v || v.toUpperCase() == v is not able to catch mixed-case situations (like route 1B), and, more importantly, still breaks for non-case scripts (급행12 or 東西線). The second situation could be mitigating to only matching on cased letters (\p{LC}), not just any letter from any script.

My regex flags route 1B correctly, but then has false positives on Another good value and Avenue des Champs-Élysées. Those contain all-lower-case words (good or des). I don't think it's possible to correctly categorize all 3 strings in a straightforward way, though.

v.matches(".*\\p{LC}{3}.*") && (v.toLowerCase() == v || v.toUpperCase() == v) seems to correctly categorize all mine and already-existing test cases, with the sole exception of route 1B.

briandonahue · 2023-05-08T00:29:18Z

@MKuranowski Another possibility would be to match all words/tokens greater than 3 cased characters, (no numbers, no non-case script characters), and then check that each conforms to some rules (off the top of my head, no all-caps, and at least some with initial caps?). I don't know though... at some point this is just a warning / suggestion and I think as long as we catch the bulk of the fields that should be corrected, users can ignore the warning for fields that are falsely flagged.

Also, as @isabelle-dr mentions here, in the future rules may be configurable so users that are dealing with a lot of false flags for this warning can opt to disable it.

@davidgamez @bdferris-v2 @isabelle-dr curious what your thoughts are.

briandonahue · 2023-05-16T19:49:11Z

@MKuranowski @davidgamez

This is the compromise I've come up with so far...

Split text into letter-only tokens (split on white-space, numbers, and punctuation).
Ignore any tokens that are 3 characters or less
Any tokens that are > 3 chars, must start with an uppper case letter and the rest of the token must be lower case (Title Case -ish, but TIL that Title Case means different things in different alphabets)

This validates almost every case presented as desired with a couple exceptions:
Another good value will get flagged, it should be Another Good Value

MixedCase will get flagged - I'm not sure how common an upper case letter is in the middle of a token without punctuation or other characters, but this seemed like a good general rule.

Find the current test set here

and the generated Java logic looks like this:

String[] tokens = value.split("[^\\p{L}]+");
boolean isValid = true;
for (String token : tokens) {
  if (!((token.length() <= 3) || 
      (Character.isUpperCase(token.charAt(0)) && 
       token.substring(1).toLowerCase().equals(token.substring(1))))) {
    isValid = false;
    break;
  }
}

laurentg · 2023-06-01T10:05:40Z

Why not dropping altogether the mixed-case policy for route_short_name and trip_short_name? There is already a length warning on route_short_name for example, and given the data I've seen I think it's hard to determine easily and reliably if upper or lower case are OK or not, especially using non-Latin charsets (where the concept of "lowercase" is sometimes unknown).

Anyway the current state of this rule is raising way too much warnings to be useful, something has to be done.

isabelle-dr · 2023-06-02T13:36:46Z

@laurentg thanks for the ping and reminder that these false positives should be removed quickly.

Before making the decision to remove this rule, I would like to see how the logic implemented by @briandonahue in PR#1430 is doing.
I suggest we do:

Look at for how many datasets from the Mobility Database this notice is disappearing with PR#1430.
Look at the list of remaining datasets that have this notice, after PR#1430.
Among them, what is the estimated % of false positives? (approx, no need for an exact number)
If more than ~5%, remove this notice from the validator until we find a better logic.

briandonahue · 2023-06-02T16:01:14Z

@isabelle-dr Do we have an existing tool/process for querying the validation results for the Mobility Dataset? I'm not sure how to go about that other than pulling down results from a test run on Github and using a JSON CLI query tool to try to query the JSON results?

szjozsef · 2023-06-07T13:45:20Z

Hello,
besides the false positives on the _short_name fields, I also have issue with this :

{
    "filename": "stops.txt",
    "fieldName": "stop_name",
    "fieldValue": "\u00cenv\u0103\u021b\u0103torilor",
    "csvRowNumber": 436
}

which is an unicode-escaped string for: Învățătorilor, which has the correct mixed case format.

isabelle-dr · 2023-06-10T16:51:48Z

@briandonahue I am not sure your question is still relevant, but here is a vague answer to it 😅.
We've been doing this ad-hoc, I believe a little differently every time.

Since we built the acceptance tests, we've been leveraging them by doing things like this, and then running analytics on the acceptance test report with python & pandas in a Google Colab.
Then, Brian opened this PR to avoid doing that, WIP.

MKuranowski added bug Something isn't working (crash, a rule has a problem) status: Needs triage Applied to all new issues labels Apr 28, 2023

briandonahue self-assigned this Apr 28, 2023

isabelle-dr added status: Needs discussion We need a discussion on requirements before calling this issue ready and removed status: Needs triage Applied to all new issues labels May 1, 2023

briandonahue mentioned this issue May 3, 2023

fix: Mixed Case false positives #1430

Merged

4 tasks

isabelle-dr added this to the 4.0.1 milestone May 25, 2023

isabelle-dr added this to Schedule Validator Q2 backlog May 25, 2023

isabelle-dr moved this to In Progress in Schedule Validator Q2 backlog May 29, 2023

davidgamez closed this as completed in #1430 Jun 11, 2023

github-project-automation bot moved this from In Progress to Done in Schedule Validator Q2 backlog Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False positive mixed_case_recommended_field for numeric values #1402

False positive mixed_case_recommended_field for numeric values #1402

MKuranowski commented Apr 28, 2023

welcome bot commented Apr 28, 2023

briandonahue commented Apr 28, 2023 •

edited

Loading

isabelle-dr commented May 1, 2023

briandonahue commented May 2, 2023

isabelle-dr commented May 2, 2023 •

edited

Loading

briandonahue commented May 2, 2023

briandonahue commented May 2, 2023 •

edited

Loading

isabelle-dr commented May 2, 2023

briandonahue commented May 3, 2023

MKuranowski commented May 4, 2023 •

edited

Loading

MKuranowski commented May 4, 2023

briandonahue commented May 4, 2023

MKuranowski commented May 5, 2023

briandonahue commented May 8, 2023

briandonahue commented May 16, 2023

laurentg commented Jun 1, 2023

isabelle-dr commented Jun 2, 2023

briandonahue commented Jun 2, 2023

szjozsef commented Jun 7, 2023

isabelle-dr commented Jun 10, 2023

False positive mixed_case_recommended_field for numeric values #1402

False positive mixed_case_recommended_field for numeric values #1402

Comments

MKuranowski commented Apr 28, 2023

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Screenshots

Files used

Validator version

Operating system

Java version

Additional notes

welcome bot commented Apr 28, 2023

briandonahue commented Apr 28, 2023 • edited Loading

isabelle-dr commented May 1, 2023

briandonahue commented May 2, 2023

isabelle-dr commented May 2, 2023 • edited Loading

briandonahue commented May 2, 2023

briandonahue commented May 2, 2023 • edited Loading

isabelle-dr commented May 2, 2023

briandonahue commented May 3, 2023

MKuranowski commented May 4, 2023 • edited Loading

MKuranowski commented May 4, 2023

briandonahue commented May 4, 2023

MKuranowski commented May 5, 2023

briandonahue commented May 8, 2023

briandonahue commented May 16, 2023

laurentg commented Jun 1, 2023

isabelle-dr commented Jun 2, 2023

briandonahue commented Jun 2, 2023

szjozsef commented Jun 7, 2023

isabelle-dr commented Jun 10, 2023

briandonahue commented Apr 28, 2023 •

edited

Loading

isabelle-dr commented May 2, 2023 •

edited

Loading

briandonahue commented May 2, 2023 •

edited

Loading

MKuranowski commented May 4, 2023 •

edited

Loading