-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False positive mixed_case_recommended_field for numeric values #1402
Comments
Thanks for opening your first issue in this project! If you haven't already, you can join our slack and join the #gtfs-validators channel to meet our awesome community. Come say hi 👋! |
@isabelle-dr How do we want to approach this? It looks like in some cases when numbers are present, we don't want to throw the warning: In those cases we could skip the mixed case validation altogether, but how about longer strings? Should we have a length minimum or number of letters in a string before mixed case is considered? Thinking about strings like:
etc |
@briandonahue thanks for having a look at this. |
@isabelle-dr I had a couple thoughts. 1) I think the current implementation has another flaw - I believe it only checks for the 26 letters of the basic alphabet
We may still get some false positives if there are values such as // if more than one non-numeric, non-whitespace character
if((value.length() - countNumericChars(value) - countWhiteSpaceChars(value)) > 1) &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value){
notices.add(new MixedFieldNotice());
} |
I like this idea!
|
@isabelle-dr I thought about this, but I think we want to keep it for long words with no spaces such as Another approach would be to validate any values with strings of more than 1 letter, regardless of spaces. In this case |
After further thought, I think the simplest logic that will cover most cases is to validate any value that has more than one consecutive letter, regardless of numbers or spaces. It will flag things like Pseudo-code for simpler logic: // any consecutive non-numeric, non-space characters (2 or more sequential letters)
if (/[0-9 ]{2,}/.test(value) &&
// AND the value is all upper OR lower case characters
(value.toLowerCase() == value || value.toUpperCase() == value)){
notices.add(new MixedFieldNotice());
} |
These are very good points, thank you for giving it some thought @briandonahue! |
@isabelle-dr I've got this working (modified to allow non-English characters, etc) but I noticed the original issue lists |
Why not more than two? Two-letter codes for routes are almost too common. Apart from the example from Warsaw Commuter Rail, rapid suburban railway services in Berlin use them (RE and RB), Singapore MTR (EW, NS, NE, CC, DT and TE), Taipei Metro (BL and BR), rail services in Tokyo (too many to list), and many more. I'd even go as far as too say that 3-letter codes are also relatively common, see Paris RER or Gemarny's ICE. Asian networks also use acronyms like MRT, LRT pretty often. The posted pseudocode earlier is arguably even worse. It flags things without letters at all, or flags a very common numbering scheme where a number is prefixed or suffixed with a single letter. The idea of |
A much more reliable strategy would be to rely on Unicode character properties. Catching 3 consecutive upper case letters can be done with something like |
@MKuranowski Thank you for the feedback! I realized some of this during my testing, and changed the approach in the draft PR. Here is a link to the current implementation, it is generated so a little hard to read, but it ends up being this: if (value.matches(".*\\p{L}{2}.*") &&
(value.toLowerCase() == value || value.toUpperCase() == value)) {
// add notice
} We can adjust the requirement such that we only check sequential letters of 4 or more, or whatever we choose. I can also add better test cases like the ones you included above and any other suggestions. I simplified the check to compare upper cased and lower cased values to the original, let me know if you see any flaw with that approach. |
The main problem with The My regex flags
|
@MKuranowski Another possibility would be to match all words/tokens greater than 3 cased characters, (no numbers, no non-case script characters), and then check that each conforms to some rules (off the top of my head, no all-caps, and at least some with initial caps?). I don't know though... at some point this is just a warning / suggestion and I think as long as we catch the bulk of the fields that should be corrected, users can ignore the warning for fields that are falsely flagged. Also, as @isabelle-dr mentions here, in the future rules may be configurable so users that are dealing with a lot of false flags for this warning can opt to disable it. @davidgamez @bdferris-v2 @isabelle-dr curious what your thoughts are. |
This is the compromise I've come up with so far...
This validates almost every case presented as desired with a couple exceptions:
Find the current test set here and the generated Java logic looks like this:
|
Why not dropping altogether the mixed-case policy for route_short_name and trip_short_name? There is already a length warning on route_short_name for example, and given the data I've seen I think it's hard to determine easily and reliably if upper or lower case are OK or not, especially using non-Latin charsets (where the concept of "lowercase" is sometimes unknown). Anyway the current state of this rule is raising way too much warnings to be useful, something has to be done. |
@laurentg thanks for the ping and reminder that these false positives should be removed quickly. Before making the decision to remove this rule, I would like to see how the logic implemented by @briandonahue in PR#1430 is doing.
|
@isabelle-dr Do we have an existing tool/process for querying the validation results for the Mobility Dataset? I'm not sure how to go about that other than pulling down results from a test run on Github and using a JSON CLI query tool to try to query the JSON results? |
Hello, {
"filename": "stops.txt",
"fieldName": "stop_name",
"fieldValue": "\u00cenv\u0103\u021b\u0103torilor",
"csvRowNumber": 436
} which is an unicode-escaped string for: Învățătorilor, which has the correct mixed case format. |
@briandonahue I am not sure your question is still relevant, but here is a vague answer to it 😅. Since we built the acceptance tests, we've been leveraging them by doing things like this, and then running analytics on the acceptance test report with python & pandas in a Google Colab. |
Describe the bug
The
mixed_case_recommended_field
warning is always triggered when a field contains digits, like route_short_name=A1
or trip_short_name=101
.Characters which are not letters should not be considered at all when comparing case.
Steps/Code to Reproduce
Validate a file which uses numbers in fields like
route_short_name
ortrip_short_name
.Expected Results
No
mixed_case_recommended_field
warning for those fileds.Actual Results
Screenshots
No response
Files used
https://mkuran.pl/gtfs/wkd.zip
Validator version
4.1.0
Operating system
Debian 12
Java version
openjdk version "17.0.6" 2023-01-17
Additional notes
No response
The text was updated successfully, but these errors were encountered: