po2json: Make sure that --fuzzy and --removeuntranslated can be used together. #4257

rv0 · 2021-02-02T15:38:15Z

Currently when you combine the options --fuzzy and --removeuntranslated together, the fuzzy translations are also removed.
Both of these options work fine on their own, the problem lies in the check itself, that currently ignores the fuzzy parameter and will skip any fuzzy strings when --removeuntranslated is passed on.

The following PR will improve the logic in the "skip" check, to make sure that:

When --removeuntranslated is passed, the string must be empty or missing to be skipped
or
When --fuzzy is passed, the string will be included

rv0 · 2021-02-02T16:19:54Z

I made a mistake, will fix :)

…uzzy items

rv0 · 2021-02-02T16:49:25Z

Not sure how to fix/handle/interpret the flake8 test that is failing.

Also, would love to add a unittest for these changes, but I haven't figured out yet how to run them

nijel

Your changes break several po2json tests, can you please look at it?
The flake8 issues can be easier addressed by running black code formatter, see http://docs.translatehouse.org/projects/translate-toolkit/en/latest/developers/styleguide.html
Testing is documented at http://docs.translatehouse.org/projects/translate-toolkit/en/latest/developers/testing.html

rv0 · 2021-02-02T19:43:37Z

@nijel
Thanks for providing me the information, got unit tests to work eventually and noticed the flaws of my initial patch.

However, I don't think the test cases reflect the expected behaviour, so I changed them too. This might be radical so allow me to explain the logic used,

given that we start with po input:

#: .foo
msgid "foo"
msgstr "oof"

#: .bar
#, fuzzy
msgid "bar"
msgstr "rab"

#: .baz
#, fuzzy
msgid "baz"
msgstr ""

#: .qux
msgid "qux"
msgstr ""
"""

And json file:

{
    "foo": "foo",
    "bar": "bar",
    "baz": "baz",
    "qux": "qux"
}

I adapted the following tests:

include_fuzzy=false and remove_untranslated=false

This was the expected output:

{
    "foo": "oof",
     "bar": "bar",
     "baz": "baz",
     "qux": "qux"
}

As both bar and baz are fuzzy, they should be removed if include_fuzzy is false:

{
    "foo": "oof",
     "qux": "qux"
}

include_fuzzy=true and remove_untranslated=true

This was the expected output:

{
    "foo": "oof"
}

As bar is fuzzy and translated, imo it should also be included:

{
    "foo": "oof",
     "bar": "bar",
}

In closing..

These changes seem logical to me, however doing a search in the issue queue I see they have caused confusion to others in the past, some examples:
#3456
#3550

Outside of the scope of this change however, I also think the code could be simplified a lot by adding an "allow_fuzzy=False" parameter to the global is_translated function, or even better just removing the fuzzy check from that function all together (because it's an assumption that is only clear when you read the source code), "not translated" and "fuzzy" mean two completely different things to me, but then again there may be conceptual or historical reasons behind all this that I'm completely missing.

nijel · 2021-02-03T17:48:12Z

Outside of the scope of this change however, I also think the code could be simplified a lot by adding an "allow_fuzzy=False" parameter to the global is_translated function, or even better just removing the fuzzy check from that function all together (because it's an assumption that is only clear when you read the source code), "not translated" and "fuzzy" mean two completely different things to me, but then again there may be conceptual or historical reasons behind all this that I'm completely missing.

This depends on how you interpret "is translated". Right now it means "the translation is completed", which definitely should exclude fuzzy (and other needing review states, as in xliff). I don't think it's reasonable to change the semantics as that could easily silently break third-party code. So, the unit is either untranslated, fuzzy or translated (or approved, but that's out of scope here).

rv0 · 2021-02-03T18:20:32Z

Outside of the scope of this change however, I also think the code could be simplified a lot by adding an "allow_fuzzy=False" parameter to the global is_translated function, or even better just removing the fuzzy check from that function all together (because it's an assumption that is only clear when you read the source code), "not translated" and "fuzzy" mean two completely different things to me, but then again there may be conceptual or historical reasons behind all this that I'm completely missing.

This depends on how you interpret "is translated". Right now it means "the translation is completed", which definitely should exclude fuzzy (and other needing review states, as in xliff). I don't think it's reasonable to change the semantics as that could easily silently break third-party code. So, the unit is either untranslated, fuzzy or translated (or approved, but that's out of scope here).

There's not really any indication that it means that a translation is completed, the comments in code say:

Indicates whether this unit is translated.

As for the semantics, the GNU documentation does more or less say a fuzzy entry is a translated entry, for must purposes, only from the viewpoint of the translator one that needs additional review.
I think the fact that the difference is not clear is potentially a source of confusion and bugs.

In any case, the changes in my PR are made with the idea in mind that a fuzzy translation is a translation. Any comments on that?

nijel · 2021-02-03T19:28:06Z

I have no clue about the original intents behind this behavior, I'm just describing how it behaves. Most of the things in the translate-toolkit are modelled around XLIFF, so it might be also the reason for this behavior.

Another issue with --removeuntranslated and --no-fuzzy is that each convertor implements this on its own and the behavior is not consistent. This is described in issues you've referenced.

include_fuzzy=false and remove_untranslated=false
As both bar and baz are fuzzy, they should be removed if include_fuzzy is false:

The question is whether they should be removed (what you implement) or source used instead (what is current implementation). po2prop and po2dtd both include source string in this case.

include_fuzzy=true and remove_untranslated=true
As bar is fuzzy and translated, imo it should also be included:

po2prop doesn't include fuzzy in this case and po2dtd does.

rv0 · 2021-02-03T20:02:09Z

I understand. So how to continue from here?

In my use case, I need to make sure no source strings end up in the localized json, as the strings are used with speech synthesis and this will make for example an English word to be spoken with a French voice. Not very good ;) If I don't include the source string, the application will automatically fall back to the English word and speak it with an English voice. That's fine.
On the other hand some words are machine translated and thus fuzzy, I want to include these in the localized json as in most cases the translation is good or good enough (using generic short sentences and machine learning driven translations is surprisingly effective nowadays).

I'd be happy to perform additional work on this PR if needed, I think my use-case is not that uncommon so a solution would benefit others, but I'm going to need some guidance on the approach to take. Perhaps additional parameters should be added?

nijel · 2021-02-03T21:19:37Z

In the end the "correct" behavior for the first case depends on usage of the resulting file - whether all keys should be present there or not. And I think --removeuntranslated is there to indicate this and in case it's not present, then all keys should be present in the output, even with the source strings. Therefore, I'd prefer reverting this part of the PR.

For fuzzy inclusion, I think the behavior you've made makes more sense - when you explicitly specify --fuzzy you want these strings to be present (as --nofuzzy is the default behavior). Even with --removeuntranslated.

Don't get me wrong, I just don't want to merge change which will end up breaking existing workflows and this easily can do that. The perfect solution would be to have a single logic used by all the convertors (see #3573).

rv0 · 2021-02-03T21:29:27Z

Does any of the other converters use some sort of parameter to disable the inclusion of source strings? If so I can add that as well (either here or in another PR)

nijel · 2021-02-03T21:54:41Z

I think --removeuntranslated is really the parameter which controls this, but it's supported only by few converters.

Wim Vanheste added 2 commits February 2, 2021 16:33

Make sure that --fuzzy and --removeuntranslated can be used together.

8ef7f3a

po2json formatting fix hanging indent

a338f6e

Take into account that inputunit.istranslated() is always false for f…

c74fd86

…uzzy items

nijel reviewed Feb 2, 2021

View reviewed changes

rv0 marked this pull request as draft February 2, 2021 18:31

Fix po2json logic and adapt unit tests to expected result

18fb47b

rv0 marked this pull request as ready for review February 2, 2021 19:43

nijel mentioned this pull request Apr 22, 2021

po2prop not working as expected #4322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

po2json: Make sure that --fuzzy and --removeuntranslated can be used together. #4257

po2json: Make sure that --fuzzy and --removeuntranslated can be used together. #4257

rv0 commented Feb 2, 2021

rv0 commented Feb 2, 2021

rv0 commented Feb 2, 2021

nijel left a comment

rv0 commented Feb 2, 2021 •

edited

Loading

nijel commented Feb 3, 2021

rv0 commented Feb 3, 2021

nijel commented Feb 3, 2021

rv0 commented Feb 3, 2021

nijel commented Feb 3, 2021

rv0 commented Feb 3, 2021

nijel commented Feb 3, 2021

po2json: Make sure that --fuzzy and --removeuntranslated can be used together. #4257

Are you sure you want to change the base?

po2json: Make sure that --fuzzy and --removeuntranslated can be used together. #4257

Conversation

rv0 commented Feb 2, 2021

rv0 commented Feb 2, 2021

rv0 commented Feb 2, 2021

nijel left a comment

Choose a reason for hiding this comment

rv0 commented Feb 2, 2021 • edited Loading

include_fuzzy=false and remove_untranslated=false

include_fuzzy=true and remove_untranslated=true

In closing..

nijel commented Feb 3, 2021

rv0 commented Feb 3, 2021

nijel commented Feb 3, 2021

rv0 commented Feb 3, 2021

nijel commented Feb 3, 2021

rv0 commented Feb 3, 2021

nijel commented Feb 3, 2021

rv0 commented Feb 2, 2021 •

edited

Loading