CLDR-18002 Update population and likely subtags for MU, TK, ZM and SL #4104

conradarcturus · 2024-10-02T23:52:12Z

There are 4 manual overrides in GenerateLikelySubtags.java that conflict with other data: for MU, SL, TK, and ZM.

For each country, the local language (mfe, kri, tkl, and bem) is spoken by far more than English, even if English is the main language of instruction. Education and literacy in each country is low enough that the local languages should be considered the dominant ones.

I was able to find censuses listing language characteristics for MU, TK and ZM. SL I wasn't able to find data, but I removed the override.

To regenerate data use this command mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags && java -jar tools/cldr-code/target/cldr-code.jar GenerateTestData

CLDR-18002

This PR completes the ticket.

ALLOW_MANY_COMMITS=true

jira-pull-request-webhook · 2024-10-25T17:40:59Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-10-29T17:35:02Z

Notice: the branch changed across the force-push!

common/supplemental/supplementalData.xml is different
common/testData/localeIdentifiers/likelySubtags.txt is different
common/testData/localeIdentifiers/localeDisplayName.txt is now changed in the branch
tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati · 2024-10-29T23:08:05Z

I'm generally ok with this, but for the following:

Education and literacy in each country is low enough that the local languages should be considered the dominant ones.

I think that would clearly be the case for voice UIs. Not so clear for text UIs. We should discuss more about how to cleanly segment those. For example, we might have a separate set of likely subtags / locale matching for voice than for text.

jira-pull-request-webhook · 2024-10-30T14:33:43Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
common/testData/localeIdentifiers/likelySubtags.txt is different
common/testData/localeIdentifiers/localeDisplayName.txt is no longer changed in the branch
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

conradarcturus · 2024-10-30T15:41:30Z

we might have a separate set of likely subtags / locale matching for voice than for text.

@macchiati I think that's a great idea to differentiate likely subtags for voice & text content. We can also perhaps make a policy using with macrolanguages, eg. Arabic, Chinese, Fulah dialects. For the most part, most text would be best classified as just zh/ar/ff. However spoken content will have significant differences for both constituent dialects (yue, cmn, apc, ary, fuv, ...). Do we have any initiatives getting CLDR/Unicode to work better for spoken content?

macchiati · 2024-10-30T17:02:52Z

Inflections and RBNF play a role, but no organized initiatives yet. We have made room for separate tagging, eg

(intending to allow for a a 'voice' in the future.)

jira-pull-request-webhook · 2024-11-11T16:28:06Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
common/testData/localeIdentifiers/likelySubtags.txt is different
tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

There are 4 manual overrides in GenerateLikelySubtags.java that conflict with other data: for MU, SL, TK, and ZM. For each country, the local language (mfe, kri, tkl, and bem) is spoken by far more than English, even if English is the main language of instruction. Education and literacy in each country is low enough that the local languages should be considered the dominant ones. I was able to find censuses listing language characteristics for MU, TK and ZM. SL I wasn't able to find data, but I removed the override. To regenerate data use this command ` mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags && java -jar tools/cldr-code/target/cldr-code.jar GenerateTestData` CLDR-18002 Actually make local languages the default matches The prior change didn't exactly work because und_MU was defaulting to en_Latn_MU -- this fixes it to go to mfe -- also for the other languages. The problem is that English is official in these countries so there's a mis-match CLDR-18002 Style fix `mvn --file=tools/pom.xml spotless:apply` CLDR-18002 Default to English since its official

jira-pull-request-webhook · 2024-11-19T19:17:52Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
common/testData/localeIdentifiers/likelySubtags.txt is different
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati · 2024-11-20T21:54:29Z

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java

@@ -433,15 +433,11 @@ public static void main(String[] args) throws IOException {
                                {"und_Latn_PH", "fil_Latn_PH"},
                                {"und_ML", "bm_Latn_ML"},
                                {"und_Latn_ML", "bm_Latn_ML"},
-                                {"und_MU", "mfe_Latn_MU"},


The more we can delete from this list, the better!

conradarcturus requested review from macchiati and srl295 October 2, 2024 23:52

github-actions bot assigned conradarcturus Oct 2, 2024

srl295 deleted the branch main October 25, 2024 16:35

srl295 closed this Oct 25, 2024

srl295 reopened this Oct 25, 2024

srl295 added the ddl DDL-SC specific work label Oct 25, 2024

srl295 changed the base branch from _ddl/v47 to main October 25, 2024 17:40

srl295 force-pushed the CLDR-18002-Update-pop-MU-TK-ZM branch from 1ee81e8 to 89ade9a Compare October 25, 2024 17:40

conradarcturus force-pushed the CLDR-18002-Update-pop-MU-TK-ZM branch from 89ade9a to 91049e8 Compare October 29, 2024 17:34

conradarcturus force-pushed the CLDR-18002-Update-pop-MU-TK-ZM branch from 91049e8 to 32878b6 Compare October 30, 2024 14:33

conradarcturus force-pushed the CLDR-18002-Update-pop-MU-TK-ZM branch from 32878b6 to cc75524 Compare November 11, 2024 16:28

conradarcturus force-pushed the CLDR-18002-Update-pop-MU-TK-ZM branch from cc75524 to aa56507 Compare November 19, 2024 19:17

macchiati reviewed Nov 20, 2024

View reviewed changes

macchiati approved these changes Nov 20, 2024

View reviewed changes

conradarcturus merged commit a82d6ca into main Nov 21, 2024
16 checks passed

AEApple deleted the CLDR-18002-Update-pop-MU-TK-ZM branch November 22, 2024 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-18002 Update population and likely subtags for MU, TK, ZM and SL #4104

CLDR-18002 Update population and likely subtags for MU, TK, ZM and SL #4104

conradarcturus commented Oct 2, 2024

jira-pull-request-webhook bot commented Oct 25, 2024

jira-pull-request-webhook bot commented Oct 29, 2024

macchiati commented Oct 29, 2024

jira-pull-request-webhook bot commented Oct 30, 2024

conradarcturus commented Oct 30, 2024 •

edited

Loading

macchiati commented Oct 30, 2024

jira-pull-request-webhook bot commented Nov 11, 2024

jira-pull-request-webhook bot commented Nov 19, 2024

macchiati Nov 20, 2024

CLDR-18002 Update population and likely subtags for MU, TK, ZM and SL #4104

CLDR-18002 Update population and likely subtags for MU, TK, ZM and SL #4104

Conversation

conradarcturus commented Oct 2, 2024

jira-pull-request-webhook bot commented Oct 25, 2024

jira-pull-request-webhook bot commented Oct 29, 2024

macchiati commented Oct 29, 2024

jira-pull-request-webhook bot commented Oct 30, 2024

conradarcturus commented Oct 30, 2024 • edited Loading

macchiati commented Oct 30, 2024

jira-pull-request-webhook bot commented Nov 11, 2024

jira-pull-request-webhook bot commented Nov 19, 2024

macchiati Nov 20, 2024

Choose a reason for hiding this comment

conradarcturus commented Oct 30, 2024 •

edited

Loading