-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Globaldata CLDR locales set #3538
Comments
I want to include at least |
If size isn't a problem I'm also in favor of |
I second @robertbastian’s comment about standard proliferation:
I have no opinion as to the actual set (for tiny sets of locales whose sole purpose is a good coverage of I18N issues I might have something to say, but as far as I can tell this is not the use case here). |
What defines the CLDR sets is not the usage or the need but rather the amount of data that happens to be collected for a particular locale. When a locale is in "modern", it is a stronger reflection of the well-connectedness of influencers in that locale than it is of whether that locale is a good choice for being a default locale. |
Do you have a proposal? I'm happy to use a smaller pre-existing set, I just don't want to define a new one. I don't see a big problem with locales being I also don't mind including all CLDR locales, including |
According to territoryInfo.json, there are 487 language-region pairs that are "official", "official_regional", or "de_facto_official": >>> ti = json.load(open("Downloads/territoryInfo.json")) >>> official_locales = [(lang, region) for (region,regionInfo) in ti["supplemental"]["territoryInfo"].items() for (lang,langInfo) in regionInfo.get("languagePopulation", {}).items() if langInfo.get("_officialStatus", None) is not None] >>> ["%s_%s" % (lang, region) for (lang, region) in official_locales]['ca_AD', 'ar_AE', 'fa_AF', 'ps_AF', 'tk_AF', 'uz_Arab_AF', 'en_AG', 'en_AI', 'sq_AL', 'hy_AM', 'pt_AO', 'es_AR', 'en_AS', 'sm_AS', 'de_AT', 'hr_AT', 'hu_AT', 'sl_AT', 'en_AU', 'nl_AW', 'pap_AW', 'sv_AX', 'az_AZ', 'az_Cyrl_AZ', 'bs_BA', 'bs_Cyrl_BA', 'hr_BA', 'sr_BA', 'sr_Latn_BA', 'en_BB', 'bn_BD', 'de_BE', 'fr_BE', 'nl_BE', 'fr_BF', 'bg_BG', 'ar_BH', 'en_BI', 'fr_BI', 'rn_BI', 'fr_BJ', 'fr_BL', 'en_BM', 'ms_BN', 'ms_Arab_BN', 'ay_BO', 'es_BO', 'qu_BO', 'nl_BQ', 'pt_BR', 'vec_BR', 'en_BS', 'dz_BT', 'en_BW', 'tn_BW', 'be_BY', 'ru_BY', 'en_BZ', 'chp_CA', 'cr_CA', 'den_CA', 'dgr_CA', 'en_CA', 'fr_CA', 'gwi_CA', 'iu_CA', 'iu_Latn_CA', 'en_CC', 'fr_CD', 'kg_CD', 'ln_CD', 'lua_CD', 'sw_CD', 'fr_CF', 'sg_CF', 'fr_CG', 'de_CH', 'fr_CH', 'gsw_CH', 'it_CH', 'rm_CH', 'fr_CI', 'en_CK', 'es_CL', 'en_CM', 'fr_CM', 'bo_CN', 'ko_CN', 'mn_Mong_CN', 'ug_CN', 'za_CN', 'zh_CN', 'es_CO', 'en_CQ', 'es_CR', 'es_CU', 'pt_CV', 'nl_CW', 'pap_CW', 'en_CX', 'el_CY', 'tr_CY', 'cs_CZ', 'de_DE', 'frr_DE', 'en_DG', 'ar_DJ', 'fr_DJ', 'da_DK', 'de_DK', 'kl_DK', 'en_DM', 'es_DO', 'ar_DZ', 'fr_DZ', 'es_EA', 'es_EC', 'qu_EC', 'et_EE', 'ar_EG', 'ar_EH', 'ar_ER', 'en_ER', 'ti_ER', 'ast_ES', 'ca_ES', 'es_ES', 'eu_ES', 'gl_ES', 'oc_ES', 'am_ET', 'fi_FI', 'sms_FI', 'sv_FI', 'en_FJ', 'fj_FJ', 'hif_FJ', 'en_FK', 'en_FM', 'fo_FO', 'fr_FR', 'fr_GA', 'cy_GB', 'en_GB', 'ga_GB', 'gd_GB', 'en_GD', 'ab_GE', 'ka_GE', 'os_GE', 'fr_GF', 'en_GG', 'ak_GH', 'ee_GH', 'en_GH', 'gaa_GH', 'en_GI', 'kl_GL', 'en_GM', 'fr_GN', 'fr_GP', 'es_GQ', 'fr_GQ', 'pt_GQ', 'el_GR', 'es_GT', 'quc_GT', 'ch_GU', 'en_GU', 'pt_GW', 'en_GY', 'en_HK', 'zh_Hant_HK', 'es_HN', 'hr_HR', 'it_HR', 'vec_HR', 'fr_HT', 'ht_HT', 'hu_HU', 'es_IC', 'id_ID', 'en_IE', 'ga_IE', 'ar_IL', 'he_IL', 'en_IM', 'gv_IM', 'as_IN', 'bn_IN', 'en_IN', 'gu_IN', 'hi_IN', 'kha_IN', 'kn_IN', 'kok_IN', 'ks_IN', 'mai_IN', 'ml_IN', 'mr_IN', 'ne_IN', 'or_IN', 'pa_IN', 'sa_IN', 'sat_IN', 'sd_IN', 'sd_Deva_IN', 'ta_IN', 'te_IN', 'ur_IN', 'en_IO', 'ar_IQ', 'az_Arab_IQ', 'ckb_IQ', 'fa_IR', 'is_IS', 'fr_IT', 'it_IT', 'vec_IT', 'en_JE', 'en_JM', 'ar_JO', 'ja_JP', 'en_KE', 'sw_KE', 'ky_KG', 'ru_KG', 'km_KH', 'en_KI', 'gil_KI', 'ar_KM', 'fr_KM', 'wni_KM', 'zdj_KM', 'en_KN', 'ko_KP', 'ko_KR', 'ar_KW', 'en_KY', 'kk_KZ', 'ru_KZ', 'lo_LA', 'ar_LB', 'en_LC', 'de_LI', 'gsw_LI', 'si_LK', 'ta_LK', 'en_LR', 'en_LS', 'st_LS', 'lt_LT', 'de_LU', 'fr_LU', 'lb_LU', 'lv_LV', 'ar_LY', 'ar_MA', 'fr_MA', 'tzm_MA', 'fr_MC', 'ro_MD', 'sr_Latn_ME', 'fr_MF', 'en_MG', 'fr_MG', 'mg_MG', 'en_MH', 'mh_MH', 'mk_MK', 'sq_MK', 'fr_ML', 'my_MM', 'mn_MN', 'pt_MO', 'zh_Hant_MO', 'en_MP', 'fr_MQ', 'ar_MR', 'en_MS', 'en_MT', 'mt_MT', 'en_MU', 'fr_MU', 'dv_MV', 'en_MW', 'ny_MW', 'es_MX', 'vec_MX', 'ms_MY', 'pt_MZ', 'en_NA', 'fr_NC', 'fr_NE', 'en_NF', 'en_NG', 'yo_NG', 'es_NI', 'fy_NL', 'nl_NL', 'nb_NO', 'nn_NO', 'no_NO', 'se_NO', 'ne_NP', 'en_NR', 'na_NR', 'en_NU', 'niu_NU', 'en_NZ', 'mi_NZ', 'ar_OM', 'es_PA', 'es_PE', 'qu_PE', 'fr_PF', 'ty_PF', 'en_PG', 'ho_PG', 'tpi_PG', 'ceb_PH', 'en_PH', 'fil_PH', 'hil_PH', 'ilo_PH', 'mdh_PH', 'pag_PH', 'tsg_PH', 'war_PH', 'en_PK', 'ur_PK', 'csb_PL', 'de_PL', 'lt_PL', 'pl_PL', 'fr_PM', 'en_PN', 'en_PR', 'es_PR', 'ar_PS', 'pt_PT', 'en_PW', 'pau_PW', 'es_PY', 'gn_PY', 'ar_QA', 'fr_RE', 'ro_RO', 'hr_RS', 'hu_RS', 'ro_RS', 'sk_RS', 'sr_RS', 'sr_Latn_RS', 'uk_RS', 'ady_RU', 'av_RU', 'az_Cyrl_RU', 'ba_RU', 'ce_RU', 'inh_RU', 'kbd_RU', 'koi_RU', 'krc_RU', 'kum_RU', 'kv_RU', 'lbe_RU', 'lez_RU', 'mdf_RU', 'myv_RU', 'ru_RU', 'sah_RU', 'tt_RU', 'tyv_RU', 'udm_RU', 'en_RW', 'fr_RW', 'rw_RW', 'ar_SA', 'en_SB', 'en_SC', 'fr_SC', 'ar_SD', 'en_SD', 'fi_SE', 'sv_SE', 'en_SG', 'ms_SG', 'ta_SG', 'zh_SG', 'en_SH', 'sl_SI', 'vec_SI', 'nb_SJ', 'sk_SK', 'en_SL', 'it_SM', 'bjt_SN', 'bsc_SN', 'dyo_SN', 'ff_SN', 'fr_SN', 'knf_SN', 'mey_SN', 'mfv_SN', 'sav_SN', 'snf_SN', 'srr_SN', 'tnr_SN', 'wo_SN', 'ar_SO', 'so_SO', 'nl_SR', 'en_SS', 'pt_ST', 'es_SV', 'en_SX', 'nl_SX', 'ar_SY', 'fr_SY', 'en_SZ', 'ss_SZ', 'en_TC', 'ar_TD', 'fr_TD', 'fr_TG', 'th_TH', 'tg_TJ', 'en_TK', 'tkl_TK', 'pt_TL', 'tet_TL', 'tk_TM', 'ar_TN', 'fr_TN', 'en_TO', 'to_TO', 'tr_TR', 'en_TT', 'en_TV', 'tvl_TV', 'zh_Hant_TW', 'en_TZ', 'sw_TZ', 'ru_UA', 'uk_UA', 'en_UG', 'sw_UG', 'en_UM', 'en_US', 'es_US', 'haw_US', 'es_UY', 'uz_UZ', 'uz_Cyrl_UZ', 'it_VA', 'en_VC', 'es_VE', 'en_VG', 'en_VI', 'vi_VN', 'bi_VU', 'en_VU', 'fr_VU', 'fr_WF', 'en_WS', 'sm_WS', 'sq_XK', 'sr_XK', 'sr_Latn_XK', 'ar_YE', 'fr_YT', 'af_ZA', 'en_ZA', 'nr_ZA', 'nso_ZA', 'ss_ZA', 'st_ZA', 'tn_ZA', 'ts_ZA', 've_ZA', 'xh_ZA', 'zu_ZA', 'en_ZM', 'en_ZW', 'nd_ZW', 'sn_ZW'] There are currently 93 modern locales in coverageLevels.json: >>> cl = json.load(open("Downloads/coverageLevels.json")) >>> modern = [x for (x,y) in cl["coverageLevels"].items() if y=="modern"]['af', 'am', 'ar', 'as', 'az', 'be', 'bg', 'bn', 'bs', 'ca', 'cs', 'cy', 'da', 'de', 'el', 'en', 'es', 'et', 'eu', 'fa', 'fi', 'fil', 'fr', 'ga', 'gd', 'gl', 'gu', 'ha', 'he', 'hi', 'hi-Latn', 'hr', 'hu', 'hy', 'id', 'ig', 'is', 'it', 'ja', 'jv', 'ka', 'kk', 'km', 'kn', 'ko', 'kok', 'ky', 'lo', 'lt', 'lv', 'mk', 'ml', 'mn', 'mr', 'ms', 'my', 'ne', 'nl', 'nn', 'no', 'or', 'pa', 'pcm', 'pl', 'ps', 'pt', 'ro', 'ru', 'sd', 'si', 'sk', 'sl', 'so', 'sq', 'sr', 'sr-Latn', 'sv', 'sw', 'ta', 'te', 'th', 'tk', 'tr', 'uk', 'ur', 'uz', 'vi', 'yo', 'yue', 'yue-Hans', 'zh', 'zh-Hant', 'zu'] There are 6 languages that are modern but not official: 'pcm', 'yue', 'hi-Latn', 'ha', 'ig', 'yue-Hans', 'jv' There are 132 that are official but not modern: 'lb', 'bjt', 'tvl', 'dz', 'bs-Cyrl', 'kg', 'mi', 'ln', 'fj', 'gil', 'xh', 'kv', 'ce', 'mfv', 'ak', 'sa', 'az-Arab', 'kum', 'dv', 'koi', 'den', 'st', 'wo', 'bsc', 'dyo', 'ks', 'vec', 'fy', 'pag', 'gsw', 'lua', 'ht', 'sg', 'ba', 'ilo', 'za', 'mey', 'gv', 'pap', 'qu', 'inh', 'ss', 'av', 'fo', 'kha', 'kbd', 'ti', 'ee', 'tt', 'nr', 'mh', 'mdf', 'tnr', 'cr', 'uz-Arab', 'mn-Mong', 'oc', 'ts', 'sat', 'bi', 'tg', 'bo', 'ny', 'csb', 'udm', 'sah', 'lez', 'hil', 'sn', 'war', 'sms', 'az-Cyrl', 'ho', 'dgr', 'tyv', 'tet', 'na', 'myv', 've', 'gn', 'chp', 'tsg', 'ty', 'tn', 'zdj', 'ay', 'frr', 'mai', 'rw', 'sd-Deva', 'iu-Latn', 'gaa', 'pau', 'srr', 'ms-Arab', 'quc', 'mdh', 'ug', 'mt', 'to', 'nd', 'ady', 'sm', 'ab', 'ast', 'ff', 'ceb', 'niu', 'haw', 'lbe', 'iu', 'nb', 'kl', 'krc', 'mg', 'ckb', 'tzm', 'tkl', 'wni', 'rn', 'tpi', 'se', 'gwi', 'hif', 'snf', 'uz-Cyrl', 'knf', 'nso', 'os', 'sav', 'rm', 'ch' Also note that coverage levels does not seem to consider region, only language and script. Perhaps we take the union of modern and official? |
Languages that are modern/moderate/basic but not official: 'brx', 'su', 'jv', 'dsb', 'br', 'hsb', 'chr', 'bgc', 'pcm', 'cv', 'yue', 'bho', 'kea', 'ig', 'ff-Adlm', 'hi-Latn', 'ha', 'mni', 'kgp', 'sc', 'yrl', 'doi', 'ks-Deva', 'yue-Hans', 'raj', 'ia' Languages that are official but not modern/moderate/basic: 'lb', 'bjt', 'tvl', 'dz', 'kg', 'ln', 'fj', 'gil', 'kv', 'ce', 'mfv', 'ak', 'az-Arab', 'kum', 'dv', 'koi', 'den', 'st', 'bsc', 'dyo', 'vec', 'fy', 'pag', 'gsw', 'lua', 'ht', 'sg', 'ba', 'ilo', 'za', 'mey', 'gv', 'pap', 'inh', 'ss', 'av', 'kha', 'kbd', 'ee', 'nr', 'mh', 'mdf', 'tnr', 'cr', 'uz-Arab', 'mn-Mong', 'oc', 'ts', 'bi', 'bo', 'ny', 'csb', 'udm', 'sah', 'lez', 'hil', 'sn', 'war', 'sms', 'az-Cyrl', 'ho', 'dgr', 'tyv', 'tet', 'na', 'myv', 've', 'gn', 'chp', 'tsg', 'ty', 'tn', 'zdj', 'ay', 'frr', 'rw', 'iu-Latn', 'gaa', 'pau', 'srr', 'ms-Arab', 'quc', 'mdh', 'ug', 'mt', 'nd', 'ady', 'sm', 'ab', 'ff', 'niu', 'haw', 'lbe', 'iu', 'nb', 'kl', 'krc', 'mg', 'ckb', 'tzm', 'tkl', 'wni', 'rn', 'tpi', 'se', 'gwi', 'hif', 'snf', 'knf', 'nso', 'os', 'sav', 'ch' Above list but with regions: 'sm-AS', 'pap-AW', 'rn-BI', 'ay-BO', 'vec-BR', 'dz-BT', 'tn-BW', 'chp-CA', 'cr-CA', 'den-CA', 'dgr-CA', 'gwi-CA', 'iu-CA', 'kg-CD', 'ln-CD', 'lua-CD', 'sg-CF', 'gsw-CH', 'bo-CN', 'ug-CN', 'za-CN', 'pap-CW', 'frr-DE', 'kl-DK', 'oc-ES', 'sms-FI', 'fj-FJ', 'hif-FJ', 'ab-GE', 'os-GE', 'ak-GH', 'ee-GH', 'gaa-GH', 'kl-GL', 'quc-GT', 'ch-GU', 'vec-HR', 'ht-HT', 'gv-IM', 'kha-IN', 'ckb-IQ', 'vec-IT', 'gil-KI', 'wni-KM', 'zdj-KM', 'gsw-LI', 'st-LS', 'lb-LU', 'tzm-MA', 'mg-MG', 'mh-MH', 'mt-MT', 'dv-MV', 'ny-MW', 'vec-MX', 'fy-NL', 'nb-NO', 'se-NO', 'na-NR', 'niu-NU', 'ty-PF', 'ho-PG', 'tpi-PG', 'hil-PH', 'ilo-PH', 'mdh-PH', 'pag-PH', 'tsg-PH', 'war-PH', 'csb-PL', 'pau-PW', 'gn-PY', 'ady-RU', 'av-RU', 'ba-RU', 'ce-RU', 'inh-RU', 'kbd-RU', 'koi-RU', 'krc-RU', 'kum-RU', 'kv-RU', 'lbe-RU', 'lez-RU', 'mdf-RU', 'myv-RU', 'sah-RU', 'tyv-RU', 'udm-RU', 'rw-RW', 'vec-SI', 'nb-SJ', 'bjt-SN', 'bsc-SN', 'dyo-SN', 'ff-SN', 'knf-SN', 'mey-SN', 'mfv-SN', 'sav-SN', 'snf-SN', 'srr-SN', 'tnr-SN', 'ss-SZ', 'tkl-TK', 'tet-TL', 'tvl-TV', 'haw-US', 'bi-VU', 'sm-WS', 'nr-ZA', 'nso-ZA', 'ss-ZA', 'st-ZA', 'tn-ZA', 'ts-ZA', 've-ZA', 'nd-ZW', 'sn-ZW' |
There are only 38 languages that are moderate/basic (less than half as many as are modern). So I think there's not a good reason to exclude them. One can argue that it simply does not make a lot of sense to include locales that don't have good coverage. However, even sub-basic locales often have some coverage. In terms of regional variants, we discussed a few options, which can be flags:
I think we can start with 1 and 2, adding 3 later if someone asks for it. One issue is that we seem to lose our current testdata locale Chakma (ccp). We mainly include it to test supplemental code points. The only language in modern/moderate/basic that seems to use supplemental code points is ff-Adlm. In all, here are the locales with supplemental code points in their exemplar sets: 'rhg-Rohg', 'ff-Adlm-GW', 'ff-Adlm-NE', 'ccp-IN', 'ff-Adlm-GM', 'ff-Adlm-GH', 'ccp', 'hnj-Hmnp', 'rhg', 'ff-Adlm-SN', 'ff-Adlm-NG', 'hnj', 'ff-Adlm-MR', 'en-Dsrt', 'ff-Adlm-CM', 'en-Shaw', 'ff-Adlm-BF', 'rhg-Rohg-BD', 'ff-Adlm-LR', 'osa', 'ff-Adlm', 'ff-Adlm-SL' |
Conclusions:
LGTM: @Manishearth @sffc @eggrobin @robertbastian Furthermore:
|
@macchiati also approved of the above scheme. |
What set of locales should we use in globaldata?
Discuss with:
Optional:
Plan to add this to an upcoming ICU-TC call.
The text was updated successfully, but these errors were encountered: