BCP 47 Changes (DRAFT)

With the new release of the new version of BCP 47, there are various changes we need to make in Unicode CLDR and LDML. Already in CLDR 1.7 we have made modifications anticipating the release: see BCP 47 Tag Conversion in the spec (and the orginal design proposal), but more changes need to be made.

Formula

We need to take another look at which languages we show in the survey tool for translation, because the new version is very large, around 7,000 languages. Showing all of those languages in the Survey tool would neither be good for the usability of the tool for most translators, nor for tool performance, so we need some formula for picking which languages to show by default.

For feedback on this document, please file a Reply under http://www.unicode.org/cldr/bugs/locale-bugs?findid=1977. For discussion of issues, please send email to cldr-users@unicode.org.

Draft Formula

A. We show a language code X for translation if any of the following conditions are true:

  1. X is a qualified language**, and has at least 100K speakers, and at least one of the following is true:
    1. X is has official status* in any country
    2. X exceeds a threshold population† of literate users worldwide: 10M
    3. X exceeds a threshold population† in some country Z: 1M and 1/3 of Z’s population†.
  2. X has non-draft minimal language coverage‡ in CLDR itself.
  3. Only for translation in locale Y: X is a qualified language** that already has a translation in CLDR data in Y.
  4. X is an exception explicitly approved by the committee, either in root, or in some language Y.
    1. Current examples: Latin, Sanskrit

If a translator finds that X is needed for translation in language Y, then a bug can be filed. If we find the volume is high, we may need to add is some way for a translator to add a language in the survey tool.

B. We show a script code S for translation if and only if it is one of the scripts used by one of the languages shown.

Notes

** qualified language: excluding collection (except for macrolanguages with predominant forms), ancient, historic, and extinct languages: see Scope and Types. Some could be added as exceptions as needed.

‡ minimal coverage - see Coverage Levels - at a non-draft level.

* official status means official, de facto official, official regional, or de facto official regional.

population means literate 14-day active users (well, theoretically - we can only get an approximation of that), based on CLDR figures. Our concern is with written language, not spoken, and so we don’t focus on variants that don’t have much written usage; moreover, the population figures we want to focus on are the literate population. For this reason and others, we don’t rely on the Ethnologue figures. See also Picking the Right Language Code.

Please review the generated lists in Filtered Scripts and Languages. A spreadsheet with some details is onhttp://spreadsheets.google.com/pub?key=rORMJfeNEUR37PlS8HIa_rQ. The first column is the language, 2rd is the world population of the language (literate), and the remaining columns are the reasons (data for 1.1, 1.2, 1.3 from the above).

Known issues:

Survey Tool Changes

The above would only require a small tool change: the main change is that the approved list from #1 and #2 would be in CODE_FALLBACK, and nothing else would. Languages would get #3 cases by virtue of there being a translated tag already in the language, even if Root doesn’t have anything (because it is not in CODE_FALLBACK). Thus if the locale doesn’t already contain a translation for, say, Ancient Greek, it would not show up in the survey tool.

We would add the lists to the supplemental metadata for access by the tools. The Coverage tool and spec also need to be aligned with the above.

Other Changes

We also need to make other changes to the spec in regards to the new version of BCP 47. In particular, those macrolanguages with an encompassed language that is a “predominant form”, CLDR treats the predominant form and the macrolanguage as aliases. See Locale Field Definitions in the spec. We need to flesh that table out to include all of the macrolanguages that are in the Included Languages, such as Azerbaijani. Here is a start at that (but still just draft). The first part of this list is from a draft of BCP 47bis. The last three are codes that are in the current (2006) version of BCP 47.

Macrolanguage Table

Macrolanguage Encompassed Language Comments
Arabic ‘ ar ‘ Standard Arabic ‘ arb ‘  
Konkani ‘ kok ‘ Konkani ( individual language) ‘ knn ‘  
Malay ‘ ms ‘ Standard Malay ‘ zsm ‘  
Swahili ‘ sw ‘ Swahili ( individual language) ‘ swh ‘  
Uzbek ‘ uz ‘ Northern Uzbek ‘ uzn ‘  
Chinese ‘ zh ‘ Mandarin Chinese ‘ cmn ‘  
Norwegian ‘ no ‘ Norwegian Bokmal ‘ nob ‘ = nb To regularize, we may want to switch in CLDR from nb as the ‘norm’ to no.
Serbo-croatian ‘ sh ‘   This is a complex situation, and we’ll probably leave as is.
Kurdish ‘ku’ Northern Kurdish ‘kmr’? We probably want to change the default content locale to ku-Latn
Akan ‘ ak ‘ Twi ‘ tw ‘ and Fanti ‘ fat’ This appears to be a mistake in ISO 639. See: ISO 636 Deprecation Requests .
Persian fas (fa) Western Farsi pes and prs Dari This appears to be a mistake in ISO 639. See: ISO 636 Deprecation Requests .

These would also go into the <alias> element of the supplemental metadata. We may add more such aliases over time, as we find new predominant forms. Note that we still need to offer both aliases for translation in many cases. For example, we want to show both ‘no’ and ‘nb’.

Lenient Parsing

There are many circumstances where we get less than perfect language identifiers coming in. I think we should have some guidelines as to how to do this. Here are the possibilities:

  1. case / hyphen insensitivity
  2. map valid non-canonical forms to their canonical equivalents (zh-cmn, cmn => zh)
  3. map certain common invalid forms to their canonical equivalents:
    1. UK => GB
    2. eng => en // and other illegal 3-letter 639 codes that correspond to 2-letter codes
    3. 840 => US // other numeric region codes that correspond to 2-letter codes
  4. map away extlangs. Formally, en-yue is valid (this slipped by us in doing BCP 47), and canonicalizes in BCP 47 to yue, the same as zh-yue does. In any event, the simplest thing for us to do is if there is a syntactic extlang:
    1. Verify that the base language and extlang are both valid language subtags
    2. Remove the base language
    3. This avoids having to store which languages are also extlangs, and what their prefixes are.

People have to do #1. We should recommend #2, and make it easy to support #3.

See demo at http://unicode.org/cldr/utility/languageid.jsp

Also, we should consider modifying the canonical form of language identifiers so as to have lowercase variants (with the exception of some set of grandfathered codes). The following are generated by GenerateMaximalLocales, plus 7 hand modifications for the last line.

Filtered Scripts and Languages

The following script/language names would be included (/excluded) from default translation. For the method used to get this list, see Formula.

The languages are listed in the format Abkhazian [ab]-OR, where [xx] is the code, and “OR” is the abbreviated “best” status in some territory: Unknown, Official Regional, Official Minority, De facto official, Official.

Included Script Names: 41+

Excluded Script Names:

Included Languages: 202

Excluded Languages: 299