CLDR 38 Release Note

See Key to Header Links

Overview

Unicode CLDR provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

CLDR v38 focused on enhancing the support for existing locales: Support for units of measurement in inflected languages (phase 1), adding annotations (names and search keywords) for many more non-emoji symbols (~400), plus for Emoji v13.1. In this version, there is also substantially higher coverage for (in order of completeness): Norwegian Nynorsk, Hausa, Igbo, Breton, Quechua, Yoruba, Fulah (Adlam script), Chakma, Asturian, Sanskrit, and Dogri.

The units of measurement additions allow for support of APIs for simple unitIDs such as meter up to compound unitIDs such as cubic-meter-per-square-second or acre-feet-per-day, such as the following:

getUnitPattern(unitId, locale, width, pluralCategory, caseVariant) — to get the localized, inflected pattern for a simple or compound unit of measurement, appropriate for a position in a sentence or phrase with the appropriate pluralCategory and grammatical case (nominative, accusative, genitive, etc).

getUnitGender(unitId, locale) — to get the gender for a unit of measurement, so that other parts of a sentence or phrase can be modified to agree with that gender.

The Survey Tool has improvements in performance, and introduced structured forum requests to improve coordination among translators. We would like to thank the 393 language experts who contributed to this release.

There are some changes that affect existing specifications and data: for example, the plural rules for French changed to add a new category; the specification for using aliases is more rigorous, and some alias data has changed — along with the specification for handling locale identifier canonicalization. For more information, see Migration.

The overall changes to the data items were:

Added

155,131

Deleted

33,805

Changed

45,895

Data Changes

The following summarizes the changes to the data for this version of CLDR.

  • 13.1 Emoji and Unicode Symbols

      • Added names & search keywords for Emoji 13.1 and enhancements to existing emoji annotation data.

      • Added approximately 400 non-emoji Unicode symbols such as punctuation and currency symbols.

      • Added 2 character labels: superscript {0} and subscript {0}.

      • Aside from the CLDR target locales, emoji annotations and keywords expanded in Hausa (ha), Igbo (ig), Kalaallisut (kl), Luxembourgish (lb), Maori (mi), Manipuri (mni), Maltese (mt), Punjabi [Arabic] (pa_Arab), Kinyarwanda (rw), Tajik (tg), Tigrinya (ti), Uyghur (ug), Wolof (wo), Xhosa (xh), Yoruba (yo), with minor expansions in a few other languages.

  • Compact decimals and Units

      • Added 14 new units.

      • Added new binary prefixes.

      • Added new operand 'c' (with a synonym 'e') for languages like French (CLDR-12010)

  • Higher Coverage Levels

      • Modern: Norwegian Nynorsk

      • Moderate++: Hausa, Igbo, Breton, Quechua, Yoruba — made significant improvements, but didn't make it quite to Modern

      • Moderate: Fulah (Adlam), Chakma, Asturian

      • Basic+: Wolof, Tajik, Maori, Luxembourgish, Uyghur, Tigrinya — made significant improvements, but didn't get near to Moderate

      • Basic: Sanskrit, Dogri

  • Unit Inflections

      • Completed phase 1. The full goal is to add full case and gender support for formatted units. During phase 1, a limited number of locales (see below) and units of measurement are being handled, so that we can work kinks out of the process before expanding to all units for all locales (where we can get the grammatical structure).

      • Case & Gender: Polish (pl), Russian (ru), German (de), Hindi (hi) (in rough order of complexity)

      • Gender Only: Dutch (nl), Norwegian Bokmål (nb), Danish (da), Swedish (sv), French (fr), Italian (it), Portuguese (pt), Spanish (es)

  • Performance & Quality

      • Made substantial improvements in Survey Tool performance, lowering cost for translation.

      • Made substantial improvement in quality, using structured Forum topics to allow translators to collaborate more effectively.

      • Improved detection of translator errors.

  • ICU support

      • Improvements to CLDR API, providing a limited, stable API for extracting CLDR data.

      • Adding approximatelySign for number formatting.

  • Unicode locale identifiers and BCP 47

      • Added a new -u locale extension keyword -dx, used to specify scripts to exclude from dictionary break (for word and line break)

      • Added a new short timezone identifier: tz-glgoh

      • Revamped the language, script, region, and variant alias data to improve replacement of deprecated codes.

For access to the draft data, see the git tag above. For more details see the Delta tickets above.

JSON Data Changes

JSON data now includes data for plural ranges, grammatical inflections, typographical labels, and annotations. If you are making use of JSON data, please join the [cldr-users] mailing list where we would like to hear your feedback.

CLDR JSON data for v38 is available, please see https://github.com/unicode-org/cldr-json

Specification Changes

The largest changes were the following:

  • To make the canonicalization of locale identifiers clear and unambiguous, provided major restructuring of the specification for canonicalization. (This was done in concert with fixes to the alias data to work better with the specification.) See Migration and Annex C. LocaleId Canonicalization for more details.

  • To allow for overriding dictionary-based segmentation breaks, added the Unicode Dictionary Break Exclusion Identifier, with the new key “dx”.

  • For picking the correct units of measurement for locales, defined the userPreferences skeleton more precisely.

    • For accurate plural categories in compact numbers, added the 'c' operand to plural rules to provide formatting for languages such as French. (CLDR-12010)

  • To support inflected units of measurement (phase 1), add specifications for the new elements listed under Structure Changes and an algorithm for how to construct grammatical unit names (simple or compound).

For more detailed specification changes, see the Spec above, and look at the Modifications section.

Structure Changes

  • Added additional structure for unit inflections

    • New elements:

      • minimalPairs adds new elements caseMinimalPairs and genderMinimalPairs

      • unit adds a new element gender

      • grammaticalData adds new elements grammaticalDerivations, deriveCompound, and deriveComponent

    • New attributes for existing elements:

      • unitPattern adds a new attribute case

      • grammaticalCase, grammaticalGender, grammaticalDefiniteness add a new attribute scope

      • compoundUnitPattern1 adds new attributes case and gender

      • compoundUnitPattern adds a new attribute case

  • Number symbols adds approximatelySign element

  • Some additional attribute value constraints are added

    • for example, characterLabelPattern@type now allows for superscript and subscript values, indicated by the notation ⟪… strokes⟫➠⟪… strokes, subscript, superscript⟫ in Delta DTDs

    • some of these constraints are expanded due to new structure, while others are

For more details, see the Delta DTDs above.

Chart Changes

  • All charts are updated for the new data; for example, Romance Annotations shows the new non-emoji symbols and punctuation for Romance languages.

  • The DTD Deltas chart has a more compact representation for changes in attribute constraints, making the changes easier to see.

  • The new Grammatical Forms Charts show the new grammatical forms for units.

Growth

The following chart shows the growth of CLDR locale-specific data over time. It does not include the non-locale specific data, nor locale-specific data that is not collected via the Survey Tool. It is thus restricted to data items in /main and /annotations directories.

The % values are percent of the current measure of Modern coverage. That level is notched up each release, so previous releases had many locales that were at Modern coverage as assessed at the time of their release. There is one line per year, even though there were multiple releases in most years.

See also the Locale Coverage Data v38 and for details of the changes see delta_summary.tsv and locale-growth.tsv

Migration

  • The plural rules for French changed to add a new category, 'many', using the new operand 'c' (with a synonym 'e'). It should only have effect on compact number handling.

    • Important: according to the spec, when there is no message for a plural category, the message for 'other' should be returned. As long as implementations observe this policy, migration to this should work without problems.

  • <languageMatches type="written"> was deprecated some time ago, and now has been removed. Clients should use <languageMatches type="written_new"> (recognizing that there are some syntax changes). CLDR-13245

  • The following locales have been moved in the folder structures. CLDR-14080

    • Seed → Common: Sanskrit (sa)

    • Common → Seed: Church Slavic (cu), Volapük (vo), Prussian (prg)

  • The specification for using aliases is more rigorous, and some alias data has changed. Programs using this data may need modification:

    • The specification processes the rules in a certain order, so the file order needs to be maintained.

    • The specification now explicitly takes multiple passes (though that can be optimized by implementations)

    • Various variantAliases are replaced by languageAliases where they require more context to be properly handed (the former specification did not handle variant aliases correctly).

      • AALAND ⇒ AX is replaced by und_aaland ⇒ und_AX

      • arevmda ⇒ hyw is replaced by two rules: hy_arevmda ⇒ hyw & und_arevmda ⇒ und

    • Some spurious aliases have been removed, where they are not properly aliases but rather partial duplications of more complete information:

      • Those covered by the parent locale data and/or likely subtag data, such as az_AZ ⇒ az_Latn_AZ

      • Those covered by canonicalization of extlang subtags, such as zh_wuu ⇒ wuu

    • Changes to the download files:

      • cldr-tools-*.zip no longer contains a built cldr.jar, use the separate cldr-tools-*.jar instead.

        • And as of v38.1 and later, cldr-tools-*.zip is no longer included at all. You can download or checkout the source tree directly from GitHub.

      • cldr-tools-*.jar is a standalone .jar file containing the CLDR tools and all needed dependencies.

      • There is a new "hashes/" subdirectory which contains GPG signatures and SHA-512 sums.

External Data Version

Known Issues

  1. The Transform charts have been disabled until the generating code could be fixed. [CLDR-11019]

  2. The JSON-format data for CLDR 38 currently omits the data from the CLDR common/supplemental files grammaticalFeatures.xml and units.xml. These are all new items in CLDR 37 except for the <unitPreferenceData>, which was formerly in supplementalData.xml. This will be addressed as soon as possible. [CLDR-13730]

  3. Hebrew compact number formatting scrambles text if embedded in RTL message [CLDR-14256]

    1. There are a number of fixes needed in the LDML specification.

    2. CLDR-14272 The documentation of @targets and @scope in grammaticalFeatures is missing; see the ticket for the missing text.

      1. CLDR-14312 replacement in subdivisionAlias in common/supplemental/supplementalMetadata.xml contains alpha{2}

      2. CLDR-14318 Should not remove "true" of tfield in UTS35 Appendix A

      3. CLDR-14319 Remove wrong/duplicated example below "Territory Exception" in UTS35 Appendix A

      4. CLDR-14320 "Put all <keywords, tfields> pairs into alphabetical order" is wrong in Appendix A of UTS35

      5. CLDR-13894 Need to use variantAlias replacement in BCP 47 Language Tag to Unicode BCP 47 Locale Identifier

      6. CLDR-14244 Document special 'alt' inheritance

CLDR 38.1

This dot release makes a very small number of incremental additions to version 38 to address the specific bugs listed in Δ38.1. The data changes are summarized in 38.1/delta/index.html. CLDR v38.1 is also included in ICU 68.2.

Migration note for CLDR 38.1:

    • As of v38.1 and later, cldr-tools-*.zip is no longer included in the download files. You can download or checkout the source tree directly from GitHub.

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.