CLDR 31 Release Note

Overview

Unicode CLDR 31 provides an update to the key building blocks for software supporting the world's languages. This data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

Some of the improvements in the release are:

  • Canonical codes (See Migration)

    • The subdivision codes have been changed to all have the bcp47 format.

    • The locales in the language-territory population data are in canonical format.

    • The timezone ID for GMT has been split from UTC.

    • There is a mechanism for identifying hybrid locales, such as Hinglish.

  • Emoji 5.0

    • Short names and keywords have been updated for English. (Data for other languages to be gathered in the next cycle).

    • Collation (sorting) adds the new 5.0 Emoji characters and sequences, and some fixes for Emoji 4.0 characters and sequences.

    • For Emoji usage, subdivision names for Scotland, Wales, and England have been added for 65 languages.

      • [31.0.1] Added full list of derived names #10126, and fixed some collisions in derived names #10127.

For changes that may affect migration to this version, see Migration.

Other structural additions and changes

  • Codes now use canonical form, as described above.

  • New structure for lenient parsing

  • New structure for minimal pairs (for plurals)

  • New language-matching structure for matching groups of countries

  • The literacyPercent for a region is broken out from writingPercent

  • For DTD changes, see DTD Deltas

For more information, see Spec Modifications.

Other data additions and changes

  • New timezone IDs (long form and bcp47 form).

  • New currency code BYR.

  • Minimal pairs for plural rules.

  • New data for lenient parsing

  • Enhanced Language Matching data (new elements and attributes)

  • Updated Windows keyboards

  • <fields> data fleshed out for era, weekday, dayperiod, and zone, and new <fields> data added for weekOfMonth, dayOfYear, weekdayOfMonth.

  • A pseudo-locale generation tool.

  • A number of additions to exemplar characters, such as for Arabic and Farsi

    • Some improvements to the Zawgyi-to-Unicode transform, and other transforms.

  • Collation data updated for Unihan 9.0 and for Emoji 5.0

  • New unit type "length-point"

    • [31.0.1] Fixed inconsistent names in Czechia #10122, and some negative current subpatterns for compact decimal formatting #10131

    • [31.0.1] Fixed collation charts #10139

For more information, see detailed delta charts.

Growth

The following gives the total overview of the change in data items in CLDR. This release did not have a data-submission cycle, so the changes reflect cleanup and bug fixes.

* The measurement of the number of items is reflects the different ways that the information is represented. A single data field (element or attribute value) may result in multiple data items. For example, plural rules may be shared by multiple languages, and a single data field contains all the languages to which those rules apply. Sometimes a changed item appears as a deletion+addition, and sequences of items (such as sort order) are not counted as different even if the order changes.

For more details, see the Delta Data charts.

JSON data

  • No structural changes for this release, just updated to match XML data.

Survey Tool

  • no changes in the Survey Tool this release

Specification changes

For details, see Spec Modifications.

Migration

  • Code changes

    • The subdivision codes have been changed to all be the bcp47 format, eg "usca" instead of "US-CA". This affects supplemental containment and subdivisions, and translations in subdivisions/en.xml, etc. See Part 6, Sec 2.2 [#9942]

    • The locales in the language-territory population tables have been changed to be the canonical format, dropping the script where it is the default. So "ku_Latn" changes to "ku"

    • The exemplar/ locale data file names have also been changed to be the canonical format, dropping the script where it is the default.

  • Plural rules

    • The Portuguese plural rules have changed so that all (and only) integers and decimal fractions < 2 are singular.

  • Timezones

    • The GMT timezone has been split from the UTC timezone.

    • New timezone bcp47 codes have been added.

  • Language/Region data

    • The new literacyPercent attribute for supplemental <languagePopulation> has been broken out from writingPercent, the latter now only being used to reflect primarily-spoken languages. [#9421]

    • A new format for language matching is provided. To allow time for implementations to change over, the old data is retained, and the new data is marked as "written-new".

    • Languages "hr" and "sr" are no longer a short distance apart, for political reasons.

  • Other

    • The primary names for CZ changed from "Czech Republic" to "Czechia", with the longer name now the alternate.

Known Issues

“Week of” structure

The structure and intended usage for the “week x of y” patterns is still being refined and may change. This applies especially to dateFormatItems such as the following:

<dateFormatItem id="MMMMW" count=...>'week' W 'of' MMM</dateFormatItem>

<dateFormatItem id="yw" count=...>'week' w 'of' y</dateFormatItem>

Areas of discussion include the use of the count attribute and the use of ordinal vs. cardinal numbers. For more information see [#9801].

Non-unique emoji short names (fixed in 31.0.1)

Some of the emoji names are not unique. Fixes are being gathered, but are not in time for the release. See [#10116], [#10127]

Chinese stroke collation

Since CLDR 30, Chinese stroke collation has been missing entries for several basic characters. CLDR 32 reverts the stroke collation data to the CLDR 29 version; a complete fix for the underlying problem is targeted for CLDR 33. See #10497, #10642.

Others

See tickets for v31.0.1.

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

Key

    • The Release Note contains a general description of the contents of the release, and any relevant notes about the release.

    • The Data link points to a set of zip files containing the contents of the release (the files are complete in themselves, and do not require files from earlier releases -- for the structure of the zip file, see Repository Organization).

    • The Spec is the version of UTS #35: LDML that corresponds to the release.

    • The Delta document points to a list of all the bug fixes and features in the release, which be used to get the precise corresponding file changes using BugDiffs.

    • The SVN Tag can be used to get the files via Repository Access.

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.