Unicode CLDR 31 provides an update to the key building blocks for software supporting the world's languages. This data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.
Some of the improvements in the release are:
- Canonical codes (See Migration)
- The subdivision codes have been changed to all have the bcp47 format.
- The locales in the language-territory population data are in canonical format.
- The timezone ID for GMT has been split from UTC.
- There is a mechanism for identifying hybrid locales, such as Hinglish.
- Emoji 5.0
- Short names and keywords have been updated for English. (Data for other languages to be gathered in the next cycle).
- Collation (sorting) adds the new 5.0 Emoji characters and sequences, and some fixes for Emoji 4.0 characters and sequences.
- For Emoji usage, subdivision names for Scotland, Wales, and England have been added for 65 languages.
- [31.0.1] Added full list of derived names #10126, and fixed some collisions in derived names #10127.
For changes that may affect migration to this version, see Migration
Other structural additions and changes
- Codes now use canonical form, as described above.
- New structure for lenient parsing
- New structure for minimal pairs (for plurals)
- New language-matching structure for matching groups of countries
- The literacyPercent for a region is broken out from writingPercent
- For DTD changes, see DTD Deltas
Other data additions and changes
- New timezone IDs (long form and bcp47 form).
- New currency code BYR.
- Minimal pairs for plural rules.
- New data for lenient parsing
- Enhanced Language Matching data (new elements and attributes)
- Updated Windows keyboards
- <fields> data fleshed out for era, weekday, dayperiod, and zone, and new <fields> data added for weekOfMonth, dayOfYear, weekdayOfMonth.
- A pseudo-locale generation tool.
- A number of additions to exemplar characters, such as for Arabic and Farsi
- The ar-015 locale for Arabic with ASCII digits.
- Some improvements to the Zawgyi-to-Unicode transform, and other transforms.
- Collation data updated for Unihan 9.0 and for Emoji 5.0
- New unit type "length-point"
- [31.0.1] Fixed inconsistent names in Czechia #10122, and some negative current subpatterns for compact decimal formatting #10131
- [31.0.1] Fixed collation charts #10139
The following gives the total overview of the change in data items in CLDR. This release did not have a data-submission cycle, so the changes reflect cleanup and bug fixes.
* The measurement of the number of items is reflects the different ways that the information is represented. A single data field (element or attribute value) may result in multiple data items. For example, plural rules may be shared by multiple languages, and a single data field contains all the languages to which those rules apply. Sometimes a changed item appears as a deletion+addition, and sequences of items (such as sort order) are not counted as different even if the order changes.
- No structural changes for this release, just updated to match XML data.
- no changes in the Survey Tool this release
|Part 1, Sec 3.7
||New table of -t- keys.
|Part 1, Sec 3.10.2
||Description of new hybrid locale identifiers
|Part 1, Sec 4.4
||Description of new structures for enhanced language matching
|Part 1, Sec 6.2
||Improved the emoji grapheme break rule extension GB11′
|Part 2, Sec 3.6
||Description of new parseLenient element
|Part 2, Sec 6
||New unit added for typographic point
|Part 2, Sec 14.1
||Clarified construction of emoji annotations
|Part 3, Sec 2.4.1
||Clarified use of ‘0’ in compact decimal patterns
|Part 4, Sec 3
||New <field> attributes
|Part 4, Sec 4.3
||Clarified use of “week of” patterns
|Part 4, Sec 8
||Restructured date-time table.
|Part 6, Sec 2.2
Subdivision containment, documenting the change in usage of the subgroup element attributes type and contains
|Part 6, Sec 2.3
||Supplemental territory info: added literacy percent for language population
- Code changes
- The subdivision codes have been changed to all be the bcp47 format, eg "usca" instead of "US-CA". This affects supplemental containment and subdivisions, and translations in subdivisions/en.xml, etc. See Part 6, Sec 2.2 [#9942]
- The locales in the language-territory population tables have been changed to be the canonical format, dropping the script where it is the default. So "ku_Latn" changes to "ku"
- The exemplar/ locale data file names have also been changed to be the canonical format, dropping the script where it is the default.
- Plural rules
- The Portuguese plural rules have changed so that all (and only) integers and decimal fractions < 2 are singular.
- The GMT timezone has been split from the UTC timezone.
- New timezone bcp47 codes have been added.
- Language/Region data
- The new literacyPercent attribute for supplemental <languagePopulation> has been broken out from writingPercent, the latter now only being used to reflect primarily-spoken languages. [#9421]
- A new format for language matching is provided. To allow time for implementations to change over, the old data is retained, and the new data is marked as "written-new".
- Languages "hr" and "sr" are no longer a short distance apart, for political reasons.
- The primary names for CZ changed from "Czech Republic" to "Czechia", with the longer name now the alternate.
“Week of” structure
The structure and intended usage for the “week x of y” patterns is still being refined and may change. This applies especially to dateFormatItems such as the following:
<dateFormatItem id="MMMMW" count=...>'week' W 'of' MMM</dateFormatItem>
<dateFormatItem id="yw" count=...>'week' w 'of' y</dateFormatItem>
Areas of discussion include the use of the count attribute and the use of ordinal vs. cardinal numbers. For more information see [#9801].
Non-unique emoji short names (fixed in 31.0.1)
Some of the emoji names are not unique. Fixes are being gathered, but are not in time for the release. See [#10116], [#10127]
Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.
- The Release Note contains a general description of the contents of the release, and any relevant notes about the release.
- The Data link points to a set of zip files containing the contents of the release (the files are complete in themselves, and do not require files from earlier releases -- for the structure of the zip file, see
- The Spec is the version of
UTS #35: LDML that corresponds to the release.
- The Delta document points to a list of all the bug fixes and features in the release, which be used to get the precise corresponding file changes using BugDiffs.
- The SVN Tag can be used to get the files via Repository Access.
- For more details see CLDR Releases (Downloads).