CLDR 31 Release Note
Unicode CLDR 31 provides an update to the key building blocks for software supporting the world's languages. This data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.
Some of the improvements in the release are:
Canonical codes (See Migration)
The subdivision codes have been changed to all have the bcp47 format.
The locales in the language-territory population data are in canonical format.
The timezone ID for GMT has been split from UTC.
There is a mechanism for identifying hybrid locales, such as Hinglish.
Short names and keywords have been updated for English. (Data for other languages to be gathered in the next cycle).
Collation (sorting) adds the new 5.0 Emoji characters and sequences, and some fixes for Emoji 4.0 characters and sequences.
For Emoji usage, subdivision names for Scotland, Wales, and England have been added for 65 languages.
For changes that may affect migration to this version, see Migration.
Other structural additions and changes
Codes now use canonical form, as described above.
New structure for lenient parsing
New structure for minimal pairs (for plurals)
New language-matching structure for matching groups of countries
The literacyPercent for a region is broken out from writingPercent
For DTD changes, see DTD Deltas
For more information, see Spec Modifications.
Other data additions and changes
New timezone IDs (long form and bcp47 form).
New currency code BYR.
Minimal pairs for plural rules.
New data for lenient parsing
Enhanced Language Matching data (new elements and attributes)
Updated Windows keyboards
<fields> data fleshed out for era, weekday, dayperiod, and zone, and new <fields> data added for weekOfMonth, dayOfYear, weekdayOfMonth.
A pseudo-locale generation tool.
A number of additions to exemplar characters, such as for Arabic and Farsi
Some improvements to the Zawgyi-to-Unicode transform, and other transforms.
Collation data updated for Unihan 9.0 and for Emoji 5.0
New unit type "length-point"
For more information, see detailed delta charts.
The following gives the total overview of the change in data items in CLDR. This release did not have a data-submission cycle, so the changes reflect cleanup and bug fixes.
* The measurement of the number of items is reflects the different ways that the information is represented. A single data field (element or attribute value) may result in multiple data items. For example, plural rules may be shared by multiple languages, and a single data field contains all the languages to which those rules apply. Sometimes a changed item appears as a deletion+addition, and sequences of items (such as sort order) are not counted as different even if the order changes.
For more details, see the Delta Data charts.
No structural changes for this release, just updated to match XML data.
no changes in the Survey Tool this release
For details, see Spec Modifications.
The subdivision codes have been changed to all be the bcp47 format, eg "usca" instead of "US-CA". This affects supplemental containment and subdivisions, and translations in subdivisions/en.xml, etc. See Part 6, Sec 2.2 [#9942]
The locales in the language-territory population tables have been changed to be the canonical format, dropping the script where it is the default. So "ku_Latn" changes to "ku"
The exemplar/ locale data file names have also been changed to be the canonical format, dropping the script where it is the default.
The Portuguese plural rules have changed so that all (and only) integers and decimal fractions < 2 are singular.
The GMT timezone has been split from the UTC timezone.
New timezone bcp47 codes have been added.
The new literacyPercent attribute for supplemental <languagePopulation> has been broken out from writingPercent, the latter now only being used to reflect primarily-spoken languages. [#9421]
A new format for language matching is provided. To allow time for implementations to change over, the old data is retained, and the new data is marked as "written-new".
Languages "hr" and "sr" are no longer a short distance apart, for political reasons.
The primary names for CZ changed from "Czech Republic" to "Czechia", with the longer name now the alternate.
“Week of” structure
The structure and intended usage for the “week x of y” patterns is still being refined and may change. This applies especially to dateFormatItems such as the following:
<dateFormatItem id="MMMMW" count=...>'week' W 'of' MMM</dateFormatItem>
<dateFormatItem id="yw" count=...>'week' w 'of' y</dateFormatItem>
Areas of discussion include the use of the count attribute and the use of ordinal vs. cardinal numbers. For more information see [#9801].
Non-unique emoji short names (fixed in 31.0.1)
Chinese stroke collation
Since CLDR 30, Chinese stroke collation has been missing entries for several basic characters. CLDR 32 reverts the stroke collation data to the CLDR 29 version; a complete fix for the underlying problem is targeted for CLDR 33. See #10497, #10642.
See tickets for v31.0.1.
Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.
The Release Note contains a general description of the contents of the release, and any relevant notes about the release.
The Data link points to a set of zip files containing the contents of the release (the files are complete in themselves, and do not require files from earlier releases -- for the structure of the zip file, see Repository Organization).
The Spec is the version of UTS #35: LDML that corresponds to the release.
The Delta document points to a list of all the bug fixes and features in the release, which be used to get the precise corresponding file changes using BugDiffs.
The SVN Tag can be used to get the files via Repository Access.
For more details see CLDR Releases (Downloads).
For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.