🏗 The CLDR site has been migrated to a new platform. Formatting and links continue to be fixed.
CLDR 38 Release Note
Overview
Unicode CLDR provides an update to the key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.
CLDR v38 focused on enhancing the support for existing locales: Support for units of measurement in inflected languages (phase 1), adding annotations (names and search keywords) for many more non-emoji symbols (~400), plus for Emoji v13.1. In this version, there is also substantially higher coverage for (in order of completeness): Norwegian Nynorsk, Hausa, Igbo, Breton, Quechua, Yoruba, Fulah (Adlam script), Chakma, Asturian, Sanskrit, and Dogri.
The units of measurement additions allow for support of APIs for simple unitIDs such as meter up to compound unitIDs such as cubic-meter-per-square-second or acre-feet-per-day, such as the following:
getUnitPattern(unitId, locale, width, pluralCategory, caseVariant) — to get the localized, inflected pattern for a simple or compound unit of measurement, appropriate for a position in a sentence or phrase with the appropriate pluralCategory and grammatical case (nominative, accusative, genitive, etc).
getUnitGender(unitId, locale) — to get the gender for a unit of measurement, so that other parts of a sentence or phrase can be modified to agree with that gender.
The Survey Tool has improvements in performance, and introduced structured forum requests to improve coordination among translators. We would like to thank the 393 language experts who contributed to this release.
There are some changes that affect existing specifications and data: for example, the plural rules for French changed to add a new category; the specification for using aliases is more rigorous, and some alias data has changed — along with the specification for handling locale identifier canonicalization. For more information, see Migration.
The overall changes to the data items were:
Added
155,131
Deleted
33,805
Changed
45,895
Data Changes
The following summarizes the changes to the data for this version of CLDR.
13.1 Emoji and Unicode Symbols
Added names & search keywords for Emoji 13.1 and enhancements to existing emoji annotation data.
Added approximately 400 non-emoji Unicode symbols such as punctuation and currency symbols.
Added 2 character labels: superscript {0} and subscript {0}.
Aside from the CLDR target locales, emoji annotations and keywords expanded in Hausa (ha), Igbo (ig), Kalaallisut (kl), Luxembourgish (lb), Maori (mi), Manipuri (mni), Maltese (mt), Punjabi [Arabic] (pa_Arab), Kinyarwanda (rw), Tajik (tg), Tigrinya (ti), Uyghur (ug), Wolof (wo), Xhosa (xh), Yoruba (yo), with minor expansions in a few other languages.
Compact decimals and Units
Added 14 new units.
Added new binary prefixes.
Added new operand 'c' (with a synonym 'e') for languages like French (CLDR-12010)
Higher Coverage Levels
Modern: Norwegian Nynorsk
Moderate++: Hausa, Igbo, Breton, Quechua, Yoruba — made significant improvements, but didn't make it quite to Modern
Moderate: Fulah (Adlam), Chakma, Asturian
Basic+: Wolof, Tajik, Maori, Luxembourgish, Uyghur, Tigrinya — made significant improvements, but didn't get near to Moderate
Basic: Sanskrit, Dogri
Unit Inflections
Completed phase 1. The full goal is to add full case and gender support for formatted units. During phase 1, a limited number of locales (see below) and units of measurement are being handled, so that we can work kinks out of the process before expanding to all units for all locales (where we can get the grammatical structure).
Case & Gender: Polish (pl), Russian (ru), German (de), Hindi (hi) (in rough order of complexity)
Gender Only: Dutch (nl), Norwegian Bokmål (nb), Danish (da), Swedish (sv), French (fr), Italian (it), Portuguese (pt), Spanish (es)
Performance & Quality
Made substantial improvements in Survey Tool performance, lowering cost for translation.
Made substantial improvement in quality, using structured Forum topics to allow translators to collaborate more effectively.
Improved detection of translator errors.
ICU support
Improvements to CLDR API, providing a limited, stable API for extracting CLDR data.
Adding approximatelySign for number formatting.
Unicode locale identifiers and BCP 47
Added a new -u locale extension keyword -dx, used to specify scripts to exclude from dictionary break (for word and line break)
Added a new short timezone identifier: tz-glgoh
Revamped the language, script, region, and variant alias data to improve replacement of deprecated codes.
For access to the draft data, see the git tag above. For more details see the Delta tickets above.
JSON Data Changes
JSON data now includes data for plural ranges, grammatical inflections, typographical labels, and annotations. If you are making use of JSON data, please join the [cldr-users] mailing list where we would like to hear your feedback.
CLDR JSON data for v38 is available, please see https://github.com/unicode-org/cldr-json
Specification Changes
The largest changes were the following:
To make the canonicalization of locale identifiers clear and unambiguous, provided major restructuring of the specification for canonicalization. (This was done in concert with fixes to the alias data to work better with the specification.) See Migration and Annex C. LocaleId Canonicalization for more details.
To allow for overriding dictionary-based segmentation breaks, added the Unicode Dictionary Break Exclusion Identifier, with the new key “dx”.
For picking the correct units of measurement for locales, defined the userPreferences skeleton more precisely.
For accurate plural categories in compact numbers, added the 'c' operand to plural rules to provide formatting for languages such as French. (CLDR-12010)
To support inflected units of measurement (phase 1), add specifications for the new elements listed under Structure Changes and an algorithm for how to construct grammatical unit names (simple or compound).
For more detailed specification changes, see the Spec above, and look at the Modifications section.
Structure Changes
Added additional structure for unit inflections
New elements:
minimalPairs adds new elements caseMinimalPairs and genderMinimalPairs
unit adds a new element gender
grammaticalData adds new elements grammaticalDerivations, deriveCompound, and deriveComponent
New attributes for existing elements:
unitPattern adds a new attribute case
grammaticalCase, grammaticalGender, grammaticalDefiniteness add a new attribute scope
compoundUnitPattern1 adds new attributes case and gender
compoundUnitPattern adds a new attribute case
Number symbols adds approximatelySign element
Some additional attribute value constraints are added
for example, characterLabelPattern@type now allows for superscript and subscript values, indicated by the notation ⟪… strokes⟫➠⟪… strokes, subscript, superscript⟫ in Delta DTDs
some of these constraints are expanded due to new structure, while others are
For more details, see the Delta DTDs above.
Chart Changes
All charts are updated for the new data; for example, Romance Annotations shows the new non-emoji symbols and punctuation for Romance languages.
The DTD Deltas chart has a more compact representation for changes in attribute constraints, making the changes easier to see.
The new Grammatical Forms Charts show the new grammatical forms for units.
Growth
The following chart shows the growth of CLDR locale-specific data over time. It does not include the non-locale specific data, nor locale-specific data that is not collected via the Survey Tool. It is thus restricted to data items in /main and /annotations directories.
The % values are percent of the current measure of Modern coverage. That level is notched up each release, so previous releases had many locales that were at Modern coverage as assessed at the time of their release. There is one line per year, even though there were multiple releases in most years.
See also the Locale Coverage Data v38 and for details of the changes see delta_summary.tsv and locale-growth.tsv
Migration
The plural rules for French changed to add a new category, 'many', using the new operand 'c' (with a synonym 'e'). It should only have effect on compact number handling.
Important: according to the spec, when there is no message for a plural category, the message for 'other' should be returned. As long as implementations observe this policy, migration to this should work without problems.
<languageMatches type="written"> was deprecated some time ago, and now has been removed. Clients should use <languageMatches type="written_new"> (recognizing that there are some syntax changes). CLDR-13245
The following locales have been moved in the folder structures. CLDR-14080
Seed → Common: Sanskrit (sa)
Common → Seed: Church Slavic (cu), Volapük (vo), Prussian (prg)
The specification for using aliases is more rigorous, and some alias data has changed. Programs using this data may need modification:
The specification processes the rules in a certain order, so the file order needs to be maintained.
The specification now explicitly takes multiple passes (though that can be optimized by implementations)
Various variantAliases are replaced by languageAliases where they require more context to be properly handed (the former specification did not handle variant aliases correctly).
AALAND ⇒ AX is replaced by und_aaland ⇒ und_AX
arevmda ⇒ hyw is replaced by two rules: hy_arevmda ⇒ hyw & und_arevmda ⇒ und
Some spurious aliases have been removed, where they are not properly aliases but rather partial duplications of more complete information:
Those covered by the parent locale data and/or likely subtag data, such as az_AZ ⇒ az_Latn_AZ
Those covered by canonicalization of extlang subtags, such as zh_wuu ⇒ wuu
Changes to the download files:
cldr-tools-*.zip no longer contains a built cldr.jar, use the separate cldr-tools-*.jar instead.
And as of v38.1 and later, cldr-tools-*.zip is no longer included at all. You can download or checkout the source tree directly from GitHub.
cldr-tools-*.jar is a standalone .jar file containing the CLDR tools and all needed dependencies.
There is a new "hashes/" subdirectory which contains GPG signatures and SHA-512 sums.
External Data Version
There is a new file, properties/external_data_versions.tsv that supplies information on which versions of external data were used in CLDR.
TimeZone (TZData) is up to date with IANA TZ database 2020c. See details in CLDR-14224 and CLDR-13739.
Known Issues
The Transform charts have been disabled until the generating code could be fixed. [CLDR-11019]
The JSON-format data for CLDR 38 currently omits the data from the CLDR common/supplemental files grammaticalFeatures.xml and units.xml. These are all new items in CLDR 37 except for the <unitPreferenceData>, which was formerly in supplementalData.xml. This will be addressed as soon as possible. [CLDR-13730]
Hebrew compact number formatting scrambles text if embedded in RTL message [CLDR-14256]
There are a number of fixes needed in the LDML specification.
CLDR-14272 The documentation of @targets and @scope in grammaticalFeatures is missing; see the ticket for the missing text.
CLDR-14312 replacement in subdivisionAlias in common/supplemental/supplementalMetadata.xml contains alpha{2}
CLDR-14318 Should not remove "true" of tfield in UTS35 Appendix A
CLDR-14319 Remove wrong/duplicated example below "Territory Exception" in UTS35 Appendix A
CLDR-14320 "Put all <keywords, tfields> pairs into alphabetical order" is wrong in Appendix A of UTS35
CLDR-13894 Need to use variantAlias replacement in BCP 47 Language Tag to Unicode BCP 47 Locale Identifier
CLDR-14244 Document special 'alt' inheritance
CLDR 38.1
This dot release makes a very small number of incremental additions to version 38 to address the specific bugs listed in Δ38.1. The data changes are summarized in 38.1/delta/index.html. CLDR v38.1 is also included in ICU 68.2.
Migration note for CLDR 38.1:
As of v38.1 and later, cldr-tools-*.zip is no longer included in the download files. You can download or checkout the source tree directly from GitHub.
Acknowledgments
Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.
The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.
For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.