CLDR 44 Release Note
This version is currently at alpha. For production use, see the latest release.
Overview
Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.
In CLDR 44, the focus is on:
Formatting Person Names. Added further enhancements (data and structure) for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.1 Support. Added short names, keywords, and sort-order for the new Unicode 15.1 emoji.
Unicode 15.1 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.
Digitally disadvantaged language coverage. Work began to improve DDL coverage, with the following DDL locales now having higher coverage levels:
Modern: Cherokee, Lower Sorbian, Upper Sorbian
Moderate: Anii, Interlingua, Kurdish, Māori, Venetian
Basic: Esperanto, Interlingue, Kangri, Kuvi, Kuvi (Devanagari), Kuvi (Odia), Kuvi (Telugu), Ligurian, Lombard, Low German, Luxembourgish, Makhuwa, Maltese, N’Ko, Occitan, Prussian, Silesian, Swampy Cree, Syriac, Toki Pona, Uyghur, Western Frisian, Yakut, Zhuang
Locale Coverage Status
The coverage status determines how well languages are supported on laptops, phones, and other computing devices. In particular, qualifying at a Basic level is typically a requirement for being even selectable on phones as one of the user's languages. Note that for each language there are typically multiple locales, so 90 languages at Modern coverage corresponds to more than 350 locales at that coverage.
Below is the coverage in this release:
Data Changes
The following is a summary of the dtd changes which reflect changes in the structure. The relevant ones are described more fully in the data changes.
LDML
characterLabels - characterLabelPattern addition of 'facing-left' and 'facing-right' to support Unicode 15.1 emoji that can face different directions.
contextTransformUsage - many more values allowed for the type attribute (previously it only supported a subset of the documented values)
dateFormatItem and intervalFormatItem - many more skeletons allowed for the id attribute, for example EEEEd, GyMEEEEd, GyMMMEEEEd, GyMMMMEd …
territory - added two alternative names for the territory: British Indian Ocean Territory or Chagos archipelego
personNames
Added two new parameter defaults for length and formality. These allow users to set the most customary values used in their language for common usage.
Added a new field nativeSpaceReplacement. This can be used in languages that don't normally use spaces between words.
Supplemental Data
convertUnit/systems - additional unit systems have been added, for finer-grained distinctions.
unitQuantity/descriptions - descriptions can be added for unit quantities (such as length, area, etc.)
BCP47
key/types - allow for an IANA parameter for timezones, so that the current 'canonical' timezone can be identified and used.
The islamic calendars are now described as Hijri, and may have also changed in particular locales.
The new iana attribute provides the current canonical IANA timezone ID, where that is unclear. [TBD Yoshito to refine]
New locales were added, including en_ID and es_JP, plus many locales at a Basic level.
Fixes
There was a fix made for the Zanb script, which was mistakenly categorized as special instead of regular.
There was a fix made to the BCP47 Latin↔︎ASCII transliterator ID
Units
The gasoline-energy-density unit (used in miles per gallon of gasoline equivalent for electric vehicles) and the pint-imperial (used in the UK), plus many Japanese traditional units were added.
The unit of wind speed, Beaufort, was added for translation in locales where it is used.
Remaining SI units were added. Because these are primarily of use in scientific fields, they are not translated.
A few traditional English units were added, such as chain and fortnight. These were also not translated.
Many traditional Japanese units were added. These were not translated, outside of Japanese and English.
Many units have more refined (and sometimes corrected) unit systems.
The new SI prefixes for powers of 10 have generally been added: 30, 27, -27, -30. In some non-Latin-script languages there are not yet standard names for these, and in those the prefixes are left with Latin characters.
Likely Subtags — general cleanup
Addition of data donated by SIL for determining the most likely script and region for languages.
Addition of more und_ mappings. These provide for getting a default language if only the script, region, or both are known. These are, however, of limited usage, so implementations may want to filter them out.
Removal of macroregion codes, such as und_002. These are of very limited utility, and have been removed.
Language Containment Groups
Additional mappings have been added
Plural rules — have been added for blo.
Preferred hour formats — have changed substantially for many Latin American countries
Locale Changes
There were general changes to fix the lenient parsing set for $. (The previous format for entering Unicode characters led to not escaping $; the new format is more forgiving.)
Many locales will have changed the name for the code IO to be names like "Chagos-Archipel". There are two alternates, so implementations can use the name that works best for them.
The Islamic calendars names have often been changed in English and many locales to use more descriptive names like "Hijri calendar"
Some flexible date formats may use different spacing.
Sierra Leone changed their currency — the new names are available, and the old names have an appended date range.
The Kyrgyzstan narrow symbol is now used. (Note: CLDR holds off on using new Unicode characters for currencies for a few cycles, to allow system fonts to catch up.)
There was a concerted effort to fix the Person Name Formatting data for a number of locales.
There was a concerted effort to fix the names of certain units of measurement for many locales.
The new Unicode 15.1 emoji had names and search keywords added.
Many languages added search keywords for symbols like ◉, ⋂, ⊆
Languages made improvements to other items as needed per language.
File Changes
(Aside from locale files)
Additions:
New XSD files in /common/dtd/.
These correspond to the DTDs, but do not carry the extra validity annotations.
ldml.xsd, ldmlBCP47.xsd, ldmlSupplemental.xsd, xml.xsd
New Test Data files in /common/testData/
localeIdentifiers/likelySubtags.txt
personNameTest/_header.txt, _readme.txt, chr.txt, sw_KE.txt, tg.txt, ti.txt, wo.txt
transforms/und-t-und-latn-d0-ascii.txt (changed name)
Removals:
Files with insufficient data:
/common/testData/personNameTest/br.txt, brx.txt, gaa.txt, ks_Deva.txt, lij.txt, pcm.txt, sat.txt, syr.txt, to.txt, tt.txt, xh.txt
Old format keyboard:
/keyboards/
JSON Data Changes
Specification Changes
###TBD - Will be added by Spec Beta on Oct 4
Growth
The following chart shows the growth of CLDR locale-specific data over time. It is restricted to data items in /main and /annotations directories, so it does not include the non-locale-specific data; nor does it include corrections (which typically outnumber new items). The % values are percent of the current measure of Modern coverage. That level is increases each release, so previous releases had many locales that were at Modern coverage as assessed at the time of their release. There is one line per year, even though there were multiple releases in most years.
There were generally a relatively small number of additions this cycle; the focus was improvements in quality, and changes will not show up below.
Migration
Unit systems provide information about general usage of units of measure. For example, "knot" is in the customary US and UK systems, but is also acceptable for use with SI.
Implementations using the unit systems will find that some units have changed systems (either to be finer-grained, or to incorporate corrections.
LikelySubtags are used to find the most likely missing subtags in a locale identifier, and also the minimal form. Thus "de" (German) expands to "de-Latn-DE" (German written in Latin script as used in Germany), and all of ("de-Latn-DE", "de-DE", "de-DE") minimize to "de".
The algorithm for lookup has changed slightly (favoring script over region), and there have been data changes: most macroregions are gone (such as mapping from und-003) and some other und mappings. There remain some xx-YYY-001 results for artificial languages.
Preferred hour formats indicate the preferred form for a locale: 11 PM vs 23:00 vs 11 in the evening.
Have changed substantially for many Latin American countries
###TBD — more items will be added
Known Issues
###TBD
Acknowledgments
Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.
The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.
For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.