CLDR 44 Release Note
Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.
In CLDR 44, the focus is on:
Locale Coverage Status
The coverage status determines how well languages are supported on laptops, phones, and other computing devices. In particular, qualifying at a Basic level is typically a requirement for being even selectable on phones as one of the user's languages. Note that for each language there are typically multiple locales, so 90 languages at Modern coverage corresponds to more than 350 locales at that coverage.
Below is the coverage in this release:
The following is a summary of the dtd changes which reflect changes in the structure. The relevant ones are described more fully in the data changes.
characterLabels - characterLabelPattern addition of 'facing-left' and 'facing-right' to support Unicode 15.1 emoji that can face different directions.
contextTransformUsage - many more values allowed for the type attribute (previously it only supported a subset of the documented values)
dateFormatItem and intervalFormatItem - many more skeletons allowed for the id attribute, for example EEEEd, GyMEEEEd, GyMMMEEEEd, GyMMMMEd …
territory - added two alternative names for the territory: British Indian Ocean Territory or Chagos archipelego
Added two new parameter defaults for length and formality. These allow users to set the most customary values used in their language for common usage.
Added a new field nativeSpaceReplacement. This can be used in languages that don't normally use spaces between words.
convertUnit/systems - additional unit systems have been added, for finer-grained distinctions.
unitQuantity/descriptions - descriptions can be added for unit quantities (such as length, area, etc.)
key/types - allow for an IANA parameter for timezones, so that the current 'canonical' timezone can be identified and used.
The islamic calendars are now described as Hijri, and may have also changed in particular locales.
The new iana attribute provides the current canonical IANA timezone ID, where that is unclear. [TBD Yoshito to refine]
New locales were added, including en_ID and es_JP, plus many locales at a Basic level.
There was a fix made for the Zanb script, which was mistakenly categorized as special instead of regular.
There was a fix made to the BCP47 Latin↔︎ASCII transliterator ID
The gasoline-energy-density unit (used in miles per gallon of gasoline equivalent for electric vehicles) and the pint-imperial (used in the UK), plus many Japanese traditional units were added.
The unit of wind speed, Beaufort, was added for translation in locales where it is used.
Remaining SI units were added. Because these are primarily of use in scientific fields, they are not translated.
A few traditional English units were added, such as chain and fortnight. These were also not translated.
Many traditional Japanese units were added. These were not translated, outside of Japanese and English.
Many units have more refined (and sometimes corrected) unit systems.
The new SI prefixes for powers of 10 have generally been added: 30, 27, -27, -30. In some non-Latin-script languages there are not yet standard names for these, and in those the prefixes are left with Latin characters.
Likely Subtags — general cleanup
Addition of data donated by SIL for determining the most likely script and region for languages.
Addition of more und_ mappings. These provide for getting a default language if only the script, region, or both are known. These are, however, of limited usage, so implementations may want to filter them out.
Removal of macroregion codes, such as und_002. These are of very limited utility, and have been removed.
Language Containment Groups
Additional mappings have been added
Plural rules — have been added for blo.
Preferred hour formats — have changed substantially for many Latin American countries
There were general changes to fix the lenient parsing set for $. (The previous format for entering Unicode characters led to not escaping $; the new format is more forgiving.)
Many locales will have changed the name for the code IO to be names like "Chagos-Archipel". There are two alternates, so implementations can use the name that works best for them.
The Islamic calendars names have often been changed in English and many locales to use more descriptive names like "Hijri calendar"
Some flexible date formats may use different spacing.
Sierra Leone changed their currency — the new names are available, and the old names have an appended date range.
The Kyrgyzstan narrow symbol is now used. (Note: CLDR holds off on using new Unicode characters for currencies for a few cycles, to allow system fonts to catch up.)
There was a concerted effort to fix the Person Name Formatting data for a number of locales.
There was a concerted effort to fix the names of certain units of measurement for many locales.
The new Unicode 15.1 emoji had names and search keywords added.
Many languages added search keywords for symbols like ◉, ⋂, ⊆
Languages made improvements to other items as needed per language.
(Aside from locale files)
New XSD files in /common/dtd/.
These correspond to the DTDs, but do not carry the extra validity annotations.
ldml.xsd, ldmlBCP47.xsd, ldmlSupplemental.xsd, xml.xsd
New Test Data files in /common/testData/
personNameTest/_header.txt, _readme.txt, chr.txt, sw_KE.txt, tg.txt, ti.txt, wo.txt
transforms/und-t-und-latn-d0-ascii.txt (changed name)
Files with insufficient data:
/common/testData/personNameTest/br.txt, brx.txt, gaa.txt, ks_Deva.txt, lij.txt, pcm.txt, sat.txt, syr.txt, to.txt, tt.txt, xh.txt
Old format keyboard:
JSON Data Changes
Unit systems provide information about general usage of units of measure. For example, "knot" is in the customary US and UK systems, but is also acceptable for use with SI.
Implementations using the unit systems will find that some units have changed systems (either to be finer-grained, or to incorporate corrections.
LikelySubtags are used to find the most likely missing subtags in a locale identifier, and also the minimal form. Thus "de" (German) expands to "de-Latn-DE" (German written in Latin script as used in Germany), and all of ("de-Latn-DE", "de-DE", "de-DE") minimize to "de".
The algorithm for lookup has changed slightly (favoring script over region), and there have been data changes: most macroregions are gone (such as mapping from und-003) and some other und mappings. There remain some xx-YYY-001 results for artificial languages.
Preferred hour formats indicate the preferred form for a locale: 11 PM vs 23:00 vs 11 in the evening.
Have changed substantially for many Latin American countries
###TBD — more items will be added