CLDR 40 Release Note

See Key to Header Links

Overview

Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

In CLDR v40, the focus is on:

Grammatical features (gender and case)

In many languages, forming grammatical phrases requires dealing with grammatical gender and case. Without that, it can sound as bad as "on top of 3 hours" instead of "in 3 hours". The overall goal for CLDR is to supply building blocks so that implementations of advanced message formatting can handle gender and case. See also: Inflection Points.

  • Phase 1 (v39) of grammatical features included just 12 locales (da, de, es, fr, hi, it, nl, no, pl, pt, ru, sv) for all units of measurement.

  • Phase 2 (v40) has expanded the number of locales by 29 (am, ar, bn, ca, cs, el, fi, gu, he, hr, hu, hy, is, kn, lt, lv, ml, mr, nb, pa, ro, si, sk, sl, sr, ta, te, uk, ur), but for a more restricted number of units.

  • Phase 3 (v41) will further expand the units.

Emoji v14 names and search keywords

CLDR supplies short names and search keywords for the new emoji, so that implementations can build on them to provide, for example, type-ahead in keyboards.

Modernized Survey Tool front end

The Survey Tool is used to gather all the data for locales. The outmoded Javascript infrastructure was modernized to make it easier to add enhancements (such as the split-screen dashboard) and to fix bugs.

Specification Improvements

The LDML specification has some important fixes and clarifications for Locale Identifiers, Dates, and Units of Measurement.

Approximately 140,000 data items were added or changed.

Data Changes

  • The datetimeSkeleton has been added, to show the skeletons corresponding to the 4 'stock' date and time formats.

    • The end goal is to allow implementations to work solely with skeletons instead of patterns.

    • NOTE: while the v40 mechanism handles most locales, it needs further enhancements to generate all of the existing standard formats from skeletons.

  • The grammatical case values have been expanded to include: elative, illative, partitive, terminative, and translative

  • The numberingSystem values have been expanded to include TANGSA digits (new in Unicode 14.0)

  • The keyboard DTDs are undergoing a number of changes to expand and simplify the structure.

  • Metazones now have short identifiers.

    • These share the same broad namespace as timezone identifiers, but are required to have 4 letters so that there is a clean syntactic distinction between the IDs.

    • The mapping beween short metazone identifiers and the long names is in a new <metazoneIds> element in metaZones.xml.

    • Timezone IDs now cannot have 4-letter IDs (the only such timezone identifier (gaza) has been deprecated in favor of a longer ID).

Segmentation Changes

  • The standard segmentation data for line break has been updated to Unicode 14.0 rules, allowing for increased emoji compatibility over versions.

  • Unicode language and script identifiers have been updated

  • Language Group data has been updated

  • Grammatical Case & Gender data has been updated

  • The Yukon timezone has been revived, deleting the old zones that belonged to it during 1970– 1983. Now, beginning November 2020, it is used for America/Dawson and America/Whitehorse.

  • To support timezone data tz2021b, “Pacific/Kanton” is added as an alias for zone “Pacific/Enderbury”

  • The ‘many’ plural category has been introduced into es, it, pt, and pt_PT. As in French, this is used for large numbers.

  • The default first day of the week is now Monday for Australia, and the default day period preferences for times have changed for CN (to “HH”) and TW (to “hb”, i.e. 12-hour with day periods instead of AM/PM).

  • There are new day period rules for: kgp, nn, yrl

  • The transforms for Hani→Latn has been updated for Unicode 14.0 and tk_Cyrl→tk/BGN has some bug fixes

Locale Changes

  • Gender and case data is available for more units and in more locales. The additional locales have that data for a subset of units.

  • Short names and keywords have been added for Unicode 14.0 emoji.

  • Names are present for many more symbol and punctuation characters

    • such as ⇇ in German, “gepaarte Pfeile nach links”, or in French “paire de flèches vers la gauche”

  • There are corrections and changes to data for many locales

    • such the name for { in German, which is now “öffnende geschweifte Klammer”

    • English names for some emoji were updated, e.g. “Dizzy Face” → “Face With X Eyes”, “Hugging Face” → “Smiling Face With Open Hands”.

    • ar_AE now defaults to ‘latn’ digits

File Changes

  • locales moved from seed to common: sc, kgp

  • locales added to common: dsb, hsb, yrl

  • locales added to seed: ab, bal, hnj, tpi

  • keyboards added to und/: kgp, yrl, sat; keyboards retracted: sat-Olck

JSON Data Changes

  • A new package, cldr-bcp47, has been added, with BCP47 metadata — CLDR-14571

  • The JSON data now uses fully valid Unicode BCP 47 locale syntax for filenames and locale-related data items. — CLDR-14642
    For example:

    • "root" in v39 is now "und" in v40.

    • "ca-ES-VALENCIA" in v39 is now "ca-ES-valencia" in v40

    • "en-US-POSIX" in v39 is now removed in v40.

  • The new date/time skeleton data has been added, using a structure parallel to dateFormats and timeFormats. — CLDR-15113
    For example:

      • "dateFormats": { "full": "EEEE d MMMM y", …}

      • "dateSkeletons": { "full": "yMMMMEEEEd", … } // new

      • "timeFormats": { "full": "HH.mm.ss zzzz", … }

      • "timeSkeletons": { "full": "HHmmsszzzz", … } // new

  • Duplicate keys in data have been corrected. In the grammaticalFeatures data and elsewhere, these are to be corrected in a future release — CLDR-14717

    • As a result, the grammaticalGenderFeatures.json file was removed from cldr-core and the grammatical gender data was correctly folded into grammaticalFeatures.json. (Both of these files were corrupted by duplicate keys in v39).

  • The package metadata and readme files have been improved: see https://github.com/unicode-org/cldr-json.

Specification Changes

Locale Identifiers

Dates

    • In Elements months, days, ... and months (in Date Field Symbol Table), improved the description of the distinctions between stand-alone and format forms. In the former section, also mentioned that these forms are not intended to be used for grammatical context outside the date format itself. — CLDR-15083

    • In Element dateFormats, described the existing numbers attribute, as well as the new datetimeSkeleton element added — CLDR-13425.

    • In Element intervalFormats, clarified that when determining the repeating field of an interval pattern, standalone and format fields are considered equivalent. — CLDR-14971

    • In Time Data, noted that the region attribute may also specify locales. — CLDR-15069

    • In Metazone Names, noted that CLDR metazone IDs may be the same as the aliases for some TZIDs.CLDR-15023

Units of Measurement

    • Allowed for generative number prefixes, such as gallons-per-100-milesCLDR-14751

    • Added support for formatting currencies in units, such as curr-eur-per-square-meter. — CLDR-14676

    • Cleaned up the text and the EBNF in Unit Identifiers. — CLDR-15035

Growth

The chart below shows the growth over time, with the additions from the latest release in the top blue section.

  • The vertical axis is the ratio compared to current Modern coverage, with the top line being 100% complete.

  • The horizontal axis is the number of locales (clipped to 200). The most complete locales are sorted to the left.

  • Hovering over a line shows some additional information.

Migration

  • Addition of the ‘many’ plural category for Italian, Portuguese, and Spanish; this is used for certain large numbers. Implementations that are sensitive to the addition of new plural categories may need code or data changes.

  • Short timezone IDs now cannot have 4-letter IDs (the only such timezone identifier (gaza) has been deprecated in favor of a longer ID).

  • The ar_AE locale data now defaults to ‘latn’ digits (eg, 0, 1, ... 9) in numbers.

  • The default time cycle for CN changes from h to HH, and the default time cycle for TW changes from ha (12-hour with AM/PM) to hb (12-hour with day periods)

  • The keyboard structure is undergoing a number of changes to expand and simplify the structure that are not complete in this version.

  • Some of the JSON changes may require code changes. See the "JSON Data Changes" section, above.

  • Note: the CLDR git repository has renamed its main branch to main, as per Renaming of the git branch. This may require update of local github repositories with CLDR data.

Known Issues

  • CLDR-14332 - Some links pointing to the CLDR data in the CLDR LDML are broken or stale.

  • CLDR-15238 - Some emoji annotations are missing in certain locales.

  • Some of the programmatic fixes for charts are delayed until after release.

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.