CLDR 40 Release Note

See Key to Header Links

This version is currently Beta (for data and spec). See the latest release for the production version. The main data is available for review, but the JSON data is not yet ready for review. Some data may change if showstopper bugs are found. The planned schedule is:

  • Oct 27Release

Overview

Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

In CLDR v40, the focus is on:

Grammatical features (gender and case) for units of measurement in additional locales

    • In many languages, forming grammatical phrases requires dealing with grammatical gender and case. Without that, it can sound as bad as "on top of 3 hours" instead of "in 3 hours"

    • Phase 1 (v39) of grammatical features included just 12 locales (da, de, es, fr, hi, it, nl, no, pl, pt, ru, sv).

    • Phase 2 (v40) has expanded the number of locales by 29 (am, ar, bn, ca, cs, el, fi, gu, he, hr, hu, hy, is, kn, lt, lv, ml, mr, nb, pa, ro, si, sk, sl, sr, ta, te, uk, ur), but for a more restricted number of units.

Emoji v14 names and search keywords

    • These supply short names and search keywords for the new emoji, so that implementations can build on them to provide, for example, type-ahead in keyboards

Modernized Survey Tool front end.

    • The Survey Tool is used to gather all the data for locales. The outmoded Javascript infrastructure (very difficult to enhance or even fix bugs) was modernized.

Specification Improvements

    • Notably in the areas of Locale Identifiers, Dates, and Units of Measurement

Data Changes

  • The datetimeSkeleton has been added, to show the skeletons corresponding to the 4 'stock' date and time formats. This allows implementations to work solely with skeletons instead of patterns.

  • The grammatical case values have been expanded to include: elative, illative, partitive, terminative, and translative

  • The numberingSystem values have been expanded to include TANGSA digits (new in Unicode 14.0)

  • The keyboard DTDs are undergoing a number of changes to expand and simplify the structure.

  • Metazones now have short identifiers. These share the same broad namespace as timezone identifiers, but are required to have 4 letters so that there is a clean syntactic distinction between the IDs. The mapping beween short metazone identifiers and the long names is in a new <metazoneIds> element in metaZones.xml.

    • Timezone IDs now cannot have 4-letter IDs (the only such timezone identifier (gaza) has been deprecated in favor of a longer ID).

Segmentation

  • The standard segmentation data for line break has been updated to Unicode 14.0 rules, allowing for increased emoji compatibility over versions. ###[TBD]

  • Unicode language and script identifiers have been updated

  • Language Group data has been updated

  • Grammatical Case & Gender data has been updated

  • The Yukon timezone has been revived, deleting the old zones that belonged to it during 1970– 1983. Now, beginning November 2020, it is used for America/Dawson and America/Whitehorse.

  • The 'many' plural category has been introduced into es, it, pt, and pt_PT. Like French, this is used for large numbers. ###[NOTE: cover in migration section]

  • The default first day of the week is now Monday for Australia, and the default day period preferences for times have changed for CN and TW.

  • There are new day period rules for: kgp, nn, yrl

  • The transforms for Hani→Latn has been updated for Unicode 14.0 and tk_Cyrl→tk/BGN has some bug fixes

Locale Changes ###(Sample Link)

  • Gender and case data is available for more units and in more locales. The additional locales have that data for a subset of units.###[TBD]

  • Short names and keywords have been added for Unicode 14.0 emoji.

  • Names are present for many more symbol and punctuation characters

    • such as ⇇ in German, “gepaarte Pfeile nach links”, or in French “paire de flèches vers la gauche”

  • There are corrections to data for many locales

    • such the name for { in German, which is now “öffnende geschweifte Klammer”

    • English emoji names updated ###[TBD]

File Changes

  • locales moved from seed to common: sc, kgp

  • locales added to common: dsb, hsb, yrl

  • keyboards added to und/: kgp, yrl, sat; keyboards retracted: sat-Olck

JSON Data Changes

  • A new package, cldr-bcp47, has been added, with BCP47 metadata — CLDR-14571

  • cldr-json now uses bcp47 filenames and data items. "root" is now "und", and ca_ES_VALENCIA is "ca-ES-valencia" The en_US_POSIX locale has been dropped from cldr-json. — CLDR-14642

  • The new "datetimeSkeleton" data has been added alongside existing dateFormats. This is an interim format and is subject to change. — CLDR-15113

    • Example: "dateFormats": { "full": "EEEE, MMMM d, y G", "full-datetimeSkeleton": "GyMMMMEEEEd", … }

  • Known Issue: There are known duplicate keys in the grammaticalFeatures data and elsewhere, these are to be corrected in a future release — CLDR-14717

Specification Changes

  • Locale Identifiers

  • Dates

    • In Elements months, days, ... and months (in Date Field Symbol Table), improved the description of the distinctions between stand-alone and format forms. In the former section, also mention that these forms are not intended to be used for grammatical context outside the date format itself. — CLDR-15083

    • In Element dateFormats, described the existing numbers attribute, as well as the new datetimeSkeleton element added — CLDR-13425.

    • In Element intervalFormats, clarified that when determining the repeating field of an interval pattern, standalone and format fields are considered equivalent. — CLDR-14971

    • In Time Data, noted that the region attribute may also specify locales. — CLDR-15069

    • In Metazone Names, noted that CLDR metazone IDs may be the same as the aliases for some TZIDs.CLDR-15023

  • Units of Measurement

    • Allowed for generative number prefixes, such as gallons-per-100-milesCLDR-14751

    • Added support for formatting currencies in units, such as curr-eur-per-square-meter. — CLDR-14676

    • Cleaned up the text and the EBNF in Unit Identifiers. — CLDR-15035

Chart Changes

  • [TBD]

Growth

The chart at the end of this page shows the growth over time, with the additions from the latest release in the top blue section.

  • The vertical axis is the ratio compared to current Modern coverage, with the top line being 100% complete.

  • The horizontal axis is the number of locales (clipped to 200). The most complete locales are sorted to the left.

  • Hovering over a line shows some additional information.

Migration

  • [TBD expand and reword the following]

  • Addition of plural categories for Italian, Spanish & Portuguese

  • Short timezone IDs now cannot have 4-letter IDs (the only such timezone identifier (gaza) has been deprecated in favor of a longer ID).

  • The keyboard structure is undergoing a number of changes to expand and simplify the structure that are not complete in this version.

  • See cldr-json comments as well.

  • As a note, the CLDR git repository has renamed its main branch to main, see this page.

Known Issues

  • CLDR-14332 - Some links pointing to the CLDR data in the CLDR LDML are broken or stale.

  • [TBD]

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.