CLDR 42 Release Note

This version is currently at Beta — for production use, see the latest release.


This document is draft, and will continue to be updated until the release.

Feedback can be filed with CLDR Ticketsmigration issues are especially important.

Overview

Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

In CLDR 42, the focus is on:

  1. Locale coverage. The following locales now have higher coverage levels:

    1. Modern: Igbo (ig), yo (Yoruba)

    2. Moderate: Chuvash (cv), Xhosa (xh)

    3. Basic: Haryanvi (bgc), Bhojpuri (bho), Rajasthani (raj), Tigrinya (ti)

  2. Formatting Person Names. Added data and structure for formatting people's names. For more information on why this feature is being added and what it does, see Background.

  3. Emoji 15.0 Support. Added short names, keywords, and sort-order for the new Unicode 15.0 emoji.

  4. Coverage, Phase 2. Added additional language names and other items to the Modern coverage level, for more consistency (and utility) across platforms.

  5. Unicode 15.0 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.

Locale Status

CLDR v42 Language Count

Data Changes

There were two areas of focus for this release: the formatting of Personal Names, and the upgrade of Modern to include many more languages.

  • Person name formatting added multiple elements and attributes to provide the needed structure.

  • Date-time formatting added "atTime" for languages where a different formatting is used for a particular time for an event (as opposed to combining a date with a time range, or showing a wall clock time, or combining a relative date and an absolute time.

  • Date-time interval formatting added more formats for timezones, where the v (generic) and for z (specific) formats change the way the rest of the time looks.

  • Currency format additions

    • Two new alt values for pattern elements used for currencyFormat elements:

      • alt="alphaNextToNumber": A pattern to use when the currencySymbol would result in letter characters being adjacent to the numeric value; typically this adds a no-break space between the currency symbol and numeric value, f the standard currencyFormat pattern does not already have a space. This provides an improved alternative to the currencySpacing patterns.

      • alt="noCurrency": A pattern to use when currency-style formats are desried but without the actual symbol (as in a table of currency values all fo the same currency).

    • For the currencyFormats element, a new element currencyPatternAppendISO containing a pattern thatshows how to append an ISO currency symbol (¤¤) to a currency pattern using a standard currency symbol (¤); this is needed for certain types of currency display.

  • A DTD annotation for @TECHPREVIEW was added, indicating that an element (and its attributes) are a tech preview, and may change.

  • For more information, see dtd_deltas.html

  • A new -u extension key is added to provide a preferred unit of measurement for temperature: Celsius, Fahrenheit, and Kelvin. (An effort has also been started to provide syntax for other unit preferences in future releases.)

  • Two new digit settings are available, corresponding to new Unicode 15.0 scripts: Kawi and Nag Mundari.

  • A new short timezone ID is available, tz-uaiev, for Europe/Kyiv

  • For more information, see delta/bcp47.html

  • A new NameOrder element provides default ordering for languages (surnameFirst vs givenFirst).

  • Locales

    • Due to changes in ISO 639, a number of language codes have been deprecated, and some added.

    • Default content locales, likely subtags, and language data have been added.

  • Units

    • New machine-readable data is supplied for the structure of unit IDs, with the unit idComponent.

    • The length of a light-year has been adjusted to the IAU value (which uses a Julian year of 365.25 days).

    • The unit ID for metric-tonne has been deprecated in favor of tonne. (see Migration)

  • Transform names for Ethiopic have been changed (with the old names being deprecated and aliased to the new names).

  • Dates and times

    • The Yukon metazone has been un-deprecated

    • 13 locales have the islamic calendar added (thus requiring localization)

    • The first day of the week is now Monday in CN (CLDR-11510)

    • Day-periods are added for hi_Latn, and adjusted for mr to only have evening1 (see Migration)

  • Plurals

    • Hebrew has a category removed ('many'), while mt, vec, and ast have categories added. (see Migration)

    • Some rules have been tweaked.

  • The currency SLE is now an official tender.

  • For more information, see delta/supplemental-data.html

Locale Changes

  • Coverage and general data

    • Modern coverage was increased by adding:

      • a number of additional languages, such as Kwakʼwala [Add more]

      • atTime patterns

      • the quarter unit (quarter of a year)

      • 31 emoji short names and search keywords.

      • patterns and other data for formatting person names

        • sample names were also added, but their use is primarily internal

      • For a sample of added items see delta/sk.html, and for additional islamic calendar data, delta/ms.html

    • New languages at basic: bgc, bho, raj

    • Large-scale normalization of different kinds of spaces (see Migration)

    • The currency formats for Arabic and Hebrew were improved to provide more consistent layout for different contexts (right-to-left, neutral) and different types of currency symbols.

    • For more information, see delta/index.html

  • Subdivision translations

    • Most subdivision names will have draft="provisional". These are derived from Wikidata, and are not further curated

    • Exceptions are curated names: the names in English, and the names in many other languages for 3 subdivisions of GB. The latter are the only subdivisions used for emoji flags.

File Changes

  • The following were promoted from seed to common:

    • Annotations and Casing: oc.xml

    • Main: cv.xml, cv_RU.xml, oc.xml, oc_FR.xml, sms.xml, sms_FI.xml

  • New Files in main: annotations/ff.xml, annotations/ff_Adlm.xml, collation/fy.xml

  • New files in main/common: ann.xml, ann_NG.xml, bgc.xml, bgc_IN.xml, bho.xml, bho_IN.xml, frr.xml, frr_DE.xml, mdf.xml, mdf_RU.xml, oc_ES.xml, pis.xml, pis_SB.xml, raj.xml, raj_IN.xml, tok.xml, tok_001.xml

  • New files in main/rbnf: kk.xml

  • New in common/segments: fi.xml, sv.xml

  • A number of Ethiopic transliterator files were renamed, see CLDR-15351

  • For more information, see file-cldr-41-vs-42-txt

JSON Data Changes

  • JSON data is available:
    https://github.com/unicode-org/cldr-json/releases/tag/42.0.0-BETA1

  • New or Changed Data:

    • TECH PREVIEW data for person names - CLDR-15414
      New data in cldr-core,
      also new packages cldr-person-names-full and cldr-person-names-modern

    • coverageLevels.json data in cldr-core - CLDR-15624

    • Additional currency data - CLDR-15958
      besides standard and accounting, new patterns:
      standard-alphaNextToNumber, standard-noCurrency,
      accounting-alphaNextToNumber, accounting-noCurrency

  • Fixes:

Background

Formatting people’s names

Software needs to be able to format people's names, such as John Smith or 宮崎駿. The data is typically drawn from a database, where a name record will have fields for the parts of people’s names, such as a given field with a value of “Maria”, and a surname field value of “Schmidt”.

There are many complications in dealing with the variety of different ways this needs to be done across languages:

  • People may have a different number of names, depending on their culture — they might have only one name (“Zendaya”), only two (“Albert Einstein”), or three or more. So the software needs to handle missing or extra name fields gracefully.

  • People may have multiple words in a particular name field, eg “Mary Beth” as a given name, or “van Berg” as a surname.

  • Some languages, such as Spanish, have two surnames (where each can be composed of multiple words).

  • The ordering of name fields can be different across languages, as well as the spacing (or lack thereof) and punctuation.

    • The ordering can be complicated; take the ordering of surname vs given name. Sometimes a language X will display names from a languages Y with a different order than names in language X. And that can even happen when language Y uses the same order as language X.

  • Name formatting need to be adapted to different circumstances, such as a need to be presented shorter or longer; in a formal or informal context; or when talking about someone, or talking to someone, or as a monogram (such as “JFK”), or in a sorted list "Smith, John".

CLDR has added structured patterns that enable implementations to format available name fields for a given languages. The formatting for a name can vary according to the available name fields, the language of the name and of the viewer, and various input settings.

The new Person Name formatting data has a tech preview status. The CLDR committee is requesting feedback on the data and structure so that it can be refined and enhanced in the next release. ICU will also be offering a tech preview API in its next release. Other clients of CLDR are recommended to try out the new data and structure, and supply feedback back to the CLDR committee in the next few months.

Specification Changes

TBD: update from LDML Modifications before release

The following are the main changes in the specification:

  • Locales

    • Updated the description of guidelines and invariants for Parent Locales data.

  • Currency Formats and Currencies

    • Described the new alt="alphaNextToNumber" and alt="noCurrency" variants for patterns used with currencyFormat elements

    • Described the new currencyPatternAppendISO element under currencyFormats

    • Discouraged the use of the old currencySpacing element (and its subelements) in favor of the alt="alphaNextToNumber" variant

  • Dates

    • Element dateTimeFormat

      • Described the new dateTimeFormat type="atTime" pattern and when to use it versus the standard dateTimeFormat pattern.

    • Matching Skeletons

      • Provided more detailed recommendations on matching pattern field length to field length in the requested skeleton.

  • Plurals

  • Units of measurement

    • Unit Preferences

      • Added a new subsection to specify the interaction of the unit Preferences data with the locale keys mu, ms, and rg, and the base locale.

    • Unit Elements, Unit_Conversion

      • For simpler and cleaner parsing, add a new element (unitIdComponent) and restructured the EBNF for parsing unit identifiers.

      • As part of this work, the identifier metric-ton was deprecated in favor of tonne. As usual, the older identifier remains for compatibility, and is aliased to the new one. `

  • Person Names

    • Added a new Part 8, Person Names.

Growth

The following chart shows the growth of CLDR locale-specific data over time. It is restricted to data items in /main and /annotations directories, so it does not include the non-locale-specific data. The % values are percent of the current measure of Modern coverage. That level is notched up each release, so previous releases had many locales that were at Modern coverage as assessed at the time of their release. There is one line per year, even though there were multiple releases in most years.

The detailed information on changes between v42 release and v41 are at v42 delta_summary.tsv: look at the TOTAL line for the overall counts of Added/Changed/Deleted. See v42 locale-growth.tsv for the detailed figures behind the chart.

CLDR v42 Growth

Migration

  • Data normalization. There was an extensive normalization of different kinds of spaces (normal, non-breaking, thin, etc.) for consistency of behavior - CLDR-14032

    • May impact tests of golden data

    • Reinforces the need to be lenient with spaces in parsing

  • Plural rules.

    • Additions. Added 'many' category for Asturian, Catalan. Implementations should handle these changes as they did for French and Spanish. They only affect messages with large numbers. Robust implementations will gracefully fall back to the 'other' category if a previously translated message doesn't have a new category; unfortunately, some implementations do not follow that practice.

    • Removals. The 'many' plural category for Hebrew (CLDR-14634) was removed; it is unnecessary in modern practice. Such changes usually do not affect migration.

    • Changes. There were a few changes to the rules that affect how numbers are assigned to categories. Such changes usually do not affect migration.

  • Unit Identifiers. The metric measurement unit ID from 'metric-ton' to 'tonne'. The old ID is still valid, but deprecated and aliased to the new unit ID. So as long as an implementation handles aliases, there should be no migration issues.

  • Subdivisions. Other than three subdivisions of GB, country subdivisions will be marked as 'provisional'. This provides a better indication of their status.

Known Issues

  • Work post-data-beta

    • All of the specification work will be done

    • There was a problem in generating the date/time verification charts (CLDR-15517), whereby interval formats with "B" fail. So in those charts in a few locales, 3 lines will have "n/a" instead of the right value. For an example, see Albanian.

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.