CLDR 42 Release Note

Overview

Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.

In CLDR 42, the focus is on:

  1. Locale coverage. The following locales now have higher coverage levels:

    1. Modern: Igbo (ig), Yoruba (yo)

    2. Moderate: Chuvash (cv), Xhosa (xh)

    3. Basic: Haryanvi (bgc), Bhojpuri (bho), Rajasthani (raj), Tigrinya (ti)

  2. Formatting Person Names. Added data and structure for formatting people's names. For more information on why this feature is being added and what it does, see Background.

  3. Emoji 15.0 Support. Added short names, keywords, and sort-order for the new Unicode 15.0 emoji.

  4. Coverage, Phase 2. Added additional language names and other items to the Modern coverage level, for more consistency (and utility) across platforms.

  5. Unicode 15.0 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.

Locale Status

CLDR v42 Language Count

Data Changes

There were two areas of focus for this release: the formatting of Personal Names, and the upgrade of Modern to include many more languages.

  • Person name formatting added multiple elements and attributes to provide the needed structure.

  • Date-time formatting added "atTime" for languages where a different formatting is used for a particular time for an event (as opposed to combining a date with a time range, or showing a wall clock time, or combining a relative date and an absolute time.

  • Date-time interval formatting added more formats for timezones, where the v (generic) and for z (specific) formats change the way the rest of the time looks.

  • Currency format additions

    • Two new alt values for pattern elements used for currencyFormat elements:

      • alt="alphaNextToNumber": A pattern to use when the currencySymbol would result in letter characters being adjacent to the numeric value; typically this adds a no-break space between the currency symbol and numeric value, f the standard currencyFormat pattern does not already have a space. This provides an improved alternative to the currencySpacing patterns.

      • alt="noCurrency": A pattern to use when currency-style formats are desried but without the actual symbol (as in a table of currency values all fo the same currency).

    • For the currencyFormats element, a new element currencyPatternAppendISO containing a pattern thatshows how to append an ISO currency symbol (¤¤) to a currency pattern using a standard currency symbol (¤); this is needed for certain types of currency display.

  • A DTD annotation for @TECHPREVIEW was added, indicating that an element (and its attributes) are a tech preview, and may change.

  • For more information, see dtd_deltas.html

  • A new -u extension key is added to provide a preferred unit of measurement for temperature: Celsius, Fahrenheit, and Kelvin. (An effort has also been started to provide syntax for other unit preferences in future releases.)

  • Two new digit settings are available, corresponding to new Unicode 15.0 scripts: Kawi and Nag Mundari.

  • A new short timezone ID is available, tz-uaiev, for Europe/Kyiv

  • For more information, see delta/bcp47.html

  • A new NameOrder element provides default ordering for languages (surnameFirst vs givenFirst).

  • Locales

    • Due to changes in ISO 639, a number of language codes have been deprecated, and some added.

    • Default content locales, likely subtags, and language data have been added.

  • Units

    • New machine-readable data is supplied for the structure of unit IDs, with the unit idComponent.

    • The length of a light-year has been adjusted to the IAU value (which uses a Julian year of 365.25 days).

    • The unit ID for metric-tonne has been deprecated in favor of tonne. (see Migration)

  • Transform names for Ethiopic have been changed (with the old names being deprecated and aliased to the new names).

  • Dates and times

    • The Yukon metazone has been un-deprecated

    • 13 locales have the islamic calendar added (thus requiring localization)

    • The first day of the week is now Monday in CN (CLDR-11510)

    • Day-periods are added for hi_Latn, and adjusted for mr to only have evening1 (see Migration)

  • Plurals

    • Hebrew has a category removed ('many'), while mt, vec, and ast have categories added. (see Migration)

    • Maltese now has the category 'two' CLDR-14665

    • Plural rules were added for Asturian CLDR-13972

    • Catalan now has the category 'many' CLDR-15599

    • Some rules have been tweaked.

  • Currencies

    • For Sierra Leone, the new currency SLE is now an official tender; the older currency SLL ceases to be legal tender after 2023-03-31.

    • For Croatia, EUR becomes legal tender on 2023-01-01, and the old currency HRK ceases to be legal tender after 2021-01-15.

  • Timezones

    • Support time zone data 2022e. For 2022e: After 2022-10-27, “Asia/Amman” and Asia/Damascus removed from metazone Europe_Eastern with no replacement metazone.

  • For more information, see delta/supplemental-data.html

Locale Changes

  • Coverage and general data

    • Modern coverage was increased by adding:

      • display names for a number of additional languages such as Cajun French, Kwakʼwala, Rohingya; display names for some additional scripts, calendars, collation and number system types.

      • atTime patterns

      • the quarter unit (quarter of a year)

      • 31 emoji short names and search keywords.

      • patterns and other data for formatting person names

        • sample names were also added, but their use is primarily internal

      • For a sample of added items see delta/sk.html, and for additional islamic calendar data, delta/ms.html

    • New languages at basic: bgc, bho, raj

    • Large-scale normalization of different kinds of spaces (see Migration)

    • The currency formats for Arabic and Hebrew were improved to provide more consistent layout for different contexts (right-to-left, neutral) and different types of currency symbols.

    • For more information, see delta/index.html

  • Subdivision translations

    • Most subdivision names will have draft="provisional". These are derived from Wikidata, and are not further curated

    • Exceptions are curated names: the names in English, and the names in many other languages for 3 subdivisions of GB. The latter are the only subdivisions used for emoji flags.

File Changes

  • The following were promoted from seed to common:

    • Annotations and Casing: oc.xml

    • Main: cv.xml, cv_RU.xml, oc.xml, oc_FR.xml, sms.xml, sms_FI.xml

  • New Files in main: annotations/ff.xml, annotations/ff_Adlm.xml, collation/fy.xml

  • New files in main/common: ann.xml, ann_NG.xml, bgc.xml, bgc_IN.xml, bho.xml, bho_IN.xml, frr.xml, frr_DE.xml, mdf.xml, mdf_RU.xml, oc_ES.xml, pis.xml, pis_SB.xml, raj.xml, raj_IN.xml, tok.xml, tok_001.xml

  • New files in main/rbnf: kk.xml

  • New in common/segments: fi.xml, sv.xml

  • A number of Ethiopic transliterator files were renamed, see CLDR-15351

  • For more information, see file-cldr-41-vs-42-txt

JSON Data Changes

  • JSON data is available:
    https://github.com/unicode-org/cldr-json/releases/tag/42.0.0

  • New or Changed Data:

    • TECH PREVIEW data for person names - CLDR-15414
      New data in cldr-core,
      also new packages cldr-person-names-full and cldr-person-names-modern

    • coverageLevels.json data in cldr-core - CLDR-15624

    • Additional currency data - CLDR-15958
      besides standard and accounting, new patterns:
      standard-alphaNextToNumber, standard-noCurrency,
      accounting-alphaNextToNumber, accounting-noCurrency

    • new 'atTime' data CLDR-16032

  • Fixes:

Background

Formatting people’s names

Software needs to be able to format people's names, such as John Smith or 宮崎駿. The data is typically drawn from a database, where a name record will have fields for the parts of people’s names, such as a given field with a value of “Maria”, and a surname field value of “Schmidt”.

There are many complications in dealing with the variety of different ways this needs to be done across languages:

  • People may have a different number of names, depending on their culture — they might have only one name (“Zendaya”), only two (“Albert Einstein”), or three or more. So the software needs to handle missing or extra name fields gracefully.

  • People may have multiple words in a particular name field, eg “Mary Beth” as a given name, or “van Berg” as a surname.

  • Some languages, such as Spanish, have two surnames (where each can be composed of multiple words).

  • The ordering of name fields can be different across languages, as well as the spacing (or lack thereof) and punctuation.

    • The ordering can be complicated; take the ordering of surname vs given name. Sometimes a language X will display names from a languages Y with a different order than names in language X. And that can even happen when language Y uses the same order as language X.

  • Name formatting need to be adapted to different circumstances, such as a need to be presented shorter or longer; in a formal or informal context; or when talking about someone, or talking to someone, or as a monogram (such as “JFK”), or in a sorted list "Smith, John".

CLDR has added structured patterns that enable implementations to format available name fields for a given languages. The formatting for a name can vary according to the available name fields, the language of the name and of the viewer, and various input settings.

The new Person Name formatting data has a tech preview status. The CLDR committee is requesting feedback on the data and structure so that it can be refined and enhanced in the next release. ICU will also be offering a tech preview API in its next release. Other clients of CLDR are recommended to try out the new data and structure, and supply feedback back to the CLDR committee in the next few months.

Specification Changes

The following are the main changes in the specification:

  • Locales

    • Updated the description of guidelines and invariants for Parent Locales data.

  • Hybrid locales

    • In Hybrid Locale Identifiers, clarify how hybrid locales work with better examples especially in regards to default script for hybrid locales such as hi-t-en-h0-hybrid which implies a Devanagari script as compared to hi-Latn-t-en-h0-hybrid which specifies using the Latin script

  • Currency Formats and Currencies

    • Described the new alt="alphaNextToNumber" and alt="noCurrency" variants for patterns used with currencyFormat elements

    • Described the new currencyPatternAppendISO element under currencyFormats

    • Discouraged the use of the old currencySpacing element (and its subelements) in favor of the alt="alphaNextToNumber" variant

  • Dates

    • Element dateTimeFormat

      • Described the new dateTimeFormat type="atTime" pattern and when to use it versus the standard dateTimeFormat pattern.

    • Matching Skeletons

      • Provided more detailed recommendations on matching pattern field length to field length in the requested skeleton.

  • Plurals

  • Units of measurement

    • Unit Preferences

      • Added a new subsection to specify the interaction of the unit Preferences data with the locale keys mu, ms, and rg, and the base locale.

    • Unit Elements, Unit_Conversion

      • For simpler and cleaner parsing, add a new element (unitIdComponent) and restructured the EBNF for parsing unit identifiers.

      • As part of this work, the identifier metric-ton was deprecated in favor of tonne. As usual, the older identifier remains for compatibility, and is aliased to the new one. `

  • Person Names

    • Added a new Part 8, Person Names.

Growth

The following chart shows the growth of CLDR locale-specific data over time. It is restricted to data items in /main and /annotations directories, so it does not include the non-locale-specific data. The % values are percent of the current measure of Modern coverage. That level is notched up each release, so previous releases had many locales that were at Modern coverage as assessed at the time of their release. There is one line per year, even though there were multiple releases in most years.

The detailed information on changes between v42 release and v41 are at v42 delta_summary.tsv: look at the TOTAL line for the overall counts of Added/Changed/Deleted. See v42 locale-growth.tsv for the detailed figures behind the chart.

CLDR v42 Growth

Migration

  • Data normalization. There was an extensive normalization of different kinds of spaces (normal, non-breaking, thin, etc.) for consistency of behavior - CLDR-14032

    • May impact tests of golden data

    • Reinforces the need to be lenient with spaces in parsing

  • Plural rules.

    • Additions. Added 'many' category for Asturian, Catalan. Implementations should handle these changes as they did for French and Spanish. They only affect messages with large numbers. Robust implementations will gracefully fall back to the 'other' category if a previously translated message doesn't have a new category; unfortunately, some implementations do not follow that practice. Maltese now has the 'two' category.

    • Removals. The 'many' plural category for Hebrew (CLDR-14634) was removed; it is unnecessary in modern practice. Such changes usually do not affect migration.

    • Changes. There were a few changes to the rules that affect how numbers are assigned to categories. Such changes usually do not affect migration.

  • Compact number formats

    • Compact decimal and currency formats are now allowed for values up to 10000000000000000000 (and these are used in locale data for ja).

  • Unit Identifiers. The metric measurement unit ID from 'metric-ton' to 'tonne'. The old ID is still valid, but deprecated and aliased to the new unit ID. So as long as an implementation handles aliases, there should be no migration issues.

  • Subdivisions. Other than three subdivisions of GB, country subdivisions will be marked as 'provisional'. This provides a better indication of their status.

  • Coverage

    • The CLDR-TC plans to combine the 'seed' directory into the 'common' directory in the future (CLDR-6396). To prepare for this, your application should make use of the common/properties/coverageLevels.txt file in order to determine the completeness of a locale, rather than whether a locale file is in 'common' or 'seed'.

  • Currencies

    • Croatia changes to use Euro starting in January 2023.

  • Keyboards

    • The Keyboard-SC is working on a major revamp of the Keyboard specification, planned for release in late 2022. “Keyboard 3.0” has a very different goal than the original format, and therefore existing keyboard files are not expected to interoperate with new implementations. For this reason, an entirely new DTD will be created. See CLDR-15034 for the latest status or to give feedback.

Known Issues

    • In the cldr-staging files with github tag “release-42”, the DTD Deltas chart for CLDR 42 is missing the changes for v42. The online versions have been updated and include the changes for v42. [CLDR-16097]

Upcoming changes

    • Display name for region TR

      • In CLDR, the English standard display name for region TR is “Turkey”, with alt="variant" form “Türkiye”. In an upcoming version (possibly as early as CLDR 43), these will switch, so the standard form will become “Türkiye”.

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.