CLDR 41 Release Note

Overview

Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.


CLDR v41 is a limited-submission release. Most work was on tooling, with only specified updates to the data, namely Phase 3 of the grammatical units of measurement project. The required grammar data for the Modern coverage level increased, with 40 locales adding an average of 4% new data each. Ukrainian grew the most, by 15.6%.


The tooling changes are targeted at the v42 general submission release. They include a number of features and improvements such as progress meter widgets in the Survey Tool.


Finally, the Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)

Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:

  • Modern: Cherokee, Cantonese, Sorbian (Lower), Scottish Gaelic, Sorbian (Upper)

  • Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian

  • Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof

Data Changes

Because this is a limited-submission release, the data changes are limited. The focus for data this release was on Phase 3 of the project for providing grammatical information for units of measurement, with more locales reaching a modern coverage level, plus Phase 1 of a project to revamp Coverage levels.

  1. There are no DTD changes in this release.

Locale Changes

  1. Inflected Unit Data. The inflected unit data allows formatted units to adapt to the context, particularly grammatical case, required for many languages. Locales at a modern level — where CLDR has grammatical feature data — now provide grammatical inflections for the common metric units (and a subset also provide grammatical inflections for common US/UK units). (Example: Armenian)

  2. Minimal pairs. The minimal pairs show how translated material needs to adapt to context (plural category, grammatical case, etc.) The minimal pairs for grammatical features has been reviewed and in many cases corrected. (Example: Hindi)

  3. Hindi (Latin). There have been substantial additions made to hi_Latn.xml. Note that based on user expectations, hi_Latn incorporates a large amount of English, and can also be referred to as "Hinglish". That is, it is assumed to be content more formally identified as be hi-Latn-t-en-h0-hybrid.

  4. Sublocales. There is a new sublocale: en_MV.xml

  5. Transliteration. Fourteen new transforms (and associated test files) have been added for the Ethiopic script and languages written in it. Note: the file names are not necessarily the best representation of the content; they may change in v42. Thanks to Daniel Yacob his contributions of this data.

  6. Other. There are additional small changes to a number of locales (See charts)

  1. Coverage Levels. The Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.

    • The target is locales that are "selectable" in a UI, and have at least the very basic functions for formatting dates, times, and numbers. They also need to have the name of the locale and the regions where it is official in the native language, for construction of locale names. (The target for the Moderate level is a higher level of formatting for "document content", such as the content in a spreadsheet, while the target for the Modern level is the highest level of coverage, for locales requiring full functionality.)

    • There is a new machine-readable property file (coverageLevels.txt) that provides the levels for any locales that meet the requirements for Basic and above. That way implementations can more easily filter locales by the specific coverage level they want to use.

    • The Locale Coverage chart has also been revised to make it easier to use, and the associated TSV file (locale-coverage.tsv) has been updated.

  2. BCP47. The lw-phrase key-value pair have been added, to indicate a request to 'Prioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline'. Some new -t- extension values have been added for transforms.

  3. Unicode. Recent Unicode script values have been added: Kawi and Nagm. (chart)

  4. Units. The system values add 'metric' to certain units commonly used in in many metric countries, even though they are not metric units, such as 'hour' or 'arc-minute'. The grammatical features for units in certain locales has been refined: adding to some locales (eg dative and locative to Czech) and removing from some locales (eg accusative, dative, etc. from Malayalam) (chart)

  5. Week Data. Weekend start/end data has changed for certain locales. The default time cycle (h/H) has been made explicit for some locales (instead of just inheriting from World). (chart)

  6. Language Info. Language population data and likely subtags have been added for some indigenous Canadian languages. (chart)

File Changes

  1. The following files moved from Seed to Common: hi_Latn.xml, hi_Latn_IN.xml, ks_Deva.xml, ks_Deva_IN.xml

  2. The new file /common/properties/coverageLevels.txt contains locales that meet coverage levels Modern, Moderate, or Basic. This allows implementations to easily filter to their desired coverage level.

  3. New files for transform rules and tests are added for Ethiopic.

JSON Data Changes

  • There are no significant changes, but be aware of the packaging changes from v40.

Specification Changes

The following are the main changes in the specification:

Tooling Changes

Survey Tool

  • Improved ability for translators to see their progress:

    • Progress Meters — There are new progress meters for translators.

    • Dashboard — Now always shows Error and Missing items. In addition, adds Abstained items.

Developer

  • Commit CheckerAllows irrelevant commits to be excluded, reducing review time.

  • Line Numbers — The checks on data values now report XML file line numbers, making it easier to find and fix errors.

Migration

  1. There are increasing numbers of cross language inheritance, which may require some code changes. CLDR-15378

Upcoming Changes

  1. Subdivision names. The subdivision names are being deprecated, with the exception of the English names and the names in other languages for the three subdivisions of GB used in the RGI Emoji (England, Scotland, Wales).

    • The deprecated data had been collected by merging in data from Wikipedia and ISO, but did not undergo any substantial vetting beyond that, due to resourcing constraints.

    • The deprecated data remains in v41, but the plan is to remove it from v42.

  2. Seed directory. Locale files have been separated into two directories: seed and common. The seed locales were those that (roughly) didn't satisfy the Basic level. Starting in v42, the plan is to have all locale files are in the common directory, and deprecate the seed directory.

  3. JDK11. CLDR will update to JDK11 in v42. CLDR-14311

  4. Transform Names. Several new Ethiopic transforms were added in CLDR 41. Some of these have names that do not follow normal conventions; these will be renamed in an upcoming CLDR release. (CLDR-15351)

  5. Coverage Levels. Phase 2 of the coverage level project will move a number of items from Comprehensive into Modern.

  6. Person Names. In Phase 1 of the Person Name Formatting project, the infrastructure for gathering data for formatting people‘s names in different locales will be added, and data will be gathered for a select number of locales.

  7. Keyboards. The Keyboard-SC is working on a major revamp of the Keyboard specification, planned for release in late 2022. “Keyboard 3.0” has a very different goal than the original format, and therefore existing keyboard files are not expected to interoperate with new implementations. For this reason, an entirely new DTD will be created.

Growth

The following shows the growth of CLDR data per year, represented as an area chart.

  • Each area represents the incremental increase in data during that year, as a percentage of current Modern coverage

  • For year 2022 there is a small amount of data so far (top area ) , because the main cycle for submission will not be done until September. That area shows how certain locales were fleshed out as a result of the focus on completing the inflected units in v41.

  • Data before 2015 is suppressed, so the lowest area () represents the data in 2016.

  • Hovering over the top line of the area shows the percentages.

Known Issues

This section will contain issues that arise after the data, code, or spec has been frozen.

  • Several new Ethiopic transforms were added in CLDR 41. Some of these have names that do not follow normal conventions; these will be renamed in an upcoming CLDR release. (CLDR-15351)

  • Some of the new exhaustive tests are failing (CLDR-15486). However, they don't appear to be due to problems in the data, and are likely some issue in the test code.

  • There was a problem in generating the date/time verification charts (CLDR-15517), whereby interval formats with "B" fail. So in those charts in a few locales, 3 lines will have "n/a" instead of the right value. For an example, see Albanian.

  • The Yukon metazone is missing the short ID (CLDR-15518). This will only affect users of those short IDs: they can patch metaZone.xml to add: <metazoneId shortId="yuko" longId="Yukon" />

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.