CLDR 41 Release Note
Unicode CLDR provides key building blocks for software supporting the world's languages. CLDR data is used by all major software systems (including all mobile phones) for their software internationalization and localization, adapting software to the conventions of different languages.
The following table shows the number of Languages/Locales in this version. (See the v41 Locale Coverage table for more information.)
Beyond the member organizations of the Unicode Consortium, many dedicated communities and individuals regularly contribute to updating their locales, including:
Modern: Cherokee, Cantonese, Sorbian (Lower), Scottish Gaelic, Sorbian (Upper)
Moderate: Asturian [nearly Modern], Breton, Faroese, Fulah (Adlam), Kaingang, Nheengatu, Quechua, Sardinian
Basic: Bosnian (Cyrillic), Interlingua, Kabuverdianu, Māori, Romansh, Tajik, Tatar, Tongan, Uzbek (Cyrillic), Wolof
Because this is a limited-submission release, the data changes are limited. The focus for data this release was on Phase 3 of the project for providing grammatical information for units of measurement, with more locales reaching a modern coverage level, plus Phase 1 of a project to revamp Coverage levels.
There are no DTD changes in this release.
Inflected Unit Data. The inflected unit data allows formatted units to adapt to the context, particularly grammatical case, required for many languages. Locales at a modern level — where CLDR has grammatical feature data — now provide grammatical inflections for the common metric units (and a subset also provide grammatical inflections for common US/UK units). (Example: Armenian)
Minimal pairs. The minimal pairs show how translated material needs to adapt to context (plural category, grammatical case, etc.) The minimal pairs for grammatical features has been reviewed and in many cases corrected. (Example: Hindi)
Hindi (Latin). There have been substantial additions made to hi_Latn.xml. Note that based on user expectations, hi_Latn incorporates a large amount of English, and can also be referred to as "Hinglish". That is, it is assumed to be content more formally identified as be hi-Latn-t-en-h0-hybrid.
Sublocales. There is a new sublocale: en_MV.xml
Transliteration. Fourteen new transforms (and associated test files) have been added for the Ethiopic script and languages written in it. Note: the file names are not necessarily the best representation of the content; they may change in v42. Thanks to Daniel Yacob his contributions of this data.
Other. There are additional small changes to a number of locales (See charts)
Coverage Levels. The Basic level has been modified to make it easier to onboard new languages, and easier for implementations to filter locale data based on coverage levels.
The target is locales that are "selectable" in a UI, and have at least the very basic functions for formatting dates, times, and numbers. They also need to have the name of the locale and the regions where it is official in the native language, for construction of locale names. (The target for the Moderate level is a higher level of formatting for "document content", such as the content in a spreadsheet, while the target for the Modern level is the highest level of coverage, for locales requiring full functionality.)
There is a new machine-readable property file (coverageLevels.txt) that provides the levels for any locales that meet the requirements for Basic and above. That way implementations can more easily filter locales by the specific coverage level they want to use.
BCP47. The lw-phrase key-value pair have been added, to indicate a request to 'Prioritize keeping natural phrases (of multiple words) together when breaking, used in short text like title and headline'. Some new -t- extension values have been added for transforms.
Unicode. Recent Unicode script values have been added: Kawi and Nagm. (chart)
Units. The system values add 'metric' to certain units commonly used in in many metric countries, even though they are not metric units, such as 'hour' or 'arc-minute'. The grammatical features for units in certain locales has been refined: adding to some locales (eg dative and locative to Czech) and removing from some locales (eg accusative, dative, etc. from Malayalam) (chart)
Week Data. Weekend start/end data has changed for certain locales. The default time cycle (h/H) has been made explicit for some locales (instead of just inheriting from World). (chart)
Language Info. Language population data and likely subtags have been added for some indigenous Canadian languages. (chart)
The following files moved from Seed to Common: hi_Latn.xml, hi_Latn_IN.xml, ks_Deva.xml, ks_Deva_IN.xml
The new file /common/properties/coverageLevels.txt contains locales that meet coverage levels Modern, Moderate, or Basic. This allows implementations to easily filter to their desired coverage level.
New files for transform rules and tests are added for Ethiopic.
JSON Data Changes
There are no significant changes, but be aware of the packaging changes from v40.
The following are the main changes in the specification:
There are increasing numbers of cross language inheritance, which may require some code changes. CLDR-15378
Subdivision names. The subdivision names are being deprecated, with the exception of the English names and the names in other languages for the three subdivisions of GB used in the RGI Emoji (England, Scotland, Wales).
The deprecated data had been collected by merging in data from Wikipedia and ISO, but did not undergo any substantial vetting beyond that, due to resourcing constraints.
The deprecated data remains in v41, but the plan is to remove it from v42.
Seed directory. Locale files have been separated into two directories: seed and common. The seed locales were those that (roughly) didn't satisfy the Basic level. Starting in v42, the plan is to have all locale files are in the common directory, and deprecate the seed directory.
JDK11. CLDR will update to JDK11 in v42. CLDR-14311
The following shows the growth of CLDR data per year, represented as an area chart.
Each area represents the incremental increase in data during that year, as a percentage of current Modern coverage
For year 2022 there is a small amount of data so far (top area █) , because the main cycle for submission will not be done until September. That area shows how certain locales were fleshed out as a result of the focus on completing the inflected units in v41.
Data before 2015 is suppressed, so the lowest area (█) represents the data in 2016.
Hovering over the top line of the area shows the percentages.
This section will contain issues that arise after the data, code, or spec has been frozen.
Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.