CLDR 33 Release Note

Overview

Unicode CLDR 33 provides an update to the key building blocks for software supporting the world's languages. This data is used by all major software systems for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

This release had a limited submission phase. The focus was on improvements to emoji keywords and to the Odia and Assamese locales, addition of typographic names data, and improvements to the structure for specifying keyboard layouts.

Improvements in this release include:

  • Structure

      • New structure for typographicNames translations (such as terms for Bold, Italic, ...), with data for 33 locales.

      • The structure for specifying keyboard layouts was significantly enhanced, with many new elements and attributes, and expanded syntax for some preëxisting attribute values. See spec for details: Keyboards.

  • Additional Translations/Data

      • Annotations (emoji keywords) for a limited set of locales had a full review (ar, en_GB, de, es, ja, ru).

      • Two additional locales (Odia, Assamese) were brought up to Modern coverage level; some missing items were added in other locales.

    • New typographicNames data added, with translations in 33 locales.

      • Added 4 new transforms: fa-fa_FONIPA, ha-ha_NE, nv-nv_FONIPA, vec-vec_FONIPA.

      • Added number spellout (RBNF) rules for sw (Swahili), ff (Fulfulde/Fula), qu (Quechua), lb (Luxembourgish), ccp (Chakma), su (Sundanese).

  • Property files

      • The emoji property data file ExtendedPictographic.txt has been removed from CLDR data, since the contents are now part of the UTS #51 “Unicode Emoji” data file: emoji-data.txt.

      • labels.txt was added for emoji categories and subcategories.

  • Code Updates

      • Addition of new currency code MRU for Mauritania; replaces MRO.

      • Updating of currency display names and narrow symbol for São Tomé & Príncipe Dobra (use standard names for STN, names showing older year range for STD).

    • Subdivisions (including all new codes for China).

    • Update timezone mappings for tzdata 2018c.

  • Bug fixes

For information on structural changes, see Spec Modifications.

For changes that may affect migration to this version, see Migration.

Charts

The charts have been updated for the v33 data. The Delta Data will show a number of changes in annotations that are due to the elimination of redundant keywords: see Growth.

There will also be new tab-separated-value files for loading the information into spreadsheets rather than trying to scrape the charts that will be added to CLDR33. Currently this is only for a subset of the charts.

    1. by_type.tsv

    2. delta.tsv — locales w/ inheritance

    3. delta_supp.tsv — supplemental data (eg non locale)

    4. delta_summary.tsv — stats on #2 & #3

Survey Tool

  • When collecting data for emoji names and annotations, the Survey Tool now has the capability to display its own images for emoji that may not yet be displayable on the user’s system.

Other data additions and changes

Some of the fixes and additions include:

  • Locale data:

      • Added English name for sr_ME, “Montenegrin”.

      • The cardinal (plural) rules for Macedonian (mk) have been changed so that one➞other for {11}.

      • New seed locale for scn (Sicilian), with plural rules.

      • Added exemplar characters for ha_NE (distinct from ha), nv (Navajo), cho (Choctaw).

  • Supplemental data:

      • Adjusted the territory containment data for some regions near the South Pole, following changes in UN M49, so several of these now have new containing regions.

      • Updated the <territoryInfo> GDP data for various regions.

For more information these and other bug fixes, see detailed delta charts and the list of bug fixes.

Growth

Because v33 was not a data submission release, the chart for growth differs little from that of the CLDR 32 Release Note. Here are the overall statistics:

The following files showed the largest number of raw changes:

  • annotations/as.xml, main/as.xml, annotations/ru.xml, main/br.xml, annotations/or.xml, annotations/br.xml, annotations/ga.xml

Two changes affected the statistics:

  • The keywords (in annotations) are being treated as sets for counting purposes.

    • So old:{a | b | c} → new:{a | c | d | e} counts as one deletion and 2 additions.

  • The keywords have also had some redundancies removed: if a keyword consisted entirely of other keywords, it was removed.

    • So old:{a, a b, b} → new:{a, b}.

Migration

  • Plurals: ordinal and cardinal rules have been added for scn. The cardinal (plural) rules for Macedonian (mk) have been changed so that one➞other for {11}. Should not cause migration issues.

    • The emoji property data file ExtendedPictographic.txt has been removed from CLDR data, since the contents are now part of the UTS #51 “Unicode Emoji” data file: emoji-data.txt.

    • Adjusted the territory containment data for some regions near the South Pole, following changes in UN M49, so several of these now have new containing regions.

Known Issues

    1. New macroregions

    • UN M.49 now includes Sark (680) but ISO rejected the proposed ISO 3166-1 code, so it is not included.

    1. “Week of” structure

      • The structure and intended usage for the “week x of y” patterns is still being refined and may change. This applies especially to dateFormatItems such as the following:

      • <dateFormatItem id="MMMMW" count=...>'week' W 'of' MMM</dateFormatItem>

      • <dateFormatItem id="yw" count=...>'week' w 'of' y</dateFormatItem>

      • Areas of discussion include the use of the count attribute and the use of ordinal vs. cardinal numbers. For more information see [#9801].

  1. Subdivision Names

    • The draft subdivision names were imported from wikidata. Names that had characters outside of the language's exemplars were excluded for now. Names that would cause collisions were allowed, but marked with superscripted numbers. The goal is to clean up these names over time.

    1. Chinese stroke collation

      • In CLDR 30 and 31, Chinese stroke collation was missing entries for several basic characters. CLDR 32 reverted the stroke collation data to the CLDR 29 version; a complete fix for the underlying problem is targeted for CLDR 34. See #10497, #10642.

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing.

Key to Header Links

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.