CLDR 33.1

Overview

Unicode CLDR 33.1 is an update to CLDR 33 that focuses on Unicode 11.0 support. Improvements in this release include:

  • Data

      • Updates to Unicode 11.0

      • Adds annotations (names and keywords) for Unicode 11.0 emoji, and makes improvements to previously-existing annotations.

      • Updates Chinese collation stroke order from Unicode 7.0 to Unicode 11.0, after tooling bug fixes

  • Structure*

      • No changes. The DTD Δs and DTD Diffs links above point to v33.

  • Specification*

      • There is no LDML 33.1 document. Instead, only amendments to v33 are provided, as described below in Specification Amendments.

  • Charts*

For more details, see the list of bug fixes.

Specification Amendments

There is not a new version of the LDML spec. Instead, the following are amendments to LDML33. The changed text is indicated below by green highlighting

14.1 Synthesizing Sequence Names

  1. If sequence is an emoji flag sequence, look up the territory name in CLDR for the corresponding ASCII characters. Set suffixName to that, and prefixName to the characterLabel for "flag", and go to step 10.

    • For example, "🇵🇫" has the regional indicator symbols PF and would map to “Flagge: Französisch-Polynesien” in German.

  2. If sequence is an emoji tag sequence, look up the subdivision name in CLDR for the corresponding ASCII characters and compose as for emoji flag sequence.

    • For example, "🏴󠁧󠁢󠁳󠁣󠁴󠁿" has TAG characters gbsct and would map to “Flagge: Schottland” in German.

  3. If sequence is a keycap sequence or 🔟, use the characterLabel for "keycap" as the prefixName and set the suffix to be the ASCII characters in the sequence (or "10" in the case of 🔟), then go to step 8.

    • For example, "#⃣" would map to "Taste: #" in German.

  4. If sequence contains any emoji modifiers or hair components, move them (in order) into suffix, removing them from sequence.

    • For example, "👨🏿‍🦰" would map to "Mann: dunkle Hautfarbe, rotes Haar".

  5. Transform sequence and append to prefixName, by successively getting names for the longest subsequences, skipping any singleton ZWJ characters. If there is more than one name, use the listPatterns for "unit-short" to link them. This uses the patterns for "2", "start", "middle", and "end".

The /annotationsDerived/ folder has the available composed names, pre-built.

Migration

  • Updates German AM/PM strings to follow the English, to meet most common expectations of users of 12hr formats.

Known Issues

  1. Some of the main CLDR locales are missing a few Unicode 11.0 annotations (should be fixed in v34): #11193

    1. The segmentation rules have not been updated for changes in Unicode 11.0. This does not affect ICU, since the rules there were changed manually. Implementers may wish to patch their v33.1 versions with the data in #11203 if they use the segmentation rules independent of ICU. The changes include simplifying the break rules for Emoji and not breaking within strings of white space.

    2. The ICU4J libraries included in v33.1 were not updated to ICU 62.0. There are no known problems with using ICU 61.0, but implementers may want to update their copies to ICU 62.0.

Acknowledgments

Many people have made significant contributions to CLDR and LDML; see the Acknowledgments page for a full listing. The contributors to v33.1 will appear on the page later, when v34 is released.

Key to Header Links

The Unicode Terms of Use apply to CLDR data; in particular, see Exhibit 1.

For web pages with different views of CLDR data, see http://cldr.unicode.org/index/charts.