BCP47 Syntax Mapping

In the current LDML specification, a Unicode Locale Identifier consists from is composed of a Unicode Language Identifier plus optional locale extensions. Unicode Language Identifier is fully compatible with BCP47 language tag, but the syntax of locale extensions (“@” key “=” type (“;” key “=” type)* ) are not. The LDML is trying to define systematical mapping, but the current definition may truncate (and or remove “-“ in some type values) key or type value to 8 characters because of the BCP47 language subtag’s syntax restriction. The current definition utilizes BCP47 private use features, but we want to make locale extensions formal (writing a new RFC to reserve a singleton letter for the usage), so we can avoid any conflicts with other private use values and also allow software developers to write a parser for Unicode locale extensions confidently.

BCP 47 is undergoing a revision which should be done soon:

Once we define formal representation of Unicode locale extensions in BCP47 syntax, we actually no longer have any good reasons to use @key1=type1;key2=type2… syntax for Unicode Locale Identifier other than backward compatibility reasons. This document proposes that we retire the proprietary syntax and fully migrate to the new syntax fully supported by BCP47 language tag.

There are several options for representing keyword key/type pairs in BCP47 syntax. Examples in following proposal assume a letter “u” is reserved for the Unicode locale extensions; however we could go for any of the possible extensions: [0-9 a-w y z].

The table below shows the locale extension keys/values currently defined by the LDML specification.

Key/Type Definitions

key type Description
collation standard The default ordering for each language. For root it is [ UCA ] order; for each other locale it is the same as UCA ordering except for appropriate modifications to certain characters for that language. The following are additional choices for certain locales; they only have effect in those locales.
  phonebook For a phonebook-style ordering (used in German).
  pinyin Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese)
  traditional For a traditional-style sort (as in Spanish)
  stroke Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese)
  direct Hindi variant
  posix A “C”-based locale. (no longer in CLDR data)
  big5han Pinyin ordering for Latin, big5 charset ordering for CJK characters. (used in Chinese)
  gb2312han Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. (used in Chinese)
  unihan Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters. (used in Chinese)
calendar

(For information on the calendar algorithms associated with the data used with the above types, see [ Calendars ].)
gregorian (default)
  islamic

alias: arabic
Astronomical Arabic
  chinese Traditional Chinese calendar
  islamic-civil

alias: civil-arabic
Civil (algorithmic) Arabic calendar
  hebrew Traditional Hebrew Calendar
  japanese Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor)
  buddhist

alias: thai-buddhist
Thai Buddhist Calendar (same as Gregorian except for the year)
  persian Persian Calendar
  coptic Coptic Calendar
  ethiopic Ethiopic Calendar
collation parameters:

  colStrength
  colAlternate
  colBackwards
  colNormalization
  colCaseLevel
  colCaseFirst,
  colHiraganaQuaternary
  colNumeric
  variableTop
Associated values as defined in: 5.14.1 <collation> Semantics as defined in: 5.14.1 <collation>
currency

(also known as a Unicode currency code )
ISO 4217 code,

plus others in common use
Currency value identified by ISO 4217 code, plus others in common use. Also uses XXX as Unknown or Invalid Currency .

See Appendix K: Valid Attribute Values and also [ Data Formats ]
time zone

(also known as a Unicode time zone code )
TZID, plus the value:

Etc/Unknown
Identification for time zone according to the TZ Database, plus the value Etc/Unknown .

Unicode LDML supports all of the time zone IDs by mapping all equivalent time zone IDs to a canonical ID for translation. This canonical time zone ID is not the same as the zone.tab time zone ID found in [ Olson ].

For more information, see Section 5.9.2 Time Zone Names , Appendix F: Date Format Patterns , and Appendix J: Time Zone Display Names .

Collation Parameters

Attribute Options Basic Example XML Example Description
strength primary (1)
secondary (2)
tertiary (3)
quaternary (4)
identical (5)
[strength 1] strength = “ primary “ Sets the default strength for comparison, as described in the UCA.
alternate non-ignorable shifted [alternate non-ignorable] alternate = “ non-ignorable “ Sets alternate handling for variable weights, as described in UCA
backwards on
off
[backwards 2] backwards = “ on “ Sets the comparison for the second level to be backwards (“French”), as described in UCA
normalization on
off
[normalization on] normalization = “ off “ If on , then the normal UCA algorithm is used. If off , then all strings that are in [ FCD ] will sort correctly, but others will not necessarily sort correctly. So should only be set off if the the strings to be compared are in FCD.
caseLevel on
off
[caseLevel on] caseLevel = “ off “ If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on .
caseFirst upper
lower
off
[caseFirst off] caseFirst = “ off “ If set to upper , causes upper case to sort before lower case. If set to lower , lower case will sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels.
hiraganaQuaternary on
off
[hiraganaQ on] hiragana­Quaternary = “ on “ Controls special treatment of Hiragana code points on quaternary level. If turned on , Hiragana codepoints will get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary if you want this attribute to take effect.
numeric on
off
[numeric on] numeric = “ on “ If set to on , any sequence of Decimal Digits (General_Category = Nd in the [ UCD ]) is sorted at a primary level with its numeric value. For example, “A-21” < “A-123”.
variableTop uXXuYYYY & \u00XX\uYYYY < [variable top] variableTop = “uXXuYYYY” The parameter value is an encoded Unicode string, with code points in hex, leading zeros removed, and ‘u’ inserted between successive elements.

Sets the default value for the variable top. All the code points with primary strengths less than variable top will be considered variable, and thus affected by the alternate handling.
match-boundaries: none whole-character whole-word n/a match-boundaries = “whole-word” The meaning is according to the descriptions in UTS #10 Searching .
match-style minimal medial maximal n/a match-style = “medial” The meaning is according to the descriptions in UTS #10 Searching .

1. Proposed BCP47 subtag syntax

This document propose the syntax described by the BNF below.

locale-extensions = locale-singleton “-“ extension *(“-“ extension)

extension = key “-“ type

locale-singleton = “u”

key = 2alphanum

type = 3*8alphanum

alphanum = (ALPHA / DIGIT)

Example:

en-US-u-ca-islamicc-co-phonebk

this corresponds to the former syntax

en-US@calendar=islamic-civil;collation=phonebook

Current Proposed
collation co
calendar ca
currency cu
numbers nu
time zone tz
colStrength ks
colAlternate ka
colBackwards kb
colNormalization kk
colCaseLevel kc
colCaseFirst kf
colHiraganaQuaternary kh
colNumeric kn
variableTop kv

2. Keys

Key names and only key names are always of length=2, and types (values) are always greater than 2. This proposal defines new canonical key names below.

The motivation is reduction of string size, and making sure that keys and values don’t overlap syntactically.

3. Types

3.1 Collation

3.1.1 Collation (co) types

Current Proposed
big5han big5han
digits-after digitaft
direct direct
gb2312han gb2312
phonebook phonebk
pinyin pinyin
reformed reformed
standard standard
stroke stroke
traditional trad

3.1.2 Collation Strength (ks) types

Current Proposed
primary level1
secondary level2
tertiary level3
quarternary level4
identical identic

3.1.3 Collation Alternate (ka) types

Current Proposed
non-ignorable noignore
secondary level2
shifted shifted

3.1.4 Collation Backwards (kb) / Normalization (kk) / Case Level (kc) / Hiragana Quaternary (kh) / Numeric (kn) types

Current Proposed
yes true
no false

3.1.5 Collation Case First (kf) types

Current Proposed
upper upper
lower lower
no false

3.1.6 Collation Variable Top (kv) type

The variable top parameter is specified by a code point in the format uXXuYYYY. No changes are required.

3.2 Calendar (ca)

Current Proposed
buddhist buddhist
coptic coptic
ethiopic ethiopic
ethiopic-amete-alem ethiopaa
chinese chinese
gregorian gregory
hebrew hebrew
indian indian
islamic islamic
islamic-civil islamicc
japanese japanese
persian presian
roc roc

3.3 Currency (cu) types

ISO4217 code (3-letter alpha) is used for currency. No changes required.

3.4 Number System (nu) types

The current CVS snapshot implementation uses CSS3 names. This proposal changes all of type names to script code with one exception (arabext).

Current (CVS snapshot) Proposed
arabic-indic arab
bengali beng
cambodian khmr
decimal latn
devanagari deva
gujarati gujr
gurmukhi guru
hebrew hebr
kannada knda
lao laoo
malayalam mlym
mongolian mong
myanmar mymr
oriya orya
persian arabext
telugu telu
thai thai

3.5 Time Zone (tz) types

CLDR uses Olson tzids. These IDs are usually made from <continent>+”/”+<exemplar city> and relatively long. To satisfy the syntax requirement discussed in this document, we need to map these IDs to relatively short IDs uniquely. The UN LOCODE is designed to assign unique location code and it satisfies most of the requirement. A LOCODE consists from 2 letter ISO country code and 3 letter location code. This proproposal suggest that a 5 letter LOCODE is used as a short time zone ID if examplar city has a exact match in LOCODE repertoire. Some Olson tzids do not have direct mapping in LOCODE. In this case, we assign our own codes to them, but using 3-4/6-8 letter code to distinguish them from LOCODE. For Olson tzid Etc/GMT*, this proposal suggest “UTC” + [“E” | “W”] + nn (hour offset), for example, UTCE01 means 1 hour east from UTC (Etc/GMT-1). The proposed short ID list is attached in this document.