BCP47 Syntax Mapping

In the current LDML specification, a Unicode Locale Identifier consists from is composed of a Unicode Language Identifier plus optional locale extensions.  Unicode Language Identifier is fully compatible with BCP47 language tag, but the syntax of locale extensions ("@" key "=" type (";" key "=" type)* ) are not.  The LDML is trying to define systematical mapping, but the current definition may truncate (and or remove "-" in some type values) key or type value to 8 characters because of the BCP47 language subtag's syntax restriction.  The current definition utilizes BCP47 private use features, but we want to make locale extensions formal (writing a new RFC to reserve a singleton letter for the usage), so we can avoid any conflicts with other private use values and also allow software developers to write a parser for Unicode locale extensions confidently.

BCP 47 is undergoing a revision which should be done soon:
Once we define formal representation of Unicode locale extensions in BCP47 syntax, we actually no longer have any good reasons to use @key1=type1;key2=type2... syntax for Unicode Locale Identifier other than backward compatibility reasons.  This document proposes that we retire the proprietary syntax and fully migrate to the new syntax fully supported by BCP47 language tag.

There are several options for representing keyword key/type pairs in BCP47 syntax.  Examples in following proposal assume a letter "u" is reserved for the Unicode locale extensions; however we could go for any of the possible extensions: [0-9 a-w y z].

The table below shows the locale extension keys/values currently defined by the LDML specification.

Key/Type Definitions


key type Description
collation standard The default ordering for each language. For root it is [UCA] order; for each other locale it is the same as UCA ordering except for appropriate modifications to certain characters for that language. The following are additional choices for certain locales; they only have effect in those locales.
phonebook For a phonebook-style ordering (used in German).
pinyin Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese)
traditional For a traditional-style sort (as in Spanish)
stroke Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese)
direct Hindi variant
posix A "C"-based locale.  (no longer in CLDR data)
big5han Pinyin ordering for Latin, big5 charset ordering for CJK characters. (used in Chinese)
gb2312han Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. (used in Chinese)
unihan Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters. (used in Chinese)
calendar

(For information on the calendar algorithms associated with the data used with the above types, see [Calendars].)

gregorian (default)
islamic

alias: arabic

Astronomical Arabic
chinese Traditional Chinese calendar
islamic-civil

alias: civil-arabic

Civil (algorithmic) Arabic calendar
hebrew Traditional Hebrew Calendar
japanese Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor)
buddhist

alias: thai-buddhist

Thai Buddhist Calendar (same as Gregorian except for the year)
persian Persian Calendar
coptic Coptic Calendar
ethiopic Ethiopic Calendar
collation parameters:

colStrength
colAlternate
colBackwards
colNormalization
colCaseLevel
colCaseFirst,
colHiraganaQuaternary
colNumeric
variableTop

Associated values as defined in: 5.14.1 <collation> Semantics as defined in: 5.14.1 <collation>
currency

(also known as a Unicode currency code)

ISO 4217 code,

plus others in common use

Currency value identified by ISO 4217 code, plus others in common use. Also uses XXX as Unknown or Invalid Currency.

See Appendix K: Valid Attribute Values and also [Data Formats]

time zone

(also known as a Unicode time zone code)

TZID, plus the value:

Etc/Unknown

Identification for time zone according to the TZ Database, plus the value Etc/Unknown.

Unicode LDML supports all of the time zone IDs by mapping all equivalent time zone IDs to a canonical ID for translation. This canonical time zone ID is not the same as the zone.tab time zone ID found in [Olson].

For more information, see Section 5.9.2 Time Zone Names, Appendix F: Date Format Patterns, and Appendix J: Time Zone Display Names.



Collation Parameters


Attribute Options Basic Example  XML Example Description
strength primary (1)
secondary (2)
tertiary (3)
quaternary (4)
identical (5)
[strength 1] strength = "primary" Sets the default strength for comparison, as described in the UCA.
alternate non-ignorable
shifted
[alternate non-ignorable] alternate = "non-ignorable" Sets alternate handling for variable weights, as described in UCA
backwards on
off
[backwards 2]  backwards = "on" Sets the comparison for the second level to be backwards ("French"), as described in UCA
normalization on
off
[normalization on] normalization = "off" If on, then the normal UCA algorithm is used. If off, then all strings that are in [FCD] will sort correctly, but others will not necessarily sort correctly. So should only be set off if the the strings to be compared are in FCD.
caseLevel on
off
[caseLevel on] caseLevel = "off" If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on.
caseFirst upper
lower
off
[caseFirst off] caseFirst = "off" If set to upper, causes upper case to sort before lower case. If set to lower, lower case will sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels.
hiraganaQuaternary on
off
[hiraganaQ on] hiragana­Quaternary = "on" Controls special treatment of Hiragana code points on quaternary level. If turned on, Hiragana codepoints will get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary if you want this attribute to take effect.
numeric on
off
[numeric on] numeric = "on" If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UCD]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123".
variableTop uXXuYYYY & \u00XX\uYYYY < [variable top] variableTop = "uXXuYYYY" The parameter value is an encoded Unicode string, with code points in hex, leading zeros removed, and 'u' inserted between successive elements.

Sets the default value for the variable top. All the code points with primary strengths less than variable top will be considered variable, and thus affected by the alternate handling.

match-boundaries: none
whole-character
whole-word
n/a match-boundaries = "whole-word" The meaning is according to the descriptions in UTS #10 Searching.
match-style minimal
medial
maximal
n/a match-style = "medial" The meaning is according to the descriptions in UTS #10 Searching.

1. Proposed BCP47 subtag syntax


This document propose the syntax described by the BNF below.

locale-extensions   = locale-singleton "-" extension *("-" extension)
extension           = key "-" type

locale-singleton    = "u"

key                 = 2alphanum
type                = 3*8alphanum

alphanum            = (ALPHA / DIGIT)

Example:

en-US-u-ca-islamicc-co-phonebk

this corresponds to the former syntax

en-US@calendar=islamic-civil;collation=phonebook

2. Keys


Key names and only key names are always of length=2, and types (values) are always greater than 2.  This proposal defines new canonical key names below.

 Current
 Proposed
 collation   co
 calendar  ca
 currency  cu
 numbers  nu
 time zone
 tz
 colStrength ks
 colAlternate ka
 colBackwards kb
 colNormalization kk
 colCaseLevel kc
 colCaseFirst kf
 colHiraganaQuaternary kh
 colNumeric kn
 variableTop kv

The motivation is reduction of string size, and making sure that keys and values don't overlap syntactically.

3. Types


3.1 Collation


3.1.1 Collation (co) types

 Current Proposed
 big5han big5han
 digits-after digitaft
 direct direct
 gb2312han gb2312
 phonebook phonebk
 pinyin pinyin
 reformed reformed
 standard standard
 stroke stroke
 traditional trad

3.1.2 Collation Strength (ks) types

 Current Proposed
 primary level1
 secondary level2
 tertiary level3
 quarternary level4
 identical identic

3.1.3 Collation Alternate (ka) types

 Current Proposed
 non-ignorable noignore
 shifted
 shifted

3.1.4 Collation Backwards (kb) / Normalization (kk) / Case Level (kc) / Hiragana Quaternary (kh) / Numeric (kn) types

 Current Proposed
 yes
 true
 no
 false

3.1.5 Collation Case First (kf) types

 Current Proposed
 upper
 upper
 lower
 lower
 no false

3.1.6 Collation Variable Top (kv) type

The variable top parameter is specified by a code point in the format uXXuYYYY.  No changes are required.

3.2 Calendar (ca)

 Current Proposed
 buddhist buddhist
 coptic coptic
 ethiopic ethiopic
 ethiopic-amete-alem ethiopaa
 chinese chinese
 gregorian gregory
 hebrew hebrew
 indian indian
 islamic islamic
 islamic-civil islamicc
 japanese japanese
 persian presian
 roc roc

3.3 Currency (cu) types

ISO4217 code (3-letter alpha) is used for currency.  No changes required.

3.4 Number System (nu) types

The current CVS snapshot implementation uses CSS3 names.  This proposal changes all of type names to script code with one exception (arabext).

 Current (CVS snapshot)
 Proposed
 arabic-indic arab
 bengali beng
 cambodian khmr
 decimal latn
 devanagari deva
 gujarati gujr
 gurmukhi guru
 hebrew hebr
 kannada knda
 lao laoo
 malayalam mlym
 mongolian mong
 myanmar mymr
 oriya orya
 persian arabext
 telugu telu
 thai thai

3.5 Time Zone (tz) types

CLDR uses Olson tzids.  These IDs are usually made from <continent>+"/"+<exemplar city> and relatively long.  To satisfy the syntax requirement discussed in this document, we need to map these IDs to relatively short IDs uniquely.  The UN LOCODE is designed to assign unique location code and it satisfies most of the requirement.  A LOCODE consists from 2 letter ISO country code and 3 letter location code.  This proproposal suggest that a 5 letter LOCODE is used as a short time zone ID if examplar city has a exact match in LOCODE repertoire.  Some Olson tzids do not have direct mapping in LOCODE.  In this case, we assign our own codes to them, but using 3-4/6-8 letter code to distinguish them from LOCODE.  For Olson tzid Etc/GMT*, this proposal suggest "UTC" + ["E" | "W"] + nn (hour offset), for example, UTCE01 means 1 hour east from UTC (Etc/GMT-1).  The proposed short ID list is attached in this document -> http://sites.google.com/site/cldr/development/design-proposals/bcp47-mapping/shorttz.txt?attredirects=0

ċ
shorttz.txt
(14k)
Yoshito Umaoka,
Mar 16, 2009, 7:23 PM
Comments