BCP47 Syntax Mapping

In the current LDML specification, a Unicode Locale Identifier consists from is composed of a Unicode Language Identifier plus optional locale extensions. Unicode Language Identifier is fully compatible with BCP47 language tag, but the syntax of locale extensions (“@” key “=” type (“;” key “=” type)* ) are not. The LDML is trying to define systematical mapping, but the current definition may truncate (and or remove “-“ in some type values) key or type value to 8 characters because of the BCP47 language subtag’s syntax restriction. The current definition utilizes BCP47 private use features, but we want to make locale extensions formal (writing a new RFC to reserve a singleton letter for the usage), so we can avoid any conflicts with other private use values and also allow software developers to write a parser for Unicode locale extensions confidently.

BCP 47 is undergoing a revision which should be done soon:

Once we define formal representation of Unicode locale extensions in BCP47 syntax, we actually no longer have any good reasons to use @key1=type1;key2=type2… syntax for Unicode Locale Identifier other than backward compatibility reasons. This document proposes that we retire the proprietary syntax and fully migrate to the new syntax fully supported by BCP47 language tag.

There are several options for representing keyword key/type pairs in BCP47 syntax. Examples in following proposal assume a letter “u” is reserved for the Unicode locale extensions; however we could go for any of the possible extensions: [0-9 a-w y z].

The table below shows the locale extension keys/values currently defined by the LDML specification.

Key/Type Definitions

key	type	Description
collation	standard	The default ordering for each language. For root it is [ UCA ] order; for each other locale it is the same as UCA ordering except for appropriate modifications to certain characters for that language. The following are additional choices for certain locales; they only have effect in those locales.
	phonebook	For a phonebook-style ordering (used in German).
	pinyin	Pinyin ordering for Latin and for CJK characters; that is, an ordering for CJK characters based on a character-by-character transliteration into a pinyin. (used in Chinese)
	traditional	For a traditional-style sort (as in Spanish)
	stroke	Pinyin ordering for Latin, stroke order for CJK characters (used in Chinese)
	direct	Hindi variant
	posix	A “C”-based locale. (no longer in CLDR data)
	big5han	Pinyin ordering for Latin, big5 charset ordering for CJK characters. (used in Chinese)
	gb2312han	Pinyin ordering for Latin, gb2312han charset ordering for CJK characters. (used in Chinese)
	unihan	Pinyin ordering for Latin, Unihan radical-stroke ordering for CJK characters. (used in Chinese)
calendar (For information on the calendar algorithms associated with the data used with the above types, see [ Calendars ].)	gregorian	(default)
	islamic alias: arabic	Astronomical Arabic
	chinese	Traditional Chinese calendar
	islamic-civil alias: civil-arabic	Civil (algorithmic) Arabic calendar
	hebrew	Traditional Hebrew Calendar
	japanese	Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor)
	buddhist alias: thai-buddhist	Thai Buddhist Calendar (same as Gregorian except for the year)
	persian	Persian Calendar
	coptic	Coptic Calendar
	ethiopic	Ethiopic Calendar
collation parameters: colStrength colAlternate colBackwards colNormalization colCaseLevel colCaseFirst, colHiraganaQuaternary colNumeric variableTop	Associated values as defined in: 5.14.1 <collation>	Semantics as defined in: 5.14.1 <collation>
currency (also known as a Unicode currency code )	ISO 4217 code, plus others in common use	Currency value identified by ISO 4217 code, plus others in common use. Also uses XXX as Unknown or Invalid Currency . See Appendix K: Valid Attribute Values and also [ Data Formats ]
time zone (also known as a Unicode time zone code )	TZID, plus the value: Etc/Unknown	Identification for time zone according to the TZ Database, plus the value Etc/Unknown . Unicode LDML supports all of the time zone IDs by mapping all equivalent time zone IDs to a canonical ID for translation. This canonical time zone ID is not the same as the zone.tab time zone ID found in [ Olson ]. For more information, see Section 5.9.2 Time Zone Names , Appendix F: Date Format Patterns , and Appendix J: Time Zone Display Names .

Collation Parameters

Attribute	Options	Basic Example	XML Example	Description
strength	primary (1) secondary (2) tertiary (3) quaternary (4) identical (5)	[strength 1]	strength = “ primary “	Sets the default strength for comparison, as described in the UCA.
alternate	non-ignorable shifted	[alternate non-ignorable]	alternate = “ non-ignorable “	Sets alternate handling for variable weights, as described in UCA
backwards	on off	[backwards 2]	backwards = “ on “	Sets the comparison for the second level to be backwards (“French”), as described in UCA
normalization	on off	[normalization on]	normalization = “ off “	If on , then the normal UCA algorithm is used. If off , then all strings that are in [ FCD ] will sort correctly, but others will not necessarily sort correctly. So should only be set off if the the strings to be compared are in FCD.
caseLevel	on off	[caseLevel on]	caseLevel = “ off “	If set to on, a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set strength to primary and case level to on .
caseFirst	upper lower off	[caseFirst off]	caseFirst = “ off “	If set to upper , causes upper case to sort before lower case. If set to lower , lower case will sort before upper case. Useful for locales that have already supported ordering but require different order of cases. Affects case and tertiary levels.
hiraganaQuaternary	on off	[hiraganaQ on]	hiraganaQuaternary = “ on “	Controls special treatment of Hiragana code points on quaternary level. If turned on , Hiragana codepoints will get lower values than all the other non-variable code points. The strength must be greater or equal than quaternary if you want this attribute to take effect.
numeric	on off	[numeric on]	numeric = “ on “	If set to on , any sequence of Decimal Digits (General_Category = Nd in the [ UCD ]) is sorted at a primary level with its numeric value. For example, “A-21” < “A-123”.
variableTop	uXXuYYYY	& \u00XX\uYYYY < [variable top]	variableTop = “uXXuYYYY”	The parameter value is an encoded Unicode string, with code points in hex, leading zeros removed, and ‘u’ inserted between successive elements. Sets the default value for the variable top. All the code points with primary strengths less than variable top will be considered variable, and thus affected by the alternate handling.
match-boundaries:	none whole-character whole-word	n/a	match-boundaries = “whole-word”	The meaning is according to the descriptions in UTS #10 Searching .
match-style	minimal medial maximal	n/a	match-style = “medial”	The meaning is according to the descriptions in UTS #10 Searching .

1. Proposed BCP47 subtag syntax

This document propose the syntax described by the BNF below.

locale-extensions = locale-singleton “-“ extension *(“-“ extension)

extension = key “-“ type

locale-singleton = “u”

key = 2alphanum

type = 3*8alphanum

alphanum = (ALPHA / DIGIT)

Example:

en-US-u-ca-islamicc-co-phonebk

this corresponds to the former syntax

en-US@calendar=islamic-civil;collation=phonebook

Current	Proposed
collation	co
calendar	ca
currency	cu
numbers	nu
time zone	tz
colStrength	ks
colAlternate	ka
colBackwards	kb
colNormalization	kk
colCaseLevel	kc
colCaseFirst	kf
colHiraganaQuaternary	kh
colNumeric	kn
variableTop	kv

2. Keys

Key names and only key names are always of length=2, and types (values) are always greater than 2. This proposal defines new canonical key names below.

The motivation is reduction of string size, and making sure that keys and values don’t overlap syntactically.

3. Types

3.1 Collation

3.1.1 Collation (co) types

Current	Proposed
big5han	big5han
digits-after	digitaft
direct	direct
gb2312han	gb2312
phonebook	phonebk
pinyin	pinyin
reformed	reformed
standard	standard
stroke	stroke
traditional	trad

3.1.2 Collation Strength (ks) types

Current	Proposed
primary	level1
secondary	level2
tertiary	level3
quarternary	level4
identical	identic

3.1.3 Collation Alternate (ka) types

Current	Proposed
non-ignorable	noignore
secondary	level2
shifted	shifted

3.1.4 Collation Backwards (kb) / Normalization (kk) / Case Level (kc) / Hiragana Quaternary (kh) / Numeric (kn) types

Current	Proposed
yes	true
no	false

3.1.5 Collation Case First (kf) types

Current	Proposed
upper	upper
lower	lower
no	false

3.1.6 Collation Variable Top (kv) type

The variable top parameter is specified by a code point in the format uXXuYYYY. No changes are required.

3.2 Calendar (ca)

Current	Proposed
buddhist	buddhist
coptic	coptic
ethiopic	ethiopic
ethiopic-amete-alem	ethiopaa
chinese	chinese
gregorian	gregory
hebrew	hebrew
indian	indian
islamic	islamic
islamic-civil	islamicc
japanese	japanese
persian	presian
roc	roc

3.3 Currency (cu) types

ISO4217 code (3-letter alpha) is used for currency. No changes required.

3.4 Number System (nu) types

The current CVS snapshot implementation uses CSS3 names. This proposal changes all of type names to script code with one exception (arabext).

Current (CVS snapshot)	Proposed
arabic-indic	arab
bengali	beng
cambodian	khmr
decimal	latn
devanagari	deva
gujarati	gujr
gurmukhi	guru
hebrew	hebr
kannada	knda
lao	laoo
malayalam	mlym
mongolian	mong
myanmar	mymr
oriya	orya
persian	arabext
telugu	telu
thai	thai

3.5 Time Zone (tz) types

CLDR uses Olson tzids. These IDs are usually made from <continent>+”/”+<exemplar city> and relatively long. To satisfy the syntax requirement discussed in this document, we need to map these IDs to relatively short IDs uniquely. The UN LOCODE is designed to assign unique location code and it satisfies most of the requirement. A LOCODE consists from 2 letter ISO country code and 3 letter location code. This proproposal suggest that a 5 letter LOCODE is used as a short time zone ID if examplar city has a exact match in LOCODE repertoire. Some Olson tzids do not have direct mapping in LOCODE. In this case, we assign our own codes to them, but using 3-4/6-8 letter code to distinguish them from LOCODE. For Olson tzid Etc/GMT*, this proposal suggest “UTC” + [“E” | “W”] + nn (hour offset), for example, UTCE01 means 1 hour east from UTC (Etc/GMT-1). The proposed short ID list is attached in this document.